ISBN 978-3-319-51516-8 ISBN 978-3-319-51517-5 (eBook)

Library of Congress Control Number: 2016961664

This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material isconcerned, speciﬁcally the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproductionon microﬁlms or in any other physical way, and transmission or information storage and retrieval, electronicadaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed.The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does notimply, even in the absence of a speciﬁc statement, that such names are exempt from the relevant protective laws andregulations and therefore free for general use.The publisher, the authors and the editors are safe to assume that the advice and information in this book are believedto be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty,express or implied, with respect to the material contained herein or for any errors or omissions that may have beenmade. The publisher remains neutral with regard to jurisdictional claims in published maps and institutionalafﬁliations.

Printed on acid-free paper

This Springer imprint is published by Springer Nature

The registered company is Springer International Publishing AGThe registered company address is: Gewerbestrasse 11, 6330 Cham, SwitzerlandDedicated to CindyPreface

Since the publication of my ﬁrst book on the Design and Implementation of the MTXOperating System by Springer in 2015, I have received inquiries from many enthusiasticreaders about how to run the MTX OS on their ARM based mobile devices, such as iPods oriPhones, etc. which motivated me to write this book. The purpose of this book is to provide a suitable platform for teaching and learning thetheory and practice of embedded and real-time operating systems. It covers the basic conceptsand principles of operating systems, and it shows how to apply them to the design andimplementation of complete operating systems for embedded and real-time systems. In orderto do these in a concrete and meaningful way, it uses the ARM toolchain for programdevelopment, and it uses ARM virtual machines to demonstrate the design principles andimplementation techniques. Due to its technical contents, this book is not intended for entry-level courses that teachonly the concepts and principles of operating systems without any programming practice. It isintended for technically oriented Computer Science/Engineering courses on embedded andreal-time systems that emphasize both theory and practice. The book’s evolutional style,coupled with detailed source code and complete working sample systems, make it especiallysuitable for self-study. Undertaking this book project has proved to be yet another very demanding andtime-consuming endeavor, but I enjoyed the challenges. While preparing the book manuscriptsfor publication, I have been blessed with the encouragements and helps from numerouspeople, including many of my former TaiDa EE60 classmates. I would like to take thisopportunity to thank all of them. I am also grateful to Springer International Publishing AG forallowing me to disclose the source code of this book to the public for free, which are availableat http://www.eecs.wsu.edu/*cs460/ARMhome for download. Special thanks go to Cindy for her continuing support and inspirations, which have madethis book possible. Last but not least, I would like to thank my family again for bearing withme with endless excuses of being busy all the time.

K.C. Wang received the BSEE degree from National Taiwan

University, in 1960 and the Ph.D. degree in Electrical Engi- neering from Northwestern University, Evanston, Ill in 1965. He is currently a Professor in the School of Electrical Engineering and Computer Science at Washington State University. His academic interests are in Operating Systems, Distributed Systems and Parallel Computing.

xix Introduction 1

1.1 About This Book

This book is about the design and implementation of embedded and real-time operating systems (Gajski et al. 1994). Itcovers the basic concepts and principles of operating systems (OS) (Silberschatz et al. 2009; Stallings 2011; Tanenbaum andWoodhull 2006; Wang 2015), embedded systems architecture (ARM Architecture 2016), embedded systems programming(ARM Programming 2016), real-time system concepts and real-time system requirements (Dietrich and Walker 2015). Itshows how to apply the theory and practice of OS to the design and implementation of operating systems for embedded andreal-time systems.

1.2 Motivations of This Book

In the early days, most embedded systems were designed for special applications. An embedded system usually consists of amicrocontroller and a few I/O devices, which is designed to monitor some input sensors and generate signals to controlexternal devices, such as to turn on LEDs or activate some switches, etc. For this reason, control programs of earlyembedded systems are also very simple. They are usually written in the form of a super-loop or a simple event-drivenprogram structure. However, as the computing power of embedded systems increases in recent years, embedded systemshave undergone a tremendous leap in both complexity and areas of applications. As a result, the traditional approaches tosoftware design for embedded systems are no longer adequate. In order to cope with the ever increasing system complexityand demands for extra functionality, embedded systems need more powerful software. As of now, many embedded systemsare in fact high-power computing machines with multicore processors, gigabytes memory and multi-gigabyte storagedevices. Such systems are intended to run a wide range of application programs. In order to fully realize their potential,modern embedded systems need the support of multi-functional operating systems. A good example is the evolution ofearlier cell phones to current smart phones. Whereas the former were designed to perform the simple task of placing orreceiving phone calls only, the latter may use multicore processors and run adapted versions of Linux, such as Android, toperform multitasks. The current trend of embedded system software design is clearly moving in the direction of developingmulti-functional operating systems suitable for future mobile environment. The purpose of this book is to show how to applythe theory and practice of operating systems to develop OS for embedded and real-time systems.

1.3 Objective and Intended Audience

The objective of this book is to provide a suitable platform for teaching and learning the theory and practice of embeddedand real-time operating systems. It covers embedded system architecture, embedded system programming, basic conceptsand principles of operating systems (OS) and real-time systems. It shows how to apply these principles and programmingtechniques to the design and implementation of real OS for both embedded and real-time systems. This book is intended forcomputer science students and computer professionals, who wish to study the internal details of embedded and real-timeoperating systems. It is suitable as a textbook for courses on embedded and real-time systems in technically orientedComputer Science/Engineering curriculums that strive for a balance between theory and practice. The book's evolutional

style, coupled with detailed example code and complete working sample systems, make it especially suitable for self-studyby computer enthusiasts. The book covers the entire spectrum of software design for embedded and real-time systems, ranging from simplesuper-loop and event-driven control programs for uniprocessor (UP) systems to complete Symmetric Multiprocessing (SMP)operating systems on multicore systems. It is also suitable for advanced study of embedded and real-time operating systems.

1.4 Unique Features of This Book

This book has many unique features, which distinguish it from other books.

1. This book is self-contained. It includes all the foundation and background information for studying embedded systems, real-time systems and operating systems in general. These include the ARM architecture, ARM instructions and pro- gramming (ARM architecture 2016; ARM926EJ-S 2008), toolchain for developing programs (ARM toolchain 2016), virtual machines (QEMU Emulators 2010) for software implementation and testing, program execution image, function call conventions, run-time stack usage and link C programs with assembly code.2. Interrupts and interrupts processing are essential to embedded systems. This book covers interrupt hardware and interrupts processing in great detail. These include non-vectored interrupts, vectored interrupts (ARM PL190 2004), non-nested interrupts, nested interrupts (Nesting Interrupts 2011) and programming the Generic Interrupt Controller (GIC) (ARM GIC 2013) in ARM MPcore (ARM Cortex-A9 MPCore 2012) based systems. It shows how to apply the principles of interrupts processing to develop interrupt-driven device drivers and event-driven embedded systems.3. The book presents a general framework for developing interrupt-driven device drivers, with an emphasis on the inter- action and synchronization between interrupt handlers and processes. For each device, it explains the principles of operation and programming techniques before showing the actual driver implementation, and it demonstrates the device drivers by complete working sample programs.4. The book shows the design and implementation of complete OS for embedded systems in incremental steps. First, it develops a simple multitasking kernel to support process management and process synchronization. Then it incorporates the Memory Management Unit (MMU) (ARM MMU 2008) hardware into the system to provide virtual address mappings and extends the simple kernel to support user mode processes and system calls. Then it adds process scheduling, signal processing, ﬁle system and user interface to the system, making it a complete operating system. The book's evolutional style helps the reader better understand the material.5. Chapter 9 covers Symmetric Multiprocessing (SMP) (Intel 1997) embedded systems in detail. First, it explains the requirements of SMP systems and compares the ARM MPcore architecture (ARM11 2008; ARM Cortex-A9 MPCore 2012) with the SMP system architecture of Intel. Then it describes the SMP features of ARM MPcore processors, which include the SCU and GIC for interrupts routing and interprocessor communication and synchronization by Software Generated Interrupts (SGIs). It uses a series of programming examples to show how to start up the ARM MPcore processors and points out the need for synchronization in a SMP environment. Then it explains the ARM LDREX/ STREX instructions and memory barriers and uses them to implement spinlocks, mutexes and semaphores for process synchronization in SMP systems. It presents a general methodology for SMP kernel design and shows how to apply the principles to adapt a uniprocessor (UP) kernel for SMP. In addition, it also shows how to use parallel algorithms for process and resource management to improve the concurrency and efﬁciency in SMP systems.6. Chapter 10 covers real-time operating systems (RTOS). It introduces the concepts and requirements of real-time sys- tems. It covers the various kinds of task scheduling algorithms in RTOS, and it shows how to handle priority inversion and task preemption in real-time systems. It includes case studies of several popular RTOS and formulates a set of general guidelines for RTOS design. It shows the design and implementation of a UP_RTOS for uniprocessor (UP) systems. Then it extends the UP_RTOS to SMP_RTOS for SMP, which supports nested interrupts, preemptive task scheduling, priority inheritance and inter-processor synchronization by SGI.7. Throughout the book, it uses completely working sample systems to demonstrate the design principles and imple- mentation techniques. It uses the ARM toolchain under Ubuntu (15.10) Linux to develop software for embedded systems, and it uses emulated ARM virtual machines under QEMU as the platform for implementation and testing.1.5 Book Contents 3

1.5 Book Contents

This book is organized as follows.

Chapter 2 covers the ARM architecture, ARM instructions, ARM programming and development of programs forexecution on ARM virtual machines. These include ARM processor modes, register banks in different modes, instructionsand basic programming in ARM assembly. It introduces the ARM toolchain under Ubuntu (15.10) Linux and emulatedARM virtual machines under QEMU. It shows how to use the ARM toolchain to develop programs for execution on theARM Versatilepb virtual machine by a series of programming examples. It explains the function call convention in C andshows how to interface assembly code with C programs. Then it develops a simple UART driver for I/O on serial ports, anda LCD driver for displaying both graphic images and text. It also shows the development of a generic printf() function forformatted printing to output devices that support the basic print char operation. Chapter 3 covers interrupts and exceptions processing. It describes the operating modes of ARM processors, exceptiontypes and exception vectors. It explains the functions of interrupt controllers and the principles of interrupts processing indetail. Then it applies the principles of interrupts processing to the design and implementation of interrupt-driven devicedrivers. These include drivers for timers, keyboard, UARTs and SD cards (SDC 2016), and it demonstrates the devicedrivers by example programs. It explains the advantages of vectored interrupts over non-vectored interrupts. It shows how toconﬁgure the Vector Interrupt Controllers (VICs) for vectored interrupts, and demonstrates vectored interrupts processing byexample programs. It also explains the principles of nested interrupts and demonstrates nested interrupts processing byexample programs. Chapter 4 covers models of embedded systems. First, it explains and demonstrates the simple super-loop system modeland points out its shortcomings. Then it discusses the event-driven model and demonstrates both periodic and asynchronousevent-driven systems by example programs. In order to go beyond the simple super-loop and event-driven system models, itjustiﬁes the needs for processes or tasks in embedded systems. Then it introduces the various kinds of process models, whichare used as the models of developing embedded systems in the book. Lastly, it presents the formal methodologies forembedded systems design. It illustrates the Finite State Machine (FSM) (Katz and Borriello 2005) model by a completedesign and implementation example in detail. Chapter 5 covers process management. It introduces the process concept and the basic principle of multitasking. Itdemonstrates the technique of multitasking by context switching. It shows how to create processes dynamically anddiscusses the goals, policy and algorithms of process scheduling. It covers process synchronization and explains the variouskinds of process synchronization mechanisms, which include sleep/wakeup, mutexes and semaphores. It shows how to useprocess synchronization to implement event-driven embedded systems. It discusses the various schemes for inter-processcommunication, which include shared memory, pipes and message passing. It shows how to integrate these concepts andtechniques to implement a uniprocessor (UP) kernel for process management, and it demonstrates the system requirementsand programming techniques for both non-preemptive and preemptive process scheduling. The UP kernel serves as thefoundation for developing complete operating systems in later chapters. Chapter 6 covers the ARM memory management unit (MMU) and virtual address space mappings. It explains theARM MMU in detail and shows how to conﬁgure the MMU for virtual address mapping using both one-level and two-levelpaging. In addition, it also explains the distinction between low VA space and high VA space mappings and theirimplications on system implementations. Rather than only discussing the principles of memory management, it demonstratesthe various kinds of virtual address mappings schemes by complete working example programs. Chapter 7 covers user mode processes and system calls. First it extends the basic UP kernel of Chapter 5 to supportadditional process management functions, which include dynamic process creation, process termination, process synchro-nization and wait for child process termination. Then it extends the basic kernel to support user mode processes. It showshow to use memory management to provide each process with a private user mode virtual address space that is isolated fromother processes and protected by the MMU hardware. It covers and demonstrates the various kinds of memory managementschemes, which include one-level sections and two-level static and dynamic paging. It covers the advanced concepts andtechniques of fork, exec, vfork and threads. In addition, it shows how to use SD cards for storing both kernel and user modeimage ﬁles in a SDC ﬁle system. It also shows how to boot up the system kernel from SDC partitions. This chapter serves asa foundation for the design and implementation of general purpose OS for embedded systems. Chapter 8 presents a fully functional general purpose OS (GPOS), denoted by EOS, for uniprocessor (UP) ARM basedembedded systems. The following is a brief summary of the organization and capabilities of the EOS system.4 1 Introduction

1. System Images: Bootable kernel image and User mode executables are generated from a source tree by the ARM toolchain under Ubuntu (15.10) Linux and reside in an (EXT2 2001) ﬁle system on a SDC partition. The SDC contains stage-1 and stage-2 booters for booting up the kernel image from the SDC partition. After booting up, the EOS kernel mounts the SDC partition as the root ﬁle system.2. Processes: The system supports NPROC = 64 processes and NTHRED = 128 threads per process, both can be increased if needed. Each process (except the idle process P0) runs in either Kernel mode or User mode. Memory management of process images is by 2-level dynamic paging. Process scheduling is by dynamic priority and timeslice. It supports inter-process communication by pipes and message passing. The EOS kernel supports process management functions of fork, exec, vfork, threads, exit and wait for child termination.3. Device drivers: It contains device drivers for the most commonly used I/O devices, which include LCD display, timer, keyboard, UART and SDC.4. File system: EOS supports an EXT2 ﬁle system that is totally Linux compatible. It shows the principles of ﬁle operations, the control path and data flow from user space to kernel space down to the device driver level. It shows the internal organization of ﬁle systems, and it describes the implementation of a complete ﬁle system in detail.5. Timer service, exceptions and signal processing: It provides timer service functions, and it uniﬁes exceptions handling with signal processing, which allows users to install signal catchers to handle exceptions in User mode.6. User Interface: It supports multi-user logins to the console and UART terminals. The command interpreter sh supports executions of simple commands with I/O redirections, as well as multiple commands connected by pipes.7. Porting: The EOS system runs on a variety of ARM virtual machines under QEMU, mainly for convenience. It should also run on real ARM based system boards that support suitable I/O devices. Porting EOS to some popular ARM based systems, e.g. Raspberry PI-2 is currently underway. The plan is to make it available for readers to download as soon as it is ready.

Chapter 9 covers multiprocessing in embedded systems. It explains the requirements of Symmetric Multiprocessing(SMP) systems and compares the approach to SMP of Intel with that of ARM. It lists ARM MPcore processors and describesthe components and functions of ARM MPcore processors in support of SMP. All ARM MPcore based systems depends onthe Generic Interrupt Controller (GIC) for interrupts routing and inter-processor communication. It shows how to conﬁgurethe GIC to route interrupts and demonstrates GIC programming by examples. It shows how to start up ARM MPcores andpoints out the need for synchronization in a SMP environment. It shows how to use the classic test-and-set or equivalentinstructions to implement atomic updates and critical regions and points out their shortcomings. Then it explains the newfeatures of ARM MPcores in support of SMP. These include the ARM LDRES/STRES instructions and memory barriers. Itshows how to use the new features of ARM MPcore processors to implement spinlocks, mutexes and semaphores for processsynchronization in SMP. It deﬁnes conditional spinlocks, mutexes and semaphores and shows how to use them for deadlockprevention in SMP kernels. It also covers the additional features of the ARM MMU for SMP. It presents a generalmethodology for adapting uniprocessor OS kernel to SMP. Then it applies the principles to develop a complete SMP_EOSfor embedded SMP systems, and it demonstrates the capabilities of the SMP_EOS system by example programs. Chapter 10 covers real-time operating systems (RTOS). It introduces the concepts and requirements of real-time systems.It covers the various kinds of task scheduling algorithms in RTOS, which include RMS, EDF and DMS. It explains theproblem of priority inversion due to preemptive task scheduling and shows how to handle priority inversion and taskpreemption. It includes case studies of several popular real-time OS and presents a set of general guidelines for RTOSdesign. It shows the design and implementation of a UP_RTOS for uniprocessor (UP) systems. Then it extends theUP_RTOS to a SMP_RTOS, which supports nested interrupts, preemptive task scheduling, priority inheritance andinter-processor synchronization by SGI.

1.6 Use This Book as a Textbook for Embedded Systems

This book is suitable as a textbook for technically oriented courses on embedded and real-time systems in ComputerScience/Engineering curricula that strive for a balance between theory and practice. A one-semester course based on thisbook may include the following topics.1.6 Use This Book as a Textbook for Embedded Systems 5

The problems section of each chapter contains questions designed to review the concepts and principles presented in thechapter. While some of the questions involve only simple modiﬁcations of the example programs to let the studentsexperiment with alternative design and implementation, many other questions are suitable for advanced programmingprojects.

1.8 Use This Book for Self-study

Judging from the large number of OS developing projects, and many popular sites on embedded and real-time systemsposted on the Internet and their enthusiastic followers, there is a tremendous number of computer enthusiasts who wish tolearn the practical side of embedded and real-time operating systems. The evolutional style of this book, coupled with amplecode and demonstration system programs, make it especially suitable for self-study. It is hoped that this book will be usefuland beneﬁcial to such readers.

ARM (ARM Architecture 2016) is a family of Reduced Instruction Set Computing (RISC) microprocessors developedspeciﬁcally for mobile and embedded computing environments. Due to their small sizes and low power requirements, ARMprocessors have become the most widely used processors in mobile devices, e.g. smart phones, and embedded systems.Currently, most embedded systems are based on ARM processors. In many cases, embedded system programming hasbecome almost synonymous with ARM processor programming. For this reason, we shall also use ARM processors for thedesign and implementation of embedded systems in this book. Depending on their release time, ARM processors can beclassiﬁed into classic cores and the more recent (since 2005) Cortex cores. Depending on their capabilities and intendedapplications, ARM Cortex cores can be classiﬁed into three categories (ARM Cortex 2016).

.The Cortex-M series: these are microcontroller-oriented processors intended for Micro Controller Unit (MCU) and Systemon Chip (SoC) applications..The Cortex-R series: these are embedded processors intended for real-time signal processing and control applications..The Cortex-A series: these are applications processors intended for general purpose applications, such as embedded systemswith full featured operating systems.

The ARM Cortex-A series processors are the most powerful ARM processors, which include the Cortex-A8 (ARMCortex-A8 2010) single core and the Cortex A9-MPcore (ARM Cortex A9 MPcore 2016) with up to 4 CPUs. Because oftheir advanced capabilities, most recent embedded systems are based on the ARM Cortex-A series processors. On the otherhand, there are also a large number of embedded systems intended for dedicated applications that are based on the classicARM cores, which proved to be very cost-effective. In this book, we shall cover both the classic and ARM-A seriesprocessors. Speciﬁcally, we shall use the classic ARM926EJ-S core (ARM926EJ-ST 2008, ARM926EJ-ST 2010) for singleCPU systems and the Cortex A9-MPcore for multiprocessor systems. A primary reason for choosing these ARM cores isbecause they are available as emulated virtual machines (VMs). A major goal of this book is to show the design andimplementation of embedded systems in an integrated approach. In addition to covering the theory and principles, it alsoshows how to apply them to the design and implementation of embedded systems by programming examples. Since mostreaders may not have access to real ARM based systems, we shall use emulated ARM virtual machines under QEMU forimplementation and testing. In this chapter, we shall cover the following topics.

A particular set of R0-R12 registers

2.2.1 General Registers

Figure 2.1 shows the organization of general registers in the ARM processor.

Fig. 2.1 Register banks in ARM processor

2.2 ARM CPU Registers 9

Fig. 2.2 Status register of ARM processor

User and System modes share the same set of registers. Registers R0-R12 are the same in all modes, except for the FIQmode, which has its own separate registers R8-R12. Each mode has its own stack pointer (R13) and link register (R14). TheProgram Counter (PC or R15) and Current Status Register (CPSR) are the same in all modes. Each privileged mode (SVC toFIQ) has its own Saved Processor Status Register (SPSR).

2.2.2 Status Registers

In all modes, the ARM processor has the same Current Program Status Register (CPSR). Figure 2.2 shows the contents ofthe CPSR register. In the CPSR register, NZCV are the condition bits, I and F are IRQ and FIQ interrupt mask bits, T = Thumb state, andM[4:0] are the processor mode bits, which deﬁne the processor mode as

USR: 10000 (0x10)

2.2.3 Change ARM Processor Mode

All the ARM modes are privileged except the User mode, which is unprivileged. Like most other CPUs, the ARM processorchanges mode in response to exceptions or interrupts. Speciﬁcally, it changes to FIQ mode when a FIQ interrupt occurs. Itchanges to IRQ mode when an IRQ interrupt occurs. It enters the SVC mode when power is turned on, following reset orexecuting a SWI instruction. It enters the Abort mode when a memory access exception occurs, and it enters the UND modewhen it encounters an undeﬁned instruction. An unusual feature of the ARM processor is that, while in a privileged mode, itcan change mode freely by simply altering the mode bits in the CPSR, by using the MSR and MRS instructions. For example,when the ARM processor starts or following a reset, it begins execution in SVC mode. While in SVC mode, the systeminitialization code must set up stack pointers of other modes. To do these, it simply changes the processor to an appropriatemode, initialize the stack pointer (R13_mode) and the saved program status register (SPSR) of that mode. The following codesegment shows how to switch the processor to IRQ mode while preserving other bits, e.g. F and I bits, in the CPSR.10 2 ARM Architecture and Programming

MRS r0, cpsr // get cpsr into r0

If we do not care about the CPSR contents other than the mode ﬁeld, e.g. during system initialization, changing to IRQmode can be done by writing a value to CPSR directly, as in

MSR cpsr, #0x92 // IRQ mode with I bit=1

A special usage of SYS mode is to access User mode registers, e.g. R13 (sp), R14 (lr) from privileged mode. In anoperating system, processes usually run in the unprivileged User mode. When a process does a system call (by SWI), itenters the system kernel in SVC mode. While in kernel mode, the process may need to manipulate its User mode stack andreturn address to the User mode image. In this case, the process must be able to access its User mode sp and lr. This can bedone by switching the CPU to SYS mode, which shares the same set of registers with User mode. Likewise, when an IRQinterrupt occurs, the ARM processor enters IRQ mode to execute an interrupt service routine (ISR) to handle the interrupt. Ifthe ISR allows nested interrupts, it must switch the processor from IRQ mode to a different privileged mode to handle nestedinterrupts. We shall demonstrate this later in Chap. 3 when we discuss exceptions and interrupts processing in ARM basedsystems.

2.3 Instruction Pipeline

The ARM processor uses an internal pipeline to increase the rate of instruction flow to the processor, allowing severaloperations to be undertaken simultaneously, rather than serially. In most ARM processors, the instruction pipeline consists of3 stages, FETCH-DECODE-EXECUTE, as shown below.

PC FETCH Fetch instruction from memory

The Program Counter (PC) actually points to the instruction being fetched, rather than the instruction being executed.This has implications to function calls and interrupt handlers. When calling a function using the BL instruction, the returnaddress is actually PC-4, which is adjusted by the BL instruction automatically. When returning from an interrupt handler,the return address is also PC-4, which must be adjusted by the interrupt handler itself, unless the ISR is deﬁned with the__attribute__((interrupt)) attribute, in which case the compiled code will adjust the link register automatically. For someexceptions, such as Abort, the return address is PC-8, which points to the original instruction that caused the exception.

2.4 ARM Instructions

2.4.1 Condition Flags and Conditions

In the CPSR of ARM processors, the highest 4 bits, NZVC, are condition flags or simply condition code, where

N = negative, Z = zero, V = overﬂow, C = carry bit out

Condition flags are set by comparison and TST operations. By default, data processing operations do not affect thecondition flags. To cause the condition flags to be updated, an instruction can be postﬁx with the S symbol, which sets the Sbit in the instruction encoding. For example, both of the following instructions add two numbers, but only the ADDSinstruction affects the condition flags.2.4 ARM Instructions 11

ADD r0, r1, r2 ; r0 = r1 + r2

. ADDS r0, r1, r2 ; r0 = r1 + r2 and set condition flags

In the ARM 32-bit instruction encoding, the leading 4 bits [31:28] represent the various combinations of the conditionflag bits, which form the condition ﬁeld of the instruction (if applicable). Based on the various combinations of the conditionflag bits, conditions are deﬁned mnemonically as EQ, NE, LT, GT, LE, GE, etc. The following shows some of the mostcommonly used conditions and their meanings.

A rather unique feature of the ARM architecture is that almost all instructions can be executed conditionally. Aninstruction may contain an optional condition sufﬁx, e.g. EQ, NE, LT, GT, GE, LE, GT, LT, etc. which determines whetherthe CPU will execute the instruction based on the speciﬁed condition. If the condition is not met, the instruction will not beexecuted at all without any side effects. This eliminates the need for many branches in a program, which tend to disrupt theinstruction pipeline. To execute an instruction conditionally, simply postﬁx it with the appropriate condition. For example, anon-conditional ADD instruction has the form:

ADD r0, r1, r2 ; r0 = r1 + r2

To execute the instruction only if the zero flag is set, append the instruction with EQ.

ADDEQ r0, r1, r2 ; If zero flag is set then r0 = r1 + r2

Similarly for other conditions.

2.4.2 Branch Instructions

Branching instructions have the form

B{<cond>} label ; branch to label

BL{<cond>} subroutine ; branch to subroutine with link

The Branch (B) instruction causes a direct branch to an offset relative to the current PC. The Branch with link(BL) instruction is for subroutine calls. It writes PC-4 into LR of the current register bank and replaces PC with the entryaddress of the subroutine, causing the CPU to enter the subroutine. When the subroutine ﬁnishes, it returns by the savedreturn address in the link register R14. Most other processors implement subroutine calls by saving the return address onstack. Rather than saving the return address on stack, the ARM processor simply copies PC-4 into R14 and branches to thecalled subroutine. If the called subroutine does not call other subroutines, it can use the LR to return to the calling placequickly. To return from a subroutine, the program simply copies LR (R14) into PC (R15), as in

MOV PC, LR or BX LR12 2 ARM Architecture and Programming

However, this works only for one-level subroutine calls. If a subroutine intends to make another call, it must save andrestore the LR register explicitly since each subroutine call will change the current LR. Instead of the MOV instruction, theMOVS instruction can also be used, which restores the original flags in the CPSR.

2.4.3 Arithmetic Operations

The syntax of arithmetic operations is:

<Operation>{<cond>}{S} Rd, Rn, Operand2

The instruction performs an arithmetic operation on two operands and places the results in the destination register Rd. Theﬁrst operand, Rn, is always a register. The second operand can be either a register or an immediate value. In the latter case,the operand is sent to the ALU via the barrel shifter to generate an appropriate value.Examples:

ADD r0, r1, r2 ; r0 = r1 + r2

SUB r3, r3, #1 ; r3 = r3 - 1

2.4.4 Comparison Operations

CMP: operand1—operand2, but result not written TST: operand1 AND operand2, but result not written TEQ: operand1 EOR operand2, but result not written Comparison operations update the condition flag bits in the status register, which can be used as conditions in subsequentinstructions. Examples:

2.4.6 Data Movement Operations

MOV operand1, operand2

MOV r0, r1 ; r0 = r1 : Always execute

2.4.7 Immediate Value and Barrel Shifter

The Barrel shifter is another unique feature of ARM processors. It is used to generate shift operations and immediateoperands inside the ARM processor. The ARM processor does not have actual shift instructions. Instead, it has a barrelshifter, which performs shifts as part of other instructions. Shift operations include the conventional shift left, right androtate, as in

MOV r0, r0, LSL #1 ; shift r0 left by 1 bit (multiply r0 by 2)

Most other processors allow loading CPU registers with immediate values, which form parts of the instruction stream,making the instruction length variable. In contrast, all ARM instructions are 32 bits long, and they do not use the instructionstream as data. This presents a challenge when using immediate values in instructions. The data processing instructionformat has 12 bits available for operand2. If used directly, this would only give a range of 0–4095. Instead, it is used to storea 4-bit rotate value and an 8-bit constant in the range of 0–255. The 8 bits can be rotated right an even number of positions(i.e. RORs by 0, 2, 4,…,30). This gives a much larger range of values that can be directly loaded. For example, to load r0with the immediate value 4096, use

MOV r0, #0x40, 26 ; generate 4096 (0x1000) by 0x40 ROR 26

To make this feature easier to use, the assembler will convert to this form if given the required constant in an instruction,e.g.

MOV r0, #4096

The assembler will generate an error if the given value can not be converted this way. Instead of MOV, the LDRinstruction allows loading an arbitrary 32-bit value into a register, e.g.

LDR rd, =numeric_constant

If the constant value can be constructed by using either a MOV or MVN, then this will be the instruction actuallygenerated. Otherwise, the assembler will generate an LDR with a PC-relative address to read the constant from a literal pool.14 2 ARM Architecture and Programming

2.4.8 Multiply Instructions

MUL{<cond>}{S} Rd, Rm, Rs ; Rd = Rm * Rs

MLA{<cond>}{S} Rd, Rm, Rs,Rn ; Rd = (Rm * Rs) + Rn

2.4.9 LOAD and Store Instructions

The ARM processor is a Load/ Store Architecture. Data must be loaded into registers before using. It does not supportmemory to memory data processing operations. The ARM processor has three sets of instructions which interact withmemory. These are:

• Single register data transfer (LDR/STR).

• Block data transfer (LDM/STM).• Single Data Swap (SWP).

The basic load and store instructions are: Load and Store Word or Byte:

LDR / STR / LDRB / STRB

2.4.10 Base Register

Load/store instructions may use a base register as an index to specify the memory location to be accessed. The index mayinclude an offset in either pre-index or post-index addressing mode. Examples of using index registers are

STR r0, [r1] ; store r0 to location pointed by r1.

2.4.11 Block Data Transfer

The base register is used to determine where memory access should occur. Four different addressing modes allow incrementand decrement inclusive or exclusive of the base register location. Base register can be optionally updated following the datatransfer by appending it with a '!' symbol. These instructions are very efﬁcient for saving and restoring execution context,e.g. to use a memory area as stack, or move large blocks of data in memory. It is worth noting that, when using theseinstructions to save/restore multiple CPU registers to/from memory, the register order in the instruction does not matter.Lower numbered registers are always transferred to/from lower addresses in memory.

2.4.12 Stack Operations

A stack is a memory area which grows as new data is "pushed" onto the "top" of the stack, and shrinks as data is "popped"off the top of the stack. Two pointers are used to deﬁne the current limits of the stack.

• A base pointer: used to point to the "bottom" of the stack (the ﬁrst location).• A stack pointer: used to point the current "top" of the stack.2.4 ARM Instructions 15

A stack is called descending if it grows downward in memory, i.e. the last pushed value is at the lowest address. A stackis called ascending if it grows upward in memory. The ARM processor supports both descending and ascending stacks. Inaddition, it also allows the stack pointer to either point to the last occupied address (Full stack), or to the next occupiedaddress (Empty stack). In ARM, stack operations are implemented by the STM/LDM instructions. The stack type isdetermined by the postﬁx in the STM/LDM instructions:

• STMFD/LDMFD: Full Descending stack

The C compiler always uses Full Descending stack. Other forms of stacks are rare and almost never used in practice. Forthis reason, we shall only use Full Descending stacks throughout this book.

2.4.13 Stack and Subroutines

A common usage of stack is to create temporary workspace for subroutines. When a subroutine begins, any registers that areto be preserved can be pushed onto the stack. When the subroutine ends, it restores the saved registers by popping them offthe stack before returning to the caller. The following code segment shows the general pattern of a subroutine.

If the pop instruction has the 'S' bit set (by the '^' symbol), then transferring the PC register while in a privileged modealso copies the saved SPSR to the previous mode CPSR, causing return to the previous mode prior to the exception (by SWIor IRQ).

2.4.14 Software Interrupt (SWI)

In ARM, the SWI instruction is used to generate a software interrupt. After executing the SWI instruction, the ARMprocessor changes to SVC mode and executes from the SVC vector address 0x08, causing it to execute the SWI handler,which is usually the entry point of system calls to the OS kernel. We shall demonstrate system calls in Chap. 5.

2.4.15 PSR Transfer Instructions

The MRS and MSR instructions allow contents of CPSR/SPSR to be transferred from appropriate status register to a generalpurpose register. Either the entire status register or only the flag bits can be transferred. These instructions are used mainly tochange the processor mode while in a privileged mode.

MRS{<cond>} Rd, <psr> ; Rd = <psr>

MSR{<cond>} <psr>, Rm ; <psr> = Rm

2.4.16 Coprocessor Instructions

The ARM architecture treats many hardware components, e.g. the Memory Management Unit (MMU) as coprocessors,which are accessed by special coprocessor instructions. We shall cover coprocessors in Chap. 6 and later chapters.16 2 ARM Architecture and Programming

Fig. 2.3 Toolchain components

2.5 ARM Toolchain

A toolchain is a collection of programming tools for program development, from source code to binary executable ﬁles.A toolchain usually consists of an assembler, a compiler, a linker, some utility programs, e.g. objcopy, for ﬁle conversionsand a debugger. Figure 2.3 depicts the components and data flows of a typical toolchan. A toolchain runs on a host machine and generates code for a target machine. If the host and target architectures aredifferent, the toolchain is called a cross toolchain or simply a cross compiler. Quite often, the toolchain used for embeddedsystem development is a cross toolchain. In fact, this is the standard way of developing software for embedded systems. Ifwe develop code on a Linux machine based on the Intel x86 architecture but the code is intended for an ARM targetmachine, then we need a Linux-based ARM-targeting cross compiler. There are many different versions of Linux basedtoolchains for the ARM architecture (ARM toolchains 2016). In this book, we shall use the arm-none-eabi toolchain underUbuntu Linux Versions 14.04/15.10. The reader can get and install the toolchain, as well as the qemu-system-arm for ARMvirtual machines, on Ubuntu Linux as follows.

sudo apt-get install gcc-arm-none-eabi

sudo apt-get install qemu-system-arm

In the following sections, we shall demonstrate how to use the ARM toolchain and ARM virtual machines under QEMUby programming examples.

2.6 ARM System Emulators

QEMU supports many emulated ARM machines (QEMU Emulator 2010). These includes ARM Integrator/CP board, ARMVersatile baseboard, ARM RealView baseboard and several others. The supported ARM CPUs include ARM926E,ARM1026E, ARM946E, ARM1136 or Cortex-A8. All these are uniprocessor (UP) or single CPU systems. To begin with, we shall consider only uniprocessor (UP) systems. Multiprocessor (MP) systems will be covered later inChap. 9. Among the emulated ARM virtual machines, we shall choose the ARM Versatilepb baseboard (ARM926EJ-S2016) as the platform for implementation and testing, for the following reasons.2.6 ARM System Emulators 17

QEMU will load the t.bin ﬁle to 0x10000 in RAM and executes it directly. This is very convenient since it eliminates theneed for storing the system image in a flash memory and relying on a dedicated booter to boot up the system image.

2.7 ARM Programming

2.7.1 ARM Assembly Programming Example 1

We begin ARM programming by a series of example programs. For ease of reference, we shall label the exampleprograms by C2.x, where C2 denotes the chapter number and x denotes the program number. The ﬁrst example program,C2.1, consists of a ts.s ﬁle in ARM assembly. The following shows the steps of developing and running the exampleprogram.

The program code loads the CPU registers r0 with the value 1, r1 with the value 2. Then it adds r0 to r1 and stores theresult into the memory location labeled result. Before continuing, it is worth noting the following. First, in an assembly code program, instructions are case insensitive.An instruction may use uppercase, lowercase or even mixed cases. For better readability, the coding style should beconsistent, either all lowercase or all uppercase. However, other symbols, e.g. memory locations, are case sensitive. Second,as shown in the program, we may use the symbol @ or // to start a comment line, or include comments in matched pairs of /*18 2 ARM Architecture and Programming

and */. Which kind of comment lines to use is a matter of personal preference. In this book, we shall use // for singlecomment lines, and use matched pairs of /* and */ for comment blocks that may span multiple lines, which are applicable toboth assembly code and C programs.

(2). The mk script ﬁle:

A sh script, mk, is used to (cross) compile-link ts.s into an ELF ﬁle. Then it uses objcopy to convert the ELF ﬁle into abinary executable image named t.bin.

arm-none-eabi-objcopy –O binary t.elf t.bin # convert t.elf to t.bin

The reader may include all the above commands in a mk script, which will compile-link and run the binary executable bya single script command.2.7 ARM Programming 19

Fig. 2.4 Register contents of program C2.1

(4). Check Results: To check the results of running the program, enter the QEMU monitor commands:

info registers : display CPU registers

xp /wd [address] : display memory contents in 32-bit words

Figure 2.4 shows the register contents of running the C2.1 program. As the ﬁgure shows, the register R2 contains0x0001001C, which is the address of result. Alternatively, the command line arm-none-eabi-nm t.elf in the mkscript also shows the locations of symbols in the program. The reader may enter the QEMU monitor command

xp /wd 0x1001C

to display the contents of result, which should be 3. To exit QEMU, enter Control-a x, or Control-C to terminate the QEMUprocess.

2.7.2 ARM Assembly Programming Example 2

The next example program, denoted by C2.2, uses ARM assembly code to compute the sum of an integer array. It showshow to use a stack to call subroutine. It also demonstrates indirect and post-index addressing modes of ARM instructions.For the sake of brevity, we only show the ts.s ﬁle. All other ﬁles are the same as in the C2.1 program. C2.2: ts.s ﬁle:

Fig. 2.5 Register contents of program C2.2

ldmfd sp!, {r0-r4, pc} // pop stack, return to caller

The program computes the sum of an integer array. The number of array elements (10) is deﬁned in the memory locationlabeled N, and the array elements are deﬁned in the memory area labeled Array. The sum is computed in R0, which is savedinto the memory location labeled Result. As before, run the mk script to generate a binary executable t.bin. Then run t.binunder QEMU. When the program stops, use the monitor commands info and xp to check the results. Figure 2.5 shows theresults of running the C2.2 program. As the ﬁgure shows, the register R0 contains the computed result of 0x37 (55 indecimal). The reader may use the command

arm-none-eabi-nm t.elf

to display symbols in an object code ﬁle. It lists the memory locations of the global symbols in the t.elf ﬁle, such as

0001004C N 00010050 Array 00010078 Result

Then, use xp/wd 0x10078 to see the contents of Result, which should be 55 in decimal.

2.7.3 Combine Assembly with C Programming

Assembly programming is indispensable, e.g. when accessing and manipulating CPU registers, but it is also very tedious. Insystems programming, assembly code should be used as a tool to access and control low-level hardware, rather than as ameans of general programming. In this book, we shall use assembly code only if absolutely necessary. Whenever possible,we shall implement program code in the high-level language C. In order to integrate assembly and C code in the sameprogram, it is essential to understand program execution images and the calling convention of C.

2.7.3.1 Execution Image

An executable image (ﬁle) generated by a complier-linker consists of three logical parts. Text section: also called Code section containing executable code Data section: initialized global and static variables, static constants BSS section: un-initialized global and static variables. (BSS is not in the image ﬁle) During execution, the executable image is loaded into memory to create a run-time image, which looks like the following.2.7 ARM Programming 21

--------------------------------------------- (Low Address) | Code | Data | BSS | Heap | Stack | (High address) --------------------------------------------- A run-time image consists of 5 (logically) contiguous sections. The Code and Data sections are loaded directly from theexecutable ﬁle. The BSS section is created by the BSS section size in the executable ﬁle header. Its contents are usuallycleared to zero. In the execution image, the Code, Data and BSS sections are ﬁxed and do not change. The Heap area is fordynamic memory allocation within the execution image. The stack is for function calls during execution. It is logically at thehigh (address) end of the execution image, and it grows downward, i.e. from high address toward low address.

2.7.3.2 Function Call Convention in C

The function call convention of C consists of the following steps between the calling function (the caller) and the calledfunction (the callee). ---------------------------------- Caller ----------------------------------

(1). load ﬁrst 4 parameters in r0-r3; push any extra parameters on stack(2). transfers control to callee by BL callee

When calling the function func(), the caller must pass (6) parameters (a, b, c, d, e, f) to the called function. The ﬁrst 4parameters (a, b, c, d) are passed in registers r0–r3. Any extra parameters are passed via the stack. When control enters thecalled function, the stack top contains the extra parameters (in reverse order). For this example, the stack top contains the 2extra parameters e and f. The initial stack looks like the following.

In the diagram, the (byte) offsets are relative to the location pointed by the FP register. While execution is inside a function,the extra parameters (if any) are at [fp, +offset], local variables and saved parameters are at [fp, –offset], all are referenced byusing FP as the base register. From the assembly code lines (3) and (4), which save the ﬁrst 4 parameters passed in r0-r3 andassign values to local variables x, y, z, we can see that the stack contents become as shown in the next diagram.

Although the stack is a piece of contiguous memory, logically each function can only access a limited area of the stack.The stack area visible to a function is called the stack frame of the function, and FP (r12) is called the stack frame pointer. At the assembly code lines (5), the function calls g(x, y) with only two parameters. It loads x into r0, y into r1 and thenBL to g. At the assembly code lines (6), it computes a + e as the return value in r0. At the assembly code lines (7), it deallocates the space in stack, pops the saved FP and LR into PC, causing executionreturn to the caller. The ARM C compiler generated code only uses r0–r3. If a function deﬁnes any register variables, they are assigned theregisters r4–r11, which are saved in stack ﬁrst and restored later when the function returns. If a function does not call out,there is no need to save/restore the link register LR. In that case, the ARM C compiler generated code does not save andrestore the link register LR, allowing faster entry/exit of function calls.

2.7.3.3 Long Jump

In a sequence of function calls, such as

main() –> A() –> B()–>C();

when a called function ﬁnishes, it normally returns to the calling function, e.g. C() returns to B(), which returns to A(), etc. Itis also possible to return directly to an earlier function in the calling sequence by a long jump. The following programdemonstrates long jump in Unix/Linux.

printf("back to main() via long jump, r=%d a=%d\n", r, a);

In the above longjump.c program, the main() function ﬁrst calls setjmp(), which saves the current execution environmentin a jmp_buf structure and returns 0. Then it proceeds to call A(), which calls B(). While in the function B(), if the userchooses not to return by long jump, the functions will show the normal return sequence. If the user chooses to return bylongjmp(env, 1234), execution will return to the last saved environment with a nonzero value. In this case, it causes B() toreturn to main() directly, bypassing A(). The principle of long jump is very simple. When a function ﬁnishes, it returns by the (callerLR, callerFP) in the currentstack frame, as shown in the following diagram.

If we replace (callerLR, callerFP) with (savedLR, savedFP) of an earlier function in the calling sequence, executionwould return to that function directly. For example, we may implement setjmp(int env[2]) and longjmp(int env[2], int value)in assembly as follows.

ldmfd sp!, {fp, pc} // return via REPLACED LR and FP

Long jump can be used to abort a function in a calling sequence, causing execution to resume to a known environmentsaved earlier. In addition to the (savedLR, savedFP), setjmp() may also save other CPU registers and the caller's SP, allowinglongjmp() to restore the complete execution environment of the original function. Although rarely used in user modeprograms, long jump is a common technique in systems programming. For example, it may be used in a signal catcher tobypass a user mode function that caused an exception or trap error. We shall demonstrate this technique later in Chap. 8 onsignals and signal processing.

2.7.3.4 Call Assembly Function from C

The next example program C2.3 shows how to call assembly function from C. The main() function in C calls the assemblyfunction sum() with 6 parameters, which returns the sum of all the parameters. In accordance with the calling convention ofC, the main() function passes the ﬁrst 4 parameters a, b, c, d in r0–r3 and the remaining parameters e, f, on stack. Upon entryto the called function, the stack top contains the parameters e, f, in the order of increasing addresses. The called function ﬁrstestablish the stack frame by saving LR, FP on stack and letting FP (r12) point at the save LR. The parameters e and f are nowat FP + 4 and FP + 8, respectively. The sum function simply adds all the parameters in r0 and returns to the caller.

It is noted that in the C2.3 program, the sum() function does not save r0–r3 but use them directly. Therefore, the codeshould be more efﬁcient than that generated by the ARM GCC compiler. Does this mean we should write all programs inassembly? The answer is, of course, a resounding NO. It should be easy for the reader to ﬁgure out the reasons.26 2 ARM Architecture and Programming

2.7.3.6 Inline Assembly

In the above examples, we have written assembly code in a separate ﬁle. Most ARM tool chains are based on GCC. The GCCcompiler supports inline assembly, which is often used in C code for convenience. The basic format of inline assembly is

__asm__("assembly code"); or simply asm("assembly code");

If the assembly code has more than one line, the statements are separated by \n\t; as in

In the above code segment, %0 refers to a, %1 refers to b, %%r0 refers to the r0 register. The constraint operator "r"means to use a register for the operand. It also tells the GCC compiler that the r0 register will be clobbered by the inline code.Although we may insert fairly complex inline assembly code in a C program, overdoing it may compromise the readabilityof the program. In practice, inline assembly should be used only if the code is very short, e.g. a single assembly instruction orthe intended operation involves a CPU control register. In such cases, inline assembly code is not only clear but also moreefﬁcient than calling an assembly function.

2.8 Device Drivers

The emulated ARM Versatilepb board is a virtual machine. It behaves just like a real hardware system, but there are nodrivers for the emulated peripheral devices. In order to do any meaningful programming, whether on a real or virtual system,we must implement device drivers to support basic I/O operations. In this book, we shall develop drivers for the mostcommonly used peripheral devices by a series of programming examples. These include drivers for UART serial ports,timers, LCD display, keyboard and the Multimedia SD card, which will be used later as a storage device for ﬁle systems.A practical device driver should use interrupts. We shall show interrupt-driven device drivers in Chap. 3 when we discussinterrupts and interrupts processing. In the following, we shall show a simple UART driver by polling and a LCD driver,which does not use interrupts. In order to do these, it is necessary to know the ARM Versatile system architecture.

2.8.1 System Memory Map

The ARM system architecture uses memory-mapped-I/O. Each I/O device is assigned a block of contiguous memory in thesystem memory map. Internal registers of each I/O device are accessed as offsets from the device base address. Table 2.1shows the (condensed) memory map of the ARM Versatile/926EJ-S board (ARM 926EJ-S 2016). In the memory map, I/Odevices occupy a 2 MB area beginning from 256 MB.

2.8.2 GPIO Programming

Most ARM based system boards provide General Purpose Input-Output (GPIO) pins as I/O interface to the system. Some ofthe GPIO pins can be conﬁgured for inputs. Other pins can be conﬁgured for outputs. In many beginning level embeddedsystem courses, the programming assignments and course project are usually to program the GPIO pins of a small embeddedsystem board to interface with some real devices, such as switches, sensors, LEDs and relays, etc. Compared with other I/Odevices, GPIO programming is relatively simple. A GPIO interface, e.g. the LPC2129 GPIO MCU used in many earlyembedded system boards, consists of four 32-bit registers. GPIODIR: set pin direction; 0 for input, 1 for output GPIOSET: set pin voltage level to high (3.3 V) GPIOCLR: set pin voltage level to low (0 V) GPIOPIN: read this register returns the states of all pins28 2 ARM Architecture and Programming

The GPIO registers can be accessed as word offsets from a (memory mapped) base address. In the GPIO registers, eachbit corresponds to a GPIO pin. Depending on the direction setting in the IODIR, each pin can be connected to an appropriateI/O device. As a speciﬁc example, assume that we want to use the GPIO pin0 for input, which is connected to a (de-bounced switch),and pin1 for output, which is connected to the (ground side) of an LED with its own +3.3 V voltage source and acurrent-limiting resistor. We can program the GPIO registers as follows.

GPIODIR: bit0=0 (input), bit1=1 (output);

GPIOSET: all bits=0 (no pin is set to high); GPIOCLR: bit1=1 (set to LOW or ground); GPIOPIN: read pin state, check pin0 for any input.

Similarly, we may program other pins for desired I/O functions. Programming the GPIO registers can be done in eitherassembly code or C. Given the GPIO base address and the register offsets, it should be fairly easy to write a GPIO controlprogram, which

• turn on the LED if the input switch is pressed or closed, and

• turn off the LED if the input switch is released or open.

We leave this and other GPIO programming cases as exercises in the Problem section. In some systems, the GPIOinterface may be more sophisticated but the programming principle remains the same. For example, on the ARMVersatile-PB board, GPIO interfaces are arranged in separate groups called ports (Port0 to Port2), which are at the base2.8 Device Drivers 29

addresses 0x101E4000-0x101E6000. Each port provides 8 GPIO pins, which are controlled by a (8-bit) GPIODIR registerand a (8-bit) GPIODATA register. Instead of checking the input pin states, GPIO inputs may use interrupts. Althoughinteresting and inspiring to students, GPIO programming can only be performed on real hardware systems. Since theemulated ARM VMs do not have GPIO pins, we can only describe the general principles of GPIO programming. However,all the ARM VMs support a variety of other I/O devices. In the following sections, we shall show how to develop drivers forsuch devices.

2.8.3 UART Driver for Serial I/O

Relying on the QEMU monitor commands to display register and memory contents is very tedious. It would be much betterif we can develop device drivers to do I/O directly. In the next example program, we shall write a simple UART driver forI/O on emulated serial terminals. The ARM Versatile board supports four PL011 UART devices for serial I/O (ARM PL0112016). Each UART device has a base address in the system memory map. The base addresses of the 4 UARTs are

(2). Write to Line Control register to specify the number of bits per char and parity, e.g. 8 bits per char with no parity.(3). Write to Interrupt Mask register to enable/disable RX and TX interrupts

When using the emulated ARM Versatilepb board, it seems that QEMU automatically uses default values for both baudrate and line control parameters, making steps (1) and (2) either optional or unnecessary. In fact, it is observed that writingany value to the integer divisor register (0x24) would work but this is not the norm for UARTs in real systems. For theemulated Versatilepb board, all we need to do is to program the Interrupt Mask register (if using interrupts) and check theFlag register during serial I/O. To begin with, we shall implement the UART I/O by polling, which only checks the Flagstatus register. Interrupt-driven device drivers will be covered later in Chap. 3 when we discuss interrupts and interruptsprocessing. When developing device drivers, we may need assembly code in order to access CPU registers and the interfacehardware. However, we shall use assembly code only if absolutely necessary. Whenever possible, we shall implement thedriver code in C, thus keeping the amount of assembly code to a minimum. The UART driver and test program, C2.5,consists of the following components.

(1). ts.s ﬁle: when the ARM CPU starts, it is in the Supervisor or SVC mode. The ts.s ﬁle sets the SVC mode stack pointerand calls main() in C.30 2 ARM Architecture and Programming

(3). uart.c ﬁle: this ﬁle implements a simple UART driver. The driver uses the UART data register for input/output chars,and it checks the flag register for device readiness. The following lists the meaning of the UART register contents.

For more serial ports, add –serial /dev/pts/1 –serial /dev/pts/2, etc. to the command line. Under Linux, open xterm(s) aspseudo terminals. Enter the Linux ps command to see the pts/n numbers of the pseudo terminals, which must match the pts/nnumbers in the –serial /dev/pts/n option of QEMU. On each pseudo terminal, there is a Linux sh process running, which willgrab all the inputs to the terminal. To use a pseudo terminal as serial port, the Linux sh process must be made inactive. Thiscan be done by entering the Linux sh command

sleep 1000000

which lets the Linux sh process sleep for a large number of seconds. Then the pseudo terminal can be used as a serial port ofQEMU.

2.8.3.1 Demonstration of UART Driver

In the uart.c ﬁle, each UART device is represented by a UART data structure. As of now, the UART structure only containsa base address and a unit ID number. During UART initialization, the base address of each UART structure is set to thephysical address of the UART device. The UART registers are accessed as *(up->base+OFFSET) in C. The driver consistsof 2 basic I/O functions, ugetc() and uputc().

(1). int ugetc(UART *up): this function returns a char from the UART port. It loops until the UART flag register is no longerRXFE, indicating there is a char in the data register. Then it reads the data register, which clears the RXFF bit and sets theRXFE bit in FR, and returns the char.(2). int uputc(UART *up, c): this function outputs a char to the UART port. It loops until the UART's flag register is nolonger TXFF, indicating the UART is ready to transmit another char. Then it writes the char to the data register fortransmission out.

The functions ugets() and uprints() are for I/O of strings or lines. They are based on ugetc() and uputc(). This is the typicalway of how I/O functions are developed. For example, with gets(), we can implement an int itoa(char *s) function whichconverts a sequence of numerical digits into an integer. Similarly, with putc(), we can implement a printf() function forformatted printing, etc. We shall develop and demonstrate the printf() function in the next section on LCD driver. Figure 2.6shows the outputs of running the C2.5 program, which demonstrates UART drivers.

2.8.3.2 Use TCP/IP Telnet Session as UART Port

In addition to pseudo terminals, QEMU also supports TCP/IP telnet sessions as serial ports. First run the program as

qemu-system-arm -M versatilepb -m 128M -kernel t.bin \

-serial telnet:localhost:1234,server

When QEMU starts, it will wait until a telnet connection is made. From another (X-window) terminal, enter telnetlocalhost 1234 to connect. Then, enter lines from the telnet terminal.

Fig. 2.6 Demonstration of UART driver program

2.8 Device Drivers 33

2.8.4 Color LCD Display Driver

The ARM Versatile board supports a color LCD display, which uses the ARM PL110 Color LCD controller (ARMPrimeCell Color LCD Controller PL110, ARM Versatile Application Baseboard for ARM926EF-S). On the Versatile board,the LCD controller is at the base address 0x10120000. It has several timing and control registers, which can be programmedto provide different display modes and resolutions. To use the LCD display, the controller's timing and control registers mustbe set up properly. ARM's Versatile Application Baseboard manual provides the following timing register settings for VGAand SVGA modes.

The LCD's frame buffer address register must point to a frame buffer in memory. With 24 bits per pixel, each pixel isrepresented by a 32-bit integer, in which the low 3 bytes are the BGR values of the pixel. For VGA mode, the needed framebuffer size is 1220 KB bytes. For SVGA mode, the needed frame buffer size is 1895 KB. In order to support both VGA andSVGA modes, we shall allocate a frame buffer size of 2 MB. Assuming that the system control program runs in the lowest1 MB of physical memory, we shall allocate the memory area from 2 to 4 MB for the frame buffer. In the LCD Controlregister (0x1010001C), bit0 is LCD enable and bit11 is power-on, both must be set to 1. Other bits are for byte order,number of bits per pixel, mono or color mode, etc. In the LCD driver, bits3–1 are set to 101 for 24 bits per pixel, all otherbits are 0s for little-endian byte order by default. The reader may consult the LCD technical manual for the meanings of thevarious bits. It should be noted that, although the ARM manual lists the LCD Control register at 0x1C, it is in fact at 0x18 onthe emulated Versatilepb board of QEMU. The reason for this discrepancy is unknown.

2.8.4.1 Display Image Files

As a memory mapped display device, the LCD can display both images and text. It is actually much easier to display imagesthan text. The principle of displaying images is rather simple. An image consists of H (height) by W (width) pixels, whereH<=480 and W<=640 (for VGA mode). Each pixel is speciﬁed by a 3-byte RGB color values. To display an image, simplyextract the RGB values of each pixel and write them to the corresponding pixel location in the display frame buffer. Thereare many different image ﬁle formats, such as BMP, JPG, PNG, etc. Applications for Microsoft Windows typically use BMPimages. JPG images are popular with Internet Web pages due to their smaller size. Each image ﬁle has a header whichcontains information of the image. In principle, it should be fairly easy to read the ﬁle header and then extract the pixels ofthe image ﬁle. However, many image ﬁles are often in compressed format, e.g. JPG ﬁles, which must be uncompressed ﬁrst.Since our purpose here is to show the LCD display driver, rather than manipulation of image ﬁles, we shall only use 24-bitcolor BMP ﬁles due to their simple image format. Table 2.2 shows the format of BMP ﬁles. A 24-bit color BMP image ﬁle is uncompressed. It begins with a 14-byte ﬁle header, in which the ﬁrst two bytes are BMPﬁle signature 'M' and 'B', indicating that it is a BMP ﬁle. Following the ﬁle header is a 40-byte image header, which containsthe width (W) and height (H) of the image in number of pixels at the byte offsets 18 and 22, respectively. The image headeralso contains other information, which can be ignored for simple BMP ﬁles. Immediately following the image header are3-byte BGR values of the image pixels arranged in H rows. In a BMP ﬁle, the image is stored upside down. The ﬁrst row inthe image ﬁle is actually the bottom row of the image. Each row contains (W*3) raised to a multiple of 4 bytes. The exampleprogram reads BMP images and displays them to the LCD screen. Since the LCD can only display 640x480 pixels in VGAmode, larger images can be displayed in reduced size, e.g. 1/2 or 1/4 of their original size.

2.8.4.2 Include Binary Data Sections

Raw image ﬁles can be included as binary data sections in an executable image. Assume that IMAGE is a raw image ﬁle.The following steps show how to include it as a binary data section in an executable image.

#---- linker script file t.ld -------

2.8.4.3 Programming Example C2.6: LCD Driver

The example program C2.6 implements an LCD driver which displays raw image ﬁles. The program consists of thefollowing components.

(1). The ts.s ﬁle: Since the driver program does not use interrupts, nor tries to handle any exceptions, there is no need toinstall the exception vectors. Upon entry, it sets up the SVC mode stack and calls main() in C.

/*********** ts.s file of C2.6 *********/

(2). The vid.c ﬁle: This is the LCD driver. It initializes the LCD registers to VGA mode of 640x480 resolutions and sets theframe buffer at 2 MB. It also includes code for SVGA mode with 800x600 resolutions but they are commented out.

(3). uart.c ﬁle: This is the same UART driver in Example C2.5, except that it uses the basic uputc() function to implement auprintf() function for formatted printing.(4). The t.c ﬁle: This ﬁle contains the main() function, which calls the show_bmp() function to display images. In the linkerscript, two image ﬁles, image1 and image2, are included as binary data sections in the executable image. The start position ofan image ﬁle can be accessed by the symbols _binary_imageI_start generated by the linker.

(5). The mk script ﬁle: The mk script generates object code for the image ﬁles, which are included as binary data sections inthe executable image.# mk and run script ﬁle of C2.6: The only thing new is to convert images to object ﬁles.

2.8.4.4 Demonstration of Display Images on LCD

Figure 2.7 shows the sample outputs of running the C2.6 program in VGA mode. The top part of Fig. 2.7 shows the UART port I/O. The bottom part of the ﬁgure shows the LCD display. When theprogram starts, it ﬁrst displays image1 to (row = 0, col = 80) on the LCD, and it also prints the image size to UART0.Entering an input key from UART0 will let it display image2 to (row = 120, col = 0), etc. Displaying image ﬁles can bevery interesting, which is the basis of computer animation. Variations to the image displaying program are listed as exercisesin the Problems section for interested readers.

2.8.4.5 Display Text

In order to display text, we need a font ﬁle, which speciﬁes the fonts or bit patterns of the ASCII chars. The font ﬁle, font.bin,is a raw bitmap of 128 ASCII chars, in which each char is represented by a 8x16 bitmap, i.e. each char is represented by 16bytes, each byte speciﬁes the pixels of a scan line of the char. To display a char, use the char's ASCII code value (multiplied2.8 Device Drivers 37

by 16) as an offset to access its bytes in the bitmap. Then scan the bits in each byte. For each 0 bit, write BGR = 0x000000(black) to the corresponding pixel. For each 1 bit, write BRG = 0x111111 (white) to the pixel. Instead of black and white,each char can also be displayed in color by writing different RGB values to the pixels. Like image ﬁles, raw font ﬁles can be included as binary data sections in the executable image. An alternative way is toconvert bitmaps to char maps ﬁrst. As an example, the following program, bitmap2charmap.c, converts a font bitmap to achar map.

Fig. 2.7 Demonstration of display images on LCD

Unlike raw bitmap ﬁles, which must be converted to object ﬁles ﬁrst, char map ﬁles are larger in size but they can beincluded directly in the C code.

2.8.4.6 Color LCD Display Driver Program

The example program C2.7 demonstrates an LCD driver for displaying text. The program consists of the followingcomponents.

(1). ts.s ﬁle: The ts.s ﬁle is the same as in the example program C2.6.(2). vid.c ﬁle: The vid.c ﬁle implements a driver for the ARM PL110 LCD display [ARM PL110, 2016]. On the Versatilepbboard the base address of the Color LCD is at 0x10120000. Other registers are (u32) offsets from the base address.

2.8.4.7 Explanations of the LCD Driver Code

The LCD screen may be regarded as a rectangular box consisting of 480x640 pixels. Each pixel has a (x,y) coordinate on thescreen. Correspondingly, the frame buffer, u32 fbuf[ ], is a memory area containing 480*640 u32 integers, in which the low24 bits of each integer represents the BGR values of the pixel. The linear address or index of a pixel in fbuf[ ] at thecoordinate (x,y) = (column, row) is given by the Mailman's algorithm (Chap. 2, Wang 2015).

pixel index ¼ x þ y640;

The basic display functions of the LCD driver are

(1). setpix(x,y): set the pixel at (x, y) to BGR values (by a global color variable).(2). clrpix(x,y): clear the pixel at (x, y) by setting the BGR to background color (black).(3). dchar(char, x, y): display char at coordinate (x, y). Each char is represented by a 8x16 bitmap. For a given char value (0to 127), dchar() fetches the 16 bytes of the char from the bitmap. For each bit in a byte, it calls clrpix(x+bitNum, y+byteNum) to clear the pixel ﬁrst. This erases the old char, if any, at (x, y). Otherwise, it will display the composite bitpatterns of the chars, making it unreadable. Then it calls setpix(x+bitNum, y+byteNum) to set the pixel if the bit is 1.(4). erasechar(): erase the char at (x, y). For a memory mapped display device, once a char in written to the frame buffer, itwill remain there (and hence be rendered on the screen) until it is erased. For ordinary chars, dchar() automatically erases the2.8 Device Drivers 43

original char. For special chars like the cursor, it is necessary to erase it ﬁrst before moving it to a different position. This isdone by the erasechar() operation.(5). kputc(char c): display a char at the current (row, col) and move the cursor, which may cause scroll-up the screen.(6). scroll(): scroll screen up or down by one line.(7). The cursor: When displaying text, the cursor allows the user to see where the next char will be displayed. The ARMLCD controller does not have a cursor generator. In the LCD driver, the cursor is simulated by a special char (ASCII code127) in which all the pixels are 1's, which deﬁnes a solid rectangular box as the cursor. The cursor may be made to blink if itis turned on/off periodically, e.g. every 0.5 s, which requires a timer. We shall show the blinking cursor later in Chap. 3when we implement timers with timer interrupts. The putcursor() function draws the cursor at the current (row, col) positionon the screen, and the erasecursor() function erases the cursor from its current position.(8). The printf() Function For any output device that supports the basic printing char operation, we can implement a printf() function for formattedprinting. The following shows how to develop such a generic printf() function, which can be used for both UART and theLCD display. First, we implement a printu() function, which prints unsigned integers.

char *ctable = "0123456789ABCDEF";

The function rpu(x) generates the digits of x % 10 in ASCII recursively and prints them on the return path. For example,if x=123, the digits are generated in the order of '3', '2', '1', which are printed as '1', '2', '3' as they should. With printu(),writing a printd() function to print signed integers becomes trivial. By setting BASE to 16, we can print in hex. Assume thatwe have prints(), printd(), printu() and printx() already implemented. Then we can write a

(5). The t.c ﬁle: The t.c ﬁle contains the main() function and the show_bmp() function. It ﬁrst initializes both the UART andLCD drivers. The UART driver is used for I/O from the serial port. For demonstration purpose, it displays outputs to both theserial port and the LCD. When the program starts, it displays a small logo image at the top of the screen. The scroll upperlimit is set to a line below the logo image, so that the logo will remain on the screen when the screen is scrolled upward.

(6). t.ld ﬁle: The linker script includes the object ﬁles of a font and an image as binary data sections, similar to that of theexample program C2.6.(7). mk and run script ﬁle: This is similar to that of C2.6.

2.8.4.8 Demonstration of LCD Driver Program

Figure 2.8 shows the sample outputs of running the example program C2.7. It uses the LCD driver to display both imagesand text on the LCD screen. In addition, it also uses the UART driver to do I/O from the serial port.2.9 Summary 45

Fig. 2.8 Demonstration of display text on LCD

2.9 Summary

This chapter covers the ARM architecture, ARM instructions, programming in ARM assembly and development of programsfor execution on ARM virtual machines. These include ARM processor modes, banked registers in different modes,instructions and basic programming in ARM assembly. Since most software for embedded systems are developed bycross-compiling, it introduces the ARM toolchain, which allows us to develop programs for execution on emulated ARMvirtual machines. We choose the Ubuntu (14.04/15.0) Linux as the program development platform because it supports themost complete ARM toolchains. Among the ARM virtual machines, we choose the emulated ARM Versatilepb board underQEMU because it supports many commonly used peripheral devices found in real ARM based systems. Then it shows how touse the ARM toolchain to develop programs for execution on the ARM Versatilepb virtual machine by a series of pro-gramming examples. It explains the function call convention in C and shows how to interface assembly code with C programs.Then it develops a simple UART driver for I/O on serial ports, and a LCD driver for displaying both graphic images and text. Italso shows the development of a generic printf() function for formatted printing to output devices that support the basic printchar operation.List of Sample ProgramsC2.1: ARM assembly programmingC2.2: Sum of integer array in assemblyC2.3: Call assembly function from CC2.4: Call C function from assemblyC2.5: UART driverC2.6: LCD driver for displaying imagesC2.7: LCD driver for displaying text

Write a GPIO control program in both assembly and C to perform the following tasks. (1). Program the GPIO pins as speciﬁed in Sect. 2.8.1. (2). Determine the state of the GPIO pins. (3). Modify the control program to make the LED blink while the input switch is closed. 5. The example program C2.6 assumes that every image size is h<=640 and width<=480 pixels. Modify the program to handle BMP images of larger sizes by (1). Cropping: display at most 480x640 pixels. (2). Shrinking: reduce the image size by a factor, e.g. 2 but keep the same 4:3 aspect ratio. 6. Modify the example program C2.6 to display a sequence of slightly different images to do animation. 7. Many image ﬁles, e.g. JPG images, are compressed, which must be uncompressed ﬁrst. Modify the example program C2.6 to display (compressed) images of different format, e.g. JPG image ﬁles. 8. In the LCD display driver program C2.7, deﬁne tab_size = 8. Each tab key (\t) expands to 8 spaces. Modify the LCD driver to support tab keys. Test the modiﬁed LCD driver by including \t in printf() calls. 9. In the LCD driver of program C2.7, scroll-up one line is implemented by simply copying the entire frame buffer. Devise a more efﬁcient way to implement the scroll operation. HINT: The display memory may be regarded as a circular buffer. To scroll up one line, simply increment the frame buffer pointer by line size.10. Modify the example program C2.7 to display text with different fonts.11. Modify the example program C2.8 to implement long jump by using the code of Sect. 2.7.3.3. Use UART0 to get user inputs but display outputs to the LCD. Verify that long jump works.12. In the LCD driver, the generic printf() function is deﬁned as int printf(char *fmt, …); // note the 3 dots In the implementation of printf(), it assumes that all the parameters are adjacent on stack, so that they can be accessed linearly. This seems to be inconsistent with the calling convention of ARM C, which passes the ﬁrst 4 parameter in r0–r3 and extra parameters, if any, on stack. Compile the printf() function code to generate an assembly code ﬁle. Examine the assembly code to verify that the parameters are indeed adjacent on stack.

In every computer system, the CPU is designed to continually execute instructions. An exception is an event recognized bythe CPU, which diverts the CPU from its normal executions to do something else, called exception processing. An interruptis an external event, which diverts the CPU from its normal executions to do interrupt processing. In a broader sense,interrupts are special kinds of exceptions. The only difference between exceptions and interrupts is that the former mayoriginate from the CPU itself but the latter always originate from external sources. Interrupts are essential to every computersystem. Without interrupts, a computer system would be unable to respond to external events, such as user inputs, timerevents and requests for service from I/O devices, etc. Most embedded systems are designed to respond to external events andhandle such events when they occur. For this reason, interrupts and interrupts processing are especially important toembedded systems. In this chapter, we shall discuss exceptions, interrupts and interrupts processing in ARM based systems. In Chap. 2, we developed simple drivers for the LCD display and UARTs. The LCD is a memory mapped device, which doesnot use interrupts. UARTs support interrupts but the simple UART driver uses polling, not interrupts, for I/O. The maindisadvantage of I/O by polling is that it does not use the CPU efﬁciently. While the CPU is doing I/O by polling, it is constantlybusy and can't do anything else. In a computer system, I/O should be done by interrupts whenever possible. In this Chapter, weshall show how to apply the principle of interrupts processing to design and implement interrupt-driven device drivers.

3.1 ARM Exceptions

3.1.1 ARM Processor Modes

The ARM processor has seven different operating modes, which are determined by the 5 mode bits [4:0] in the currentprocessor status register (CSPR) (ARM Architecture 2016; ARM Processor Architecture 2016). Table 3.1 shows the sevenmodes of the ARM processor. Among the seven modes, only the User mode is non-privileged. All other modes are privileged. An unusual feature of theARM architecture is that, while the CPU is in a privileged mode, it can change to any other mode, by simply altering themode bits in the CPSR. When the CPU is in the un-privileged User mode, the only way to change to a privileged mode isthrough exceptions, interrupts, or the SWI instruction. Each privileged mode has its own banked registers, except Systemmode, which shares the same set of registers with the User mode, e.g. they have the same stack pointer (R13) and the samelink register (R14).

3.1.2 ARM Exceptions

An exception is an event recognized by the processor, which diverts the processor from its normal executions to handle theexception. In a general sense, interrupts are also exceptions. In ARM, there are seven exception types (excluding theReserved type) (ARM Processor Architecture 2016), which are shown in Table 3.2. When an exception occurs, the ARM processor does the following.

Table 3.1 ARM Processor Modes

Table 3.2 ARM Exceptions

(1). Copy CPSR into SPSR for the mode in which the exception is to be handled.(2). Change CPSR mode bits to the appropriate mode, map in the banked registers and disable interrupts. IRQ is alwaysdisabled, FIQ is disabled only when an FIQ occurs and on reset.(3). Set the LR_mode register to the return address.(4). Set the Program Counter (PC) to the vector address of the exception. This forces a branch to the appropriate exceptionhandler.

If multiple exceptions occur simultaneously, they are handled in accordance with the priority order shown in Table 3.2.The following is a list of the exception events and how they are handled by the ARM processor.

• A Reset event occurs when the processor is powering up. This is the highest priority event and shall be taken whenever it is signaled. Upon entry to the reset handler, the CPSR is in SVC mode and both IRQ and FIQ interrupts are masked out. The task of the reset handler is to initialize the system. This includes setting up stacks of various modes, conﬁguring the memory and initializing device drivers, etc.• A Data Abort (DAB) events occur when the memory controller or MMU indicates that an invalid memory address has been accessed. For example, if there is no physical memory for an address, or the processor does not have access permission to a region of memory, the data abort exception is raised. Data aborts have the second highest priority. This means that the processor will handle data abort exceptions ﬁrst, before handling any interrupts.• A FIQ interrupt occurs when an external peripheral sets the FIQ pin to nFIQ. A FIQ interrupt is the highest priority interrupt. Upon entry to the FIQ handler, both IRQ and FIQ interrupts are disabled. This means that while handling an FIQ interrupt, no other interrupts can occur unless they are explicitly enabled by the software. In ARM based systems, FIQ is usually used to handle interrupts from a single interrupt source of extreme urgency. Allowing multiple FIQ sources would defeat the purpose of FIQ.3.1 ARM Exceptions 49

• An IRQ interrupt occurs when an external peripheral device sets the IRQ pin. An IRQ interrupt is the second highest priority interrupt. The processor will handle an IRQ interrupt if there is no FIQ interrupt or data abort exception. Upon entry to the IRQ handler, IRQ interrupts are masked out. The CSPR's I-bit should remain set until the current interrupt source has been cleared.• A Pre-fetch Abort (PFA) event occurs when an attempt to load an instruction results in a memory fault. The exception occurs if the instruction reaches the execution stage of the pipeline and none of the higher exceptions/interrupts have been raised. Upon entry to the PFA handler, IRQ is disabled but FIQ remains enabled, so that any FIQ interrupt will be taken immediately while processing a PFA exception.• A SWI interrupt occurs when the SWI instruction has been fetched and decoded successfully and none of the other higher priority exceptions/interrupts have been raised. Upon entry to the SWI handler, the CPSR is set to SVC mode. SWI interrupts are usually used to implement system calls from User mode to an OS kernel in SVC mode.• An Undeﬁned Instruction event occurs when an instruction not in the ARM/Thumb instruction set has been fetched and decoded successfully, and none of the other exceptions/interrupts have been flagged. In an ARM based system with coprocessors, the ARM processor will poll the coprocessors to see if they can handle the instruction. If no coprocessor claims the instruction, then an undeﬁned instruction exception is raised. SWI and Undeﬁned Instruction have the same priority since they cannot occur at the same time. In other words, the instruction being executed can not be both a SWI and an undeﬁned instruction. In practice, undeﬁned instructions can be used to provide software breakpoints when debugging ARM programs.

3.1.3 Exceptions Vector Table

The ARM processor uses a vector table to handle exceptions and interrupts. Table 3.3 shows the ARM vector tablecontents. The vector table deﬁnes the entry points of exceptions and interrupts handlers. The vector table is located at the physicaladdress 0. Many ARM based systems begins execution from a flash memory or ROM, which may be remapped to0xFFFF0000 during booting. If the initial vector table is not at 0 in SRAM, it must be copied to SRAM before remap it to0x00000000. This is normally done during system initialization.

3.1.4 Exception Handlers

Each vector table entry contains an ARM instruction (B, BL or LDR), which causes the processor to load the PC with theentry address of an exception handler routine. The following code segment shows the typical vector table contents, in whicheach LDR instruction loads the PC with the entry address of an exception handler function. For the reserved vector entry(0x14), a branch to itself loop is sufﬁcient since the exception can never occur.

When a vector uses LDR to load the handler entry address, the handler will be called indirectly. LDR must load a constantlocated within 4 kB from the vector table but it can branch to a full 32-bit address. A B (branch) instruction will go directlyto the handler, but it can only branch to a 24-bit address. Since the FIQ vector is the last entry in the vector table, the FIQhandler code can be placed at the FIQ vector location directly, allowing the FIQ handler to be executed quickly.

3.1.5 Return from Exception Handlers

Assume that exception/interrupt handlers are entered by LDR PC instructions, as in

LDR PC, handler_entry_address

Upon entry to an exception/interrupt handler routine, the processor automatically stores the return address in the linkregister r14 of the current mode. Due to the instruction pipeline in the ARM processor, the return address stored in the linkregister includes an offset, which must be subtracted from the link register in order to return to the correct location priori tothe exception or interrupt. Table 3.4 shows the program counter offsets of different exceptions. For Data Abort, the return address is PC-8, which points to the original instruction that caused the exception. Forinterrupts and Prefetch Abort, the return address is PC-4 since the instruction to be executed before the exception is at currentPC-4. For SWI and Undeﬁned instruction, the return address is the current LR because the ARM processor has alreadyexecuted the SWI or undeﬁned instruction. To most beginners, the non-uniform treatment of the IRQ and SWI returnaddresses is often a source of confusion. The 'S' sufﬁx at the end of the MOV instruction speciﬁes that, if the destinationregisters involve loading the PC, the CPSR shall be restored from the saved SPSR also. A typical method of return from an interrupt handler is to execute the following instruction at the end of the interrupthandler

Table 3.4 Program Counter Offsets

3.1 ARM Exceptions 51

SUBS pc, r14_irq, #4

which loads the PC with r14_irq – 4, assuming that r14_irq has not been altered in the interrupt handler. Alternatively, thelink register can be adjusted at the beginning of the interrupt handler, as in.

SUB lr, lr, #4

<handler code> MOVS pc, lr

A more widely used method is as follows.

SUB lr, lr, #4 // subtract 4 form LR

The interrupt handler ﬁrst subtracts 4 from the link register, saves it in the stack, then it executes the handler code. Whenthe handler ﬁnishes, it returns to the interrupted point by the LDMFD instruction with the ^ symbol, which loads PC with thesaved LR and restores SPSR, causing return to the previous mode priori to the interrupt. Instead of subtracting 4 from thelink register manually, an interrupt handler may be written in C with the interrupt attribute, as in

void __attribute__((interrupt))handler() { // actual handler code }

In that case, the compiler generated code will adjust the link register automatically.

3.2 Interrupts and Interrupts Processing

3.2.1 Interrupt Types

The ARM processor accepts only two external interrupt requests, FIQ and IRQ. Both are level-sensitive active low signals tothe processor. For an interrupt to be accepted by the CPU, the appropriate interrupt mask (I or F) bit in the CPSR must becleared to 0. FIQ has higher priority than IRQs, so that FIQ will be handled ﬁrst when multiple interrupts occur. Handling anFIQ causes IRQs and subsequent FIQ to be disabled, preventing them from being taken until after the FIQ handler exits orexplicitly enables them. This is usually done by restoring the CPSR from the SPSR at the end of the FIQ handler. The FIQ vector is the last entry in the vector table. The FIQ handler code can be placed directly at the vector location andrun sequentially from that address. This avoids a branch instruction and its associated delay. If the system has a cachememory, the vector table and FIQ handler might all be locked down in one block within the cache. This is important becauseFIQ is designed to handle the interrupt as quickly as possible. Each privileged mode has its own banked registers r13, r14and SPSR. The FIQ mode has 5 extra banked register (r8–r12), which can be used to contain information between calls in theFIQ handler, further increasing the execution speed of the FIQ handler.

3.2.2 Interrupt Controllers

An ARM based system usually supports only one FIQ interrupt source but it may support many IRQ requests from differentsources. In order to support multiple IRQ interrupts, an interrupt controller is necessary, which sorts out different IRQsources and presents only one IRQ request to the CPU. Most ARM boards include a Vectored Interrupt Controller (VIC),which is either the ARM PL190 or PL192 (ARM PL190, PL192 2016). A VIC provides the following functions.52 3 Interrupts and Exceptions Processing

. Prioritize the interrupt sources

. Support vectored interrupts

3.2.2.1 ARM PL190/192 Interrupt Controller

Figure 3.1 shows the block diagram of the PL190 VIC. It supports 16 vectored interrupts. The PL192 VIC is similar but itsupports 32 vectored interrupts.

3.2.2.2 Vectored and Non-vectored IRQs

The VIC takes all the interrupt requests from different sources and arranges them into three categories, FIQ, vectored IRQand non-vectored IRQ. FIQ has the highest priority. In the case of the PL190 VIC, it takes 16 interrupt requests andprioritizes them within sixteen vectored IRQ slots, denoted by IRQ0-IRQ15. Each vectored interrupt has a vector address,denoted by VectAddr0-VectAddr15.

3.2.2.3 Interrupt Priorities

Among the vectored interrupts, IRQ0 has the highest priority and IRQ15 has the lowest priority. Non-vectored IRQ's havethe lowest priority. The VIC OR's the requests from both the vectored and non-vectored IRQ's to generate the IRQ signal tothe ARM core, which is shown as nVICIRQ line in Fig. 3.1.

Fig. 3.1 ARM PL190 VIC

3.2 Interrupts and Interrupts Processing 53

3.2.3 Primary and Secondary Interrupt Controllers

Some ARM boards may contain more than one VIC. For example, the ARM926EJ-S board has two VICs, a primary VIC(PIC) and a secondary VIC (SIC), which are shown in Fig. 3.2. Most inputs to the PIC are dedicated to high priorityinterrupts, such as timers, GPIO and UARTs. Low priority interrupt sources, such as USB, Ethernet, keyboard and mouse,are fed to the SIC. Some of these interrupts may be routed to the PIC at IRQs 21 to 26. Lower priority interrupts, such astouch-screen, keyboard and mouse are collectively routed to IRQ 31 of the PIC.

Fig. 3.2 VICs in ARM926EJ-S Board

54 3 Interrupts and Exceptions Processing

3.3 Interrupt Processing

3.3.1 Vector Table Contents

Interrupt vectors are in the exception vector table. Each interrupt vector location contains an instruction which loads the PCwith the entry address of an interrupt handler. For FIQ and IRQ interrupts, the vector contents are

1. LR_irq = address of the next instruction to be executed + 4.

3.3.3 Interrupts Control in Software

When discussing interrupts and interrupts processing, there are some commonly used terms which warrant clariﬁcation.

3.3.3.1 Enable/Disable Interrupts

Each device has a control register or, in some cases, a separate interrupt control register, which can be programmed to eitherallow or disallow the device to generate interrupt requests. If a device is to use interrupts, the device interrupt control registermust be conﬁgured with interrupts enabled. If needed, device interrupts can be disabled explicitly. Thus, the termsenable/disable interrupts should be applied only to devices.

3.3.3.2 Interrupt Masking

When a device raises an interrupt to the CPU, the CPU may either accept or not accept the interrupt immediately, dependingon the interrupt masking bits in the CPU's status register. For an IRQ interrupt, the ARM CPU accepts the interrupt if the Ibit in the CPSR register is 0, meaning that the CPU has IRQ interrupts unmasked or mask-in. It does not accept the interruptwhile the CPSR's I bit is 1, meaning that the CPU has IRQ interrupts masked out. Masked out interrupts are not lost. Theyare kept pending until the CPSR's I bit is changed to 0, at which time the CPU will accept the interrupt. Thus, when appliedto the CPU, enable/disable interrupts really mean mask-in/mask-out interrupts. In most literatures these terms are usedinterchangeably but the reader should be aware of their differences.3.3 Interrupt Processing 55

3.3.3.3 Clear Device Interrupt Request

When the CPU accepts an IRQ interrupt, it starts to execute the interrupt handler for that device. At the end of the interrupthandler it must clear the interrupt request, which causes the device to drop its interrupt request, allowing it to generate thenext interrupt. This is usually done by accessing some of the device interface registers. For example, reading the data registerof an input device clears the device interrupt request. For some output devices, it may be necessary to disable the deviceinterrupt explicitly when there is no more data to output.

3.3.3.4 Send EOI to Vectored Interrupt Controller

In a system with multiple interrupt sources, a Vectored Interrupt Controller (VIC) is usually used to prioritize the deviceinterrupts, each with a dedicated vector address. At the end of handling the current interrupt, the interrupt handler mustinform the VIC that it has ﬁnished processing the current interrupt (of the highest priority), allowing the VIC to re-prioritizepending interrupts requests. This is referred to as sending an End-of-Interrupt (EOI) to the interrupt controller. For the ARMPL190, this is done by writing an arbitrary value to the VIC's vector address register at base +0x30. The ARM processor has a simple way to enable/disable (unmask/mask) interrupts while in privileged mode. Thefollowing code segments show how to enable/disable IRQ interrupts of the ARM processor.

To enable IRQ interrupts, ﬁrst copy CPSR into a working register, clear the I_bit (7) in the working register to 0. Thencopy the updated register back to CPSR, which enables IRQ interrupts. Similarly, setting CPSR's I_bit disables IRQinterrupts. Similar code segments can be used to enable/disable FIQ interrupts (by clear/set bit-6 in CPSR).

3.3.4 Interrupt Handlers

Interrupt handlers are also known as Interrupt Service Routines (ISRs). Interrupt handlers can be classiﬁed into threedifferent types.

3.3.4.1 Interrupt Handler Types

. Non-nested interrupt handler: handles one interrupt at a time. Interrupts are not enabled until execution of the current ISRhas ﬁnished.. Nested interrupt handler: While inside an ISR, enable IRQ to allow IRQ interrupts of higher priorities to occur. This impliesthat while executing the current ISR, it may be interrupted to execute another ISR of higher priority.. Re-entrant interrupt handler: Enable IRQ as soon as possible, allowing the same ISR to be executed again.

3.3.5 Non-nested Interrupt Handler

When the ARM CPU accepts an IRQ interrupt, it enters the IRQ mode with IRQ interrupts masked out, preventing it fromaccepting other IRQ interrupts. The CPU sets PC to point to the IRQ entry in the vector table and executes that instruction.The instruction loads PC with the entry address of the interrupt handler, causing execution to enter the interrupt handler. Theinterrupt handler ﬁrst saves the execution context before the interrupt. Then it determines the interrupt source and calls theappropriate ISR. After servicing the interrupt, it restores the saved context and sets the PC to point back to the next56 3 Interrupts and Exceptions Processing

instruction prior to the interruption. Then it returns to the original place of the interruption. The simplest interrupt handlerserves only one interrupt at a time. While executing the interrupt handler, IRQ interrupts are masked out until control isreturned back to the interrupted point. The algorithm and control flow of a non-nested interrupt handler is as follows.

The following code segment shows the organization of a simple IRQ interrupt handler. It assumes that an IRQ mode stackhas been set up properly, which is typically done in the reset handler during system initialization.

SUB lr, lr, #4

The ﬁrst instruction adjusts the link register (r14) for return to the interrupted point. The STMFD instruction saves thecontext at the point of interruption by pushing CPU registers that must be preserved onto the stack. The time taken to executea STMFD or LDMFD instruction is proportionally to the number of registers being transferred. To reduce interruptprocessing latency, a minimum number of registers should be saved. When writing ISR in a high-level programminglanguage, such as C, it is important to know the calling convention of the compiler generated code as this will affect thedecision on which registers should be saved on the stack. For instance, the ARM compiler generated code preserves r4–r11during function calls, so there is no need to save these registers unless they are going to be used by the interrupt handler.Once the registers have been saved, it is now safe to call C functions to process the interrupt. At the end of the interrupthandler the LDMFD instruction restores the saved context and return from the interrupt handler. The '^' symbol at the end ofthe LDMFD instruction means that the CPSR will be restored from the saved SPSR. As noted in the Chap. 2 on ARMinstructions, the '^' restores the saved SPSR only if the PC is loaded at the same time. Otherwise, it only restores the bankregisters of the previous mode, excluding the saved SPSR. This special feature can be used to access User mode registerswhile in a privileged mode. The organization of the simple interrupt handler is suitable for handling both FIQ and IRQ interrupts one at a time withoutinterrupt nesting. After saving the execution context, the interrupt handler must determine the interrupt source. In a simpleARM system that does not use vectored interrupts, the interrupt source is in the interrupt status register, denoted byIRQstatus, located at a known (memory-mapped) address. To determine the interrupt source, simply read the IRQstatusregister and scan the contents for any bit or bits that are set. Each non-zero bit stands for an active interrupt request. Theinterrupt handler may scan the bits in a speciﬁc order, which determines the interrupt processing priority in software. With the above background information on interrupts and interrupts processing, we are ready to write some real programsthat use interrupts. In the following, we shall show how to write interrupt handlers for I/O devices. Device drivers usinginterrupts are called interrupt-driven device drivers. Speciﬁcally, we shall show how to implement interrupt-driven driversfor timer, keyboard, UART and Secure Digital Cards (SDC).

3.4 Timer Driver

3.4.1 ARM Versatile 926EJS Timers

The ARM Versatile 926EJS board contains two ARM SB804 dual timer modules [ARM Timers 2004]. Each timer modulecontains two timers, which are driven by the same clock. The base addresses of the timers are at3.4 Timer Driver 57

Timer0: 0x101E2000, Timer1: 0x101E2020

Timer2: 0x101E3000, Timer3: 0x101E3020

Timer0 and Timer1 interrupt at IRQ4. Timer2 and Timer3 interrupt at IRQ5, both on the primary Vectored Controller(VIC). To begin with, we shall not use vectored interrupts. Vectored interrupts will be discussed later. From a programmingpoint of view, the most important timer registers are the control and counter registers. The following lists the meanings of thetimer control register bits.

When the program starts, QEMU loads the executable image, t.bin, to 0x10000. Upon entry to ts.s, execution beginsfrom the label reset_handler. First, it sets the SVC mode stack pointer and calls copy_vector() in C to copy the vectors toaddress 0. It switches to IRQ mode to set up the IRQ stack pointer. Then it switches back to SVC mode with IRQ interruptsenabled and calls main() in C. The main program normally runs in SVC mode. It enters IRQ mode only to handle interrupts.Since we are not using FIQ, nor trying to deal with any exceptions at this moment, all other exception handlers(in exceptions.c ﬁle) are while(1) loops.

int timer_clearInterrupt(int n) // timer_start(0), 1, etc.

Figure 3.3 shows the LCD screen of running the program C3.1. Each timer displays a wall clock at the right-top cornerof the screen. In the LCD driver, the scroll-up limit is set to a line below the logo and the wall clocks, so that they will not beaffected during scroll-up. The wall clocks are updated on each second. As exercises, the reader may change the startingvalues of the wall clocks to display local time in different time zones, or change the timer counter values to generate timerinterrupts with different frequencies.

3.5 Keyboard Driver

3.5.1 ARM PL050 Mouse-Keyboard Interface

The ARM Versatile board includes an ARM PL050 Mouse-Keyboard Interface (MKI) which provides support for a mouseand a PS/2 compatible keyboard [ARM PL050 MKI 1999]. The keyboard's base address is at 0x1000600. It has several32-bit registers, which are at offsets from the base address.62 3 Interrupts and Exceptions Processing

3.5.2 Keyboard Driver

In this section, we shall develop a simple interrupt-driven driver for the ARM versatile keyboard. In order to use interrupts,the keyboard's control register must be initialized to 0x14, i.e. bit 2 enables the keyboard and bit 4 enables Rx (input)interrupts. The keyboard interrupts at IRQ3 on the Secondary VIC, which is routed to IRQ31 on the Primary VIC. Instead ofACSII code, the keyboard generates scan codes. A complete listing of scan codes is included in the keyboard driver.Translation of scan code to ASCII is done by mapping tables in software. This allows the same keyboard to be used fordifferent languages. For each key typed, the keyboard generates two interrupts; one when the key is pressed and another onewhen the key is released. The scan code of key release is 0x80 + the scan code of key press, i.e. bit7 is 0 for key press and 1for key release. When the keyboard interrupts, the scan code is in the data register (0x08). The interrupt handler must readthe data register to get the scan code, which also clears the keyboard interrupt. Some special keys generate escape keysequences, e.g. the UP arrow key generates 0xE048, where 0xE0 is the escape key itself. The following shows the mappingtables for translating scan codes into ASCII. The keyboard has 105 keys. Scan codes above 0x39 (57) are special keys,which cannot be mapped directly, so they are not shown in the key maps. Such special keys are recognized by the driver andhandled accordingly. Figure 3.4 shows the key mapping tables.

3.5.3 Interrupt-Driven Driver Design

Every interrupt-driven device driver consists of three parts; a lower-half part, which is the interrupt handler, an upper-halfpart, which is called by the application program, and a common data area containing a buffer for data and control variablesfor synchronization, which are shared by the lower and upper parts. Figure 3.5 shows the organization of the keyboarddriver. The top part of the ﬁgure shows kbd_init(), which initialize the KBD driver when the system starts. The middle partshows the control and data flow path from the KBD device to a program. The bottom part shows the lower-half, input buffer,and the upper-half organization of the KBD driver. When the main program starts, it must initialize the keyboard driver control variables. When a key is pressed, the KBDgenerates an interrupt, causing the interrupt handler to be executed. The interrupt handler fetches the scan code from KBDdata port. For normal key presses, it translates the scan code into ASCII, enters the ASCII char into an input buffer, buf[N],

Fig. 3.4 Key mapping tables

3.5 Keyboard Driver 63

Fig. 3.5 KBD driver organization

and notiﬁes the upper-half of the input char. When the program side needs an input char, it calls getc() of the upper-halfdriver, trying to get a char from buf[N]. The program waits if there is no char in buf[N]. The control variable, data, is used tosynchronize the interrupt handler and the main program. The choice of the control variable depends on the synchronizationtool used. The following shows the C code of a simple KBD driver. The driver handles only lower case keys. Extending thedriver to handle upper case and special keys is left as an exercise in the Problem section.

3.5.4 Keyboard Driver Program

The sample program C3.2 demonstrates a simple interrupt-driven keyboard driver. It consists of the following components.

(1). t.ld ﬁle: The linker script ﬁle is the same as in C3.1.(2). ts.s ﬁle: The ts.s ﬁle is almost the same as in C3.1, except that it adds the lock and unlock functions for enable/disableIRQ interrupts.

kbd_init(); // initialize keyboard driver

In order to use interrupts, the devices must be conﬁgured to generate interrupts. This is done in the device initializationcode, in which each device is initialized with interrupts enabled. In addition, both the Primary and Secondary InterruptControllers (PIC and SIC) must be conﬁgured to enable the device interrupts. The keyboard interrupts at IRQ3 on the SIC,which is routed to IRQ31 on the PIC. To enable keyboard interrupts, both bit 3 of the SIC_INTENABLE and bit 31 of theVIC_INTENABLE registers must be set to 1. These are done in main(). After initializing the devices for interrupts, the main program executes a while(1) loop, inwhich it prompts for an input line from the KBD and prints the line to the LCD display.

irq_handler ﬁrst subtract 4 form the link register lr_irq. The adjusted lr is the correct return address to the point ofinterruption. It pushes registers r0–r12 and lr into the IRQ stack. Then it call IRQ_handler() in C. Upon return fromIRQ_handler(), it pops the stack, which loads PC with the saved lr and also restores the SPSR, causing control return to theoriginal point of interruption. To speed up interrupt processing, irq_handler may only save registers that must be preserved.Since our interrupt handler is written in C, not in assembly, it sufﬁces to save only r0–r3, r12 (stack frame pointer) and lr(link register).

(5).2. IRQ_handler(): IRQ_handler() ﬁrst reads the status registers of both PIC and SIC. The interrupt handler must scan thebits of the status register to determine the interrupt source. Each bit = 1 in the status register represents an active interrupt.The scanning order should follow the interrupt priority order, i.e. from bit 0 to bit 31.3.5 Keyboard Driver 67

(5).3. kbd_handler(): The kbd_handler() reads the scan code from the KBD data register, which clears the KBD interrupt. Itignores any key release, so the driver can only handle lower case keys and no special keys at all. For each key pressed, itprints a "kbd interrupt key" message to the LCD display. As noted before, we have adapted the generic printf() function forformatted printing to the LCD screen. Then, it maps the scan code to a (lower case) ASCII char and enters the char into theinput buffer. The control variable, data, represents the number of chars in the input buffer.(5).4. kgetc() and kgets() functions: kgetc() is for getting an input char from the keyboard. kgets() is for getting an input lineended with the \r key. The simple KBD driver is intended mainly to illustrate the design principle of an interrupt-driven inputdevice driver. The driver's buffer and control variables form a critical region since they are accessed by both the mainprogram and KBD interrupt handler. When the interrupt handler executes, the main program is logically not executing. Sothe main program can not interfere with the interrupt handler. However, while the main program executes, interrupts mayoccur, which diverts the program to execute the interrupt handler, which may interfere with the main program. For thisreason, when a program calls kgetc(), which may modify the shared variables in the driver, it must mask out interrupts toprevent keyboard interrupts from occurring. In kgetc(), the main program ﬁrst enables interrupts, which is optional if theprogram is already running with interrupts enabled. Then it loops until the variable kp->data is nonzero, meaning that thereare chars in the input buffer. Then it disables interrupts, gets a char from the input buffer and updates the shared variables.The code segment which ensures shared variables can only be updated by one execution entity at a time is commonly knownas a Critical Region or Critical Section (Silberschatz et al. 2009; Stallings 2011; Wang 2016). Finally, it enables interruptsand returns a char. On the ARM CPU it is not possible to mask out only the keyboard interrupts. The lock() operation masksout all IRQ interrupts, which is a little overkill but it gets the job done. Alternatively, we may write to the keyboard's controlregister to explicitly disable/enable keyboard interrupts. The disadvantage is that it must access memory mapped locations,which is much slower than masking out interrupts via the CPU's CPSR register.

Figure 3.6 shows the KBD driver using interrupts. As the ﬁgure shows, the main program only prints complete lines, buteach input key generates an interrupt and prints a "kbd interrupt" message, along with the input char in ASCII.

3.6 UART Driver

3.6.1 ARM PL011 UART Interface

The ARM Versatile board supports four PL011 UART devices for serial I/O (ARM PL011 2005). Each UART device has abase address in the system memory map.

Fig. 3.6 KBD driver using interrupts

The ﬁrst 3 UARTs, UART0 to UART2 are adjacent in the system memory map. They interrupt at IRQ12 to IRQ14 on theprimary PIC. UART4 in located at 0x10000900 and it interrupts at IRQ6 on the SIC. In general, a UART must be initializedby the following steps.

0x4=1152000, 0xC=38400, 0x18=192000, 0x20=14400, 0x30=9600

(2). Write to Line Control register to specify the number of bits per char and parity, e.g. 8 bits per char with no parity, etc.(3). Write to Interrupt Mask register to enable/disable RX and TX interrupts

When using the emulated ARM Versatilepb board under QEMU, it seems that QEMU automatically uses default valuesfor both baud rate and line control parameters, making steps (1) and (2) either optional or unnecessary. In fact, it is observedthat writing any value to the integer divisor register (0x24) would work, but the reader should be aware this is not the normfor UARTs in real systems. In this case, we only need to program the Interrupt Mask register (if using interrupts) and checkthe Flag register during serial I/O.

3.6.2 UART Registers

Each UART interface contains several 32-bit registers. The most important UART registers are

Some of the UART registers are already explained in the program C2.3 of Chap. 2. Here, we shall focus on the registersthat are related to interrupts. The ARM UART interface supports many kinds of interrupts. For simplicity, we shall onlyconsider Rx (Receiving) and Tx (Transmitting) interrupts for data transfer and ignore such interrupts as modem status anderror conditions. To allow UART Rx and Tx interrupts, bits 4 and 5 of the interrupt mask register (UARTIMSC) must be setto 1. When a UART interrupts, the masked interrupt status register (UARTMIS) contains the interrupt identiﬁcation, e.g. bit4 = 1 if it it's an Rx interrupt and bit 5 = 1 if it's a Tx interrupt. Depending on the interrupt type, the interrupt handler canbranch to a corresponding ISR to handle the interrupt.

3.6.3 Interrupt-Driven UART Driver Program

This section shows the design and implementation of an interrupt-driven UART driver for serial I/O. In order to keep thediver simple, we shall only use UART0 and UART1 but the same code is also applicable to other UARTs. The UART driverprogram is denoted by C3.3, which is organized as follows.

3.6.3.1 The uart.c File

This ﬁle implements the UART driver using interrupts. Each UART is represented by a UART structure. It contains theUART base address, a unit number, an input buffer, an output buffer and control variables. Both buffers are circular, withhead pointers for entering chars and tail pointers for removing chars. Among the control variables, data is the number ofchars in the buffer and room is the number of empty spaces in the buffer. For outputs, txon is a flag indicating whether theUART is already in the transmission state. To use interrupts, the UART Interrupt Mask Set/Clear register (IMSC) must be setup properly. The UART supports many kinds of interrupts. For simplicity, we shall only consider TX (bit 5) and RX (bit 4)interrupts. In uart_init(), the C statement

*(up->base+IMSC) |= 0x30; // bits 4,5 = 1

sets bits 4 and 5 of IMSC to 1, which enable TX and RX interrupts of the UART. For simplicity, we shall only use UART0and UART1. UART0 interrupts at IRQ12 and UART1 interrupts at IRQ13, both on the primary VIC. In order to allowUART interrupts, bits 12 and 13 of the PIC must be set to 1. These are done in main() by the statements

VIC_INTENABLE |= (1<<12); // UART0 at bit12

VIC_INTENABLE |= (1<<13); // UART1 at bit13

A special feature of the ARM PL011 UART is that it supports FIFO buffers in hardware for both send and receiveoperations. It can be programmed to raise interrupts when the FIFO buffers are at different levels between FULL andEMPTY. In order to keep the UART driver simple, we shall not use the hardware FIFO buffers. They are disabled by thestatement

*(up->base+CNTL) &= ~0x10; // disable UART FIFO

This makes the UART operate in single char mode. Also, TX interrupt is triggered only after writing a char to the dataregister. When the system has no more outputs to a UART, it must disable the TX interrupt. The following shows the uart.cdriver code.

3.6.3.2 Explanations of the UART Driver Code

(1). uart_handler(UART *up): The uart interrupt handler reads the MIS register to determine the interrupt type. It's a RXinterrupt if bit 4 of MIS is 1. It's a TX interrupt if bit 5 of MIS is 1. Depending on the interrupt type, it calls do_rx() or do_tx() to handle the interrupt.(2). do_rx(): This is an input interrupt. It reads the ASCII char from the data register, which clears the RX interrupt of theUART. Then it enters the char into the circular input buffer and increment the data variable by 1. The data variable representsthe number of chars in the input buffer.(3). do_tx(): This is an output interrupt, which is triggered by writing the last char to the output data register and transmissionof that char has ﬁnished. The handler checks whether there are any chars in the output buffer. If the output buffer is empty, itdisables the UART TX interrupt and returns. If the TX interrupt is not disabled, it would cause an inﬁnite sequence of TXinterrupts. If the output buffer is not empty, it takes a char from the buffer and writes it to the data register for transmissionout.(4). ugetc(): ugetc is for the main program to get a char from a UART port. Its logic and synchronization with the RXinterrupt handler are the same as kgetc() of the keyboard driver. So we shall not repeat them here again.(5). uputc(): uputc() is for the main program to output a char to a UART port. If the UART port is not transmitting (the txonflag is off), it writes the char to the data register, enables TX interrupt and sets the txon flag. Otherwise, it enters the char intothe output buffer, updates the data variable and returns. The TX interrupt handler will output the chars from the output bufferon each successive interrupt.(6). Formatted printing: uprintf(UART *up, char *fmt,…) is for formatted printing to a UART port. It is based on uputc().

3.6.3.3 Demonstration of KBD and UART Drivers

The t.c ﬁle contains the IRQ_handler() and the main() function. The main() function ﬁrst initializes the UART devices andthe VIC for interrupts. Then it tests the UART driver by issuing serial I/O on the UART terminals. For each IRQ interrupt,the IRQ_handler() determines the interrupt source and calls an appropriate handler to handle the interrupt. For clarity, UARTrelated code are shown in bold faced lines.

Figure 3.7 shows the outputs of the interrupt-driven KBD and UART drivers. The ﬁgure shows that UART inputs are byrx interrupts and outputs are by tx interrupts.

3.7 Secure Digital (SD) Cards

For most embedded systems, the primary mass storage devices are Secure Digital (SD) cards (SDC 2016) due to theircompact size, low power consumption and compatibility with other kinds of mobile devices. Many embedded systems maynot have any mass storage device to provide ﬁle system support, but they usually start up from either a flash memory card ora SD card. A good example is the Raspberry Pi (Raspberry_Pi 2016). It requires a SD card to boot up an operating system,which is usually a version of Linux, called Raspbian, adapted for the ARM architecture. Most ARM based systems includethe ARM PrimeCell PL180/ PL181 multimedia card interface (ARM PL180 1998; ARM PL181 2001) to provide support for74 3 Interrupts and Exceptions Processing

Fig. 3.7 Demonstration of KBD and UART drivers

both multimedia and SD cards. The emulated ARM Versatilepb virtual machine under QEMU includes the PL180 multi-media interface also but it only supports SD cards.

3.7.1 SD Card Protocols

The simplest SD card protocol is the Serial Peripheral Interface (SPI) (SPI 2016). The SPI protocol requires the host machineto have a SPI port, which is available in many ARM based systems. For host machines without SPI ports, SD cards must usethe native SD protocol [SD speciﬁcation 2016], which is more capable and therefore more complex than the SPI protocol.QEMU's multimedia interface supports SD cards in native mode but not in SPI mode. For this reason, we shall develop a SDcard driver that operates in the native SD mode.

3.7.2 SD Card Driver

The sample program C3.4 implements an interrupt-driven SD card driver. It demonstrates the SD driver by writing to thesectors of a SD card and then reading back the sectors to verify the results. The program consists of the following components.

(1). sdc.h: This header ﬁle deﬁnes the PL180 Multi-Media Card (MMC) registers and bit masks. For the sake of brevity, weonly show the PL180 registers.

In general, every command except CMD0 expects a response of different type. After sending a command, the responsesare in the response registers, but they are ignored in the SD driver. It is observed that in the emulated PL180 MMC ofQEMU, the Relative Card Address (RCA) assigned to the SDC is hard-coded as 0x4567. In fact, each CMD3 commandincreases the RCA by 0x4567. The reason for this rather peculiar behavior is probably due to a typo in the PL180 emulatorof QEMU. It should just set the RCA once by RCA = 0x4567, rather than RCA += 0x4567, which increments it by0x4567 on each CMD3 command. After initialization, the driver may issue a CMD17 to read a block or a CMD24 to write ablock. For SDC, the default block (sector) size is 512 bytes. The SDC also supports read/write multiple sectors by CMD18and CMD25, respectively. Data transfer may use one of three different schemes: polling, interrupts or DMA. DMA issuitable for transferring large amounts of data. In order to keep the driver code simple, the SDC driver only uses interrupts,not DMA, to read/write one sector of 512 bytes at a time. It will be extended to read/write multi-sectors later. SDCinterrupts are enabled by setting bits in the Interrupt Mask registers MASK0. In sdc_init(), the interrupt mask bits are set toRxDataAvail (bit 21) and TxBufEmpty (bit 18). The MMC will generate a SDC interrupt when either there are data in theinput buffer or there are rooms in the output buffer. It disables and ignores other kinds of SDC interrupts, e.g. interrupts dueto error conditions.78 3 Interrupts and Exceptions Processing

(2). get_sector(int sector, char *buf): get_sector() is for reading a sector of 512 bytes from the SDC. The algorithm ofget_sector() is as follows.. Set global rxbuf = buf and rxdone = 0 for the Rx interrupt handler to use;. Set DataTimer to default, set DataLength to 512;. Set DataCntl to 0x93 (block size = 2**9, respR1, SDC to Host, and Enable);. Send CMD17 with argument = sector*512 (byte address on SDC);. (Busy) wait for the Rx interrupt handler to ﬁnish reading data;

Upon receiving a CMD17, the PL180 MultiMedia Controller (MMC) starts to transfer data from SDC to its internal inputbuffer. The MMC has a 16x32 bits FIFO input data buffer. When data becomes available, it generates an Rx interrupt,causing SDC_handler() to be executed, which actually transfers data from the MMC to rxbuf. After sending CMD17, themain program busily waits for a volatile rxdone flag, which will be set by the interrupt handler when data transfer completes.

(3). put_sector(int sector, char *buf): put_sector() is for writing a block of data to the SDC. The algorithm of put_sector() isas follows.. Set global txbuf=buf and txdone=0 for the Tx interrupt handler to use;. Set DataTimer to default and DataLength to 512;. Send CMD24 with argument=sector*512 (byte address on SDC);. Set DataCntl to 0x91 (block size=2**9, respR1, Host to SDC, Enable);. (Busy) wait for the Tx interrupt handler to ﬁnish writing data;

Upon receiving a CMD24, the PL180 MMC starts to transfer data. The MMC has a 16x32 bits FIFO output data buffer. Ifthe Tx buffer is empty, it generates an SDC interrupt, causing SDC_handler() to be executed, which actually transfers datafrom buf to the MMC. After sending CDM24, the main program busily waits for a volatile txdone flag, which will be set bythe interrupt handler when data transfer completes.

(4). sdc_handler(): This is the SDC interrupt handler. It ﬁrst checks the status register to determine the interrupt source. If it isan RxDataAvail interrupt (bit 21 is set), it transfers data from the MMC controller to rxbuf by a loop.

while (!err && rxcount) {

Barring any errors, each iteration of the loop reads a u32 (4 bytes) from the MMC's FIFO input buffer, decrementsrxcount by 4 until rxcount reaches 0. Then it sets the rxdone flag to 1, allowing the main program in get_sector() to continue. If the SDC interrupt is a Tx interrupt (bit 18 is set), it writes data from txbuf to the MMC's FIFO by a loop.

Barring any errors, each iteration of the loop writes a u32 (4 bytes) to the MMC's FIFO output buffer, decrementstxcount by 4 until txcount reaches 0. Then it sets the txdone flag to 1, allowing the main program in put_sector() to continue.

(3). t.c ﬁle: The t.c ﬁle is the same as in C3.3, except for the added code for SDC initialization and testing. For clarity, themodiﬁed lines of t.c are shown in bold face letters.

for (sector=0; sector < N; sector++){

The ARM Versatile user manual speciﬁes that the MMCI0 interrupts at IRQ22 on both the VIC and SIC. However, in thePL180 emulated by QEMU, it actually interrupts at IRQ22 of the SIC, which is routed to IRQ31 of the VIC. The reason forthis discrepancy is unknown. Other than this minor discrepancy, the emulated PL180 works as expected. Figure 3.8 showsthe outputs of running the SDC driver program C3.4.

3.7.3 Improved SDC Driver

In the SDC driver, the interrupt handler performs all the data transfer on a single interrupt. Since data transfer from the MMCto the SDC may be slow, the interrupt handler must execute the data transfer loop many times while waiting for the MMC tobecome ready to provide or accept data. The drawback of this scheme is that it is essentially the same as, or even worse than,I/O by polling. In general, an interrupt handler should be completed as soon as possible. Any excessive checking and waitinginside an interrupt handler must be avoided or eliminated. It is therefore desirable to minimize the number of interrupts andmaximize the amount of data transfer on each interrupt. This leads us to an improved SDC driver. In the improved SDCdriver, we program the MMC to generate interrupts only when the Rx FIFO is full or the Tx FIFO is empty. For eachinterrupt, we transfer 16 u32 data on each interrupt. The following code segments show the improved SDC driver, in whichthe modiﬁcations are shown in bold faced lines.3.7 Secure Digital (SD) Cards 81

Figure 3.9 shows the outputs of running the improved SDC driver program. As the ﬁgure shows, each interrupt transfers16 4-byte data, so that the byte transfer count decrements by 64 on each interrupt. As a further improvement, the reader mayprogram the MMC to generate interrupts when the Rx FIFO is half-full and Tx FIFO is half-empty. In that case, eachinterrupt can transfer 8 4-byte data. It improves the data transfer rate at the expense of more interrupts and hence moreoverhead due to interrupt processing.

3.7.4 Multi-sector Data Transfer

The above SD drivers transfer one sector (512 bytes) of data at a time. An embedded system may support ﬁle systems,which usually use 1 or 4 kB ﬁle block size. In that case, it would be more efﬁcient to transfer data from/to SD cards inmulti-sectors that matches ﬁle block size. The following code segments show the modiﬁed SD driver which transfers data inmulti-sectors. To read multi-sectors, issue the command CMD18. To write multi-sectors, issue the command CMD25. In both cases, thedata length is the ﬁle block size. For multi-sector data transfers, data transmission must be terminated by a stop transmissioncommand CMD12, which is issued in the interrupt handler when the byte count (rxcount or txcount) reaches 0. The modiﬁedlines of the driver are shown in bold face. In the code segment, FBLK_SIZE is deﬁned as 4096. Each get_block()/put_block()call read/write a (ﬁle) block of 4 KB data.

The reader may replace the SDC driver in the sample program C3.4 with the above code and test run the program toverify multi-sector data transfers. The reader may also change FBLK_SIZE to suit other block sizes, which must be amultiple of sector size (512).

3.8 Vectored Interrupts

So far, all the example programs use non-vectored interrupts. The disadvantage of non-vectored interrupts is that theinterrupt handler must scan the interrupt status register for non-zero bits to determine the interrupt sources, which istime-consuming. In many other computer systems, such as the Intel x86 based PCs, interrupts are vectored by hardware. Inthe vectored interrupt scheme, each interrupt is assigned a vector number determined by the interrupt priority. When aninterrupt occurs, the CPU can get the vector number of the interrupt from the interrupt controller hardware and uses it toinvoke a corresponding interrupt service routine directly. The ARM PL190 Vectored Interrupt Controller (VIC) also has thiscapability. In this section, we show how to program the PL190 VIC for vectored interrupts processing.

3.8.1 ARM PL190 Vectored Interrupts Controller (VIC)

The PL190 VIC of the ARM Versatile/926EJ-S board supports vectored interrupts. The VIC technical manuals[ARM PL190 2004] contain the following information on how to program the VIC for vectored interrupts.3.8 Vectored Interrupts 85

. VectorAddr Register (0x30): The VectorAddr register contains the ISR address of current active IRQ. At the end of currentISR, write a value to this register to clear the current interrupt. The PL192 VIC has the additional capability of prioritizingthe IRQ sources by writing values 0–15 to the IntPriority registers. At the end of the current ISR, writing to the VectorAddrregister allows the VIC to re-prioritize pending IRQs.. DefaultVecAddr Register (0x34): This register contains the ISR address of a default interrupt, e.g. for any spuriousinterrupt.. VectorAddress Registers [0–15] (0x100-0x13C): Each of these registers contains the ISR address of IRQ0 to IRQ15. ThePL192 has 32 vectorAddress registers for 32 ISRs.. VectorControl Registers [0–15] (0x200-0x23C): Each of these registers contains the interrupt source (bits 4–1) and anEnable bit (bit 5).

In order to use vectored interrupts, each device must be enabled for interrupts at both the device level and also on theVIC. We demonstrate vectored interrupts by the sample program C3.5 for the following devices.

3.8.3 Vectored Interrupts Handlers

C3.5.2. Rewrite IRQ_handler() for Vectored Interrupts. When using vectored interrupts, any IRQ interrupt still comes to theIRQ_handler() as usual. However, we must rewrite IRQ_handler() to use vectored interrupts. Upon entry to IRQ_handler(),we must read the VectorAddr register to acknowledge the interrupt ﬁrst. Unlike the non-vectored interrupt case, there is noneed to read the status registers to determine the interrupt source. Instead, we can get the address of the current IRQ handlerfrom the vectorAddr register directly. Then simply invoke the handler by its entry address. In addition, the interrupt sourcecan also be determined from the VectorStatus register. Upon return from the handler, send EOI to the VIC controller bywriting a (any) value to the vectorAddr register, allowing it to re-prioritize pending interrupt requests. The following showsthe modiﬁed IRQ_handler() function for vectored interrupts.

has ﬁnished. This implies that interrupts can only be handled one at a time. The disadvantage of this scheme is that it maylead to interrupt priority inversion, in which processing a low priority interrupt may block or delay processing of higherpriority interrupts. Interrupt priority inversion may increase the system response time to interrupts, which is undesirable inembedded systems with critical timing requirements. To remedy this, embedded systems should allow nested interrupts. Inthe nested interrupts scheme, a higher priority interrupt may preempt the processing of lower priority interrupts, i.e. beforethe current interrupt handler ﬁnishes, it can accept and handle interrupts of higher priorities, thereby reducing interruptsprocessing latency and improving the system response to interrupts.

3.9.2 Nested Interrupts in ARM

The ARM processor is not designed to support nested interrupts efﬁciently, due to the following properties of the ARMprocessor architecture. When the ARM CPU accepts an IRQ interrupt, it switches to IRQ mode, which has its own banked registers lr_irq, sp_irq,spsr and cpsr. The CPU saves the return address (with an +4 offset) into lr_irq, saves the previous mode CPSR into spsr andenters the interrupt handler to handle the current interrupt. In order to support nested interrupts, the interrupt handler mustunmask interrupts at some point in time to allow interrupts of higher priorities to occur. However, this creates two problems.

(1). Accepting another interrupt may corrupt the link register: Assume that, after enabling IRQ interrupts, the interrupthandler calls an ISR to handle the current interrupt, as in

When calling the ISR to handle the current interrupt, the link register lr_irq contains the return address to the label HERE.While executing the ISR, if another interrupt occurs, the CPU would re-enter irq_handler, which changes lr_irq to the returnaddress of the new interrupt point. This corrupts the original link register lr_irq, causing the ISR to return to the wrongaddress when it ﬁnishes.

(2). Over-writing saved CPSR: When the ARM processor accepts an interrupt, it saves the CPSR of the interrupted point inthe (banked) SPSR_irq, The saved SPSR may be in USER or SVC mode if the interrupted code was executing in USER orSVC mode. While executing an ISR in IRQ mode, if another interrupt occurs the CPU would over-write the SPSR_irq withthe CPSR in IRQ mode, which would cause the ﬁrst ISR to return to the wrong mode when it ﬁnishes.

It is fairly easy to deal with the problem in (2). Upon entry to the interrupt handler but before enabling IRQ for furtherinterrupts, we can save SPSR into the IRQ mode stack. When the ISR ﬁnishes, we restore the saved SPSR from the stack.However, if we allow nested interrupts in IRQ mode, there is no way to overcome the problem in (1). It will cause an inﬁniteloop since every ISR would return to the beginning of the interrupt handler again. The only way to alleviate this problem isto steer the CPU away from IRQ mode. For this reason, ARM introduced the SYS mode, which is a privileged mode but hasa different link register than the IRQ mode. Assume that, before enabling further IRQ interrupts, we switch the CPU to SYSmode and call an ISR in SYS mode. If another interrupt occurs, it would alter the IRQ mode link register lr_irq but not thelink register lr_sys in SYS mode. This allows the ISR to return to the correct address when it ﬁnishes. So the scheme ofhandling nested interrupt is as follows.

Many ARM processors require 8-byte aligned stacks. When changing the CPU to SYS mode, it may be necessary tocheck and adjust the SYS mode stack for proper alignment ﬁrst. Since the SYS mode stack begins at an 8-byte boundary, wemay assume that the checking and adjustment is unnecessary. The following list the irq_handler code which implements theabove algorithm.

ARM recommends handling nested interrupts in SYS mode (Nesting Interrupts 2011) but it can also be done in SVCmode, which is used in the demonstration program.

3.9.4 Demonstration of Nested Interrupts

C3.6.1. ts.s ﬁle: First, we show the modiﬁed irq_handler for nested interrupts. Instead of SYS mode, it handles nestedinterrupts in SVC mode. The get_cpsr() function returns the processor mode in CPSR. It is for displaying the current modeof the CPU.

C3.6.2. t.c ﬁle: The t.c ﬁle is the same as in C3.5, except for the added functions enterINT() and exitINT(). In the C3.5program, which uses vectored interrupts, the interrupts priorities are (from high to low) Timer0, UART0, UART1, KBD Without nested interrupts, each interrupt is processed from start to ﬁnish without preemption. With nested interrupts,processing a low priority interrupt may be preempted by a higher priority interrupt. In order to demonstrate this, we add thefollowing code to the C3.6 program.

(1). In the irq_handler code, before enabling IRQ interrupts and calling ISR for the current interrupt, we let the interrupthandler call enterINT(), which reads the VICstatus register to determine the interrupt source. If it's a KBD interrupt, it sets avolatile global inKBD flag to 1, clears a volatile global tcount to 0 and prints an enterKBD message. If it's a timer interrupt inthe middle of handling a KBD interrupt, it increments tcount by 1.(2). When an ISR returns to irq_handler, it calls exitINT(). If the current interrupt is from KBD, it prints the tcount value andresets tcount to 0.

The tcount value represents the number of high priority timer interrupts serviced while executing the low priority KBDhandler. The reader may un-comment the statements labeled (1) and (2) in the irq_handler code to verify the effect of nested(timer) interrupts. The following lists the code of enterINT() and exitINT().

Figure 3.11 shows the outputs of running the example program C3.6, which demonstrates nested interrupts. As the ﬁgureshows, many timer interrupts may occur while handling a single KBD interrupt.92 3 Interrupts and Exceptions Processing

Fig. 3.11 Demonstration of nested interrupts

3.10 Nested Interrupts and Process Switch

The ARM scheme of handling nested interrupts in SYS or SVC mode works only if every interrupt handler returns to theoriginal point of interruption, so that no process switch occurs after an interrupt. It would not work if an interrupt may causecontext switch to a different process. This is because part of the context belonging to the switched out process still remains inthe IRQ stack, which may be over-written when the new process handles another interrupt. If this happens, the switched outprocess would never be able to resume running again due to corrupted or lost execution context. In order to prevent this, theIRQ stack contents must be transferred to the SVC stack of the switched out process (and reset the IRQ stack pointer toprevent it from growing out of bounds). A possible way to avoid transferring stack contents is to allocate a separate IRQstack for each process. But this would require a lot of memory spaces dedicated to processes as IRQ stacks, which basicallynulliﬁes the advantage of using a single IRQ stack for interrupt processing. Also, in the ARM architecture it is not possible touse the same memory area as both the SVC and IRQ stacks of a process due to the separate stack pointers in SVC and IRQmodes. These necessitate transferring IRQ stack contents during context switch, which seems to be an inherent weakness ofthe ARM processor architecture in terms of multitasking.

3.11 Summary

Interrupts and interrupts processing are essential to embedded systems. This chapter covers exceptions and interruptsprocessing. It describes the operating modes of ARM processors, exception types and exception vectors. It explains thefunctions of interrupt controllers and the principles of interrupts processing in detail. Then it applies the principles ofinterrupts processing to the design and implementation of interrupt-driven device drivers, which include drivers for timers,keyboard, UART and SD cards, and demonstrates the device drivers by example programs. It explained the advantages ofvectored interrupts, showed how to conﬁgure the VIC for vectored interrupts, and demonstrated vectored interrupts pro-cessing. It also explained the principles of nested interrupts and demonstrated nested interrupts processing by exampleprograms.3.11 Summary 93

1. In the example program C3.1, the vector table is at the end of the ts.s ﬁle.(1). In the ts.s ﬁle, comment out the line BL copy_vector Re-compile and run the program again. The program should NOT work. Explain why?

(2). Move the vector table to the beginning of the ts.s ﬁle, as in

vectors_start: // vector table vectors_end: reset_handler:

Change the entry point in t.ld to vectors_start. Recompile and run the program again. It should also work. Explain why?

2. In the example 3.2, the irq_handler save all the registers in the IRQ stack. Modify it to save a MINIMUM number ofregisters. Figure out which registers must be saved to make the program still work.3. Modify the KBD driver in program C3.2 to support upper case letters, and special control keys, Control-C and Control-D.4. Modify the UART driver program C3.3 to support UART2 and UART3.5. Modify the UART driver program C3.3 to support internal FIFO buffers of the UART.6. The ARM VIC interrupt controller assigns ﬁxed IRQ priorities in the order of IRQ0 (high) to IRQ31 (low). Vectoredinterrupts allows reordering of the IRQ priorities. In the example program C3.5, the priorities of vectored interrupts areassigned in their original order. Modify vectorInt_init() to assign different vectored interrupt priorities, e.g. in the order ofKBD, UART1, UART0, timer0, from high to low. Test whether vectored interrupts still work. Discuss the implications ofsuch an assignment of interrupt priorities.7. The Example program C3.6 handles nested interrupts in SVC mode. Modify it to handle nested interrupts in SYS mode, assuggested by ARM.8. In the example program C3.6, which supports nested interrupts, there are two lines in the irq_handler:

LDR r1, =vectorAddr

LDR r0, [r1] // read VIC vectAddr to ACK interrupt

Comment out these lines to see what would happen and explain why?

9. In embedded systems, the SD card is often used as a booting device for booting up an operating system. During booting,the booter code may read sectors from the SD card by polling since it must wait for data. Instead of using interrupts, rewritethe SD driver using polling.10. Rewrite the SD driver by using DMA to transfer large amounts of data.94 3 Interrupts and Exceptions Processing

4.1 Program Structures of Embedded Systems

In the early days, most embedded systems were designed for speciﬁc applications. An embedded system usually consists of amicrocontroller, which is used to monitor a few sensors and generate signals to control a few external devices, such as to turnon LEDs or activates relays and servo motors to control a robot, etc. For this reason, control programs of early embeddedsystems are also very simple. They are written in the form of either a super-loop or event-driven program structure.However, as computing power and demand for multi-functional systems increase in recent years, embedded systems haveundergone a tremendous leap in both applications and complexity. To cope with the ever increasing demands for extrafunctionality and the resulting system complexity, the super-loop and event driven program structures are no longeradequate. Modern embedded systems need more powerful software. As of now, many embedded systems are in facthigh-powered computing machines capable of running full-fledged operating systems. A good example is smart phones,which use the ARM core with gig bytes internal memory and multi-gig bytes micro SD card for storage and run adaptedversions of Linux, such as Android (Android 2016). The current trend in embedded systems design is clearly moving in thedirection of developing multi-functional operating systems suitable for the mobile environment. In this chapter, we shalldiscuss the various program structures and programming models that are suitable for current and future embedded systems.

4.2 Super-Loop Model

A super-loop is a program structure composed of an inﬁnite loop, with all the tasks of the system contained in theloop. The general form of a super-loop program is

After system initialization, the program executes an inﬁnite loop, in which it checks the status of a system component,such as that of an input device. When the device indicates there are input data, it collects the input data, processes the dataand generates outputs as response. Then it repeats the loop.

4.2.1 Super-Loop Program Examples

We illustrate the super-loop program structure by examples. In the ﬁrst example program, denoted by C4.1, we assume thatan embedded system controls an UART for I/O. Our goal here is to develop a control program which continually checkswhether there is any input from the UART port. Whenever a key is pressed, it gets the input key, processes the input andgenerates an output, such as to turn on an LED, flip a switch, etc. When running the program on an emulated ARM virtualmachine, which does not have any LED or switch, we shall simply echo the input key to simulate processing the input andgenerating an output response. In the program, it checks a UART for inputs. For each alphabetical key in lowercase, it converts the key to uppercase anddisplays the key to the UART port. In addition, it also handles the return key by outputting a new line char to produce theright visual effects. The program's assembly code is the same as those in Chap. 3. We only show the C code of the program.

When running the C4.1 program on an ARM virtual machine under QEMU, it echoes each alphabetical key to UART0 inuppercase. Instead of a single device, the program can be generalized to monitor and control several devices, all in the sameloop. We demonstrate this technique by the next program, C4.2, which monitors and controls two devices, a UART and akeyboard.

Example Program C4.2: The program monitors and controls 2 devices in a super-loop.

When running the program C4.2, it echoes UART0 inputs in uppercase, and keyboard inputs in lowercase, all to UART0.

4.3 Event-Driven Model

4.3.1 Shortcomings of Super-Loop Programs

The drawback of the super-loop program model is that the program must continually check the status of each and everydevice, even if the device is not ready. This not only wastes CPU time but also causes excessive power consumption. In anembedded system, it is often more desirable to reduce power consumption than to increase CPU utilization. Rather thancontinually checking the status of every device, an alternative way is to wait for the device to become ready. For example, inthe program C4.1, we may replace the checking device status statements with a busy-wait loop, as in

whileðdevice has no dataÞ;

But this does not remedy the problem since the CPU is still continually executing the busy-wait loop. Another drawback ofthis scheme is that the program would be unable to respond to any KBD inputs while it is waiting for UART inputs, and viceversa.

4.3.2 Events

In a programming environment, an event is something that is generated by a source and recognized by a recipient, causingthe latter to take action to handle the event. Instead of continually checking for inputs, an embedded system can be designedto be event-driven, i.e. it takes action only in response to events. For this reason, event-driven systems (Cheong et al. 2003;Dunkels et al. 2006) are also called reactive systems. Events can be synchronous, i.e. they occur in a predictable manner, orasynchronous, i.e. they may occur at any time and in any order. Examples of synchronous events are periodic events from atimer, e.g. when the timer count has reached a certain value. Examples of asynchronous events are user inputs, such as pressa key, click a mouse button and flip a switch, etc. Because of their unpredictability, the simple super-loop program structureis unsuited to dealing with asynchronous events. In the event-driven programming model, the main program may executes ina loop or sits in an idle state, waiting for any event to occur. When an event occurs, an event catcher recognizes the event andnotiﬁes the main program, causing it to take an appropriate action to handle the event. In an embedded system, events areusually associated with interrupts from hardware devices. In this case, an event-driven program becomes a simple98 4 Models of Embedded Systems

interrupt-driven program. We illustrate the interrupt-driven program structure by two examples. The ﬁrst example handlesperiodic events and the second example handles asynchronous events.

4.3.3 Periodic Event-Driven Program

In this example, we assume that an embedded system consists of a timer and a display device, e.g. an LCD. The timer isprogrammed to generate 60 interrupts per second. The system must react to the following periodic timing events: On eachsecond, it displays a wall clock in hh:mm:ss format to the LCD. Every 5 s, it displays a message string, also to the LCD.There are two possible ways to implement a control program that meets these periodic timing requirements. If the tasks to beperformed are short, they can be performed by the timer interrupt handler directly. In this case, after initialization, the mainprogram can execute an idle loop. To reduce power consumption, the idle loop can use Wait-For-Interrupt (WFI) orequivalent instruction, which puts the CPU into a power-saving state, waiting for interrupts. The ARM926EJ-S board doesnot support the WFI instruction but most ARM Cortex-5 processors (ARM Cortex-5 2010) implement a WFI mode bywriting to the coprocessor CP15. The following shows the program code of the ﬁrst version of Example C4.3. The tasksperformed by the timer interrupt handler are shown in bold faced lines.

In general, an interrupt handler should be as short as possible. If the periodic tasks are long, e.g. longer than a timer tick, itis undesirable to perform the timer dependent tasks inside the interrupt handler, unless the system supports nested interrupts.As shown in Chap. 3, the ARM architecture can not handle nested interrupts directly. In this case, it would be better to letthe main program perform all the tasks. As before, the main program executes in a loop. When there are no events from thetimer, it enters a power-saving state, waiting for the next interrupt. When a timer event occurs, the timer interrupt handlersimply sets a global volatile flag variable. After waking up from the power-saving state, the main program can check the flagvariables to take appropriate actions. In the second version of the example program C4.3, the task of displaying the wall clock is re-written as a function, whichis executed by the main program on each second. The following shows the second version of the C4.3 program.

Fig. 4.1 Periodic Event-Driven Program

4.3.4 Asynchronous Event-Driven Program

Asynchronous events are non-periodic in nature, which may occur at any time and in any order. The next program, denotedby C4.4, demonstrates non-periodic or asynchronous events. In this example, we assume that the program monitors andcontrols two input devices; a UART and a keyboard. The main program tries to get an input line form either the UART or theKBD and echoes the line. The program is a condensed version of the example program C3.3 of Chap. 3, which implementsinterrupt-driven UART and KBD drivers. First, it initializes the (volatile) global flag variables, uline and kline to 0. Then itrepeatedly checks the flag variables, which will be set by the interrupt handlers. When keys are pressed on the UARTterminal or the keyboard, the UART or KBD interrupt handler will get the keys, echo them and enter them into an inputbuffer. When an ENTER key is pressed on either device, the interrupt handler turns on the corresponding flag variable tosignal the occurrence of an event, allowing the main program to continue. When the main program detects an event, itextracts a line from the device driver's input buffer and echoes it to the LCD (for KBD inputs) or the UART. Then, it clearsthe flag variable and continues the loop. Due to the asynchronous nature of events, it is necessary to disable interrupts whenclearing the flag variables in order to prevent race conditions between the main program and the interrupt handler.

Fig. 4.2 Asynchronous Event-Driven Program

4.4 Event Priorities 103

4.4 Event Priorities

In an event-driven system, some events may be more urgent than others. Events can be assigned different priorities accordingto their urgency and importance. Events should be handled in accordance with their priority order. There are several ways toprioritize the events. First, interrupt related events can be assigned different priorities by an interrupt controller. The eventprocessing order of the main program should be consistent with the interrupts priorities. Second, event handlers can beimplemented as independent execution entities called processes or tasks, which can be scheduled to run by priority. Weshall explain the process programming model in the next section. Here, we briefly justify the need for processes ﬁrst. Theexample programs of C4.3 and C4.4 have three main shortcomings. First, for fast response to interrupts, interrupt handlersshould be as short as possible. This is especially important to timer interrupts in order not to lose timer ticks or require nestedinterrupts. Even for periodic events, they should be handled by the main program, not in the timer interrupt handler. Second,while in the power-saving state, the main program still needs to come up to execute the loop on each and every interrupt,even though the awaited events, e.g. a complete input line, have not occurred yet. It would be better if the program can comeout of the power-saving state only when needed. This way, it does not have to poll every event when it runs. Third, eventsare not limited to user inputs or device interrupts. They may originate from other execution entities in the system as a meansof synchronization and communication. In order to accomplish these, it is necessary to incorporate the notion of processes ortasks into the system. This leads us to the process model for embedded systems.

4.5 Process Models

In the process model, an embedded system comprises many concurrent processes. A process is an execution entity which canbe scheduled to run, suspended from running (and yield CPU to other processes), and resumed to run again, etc. Eachprocess is an independent execution unit designed to perform a speciﬁc task. Depending on the execution environment of theprocesses, the process model can be classiﬁed into several sub-models.

4.5.1 Uniprocessor Process Model

A uniprocessor (UP) system consists of only one CPU. In a UP system, processes run on the same CPU concurrently throughmultitasking.

4.5.2 Multiprocessor Process Model

A Multiprocessor (MP) system consists of a multiple number of CPUs, including multi-core processors. In a MP system,processes may run on different CPUs in parallel. In addition, each CPU or processor core may also run processes throughmultitasking. MP systems will be covered in Chap. 9.

4.5.3 Real Address Space Process Model

In the real address space model, the system either is not equipped with, or does not utilize, the memory managementhardware due to timing constraints. Without memory management hardware to provide address mapping, all processes run inthe same real address space of the system kernel. The drawback of this model is the lack of memory protection. Its mainadvantages are simplicity, less hardware resource requirements and high efﬁciency.

4.5.4 Virtual Address Space Process Model

In the virtual address space model, the system uses memory management hardware to provide each process with a uniquevirtual address space through address mapping. Processes may run in either kernel mode or user mode. While in kernel104 4 Models of Embedded Systems

mode, all processes share the same address space of the kernel. While in user mode, each process has a distinct virtualaddress space that is isolated and protected from other processes.

4.5.5 Static Process Model

In the static process model, all processes are created when the system starts, and they remain in the system permanently.Each process may be periodic or event-driven. Process scheduling is usually by static process priority without preemption,i.e. each process runs until it gives up the CPU voluntarily.

4.5.6 Dynamic Process Model

In the dynamic process model, processes can be created dynamically to perform speciﬁc tasks on demand. When a processcompletes its task, it terminates and releases all the resources back to the system for reuse.

4.5.7 Non-preemptive Process Model

In the non-preemptive process model, each process runs until it gives up the CPU voluntarily, e.g. when a process goes tosleep, suspends itself or explicitly yields CPU to another process.

4.5.8 Preemptive Process Model

In the preemptive process model, CPU can be taken away from a process to run another process at any time. The above classiﬁcations of process models are not all mutually exclusive. Depending on the application, an embeddedsystem may be designed as a mixture of appropriate process models. For instance, most existing embedded systems can beclassiﬁed as the following types.

4.6 Uniprocessor (UP) Kernel Model

In this model, the system has only one CPU with no memory management hardware for address mapping. All processes runin the same address space of a kernel. Processes can be either static or dynamic. Processes are scheduled by static prioritywithout preemption. Most simple embedded systems ﬁt this model. The resulting system is equivalent to the non-preemptivekernel of an operating system. We shall discuss this system model in more detail in Chap. 5.

4.7 Uniprocessor (UP) Operating System Model

This is an extension of the UP Kernel model. In this model, the system uses memory management hardware to supportaddress mapping, thus providing each process with a unique virtual address space. Each process runs in either kernel modeor a separate user mode. While in kernel mode, all processes run in the same address space of the kernel. While in user mode,each process executes in a private address space, which is isolated and protected from other processes. Processes sharecommon data objects only in the protected kernel space. While in kernel mode, a process runs until it gives up the CPUvoluntarily without preemption. While in user mode, a process can be preempted to yield CPU to other process of higherpriority. Such a system is equivalent to a general purpose UP operating system. We shall discuss general purpose OS in moredetail in Chap. 7.4.8 Multiprocessor (MP) System Model 105

4.8 Multiprocessor (MP) System Model

In this model, the system consists of a multiple number of CPUs or processor cores, which share the same physical memory.In a MP system, processes may run on different CPUs in parallel. Compared with UP systems, MP systems require advancedconcurrent programming techniques and tools for process synchronization and protection. We shall discuss MP system inChap. 9.

4.9 Real-Time (RT) System Model

In general, all embedded systems are designed with some timing requirements, such as quick responses to interrupts andshort interrupt processing completion time, etc. An embedded system may only use these timing requirements as guidelinesbut it does not guarantee that the requirements are always attainable. In contrast, an embedded system intended for real-timeapplications must meet very stringent timing requirements, such as guaranteed minimal response time to interrupts andcompleting every requested service within a prescribed time limit. This environment is equivalent to a real-time system. Weshall discuss real-time embedded systems in Chap. 10.

4.10 Design Methodology of Embedded System Software

As embedded systems becoming ever more complex, the traditional software design for embedded systems by the ad hocapproach is no longer adequate. As a result, there are many formal design methodologies proposed for embedded systemsoftware design, which include

4.10.1 High-Level Language Support for Event-Driven Programming

This design approach focuses on using the events and exceptions support of high-level programming languages, such asJAVA and C++, as a model to develop event-driven programs for embedded systems. A representative work in this area isthe task model for event-driven programming (Fischer et al. 2007).

4.10.2 State Machine Model

This design method treats embedded system software as a ﬁnite state machine (FSM) (Edwards et al. 1997; Gajski et al.1994). A ﬁnite state machine (FSM) is a system

FSM = {S, X, O, f}, where S = a ﬁnite set of states,

X = a ﬁnite set of inputs, O = a ﬁnite set of outputs, f is a state transition function, which maps S X I into S X O.

For each pair of (state, input) = (s, x), f(s, x) = (s', o), where s' is the next state of s, and o is the output generatedduring the state transition. A FSM is fully speciﬁed if f(s, x) is deﬁned for every pair of (s, x). A FSM is deterministic if, forevery pair of (s, x), f(s, x) is unique. A FSM is of the Mealy model (Katz and Borriello 2005) if the outputs depend oninputs. A FSM is of the Moore model if the outputs depend only on states. The state machine design method models the speciﬁcations of embedded systems by fully speciﬁed and deterministicMealy model FSMs, which allow for formal veriﬁcation of the resulting systems. It also exploits programming languagefeatures to translate state machines into program code. We illustrate the state machine design model by an example.Example Program C4.5: Assume that comment lines in C programs begin with two adjacent / symbols and end on thesame line. Design an embedded system which takes C program source ﬁles as inputs and removes comment lines from the Cprograms. Design and implementation of such a system based on the FSM model consists of three steps.106 4 Models of Embedded Systems

Step 1: Construct a FSM State Table: The system can be modeled by a FSM with 5 states.

S0 = initial state, which has not seen any input

S1 = has not seen any / symbol yet S2 = has seen the ﬁrst / symbol S3 = has seen 2 adjacent // symbols S4 = ﬁnal or termination state

Although each input is a single char, we shall classify the input chars into different cases, which are treated as distinct inputsto the system. Thus, we deﬁne inputs as

x1 = '/' x2 = '\n' x3 = not in {'/', \n', EOF} x4 = EOF (end-of-ﬁle)

While the system is in a state, each input causes a state transition to a next state and generates an output (string). A FSM canbe represented by a state table, which speciﬁes the state transitions and outputs due to each input. For this example, theinitial state table of the FSM is shown in Table 4.1, in which the output symbol - denotes null string.

In the state table, S0 is the initial state, which represents the condition that the system has not seen any input yet, and S4 isthe ﬁnal or termination state, in which the system has completed its task and halts. The initial state table is constructed inaccordance with the problem speciﬁcation. Starting from the initial state S0, if the input is a '/', it goes to state S2, whichrepresents the condition that the system has seen the ﬁrst '/', and generates a null output string. This is because this '/' may bethe start of a comment line. If so, it should not be emitted as part of the output. If the input is a '\n', it goes to state S1 andgenerates an output string "\n". If the input is not '/' or '\n' or EOF, it goes to S1 also and generates an output string containingthe same input char. If the input is EOF, it goes to the ﬁnal state S4 with a null output string and terminates. While in thestate S2, if the input x is not a '/' or\n or EOF, it goes back to S1 and generates the output string "/x". This is because a '/'followed by an ordinary char is not a comment line, which must be part of the output string. Other entries of the state tableare constructed in a similar way.Step 2: State Table Minimization: When constructing an initial state table from a problem speciﬁcation, depending onhow the states are deﬁned as perceived by the system designer, the number of states may be more than actually needed.Thus, the initial state table may not be minimal. For instance, if we regard the ﬁnal state S4 as a default condition for thesystem to terminate, then S4 is redundant, which can be eliminated. An initial state table may also contain many states thatare actually equivalent. The second step in the FSM design model is to minimize the state table by eliminating redundant andequivalent states. In order to do this, we ﬁrst clarify what is meant by equivalent states.

(1) Equivalence relation: An equivalent relation R is a binary relation applied to a set of objects, which is

Reflexive: for any object x, x R x is true.

For example, the = relation of real numbers is an equivalence relation. Similarly, for any integer N >0, the modulo-N(% N) relation of nonnegative integers is also an equivalence relation.

(2) Equivalent class: An equivalence relation can be used to partition (divide) a set into equivalent classes such that all the objects in the same class are equivalent. As a result, each equivalent class can be represented by a single object of that class.

Example: When applying the % 10 relation to the set of nonnegative integers, it partitions the set into the equivalentclasses {0}–{9}. Each class {i} consists of all integers which yield a remainder of i when divided by 10. We may use 0–9 torepresent the equivalent classes {0}–{9}.

(3). Equivalent states: In a FSM, two states Si and Sj are equivalent if for every input x,

their outputs are identical and their next states are equivalent. Note that the deﬁnition of equivalent states only requires that, for each input their outputs must be identical but not theirnext states, which only need to be equivalent. This seems to create a chicken-egg problem, but we can handle it easily, aswill be shown shortly.

(4) State table minimization: This step tries to reduce the number of states in a FSM state table to a minimum. While it may be hard to ﬁnd all the equivalent states in a state table directly, it is very easy to spot state pairs that can not be equivalent based on their outputs. In elementary logic, we know that

if A then B is equivalent to if ½not B then ½not A

When trying to prove something like "if A then B", we may either attack from the front by showing that "if A is true, thenB must be true", or from the rear by showing that "if B is not true, then A can not be true" The strategy of attacking from therear is often known as proof by contradiction, which is used in almost all proofs in the computability theory in computerscience. So, rather than trying to identify equivalent states in a state table, we shall use a strategy which tries to identify andeliminate all non-equivalent states. The scheme is implemented by an Implication Chart, which is a table containing all pairsof states in a state table. In the implication chart, each cell corresponds to a state-pair (Si, Sj). Since an implication chart issymmetrical and all diagonal cells (Si, Si) are obviously equivalent states, it sufﬁces to show only the lower half of the chart,without the diagonal cells. The algorithm of identifying non-equivalent state-pairs in an implication chart is as follows.

(1) Use the outputs of the states to cross out any cell (Si, Sj) that can not be equivalent.(2) For each non-crossed out cell (Si, Sj), examine their next state pairs (Si', Sj') under each input. Cross out the cell (Si, Sj) if any of their next state-pair (Si', Sj') has been crossed out.(3) Repeat (2) until there are no state-pair cells that can be crossed out

When the algorithm ends, each non-crossed out cell (Si, Sj) identiﬁes a pair of equivalent states. Then, use the transitiveproperty of equivalent state-pairs to construct equivalent classes. For our example, we ﬁrst construct an implication chart and cross out all the cells of the state-pairs that arenon-equivalent. For example, it is obvious that S0 and S2 can not be equivalent since their outputs are not the same for everyinput. So we cross out the cell of (S0, S2). For the same reason, we can cross out the cell of (S0, S3). Likewise, we can crossout the cells of (S1, S2), (S1, S3) and (S2, S3). Table 4.2 shows the implication cart after applying (1) to cross out the cellsof non-equivalent state-pairs.108 4 Models of Embedded Systems

Table 4.2 Initial Implication Chart of FSM

Table 4.3 Implication Chart of FSM

From the initial implication chart of Table 4.2, we ﬁll each non-crossed out cell with next state pairs, which is shown inthe cell of (S0, S1) of Table 4.3. Then we apply step (2) of the algorithm, trying to cross out any cells containing state pairs that are already crossed out. Inthis case, there are none. So the algorithm terminates. The ﬁnal implication chart of Table 4.3 reveals that (S0, S1) areequivalent states, which can be combined into a single state. The process of identifying and eliminating equivalent states in state tables is known as the FSM minimization problem,which has been thoroughly studied in the design of ﬁnite state machines (Katz and Borriello 2005). It sufﬁces to say that wecan always reduce a fully speciﬁed and deterministic FSM state table to a minimal form, which is unique up to isomorphism(by renaming the states). Furthermore, the algorithm requires only polynomial computing time. For this example, theminimal state table is shown in Table 4.4, which has only 3 non-equivalent states.4.10 Design Methodology of Embedded System Software 109

Table 4.4 Minimal State Table of FSM

The state diagram of a FSM is a directed graph, in which each node represents a state and an arc form Si to Sj, denotedby Si->Sj, represents a state transition from state Si to stat Sj. The arc is marked with all the input/output pairs that cause thestate transition. State tables and state diagrams are equivalent in the sense that they convey exactly the same information. Thereader may draw a state diagram for the state table shown in Table 4.4. This is left as an exercise. Step 3: Translate State Table/State Diagram into Code: A state table or state diagram can be translated into C codealmost directly. Using the switch–case statements of C, each state corresponds to a distinct case in an outer switch statement,and each input corresponds to a distinct case in an inner switch statement. We illustrate the translation by a complete Cprogram, which simulates the intended embedded system.

The reader may compile and run the above C4.5 program under Linux on C source ﬁles that use // as comment lines.The outputs should show that it removes all comment lines from C source ﬁles. The reader may also consult Problem 4.2 tohandle a minor design flaw of the program.110 4 Models of Embedded Systems

It is noted that, when translating a FSM state table or state diagram into code, the resulting C code may not be very prettynor efﬁcient (in terms of code size), but the translation process is almost mechanical, which can be automated if needed.It makes the coding step an engineering endeavor rather than an art of programming. This is the strongest asset of the FSMmodel. However, the FSM model does have its limitations in that the number of states can not be too large. Whereas it maybe quite easy to handle problems with only a few states, it would be too difﬁcult to manage state tables or state diagrams withhundreds of states. For this reason, it is impractical and nearly impossible to design and implement a complete operatingsystem by the FSM model.

4.10.3 StateChart Model

The StateChart model (Franke B 2016) is based on the ﬁnite state machine model. It adds concurrency and communicationsamong the concurrent execution entities. It is intended for modeling complex embedded systems with concurrent tasks.Since this model involves the advanced concepts of concurrent processes, process synchronization and inter-processcommunication, we shall not discuss it any further.

4.11 Summary

This chapter covers models of embedded systems. It explained the simple super-loop system model and pointed out itsshortcomings. It discussed the event-driven model and demonstrated the periodic and asynchronous event-driven systemmodels by example programs. Then it justiﬁed the need for processes or tasks in embedded systems, and discussed thevarious process models. Lastly, it introduced some of the formal design methodologies for embedded systems, and itillustrated the FSM model by a detailed design and implementation example.

PROBLEMS

1. In the example program C4.3, after initializing the system the main() function executes a while(1) loop:

(1) Comment out the asm line. Run the program again to see what would happen. (2) In terms of CPU power consumption, what difference does the asm statement make?

2. In the Example Program C4.5, it assumes that a comment line starts with 2 adjacent / symbols to the end of line. However, string constants enclosed in matched pairs of double quotes may contain any number of / symbols but they are not comment lines, e.g. printf("this // is not a /// comment line\n"); Modify the state table of program C4.5 to handle this case. Translate the modiﬁed state table or state diagram into C code and run the modiﬁed program to test whether it works correctly.

3. Assume that comment blocks in C programs begins with /* and ends with */. Nested comment blocks are not allowed, which should result in an error. Write a C program, which detects and removes comment blocks from C program source ﬁles.4.11 Summary 111

(1) Model the program as a FSM and construct a state diagram for the FSM.(2) Write C code to implement the FSM as an event-driven system.

In general, multitasking refers to the ability of performing several independent activities at the same time. For example, weoften see people talking on their mobile phones while driving. In a sense, these people are doing multitasking, although avery dangerous kind. In computing, multitasking refers to the execution of several independent tasks at the same time. In asingle CPU or uniprocessor (UP) system, only one task can execute at a time. Multitasking is achieved by multiplexing theCPU's execution time among different tasks, i.e. by switching the CPU from one task to another. If the switch is fast enough,it gives the illusion that all the tasks are executing simultaneously. This logical parallelism is called concurrency. In amultiprocessor (MP) system, tasks can execute on different CPUs in parallel in real time. In addition, each processor mayalso do multitasking by executing different tasks concurrently. Multitasking is the basis of all operating systems, as well asthe foundation of concurrent programming in general. For simplicity, we shall consider uniprocessor (UP) systems ﬁrst. MPsystems will be covered later in Chap. 9 on Multiprocessor Systems.

5.2 The Process Concept

A multitasking system supports concurrent executions of many processes. The heart of a multitasking system is a controlprogram, known as the operating system (OS) kernel, which provides functions for process management. In a multitaskingsystem, processes are also called tasks. For all practical purposes, the terms process and task can be used interchangeably.First, we deﬁne an execution image as a memory area containing the execution's code, data and stack. Formally, a process isthe execution of an image. It is a sequence of executions regarded as a single entity by the OS kernel for using systemresources. System resources include memory space, I/O devices and, most importantly, CPU time. In an OS kernel, eachprocess is represented by a unique data structure, called the Process Control Block (PCB) or Task Control Block (TCB), etc.In this book, we shall simply call it the PROC structure. Like a personal record, which contains all the information of aperson, a PROC structure contains all the information of a process. In a single CPU system, only one process can beexecuting at a time. The OS kernel usually uses a global PROC pointer, running or current, to point at the PROC that iscurrently executing. In a real OS, the PROC structure may contain many ﬁelds and quite large. To begin with, we shalldeﬁne a very simple PROC structure to represent processes.

typedef struct proc{

struct proc *next; int *ksp; int kstack[1024]; }PROC;

In the PROC structure, the next ﬁeld is a pointer pointing to the next PROC structure. It is used to maintain PROCs indynamic data structures, such as link lists and queues. The ksp ﬁeld is the saved stack pointer of a process when it is notexecuting, and kstack is the execution stack of a process. As we expand the OS kernel, we shall add more ﬁelds to the PROCstructure later.

5.3 Multitasking and Context Switch

5.3.1 A Simple Multitasking Program

We begin to demonstrate multitasking by a simple program. The program is denoted by C5.1. It consists of a ts.s ﬁle in ARMassembly code and a t.c ﬁle in C.

(1). ts.s ﬁle: the ts.s ﬁle deﬁnes the program's entry point reset_handler, in which it(1). set the SVC stack pointer to the high end of proc0.kstack[ ].(2). add a tswitch() function in assembly code for task switching.

PROC proc0, *running; // proc0 structure and running pointer

Use the ARM toolchain (2016) to compile-link ts.s and t.c to generate a binary executable t.bin as usual. Then run t.bin onthe Versatilepb VM (Versatilepb 2016) under QEMU, as in

qemu-system-arm –M versatilepb –m 128M –kernel t.bin

During booting, QEMU loads t.bin to 0x10000 and jumps to there to execute the loaded image. When execution starts ints.s, it sets the SVC mode stack pointer to the high end of proc0. This makes proc0's kstack area as the initial stack. Up to thispoint, the system has no notion of any process because there is none. The assembly code calls main() in C. When controlenters main(), we have an image in execution. By the deﬁnition of process, which is the execution of an image, we have aprocess in execution, although the system still does not know which process is executing. In main(), after setting running =&proc0, the system is now executing the process proc0. This is how a typical OS kernel starts to run an initial process whenit begins. The initial process is handcrafted or created by brute force. Starting from main(), the run-time behavior of theprogram can be traced and explained by the execution diagram of Fig. 5.1, in which the key steps are labeled (1) to (6).

At (1), it lets running point to proc0, as shown on the right-hand side of Fig. 5.1. Since we assume that running alwayspoints at the PROC of the current executing process, the system is now executing the process proc0.At (2), it calls tswitch(), which loads LR(r14) with the return address and enters tswitch.At (3), it executes the SAVE part of tswitch(), which saves CPU registers into stack and saves the stack pointer sp into proc0.ksp.At (4), it calls scheduler(), which sets running to point at proc0 again. For now, this is redundant since running already pointsat proc0. Then it executes the RESUME part of tswitch().At (5), it sets sp to proc0.ksp, which is again redundant since they are already the same. Then it pops the stack, whichrestores the saved CPU registers.At (6), it executes MOV pc, lr at the end of RESUME, which returns to the calling place of tswitch().

Fig. 5.1 Execution diagram of proc0

116 5 Process Management in Embedded Systems

5.3.2 Context Switching

Besides printing a few messages, the program seems useless since it does practically nothing. However, it is the basis of allmultitasking programs. To see this, assume that we have another PROC structure, proc1, which called tswitch() and executedthe SAVE part of tswitch() before. Then proc1's ksp must point to its stack area, which contains saved CPU registers and areturn address from where it called tswitch(), as shown in Fig. 5.2. In scheduler(), if we let running point to proc1, as shown in the right-hand side of Fig. 5.2, the RESUME part of tswitch() would change sp to proc1's ksp. Then the RESUME code would operate on the stack of proc1. This would restore thesaved registers of proc1, causing proc1 to resume execution from where it called tswitch() earlier. This changes the executionenvironment from proc0 to proc1. Context Switching : Changing the execution environment of one process to that of another is called context switching, which is thebasic mechanism of multitasking. With context switching, we can create a multitasking environment containing many processes. In the next program,denoted by C5.2, we deﬁne NPROC = 5 PROC structures. Each PROC has a unique pid number for identiﬁcation. ThePROCs are initialized as follows.

running -> P0 -> P1 -> P2 -> P3 -> P4 ->

| | <-------------------------<--

P0 is the initial running process. All the PROCs form a circular link list for simple process scheduling. Each of thePROCs, P1 to P4, is initialized in such a way that it is ready to resume running from a body() function. Since theinitialization of the PROC stack is crucial, we explain the steps in detail. Although the processes never existed before, wemay pretend that they not only existed before but also ran before. The reason why a PROC is not running now is because itcalled tswitch() to give up CPU earlier. If so, the PROC's ksp must point to its stack area containing saved CPU registers anda return address, as shown in Fig. 5.3, where the index -i means SSIZE-i. Since the PROC never really ran before, we may assume that its stack was initially empty, so that the return address,rPC=LR, is at the very bottom of the stack. What should be the rPC? It may point to any executable code, e.g. the entryaddress of a body() function. What about the "saved" registers? Since the PROC never ran before, the register values do notmatter, so they can all be set to 0. Accordingly, we initialize each of the PROCs, P1 to P4, as shown in Fig. 5.4.

Fig. 5.2 Execution diagram of proc1

Fig. 5.3 Process stack contents

5.3 Multitasking and Context Switch 117

Fig. 5.4 Initial stack contents of process

With this setup, when a PROC becomes running, i.e. when running points to the PROC, it would execute the RESUMEpart of tswitch(),

LDMFD sp!, {r0-r12, lr}

MOV pc, lr

which restores the "saved" CPU registers, followed by MOV pc, lr, causing the process to execute the body() function. After initialization, P0 calls tswitch() to switch process. In tswitch(), P0 saves CPU registers into its own stack, saves thestack pointer in its PROC.ksp and calls scheduler(). We modify the scheduler() function by letting running point to the nextPROC, i.e.

running = running->next;

So P0 switches to P1. P1 begins by executing the RESUME part of tswitch(), causing it to resume to the body() function.While in body(), the running process prints its pid and prompts for an input char. Then it calls tswitch() to switch to the nextprocess, etc. Since the PROCs are in a circular link list, they will take turn to run. The following lists the assembly and Ccode of C5.2.

5.3.3 Demonstration of Multitasking

Figure 5.5 shows the outputs of running the C5.2 multitasking program. It uses the process pid to display in different colors,just for fun. Before continuing, it is worth noting the following.

(1). In the C5.2 multitasking program, none of processes, P1 to P4, actually calls the body() function. What we have done isto convince each process that it called tswitch() from the entry address of body() to give up CPU earlier, and that is where itshall resume to when it begins to run. Thus, we can fabricate an initial environment for each process to start. The process hasno choice but to obey. This is the power (and joy) of systems programming.

Fig. 5.5 Demonstration of multitasking

5.3 Multitasking and Context Switch 121

(2). All the processes, P1 to P4, execute the same body() function but each executes in its own environment. For instance,while executing the body() function, each process has its own local variable c in the process stack. This shows the differencebetween processes and functions. A function is just a piece of passive code, which has no life. Processes are executions offunctions, which makes the function code alive.(3). When a process ﬁrst enters the body() function, the process stack is logically empty. As soon as execution starts, theprocess stack will grow (and shrink) by the function calling sequence as described in Sect. 2.7.3.2 of Chap. 2.(4). The per-process kstack size is deﬁned as 4KB. This implies that the maximal length of function call sequence (and theassociated local variable spaces) of every process must never exceed the kstack size. Similar remarks also apply to otherprivileged mode stacks, e.g. the IRQ mode stack for interrupts processing. All of these are under the planning and control ofthe kernel designer. So stack overflow should never occur in kernel mode.

5.4 Dynamic Processes

In the program C5.2, P0 is the initial process. All other processes are created statically by P0 in kernel_init(). In the nextprogram, denoted by C5.3, we shall show how to create processes dynamically.

5.4.1 Dynamic Process Creation

(1). First, we add a status and a priority ﬁeld to the PROC structure, and deﬁne the PROC link lists: freeList andreadyQueuee, which are explained below.

. freeList = a (singly) link list containing all FREE PROCs. When the system starts, all PROCs are in the freeList initially.When create a new process, we allocate a free PROC from freeList. When a process terminates, we deallocate its PROC andrelease it back to the freeList for reuse.. readyQueue = a priority queue of PROCs that are ready to run. PROCs with the same priority are ordered First-in-ﬁrst-out(FIFO) in the readyQueue.

(2). In the queue.c ﬁle, we implement the following functions for list and queue operations.

(3). In the kernel.c ﬁle, kernel_init() initializes the kernel data structures, such as the freeList and readyQueue. It also createsP0 as the initial running process. The function

int pid = kfork(int func, int priority)

creates a new process to execute a function func() with the speciﬁed priority. In the example program, every new processbegins execution from the same body() function. When a task has completed its work, it may terminate by the function

void kexit()

which releases its PROC structure back to the freeList for reuse. The function scheduler() is for process scheduling. Thefollowing lists the C code of kernel.c and t.c ﬁles.

Fig. 5.6 Demonstration of dyamic process

In the t.c ﬁle, it ﬁrst initializes the LCD display and the KBD driver. Then it initializes the kernel to run the initial processP0, which has the lowest priority 0. P0 creates a new process P1 and enters it into readyQueue. Then P0 calls tswitch() toswitch process to run P1. Every new process resumes to execute the body() function. While a process runs, the user mayenter 's' to switch process, 'f' to create a new process and 'x' to terminate, etc.

5.4.2 Demonstration of Dynamic Processes

Figure 5.6 shows the screen of running the C5.3 program. As the ﬁgure shows, a 'f' input causes P1 to kfork a new processP2 in the readyQueue. A 's' input causes P1 to switch process to run P2, which resumes to execute the same body() function.While P2 runs, the reader may enter commands to let P2 switch process or kfork a new process, etc. While a process runs, a'x' input causes the process to terminate.

5.5 Process Scheduling

5.5.1 Process Scheduling Terminology

In a multitasking operating system, there are usually many processes ready to run. The number of runnable processes is ingeneral greater than the number of available CPUs. Process scheduling is to decide when and on which CPU to run theprocesses in order to achieve an overall good system performance. Before discussing process scheduling, we ﬁrst clarify thefollowing terms, which are usually associated with process scheduling.

(1). I/O-bound vs. compute-bound processes:

A process is considered as I/O-bound if it suspends itself frequently to wait for I/O operations. I/O-bound processes areusually from interactive users who expect fast response time. A process is considered as compute-bound if it uses CPU timeextensively. Compute-bound processes are usually associated with lengthy computations, such as compiling programs andnumerical computations, etc.5.5 Process Scheduling 125

(2). Response time vs. throughput:

Response time refers to how fast a system can respond to an event, such as entering a key from the keyboard. Throughputis the number of processes completed per unit time.

(3). Round-robin vs. dynamic priority scheduling:

In round-robin scheduling, processes take turn to run. In dynamic priority scheduling, each process has a priority, whichchanges dynamically (over time), and the system tries to run the process with the highest priority.

(4). Preemption vs. non-preemption:

Preemption means the CPU can be taken away from a running process at any time. Non-preemption means a process runsuntil it gives up CPU by itself, e.g. when the process ﬁnishes, goes to sleep or becomes blocked.

(5). Real-time vs. time-sharing:

A real-time system must respond to external events, such as interrupts, within a minimum response time, often in theorder of a few milliseconds. In addition, the system may also need to complete the processing of such events within aspeciﬁed time limit. In a time-sharing system, each process runs with a guaranteed time slice so that all processes receivetheir fair share of CPU time.

5.5.2 Goals, Policy and Algorithms of Process Scheduling

Process scheduling is intended to achieve the following goals.

.high utilization of system resources, especially CPU time,

.fast response to interactive or real-time processes,.guaranteed completion time of real-time processes,.fairness to all processes for good throughput, etc.

It is easy to see that some of goals are conflicting to one another. For example, fast response time and high throughputusually cannot be achieved at the same time. A scheduling policy is a set of rules, by which a system tries to achieve all orsome of the goals. For a general purpose operating system, the scheduling policy is usually trying to achieve good overallsystem performance by striving for a balance among the conflicting goals. For embedded and real-time systems, theemphases are usually on fast response to external events and guaranteed process execution time. A scheduling algorithm is aset of methods that implements a scheduling policy. In an OS kernel, the various components, i.e. data structures and codeused to implement the scheduling algorithm, are collectively known as the process scheduler. It is worth noting that in mostOS kernel there is not a single piece of code or module that can be identiﬁed as the scheduler. The functions of a schedulerare implemented in many places inside the OS kernel, e.g. when a running process suspends itself or terminate, when asuspended process becomes runnable again and, most notably, in the timer interrupt handler.

5.5.3 Process Scheduling in Embedded Systems

In an embedded system, processes are created to perform speciﬁc tasks. Depending on the importance of the task, eachprocess is assigned a priority, which is usually static. Processes run either periodically or in response to external events. Theprimary goal of process scheduling is to ensure quick response to external events and guarantee process execution time.Resource utilization and throughput are relatively unimportant. For these reasons, the process scheduling policy is usuallybased on process priority or by round-robin for processes with the same priority. In most simple embedded systems,processes usually execute in the same address space. In this case, the scheduling policy is usually non-preemptive. Eachprocess runs until it gives up the CPU voluntarily, e.g. when the process goes to sleep, becomes suspended or explicitly126 5 Process Management in Embedded Systems

yields control to another process. Preemptive scheduling is more complex due to the following reasons. With preemption,many processes may run concurrently in the same address space. If a process is in the middle of modifying a shared dataobject, it must not be preempted unless the shared data object is protected in a critical region. Otherwise, the shared dataobject may be corrupted by other processes. Protection of critical regions will be discussed in the next section on processsynchronization.

5.6 Process Synchronization

When multiple processes execute in the same address space, they may access and modify shared (global) data objects.Process synchronization refers to the rules and mechanisms used to ensure the integrity of shared data objects in a concurrentprocesses environment. There are many kinds of process synchronization tools. For a detailed list of such tools, theirimplementation and usage, the reader may consult (Wang 2015). In the following, we shall discuss some simple syn-chronizing tools that are suitable for embedded systems. In addition to discussing the principles of process synchronization,we shall also show how to apply them to the design and implementation of embedded systems by example programs.

5.6.1 Sleep and Wakeup

The simplest mechanism for process synchronization is the sleep/wakeup operations, which are used in the original Unixkernel. When a process must wait for something, e.g. a resource, that is currently unavailable, it goes to sleep to suspenditself and give up the CPU, allowing the system to run other processes. When the needed resource becomes available,another process or an interrupt handler wakes up the sleeping processes, allowing them to continue. Assume that each PROCstructure has an added event ﬁeld. The algorithms of sleep/wakeup are as follows.

In order for the mechanism to work, sleep() and wakeup() must be implemented properly. First, each operation must beatomic (indivisible) from the process point of view. For instance, when a process executes sleep(), it must complete the sleepoperation before someone else tries to wake it up. In a non-preemptive UP kernel, only one process runs at a time, soprocesses can not interfere with one another. However, while a process runs, it may be diverted to handle interrupts, whichmay interfere with the process. To ensure the atomicity of sleep and wakeup, it sufﬁces to disable interrupts. Thus, we mayimplement sleep() and wakeup() as follows.

Note that wakeup() wakes up ALL processes, if any, that are sleeping on an event. If no process is sleeping on the event,wakeup has no effect, i.e. it amounts to a NOP and does nothing. It is also worth noting the interrupt handlers can never sleepor wait (Wang 2015). They can only issue wakeup calls to wake up sleeping processes.

5.6.2 Device Drivers Using Sleep/Wakeup

In Chap. 3, we developed several device drivers using interrupts. The organization of these device drivers exhibits acommon pattern. Every interrupt-driven device driver consists of three parts; a lower-half part, which is the interrupt handler,an upper-half part, which is called by a main program, and a data area containing an I/O buffer and control variables, whichare shared by the lower and upper parts. Even with interrupts, the main program still must use busy-wait loops to wait fordata or room in the I/O buffer, which is essentially the same as polling. In a multitasking system, I/O by polling does not usethe CPU effectively. In this section, we shall show how to use processes and sleep/wakeup to implement interrupt-drivendevice drivers without busy-wait loops.

5.6.2.1 Input Device Drivers

In Chap. 3, the KBD driver uses interrupts but the upper-half uses polling. When a process needs an input key, it executes abusy-wait loop until the interrupt handler puts a key into the input buffer. Our goal here is to replace the busy-wait loop withsleep/wakeup. First, we show the original driver code by polling. Then, we modify it to use sleep/wakeup forsynchronization.

(1). KBD structure: The KBD structure is the middle part of the driver. It contains an input buffer and control variables, e.g.data = number of keys in the input buffer.

lock(); // disable IRQ interrupts

We assume that the main program is now running a process. When a process needs an input key, it calls kgetc(), trying toget a key from the input buffer. Without any means of synchronization, the process must rely on a busy-wait loop

while (kp->data == 0); // busy-wait for data;

which continually checks the data variable for any key in the input buffer.

(3). kbd_handler(): The interrupt handler is the lower-half of the KBD driver.

For each key press, the interrupt handler maps the scan code to a (lowercase) ASCII char, stores the char in the inputbuffer and updates the counting variables data. Again, without any means of synchronization, that's all the interrupt handlercan do. For instance, it can not notify the process of available keys directly. Consequently, the process must check for inputkeys by continually polling the driver's data variable. In a multitasking system, the busy-wait loop is undesirable. We can usesleep/wakeup to eliminate the busy-wait loop in the KBD driver as follows.

(1). KBD structure: no need to change.

(2). kgetc(): rewrite kgetc() to let process sleep for data if there are no keys in the input buffer. In order to prevent raceconditions between the process and the interrupt handler, the process disables interrupts ﬁrst. Then it checks the data variableand modiﬁes the input buffer with interrupts disabled, but it must enable interrupts before going to sleep. The modiﬁed kgetc() function is

(3). kbd_handler(): rewrite KBD interrupt handler to wake up sleeping processes, if any, that are waiting for data. Sinceprocess cannot interfere with interrupt handler, there is no need to protect the data variables inside the interrupt handler.

5.6.2.2 Output Device Drivers

An output device driver also consists of three parts; a lower-half, which is the interrupt handler, an upper-half, which iscalled by process to output data, and a middle part containing data buffer and control variables, which are shared by thelower and upper halves. The major difference between an output device driver and an input device driver is that the roles ofprocess and interrupt handler are reversed. In an output device driver, process writes data to the data buffer. If the data bufferis full, it goes to sleep to wait for rooms in the data buffer. The interrupt handler extracts data from the buffer and outputsthem to the device. Then it wakes up any process that is sleeping for rooms in the data buffer. A second difference is that formost output devices the interrupt handler must explicitly disable the device interrupts when there are no more data to output.Otherwise, the device will keep generating interrupts, resulting in an inﬁnite loop. The third difference is that it is usuallyacceptable for several processes to share the same output device, but an input device can only allow one active process at atime. Otherwise, processes may get random inputs from the same input device.

5.7 Event-Driven Embedded Systems Using Sleep/Wakeup

With dynamic process creation and process synchronization, we can implement event-driven multitasking systems withoutbusy-wait loops. We demonstrate such a system by the example program C5.4. Example Program C5.4: We assume that the system hardware consists of three devices; a timer, a UART and akeyboard. The system software consists of three processes, each controls a device. For convenience, we also include an LCDfor displaying outputs from the timer and keyboard processes. Upon starting up, each process waits for a speciﬁc event. A process runs only when the awaited event has occurred. In thiscase, events are timer counts and I/O activities. At each second, the timer process displays a wall clock on the LCD.Whenever an input line is entered from the UART, the UART process gets the line and echoes it to the serial terminal.Similarly, whenever an input line is entered from the KBD, the KBD process gets the line and echoes it to the LCD. Asbefore, the system runs on an emulated ARM virtual machine under QEMU. The system's startup sequence is identical tothat of C5.3. We shall only show how to set up the system to run the required processes and their reactions to events. Thesystem operates as follows.130 5 Process Management in Embedded Systems

(1). Initialization:copy vectors, conﬁgure VIC and SIC for interrupts;run the initial process P0, which has the lowest priority 0;initialize drivers for LCD, timer, UART and KBD; start the timer;(2). Create tasks: P0 call kfork(NAME_task, priority) to create the timer, UART and KBD processes and enter them into thereadyQueue. Each process executes its own NAME_task() function with a (static) priority, ranging from 3 to 1.(3). Then P0 executes a while(1) loop, in which it switches process whenever the readyQueue is non-empty.(4). Each process resumes to execute its own NAME_code() function, which is an inﬁnite loop. Each process calls sleep(event) to sleep on a unique event value (address of the device data structure).(5). When an event occurs, the device interrupt handler calls wakeup(event) to wake up the corresponding process. Uponwaking up, each process resumes running to handle the event. For example, the timer interrupt handler no longer displays thewall clock. It is performed by the timer process on each second.

5.7.1 Demonstration of Event-Driven Embedded System Using Sleep/Wakeup

Figure 5.7 shows the output screens of running the example program C5.4, which demonstrates an event-driven multi-tasking system. As the Fig. 5.7 shows, the timer task displays a wall clock on the LCD on each second. The uart task printsa line to UART0 only when there is an input line from the UART0 port, and the kbd task prints a line to the LCD only whenthere is an input line from the keyboard. While these tasks are sleeping for their awaited events, the system is running the idleprocess P0, which is diverted to handles all the interrupts. As soon as a task is woken up and entered into the readyQueue, P0switches process to run the newly awakened task.132 5 Process Management in Embedded Systems

Fig. 5.7 Event-driven multitasking system using sleep/wakeup

5.8 Resource Management Using Sleep/Wakeup

In addition to replacing busy-wait loops in device drivers, sleep/wakeup may also be used for general process synchro-nization. A typical usage of sleep/wakeup is for resource management. A resource is something that can be used by only oneprocess at a time, e.g. a memory region for updating, a printer, etc. Each resource is represented by a res_status variable,which is 0 if the resource is FREE, and nonzero if it's BUSY. Resource management consists of the following functions

int acquire_resource(); // acquire a resource for exclusive use

int release_resource(); // release a resource after use

When a process needs a resource, it calls acquire_resource(), trying to get a resource for exclusive use. In acquire_re-source(), the process tests res_status ﬁrst. If res_status is 0, the process sets it to 1 and returns OK for success. Otherwise, itgoes to sleep, waiting for the resource to become FREE. While the resource is BUSY, any other process callingacquire_resource() would go to sleep on the same event value also. When the process which holds the resource callsrelease_recource(), it clears res_status to 0 and issues wakeup(&res_status) to wakeup ALL processes that are waiting for theresource. Upon waking up, each process must try to acquire the resource again. This is because when an awakened processruns, the resource may no longer be available. The following code segment shows the resource management algorithm usingsleep/wakeup.5.8 Resource Management Using Sleep/Wakeup 133

5.8.1 Shortcomings of Sleep/Wakeup

Sleep and wakeup are simple tools for process synchronization, but they also have the following shortcomings.

. An event is just a value. It does not have any memory location to record the occurrence of an event. Process must go tosleep ﬁrst before another process or an interrupt handler tries to wake it up. The sleep-ﬁrst-wakeup-later order can always beachieved in a UP system, but not necessarily in MP systems. In a MP system, processes may run on different CPUssimultaneously (in parallel). It is impossible to guarantee the execution order of the processes. Therefore, sleep/wakeup aresuitable only for UP systems.. When used for resource management, if a process goes to sleep to wait for a resource, it must retry to get the resource againafter waking up, and it may have to repeat the sleep-wakeup-retry cycles many times before succeeding (if ever). Therepeated retry loops means poor efﬁciency due to excessive overhead in context switching.

5.9 Semaphores

A better mechanism for process synchronization is the semaphore, which does not have the above shortcomings ofsleep/wakeup. A (counting) semaphore is a data structure

typedef struct semaphore{

In the semaphore structure, the spinlock ﬁeld is to ensure any operation on a semaphore can only be performed as anatomic operation by one process at a time, even if they may run in parallel on different CPUs. Spinlock is needed only formultiprocessor systems. For UP systems, it is not needed and can be omitted. The most well-known operations on sema-phores are P and V, which are deﬁned (for UP kernels) as follows.134 5 Process Management in Embedded Systems

Binary semaphores may be regarded as a special case of counting semaphores. Since counting semaphores are moregeneral, we shall not use, nor discuss, binary semaphores.

5.10 Applications of Semaphores

Semaphores are powerful synchronizing tools which can be used to solve all kinds of process synchronization problems inboth UP and MP systems. The following lists the most common usage of semaphores. To simplify the notations, we shalldenote s.value = n by s = n, and P(&s)/V(&s) by P(s)/V(s), respectively.

5.10.1 Semaphore Lock

A critical region (CR) is a sequence of operations on shared data objects which can only be executed by one process at atime. Semaphores with an initial value = 1 can be used as locks to protect CRs of long durations. Each CR is associated witha semaphore s = 1. Processes access the CR by using P/V as lock/unlock, as in struct semaphore s = 1; Processes: P(s); // acquire semaphore to lock the CR // CR protected by lock semaphore s V(s); // release semaphore to unlock the CR

With the semaphore lock, the reader may verify that only one process can be inside the CR at any time.5.10 Applications of Semaphores 135

5.10.2 Mutex lock

A mutex (Pthreads 2015) is a lock semaphore with an additional owner ﬁeld, which identiﬁes the current owner of the mutexlock. When a mutex is created, its owner ﬁled is initialized to 0, i.e. no owner. When a process acquires a mutex bymutex_lock(), it becomes the owner. A locked mutex can only be unlocked by its owner. When a process unlocks a mutex, itclears the owner ﬁeld to 0 if there are no processes waiting on the mutex. Otherwise, it unblocks a waiting process from themutex queue, which becomes the new owner and the mutex remains locked. Extending P/V on semaphores to lock/unlock ofmutex is trivial. We leave it as an exercise for the reader. A major difference between mutexes and semaphores is that,whereas mutexes are strictly for locking, semaphores can be used for both locking and process cooperation.

5.10.3 Resource Management using Semaphore

A semaphore with initial value n > 0 can be used to manage n identical resources. Each process tries to get a unique resourcefor exclusive use. This can be achieved as follows.

As long as s > 0, a process can succeed in P(s) to get a resource. When all the resources are in use, requesting processeswill be blocked at P(s). When a resource is released by V(s), a blocked process, if any, will be allowed to continue to use aresource. At any time the following invariants hold.

s >= 0 : s = the number of resources still available;

s < 0 : |s| = number of processes waiting in s queue

5.10.4 Wait for Interrupts and Messages

A semaphore with initial value 0 is often used to convert an external event, such as hardware interrupt, arrival of messages,etc. to unblock a process that is waiting for the event. When a process waits for an event, it uses P(s) to block itself in thesemaphore's waiting queue. When the awaited event occurs, another process or an interrupt handler uses V(s) to unblock aprocess from the semaphore queue, allowing it to continue.

5.10.5.1 Producer-Consumer Problem

A set of producer and consumer processes share a ﬁnite number of buffers. Each buffer contains a unique item at a time.Initially, all the buffers are empty. When a producer puts an item into an empty buffer, the buffer becomes full. When aconsumer gets an item from a full buffer, the buffer becomes empty, etc. A producer must wait if there are no empty buffers.Similarly, a consumer must wait if there are no full buffers. Furthermore, waiting processes must be allowed to continuewhen their awaited events occur. Figure 5.8 shows a solution of the Producer-Consumer problem using semaphores. In Fig. 5.8, processes use mutex semaphores to access the circular buffer as CRs. Producer and consumer processescooperate with one another by the semaphores full and empty.136 5 Process Management in Embedded Systems

5.10.5.2 Reader-Writer Problem

A set of reader and writer processes share a common data object, e.g. a variable or a ﬁle. The requirements are: an activewriter must exclude all others. However, readers should be able to read the data object concurrently if there is no activewriter. Furthermore, both readers and writers should not wait indeﬁnitely (starve). Figure 5.9 shows a solution of theReader-Writer Problem using semaphores. In Fig. 5.9, the semaphore rwsem enforces FIFO order of all incoming readers and writers, which prevents starvation.The (lock) semaphore rsem is for readers to update the nreader variable in a critical region. The ﬁrst reader in a batch ofreaders locks the wsem to prevent any writer from writing while there are active readers. On the writer side, at most onewriter can be either actively writing or waiting in wsem queue. In either case, new writers will be blocked in the rwsemqueue. Assume that there is no writer blocked at rwsem. All new readers can pass through both P(rwsem) and P(rsem),allowing them to read the data concurrently. When the last reader ﬁnishes, it issues V(wsem) to allow any writer blocked atwsem to continue. When the writer ﬁnishes, it unlocks both wsem and rwsem. As soon as a writer waits at rwsem, all newcomers will be blocked at rwsem also. This prevents readers from starving writers.

5.10.6 Advantages of Semaphores

As a process synchronization tool, semaphores have many advantages over sleep/wakeup.

(1). Semaphores combine a counter, testing the counter and making decision based on the testing outcome all in a singleindivisible operation. The V operation unblocks only one waiting process, if any, from the semaphore queue. After passingthrough the P operation on a semaphore, a process is guaranteed to have a resource. It does not have to retry to get theresource again as in the case of using sleep and wakeup.(2). The semaphore's value records the number of times an event has occurred. Unlike sleep/wakeup, which must obey thesleep-ﬁrst-wakeup-later order, processes can execute P/V operations on semaphores in any order.

5.10.7 Cautions of Using Semaphores

Semaphores use a locking protocol. If a process can not acquire a semaphore in P(s), it becomes blocked in the semaphorequeue, waiting for someone else to unblock it via a V(s) operation. Improper usage of semaphores may lead to problems. Themost well-known problem is deadlock (Silberschatz et al. 2009; Tanenbaum et al. 2006). Deadlock is a condition in which a

Fig. 5.8 Producer-consumer problem solution

5.10 Applications of Semaphores 137

Fig. 5.9 Reader-writer problem solution

set of processes mutually wait for one another forever, so that none of the processes can proceed. In multitasking systems,deadlocks must not be allowed to occur. Methods of dealing with deadlocks include deadlock prevention, deadlockavoidance, and deadlock detection and recovery. Among the various methods, only deadlock prevention is practical andused in real operating systems. A simple but effective way to prevent deadlocks is to ensure that processes request differentsemaphores in a unidirectional order, so that cross or circular locking can never occur. The reader may consult (Wang 2015)for how to deal with deadlocks in general.

5.10.8 Use Semaphores in Embedded Systems

We demonstrate the use of semaphores in embedded systems by the following examples.

5.10.8.1 Device Drivers Using Semaphores

In the keyboard driver of Sect. 5.6.2.1, instead of using sleep/wakeup, we may use semaphore for synchronization betweenprocesses and the interrupt handler. To do this, we simply redeﬁne the KBD driver's data variable as a semaphore with theinitial value 0.

P(&kp->data); // P on KBD's data semaphore

(3). kbd_handler(): Rewrite KBD interrupt handler to unblock a process, if any. Since process cannot interfere with interrupthandler, there is no need to protect the data variables inside the interrupt handler.

Note that the interrupt handler only issues V() to unblock waiting process but it should never block or wait. If the inputbuffer is full, it simply discards the current input key and returns. As can be seen, the logic of the new driver usingsemaphore is much clearer and the code size is also reduced signiﬁcantly.

5.10.8.2 Event-Driven Embedded System Using Semaphore

The Example Program C5.4 uses sleep/wakeup for process synchronization. In the next example program, C5.5, we shall useP/V on semaphores for process synchronization. For the sake of brevity, we only show the modiﬁed KBD driver and the kbdprocess code. For clarity, the modiﬁcations are shown in bold faced lines.

Figure 5.10 shows the outputs of running the C5.5 program.

5.11 Other Synchronization Mechanisms

Many OS kernels use other mechanisms for process synchronization. These include140 5 Process Management in Embedded Systems

Fig. 5.10 Event-driven multitasking system using semaphores

5.11.1 Event Flags in OpenVMS

OpenVMS (formerly VAX/VMS) (OpenVMS 2014) uses event flags for process synchronization. In its simplest form, anevent flag is a single bit, which is in the address spaces of many processes. Either by default or by explicit syscall, each eventflag is associated with a speciﬁc set of processes. OpenVMS provides service functions for processes to manipulate theirassociated event flags by

set_event(b) : set b to 1 and wakeup waiter(b) if any;

Naturally, access to an event flag must be mutually exclusive. The differences between event flags and Unix events are:

. A Unix event is just a value, which does not have a memory location to record the occurrence of the event. A process mustsleep on an event ﬁrst before another process tries to wake it up later. In contrast, each event flag is a dedicated bit, which canrecord the occurrence of an event. Therefore, when using event flags the order of set_event and wait_event does not matter.Another difference is that Unix events are only available to processes in kernel mode, event flags in OpenVMS can be usedby processes in user mode.. Event flags in OpenVMS are in clusters of 32 bits each. A process may wait for a speciﬁc bit, any or all of the events in anevent cluster. In Unix, a process can only sleep for a single event.. As in Unix, wakeup(e) in OpenVMS also wakes up all waiters on an event.

Each event variable e can be awaited by at most one process at a time. However, a process may wait for any number ofevent variables. When a process calls wait(e) to wait for an event, it does not wait if the event already occurred (post bit=1).Otherwise, it turns on the w bit and waits for the event. When an event occurs, another process uses post(e) to post the eventby turning on the p bit. If the event's w bit is on, it unblocks the waiting process if all its awaited events have been posted.

5.11.3 ENQ/DEQ in MVS

In addition to event variables IBM’s MVS (2010) also uses ENQ/DEQ for resource management. In their simplest form,ENQ(resource) allows a process to acquire the exclusive control of a resource. A resource can be speciﬁed in a variety ofways, such as a memory area, the contents of a memory area, etc. A process blocks if the resource is unavailable. Otherwise,it gains the exclusive control of the resource until it is released by a DEQ(resource) operation. Like event variables, a processmay call ENQ(r1,r2,…rn) to wait for all or a subset of multiple resources.

5.12 High-Level Synchronization Constructs

Although P/V on semaphores are powerful synchronization tools, their usage in concurrent programs is scattered. Anymisuse of P/V may lead to problems, such as deadlocks. To help remedy this problem, many high-level process syn-chronization mechanisms have been proposed.

5.12.1 Condition Variables

In Pthreads (Buttlar et al. 1996; Pthreads 2015), threads may use condition variables for synchronization. To use a conditionvariable, ﬁrst create a mutex, m, for locking a CR containing shared variables, e.g. a counter. Then create a conditionvariable, con, associated with the mutex. When a thread wants to access the shared variable, it locks the mutex ﬁrst. Then itchecks the variable. If the counter value is not as expected, the thread may have to wait, as in

int count // shared variable of threads

pthread_cond_wait(con, m) blocks the calling thread on the condition variable, which automatically and atomicallyunlocks the mutex m. While a thread is blocked on the condition variable, another thread may use pthread_cond_signal(con) to unblock a waiting thread, as in

When an unblocked thread runs, the mutex m is automatically and atomically locked, allowing the unblocked thread toresume in the CR of the mutex m. In addition, a thread may use pthread_cond_broadcast(con) to unblock all threads that are142 5 Process Management in Embedded Systems

waiting for the same condition variable, which is similar to wakeup in Unix. Thus, mutex is strictly for locking, conditionvariables may be used for threads cooperation.

5.12.2 Monitors

A monitor (Hoare 1974) is an Abstract Data Type (ADT), which includes shared data objects and all the procedures thatoperate on the shared data objects. Like an ADT in object-oriented programming (OOP) languages, instead of scatteredcodes in different processes, all codes which operate on the shared data objects are encapsulated inside a monitor. Unlike anADT in OOP, a monitor is a CR which allows only one process to execute inside the monitor at a time. Processes can onlyaccess shared data objects of a monitor by calling monitor procedures, as in

MONITOR m.procedure(parameters);

The concurrent programming language compiler translates monitor procedure calls as entering the monitor CR, andprovides run-time protection automatically. When a process ﬁnishes executing a monitor procedure, it exits the monitor,which automatically unlocks the monitor, allowing another process to enter the monitor. While executing inside a monitor, ifa process becomes blocked, it automatically exits the monitor ﬁrst. As usual, a blocked process will be eligible to enter themonitor again when it is SINGALed up by another process. Monitors are similar to condition variables but without anexplicit mutex lock, which makes them somewhat more "abstract" than condition variables. The goal of monitor and otherhigh-level synchronization constructs is to help users write "synchronization correct" concurrent programs. The idea issimilar to that of using strong type-checking languages to help users write "syntactically correct" programs. These high-levelsynchronizing tools are used mostly in concurrent programming but rarely used in real operating systems.

5.13 Process Communication

Process communication refers to schemes or mechanisms that allow processes to exchange information. Process commu-nication can be accomplished in many different ways, all of which depend of process synchronization.

5.13.1 Shared Memory

The simplest way for processes communication is through shared memory. In most embedded systems, all processes run inthe same address space. It is both natural and easy to use shared memory for process communication. To ensure processesaccess the shared memory exclusively, we may use either locking semaphore or mutex to protect the shared memory as acritical region. If some processes only read but do not modify the shared memory, we may use the reader-writer algorithm toallow concurrent readers. When using shared memory for process communication, the mechanism only guarantees processesread/write shared memory in a controlled manner. It is entirely up to the user to deﬁne and interpret the meaning of theshared memory contents.

5.13.2 Pipes

Pipes are unidirectional inter-process communication channels for processes to exchange streams of data. A pipe has a readend and a write end. Data written to the write end of a pipe can be read from the read end of the pipe. Since their debut in theoriginal Unix, pipes have been incorporated into almost all OS, with many variations. Some systems allow pipes to bebidirectional, in which data can be transmitted in both directions. Ordinary pipes are for related processes. Named pipes areFIFO communication channels between unrelated processes. Reading and writing pipes are usually synchronous andblocking. Some systems support non-blocking and asynchronous read/write operations on pipes. For simplicity, we shallconsider a pipe as a ﬁnite-sized FIFO communication channel between a set of processes. Reader and writer processes of apipe are synchronized in the following manner. When a reader reads from a pipe, if the pipe has data, the reader reads asmuch as it needs (up to the pipe size) and returns the number of bytes read. If the pipe has no data but still has writers, the5.13 Process Communication 143

reader waits for data. When a writer writes data to a pipe, it wakes up the waiting readers, allowing them to continue. If thepipe has no data and also no writer, the reader returns 0. Since readers wait for data if the pipe still has writers, the 0 returnvalue means only one thing, namely the pipe has no data and also no writer. In that case, the reader can stop reading from thepipe. When a writer writes to a pipe, if the pipe has room, it writes as much as it needs to or until the pipe is full. If the pipehas no room but still has readers, the writer waits for room. When a reader reads data from the pipe to create more rooms, itwakes up the waiting writers, allowing them to continue. However, if a pipe has no more readers, the writer must detect thisas a broken pipe error and aborts.

5.13.2.1 Pipes in Unix/Linux

In Unix/Linux, pipes are an integral part of the ﬁle system, just like I/O devices, which are treated as special ﬁles. Eachprocess has three standard ﬁle streams; stdin for inputs, stdout for outputs and stderr for displaying error messages, which isusually associated with the same device as stdout. Each ﬁle stream is identiﬁed by a ﬁle descriptor of the process, which is 0for stdin, 1 for stdout and 2 for stderr. Conceptually, a pipe is a two-ended FIFO ﬁle which connects the stdout of a writerprocess to the stdin of a reader process. This is done by replacing the ﬁle descriptor 1 of the writer process with the write-endof the pipe, and replacing the ﬁle descriptor 0 of the reader process with the read-end of the pipe. In addition, the pipe usesstate variables to keep track of the status of the pipe, allowing it to detect abnormal conditions such no more writers andbroken pipe, etc.

5.13.2.2 Pipes in Embedded Systems

Most embedded systems either do not support a ﬁle system or the ﬁle system may not be Unix-compatible. Therefore,processes in an embedded system may not have opened ﬁles and ﬁle descriptors. Despite this, we still can implement pipesfor process communication in embedded systems. In principle, pipes are similar to the producer-consumer problem, exceptfor the following differences.

. In the producer-consumer problem, a blocked producer process can only be signaled up by another consumer process, andvice versa. Pipes use state variables to keep track of the numbers of reader and writer processes. When a pipe writer detectsthe pipe has no more readers, it returns with a broken pipe error. When a reader detects the pipe has no more writers and alsono data, it returns 0.. The producer-consumer algorithm uses semaphores for synchronization. Semaphores are suitable for processes towrite/read data of the same size. In contrast, pipe readers and writers do not have to read/write data of the same size. Forexample, writers may write lines but readers read chars, and vice versa.. The V operation on a semaphore unblocks at most one waiting process. Although rare, a pipe may have multiple writersand readers at both ends. When a process at either end changes the pipe status, it should unblock all waiting processes on theother end. In this case, sleep/wakeup are more suitable than P/V on semaphores. For this reason, pipes are usuallyimplemented using sleep/wakeup for synchronization.

In the following, we shall show how to implement a simpliﬁed pipe for process communication. The simpliﬁed pipebehaves as named pipes in Linux. It allows processes to write/read a sequence of bytes through the pipe, but it does notcheck or handle abnormal conditions, such as broken pipe. Full implementation of pipes as ﬁle streams will be shown later inChap. 8 when we discuss general purpose embedded operating systems. The simpliﬁed pipe is implemented as follows.

When the system starts, all the pipe objects are initialized to FREE.144 5 Process Management in Embedded Systems

(1). PIPE *create_pipe(): this creates a PIPE object in the (shared) address space of all the processes. It allocates a free PIPEobject, initializes it and returns a pointer to the created PIPE object.(2). Read/write pipe: For each pipe, the user must designate a process as either a writer or a reader, but not both. Writerprocesses call.

int write_pipe(PIPE *pipePtr, char buf[ ], int n);

to write n bytes from buf[ ] to the pipe. The return value is the number of bytes written to the pipe. Reader processes call

int read_pipe(PIPE *pipePtr, char buf[ ], int n);

which tries to read n bytes from the pipe. The return value is the actual number of bytes read. The following shows thepipe read/write algorithms, which use sleep/wakeup for process synchronization.

sleep(&p->room); // sleep for room

} }

Note that when a process tries to read n bytes from a pipe, it may return less than n bytes. If the pipe has data, it readseither n bytes or the number of available bytes in the pipe, whichever is smaller. It waits only if the pipe has no data. Thus,each read returns at most PSIZE bytes.

(3). When a pipe is no longer needed, it may be freed by destroy_pipe(PIPE *pipePtr), which deallocates the PIPE object andwake up all the sleeping processes on the pipe.

5.13.2.3 Demonstration of Pipes

The sample system C5.6 demonstrates pipe in an embedded system with static processes. When the system starts, the initialization code creates a pipe pointed by kpipe. When the initial process P0 runs, it createstwo processes, P1 as the pipe writer, and P2 as the pipe reader. For demonstration purpose, we set the pipe's buffer size to arather small value, PSIZE=16, so that if the writer tries to write more than 16 bytes, it will wait for rooms. After reading fromthe pipe, the reader wakes up the writer, allowing it to continue. In the demonstration program, P1 ﬁrst gets a line from theUART0 port. Then it tries to write the line to the pipe. It waits for room if the pipe is full. P2 reads from the pipe and displaysthe bytes read. Although each time P2 tries to read 20 bytes, the actual number of bytes read is at most PSIZE. For the sakeof brevity, we only show the t.c ﬁle of the sample program.

Figure 5.11 shows the sample outputs of running the pipe program C5.6.

Fig. 5.11 Demonstration of pipe

5.13 Process Communication 147

5.13.3 Signals

Like interrupts to a CPU, signals are (software) interrupts to a process (Unix 1990), which diverts the process from itsnormal executions to do signal processing. In an ordinary OS, processes execute in one of two distinct modes; kernel modeor user mode. The CPU checks for pending interrupts at the end of each instruction, which is invisible to processes executingon the CPU. Similarly, a process checks for pending signals only in kernel mode, which is invisible to the process in usermode. In most embedded systems, all processes execute in the same address space, so they do not have a separate user mode.If we use signals for process communication, each process must check for pending signals explicitly in the processprocessing loop, which is equivalent to polling for events. Thus, for embedded systems with only a single address space,signals are unsuited to process communication.

5.13.4 Message Passing

Message passing allows processes to communicate by exchanging messages. Message passing has a wide range of appli-cations. In operating systems, it is a general form of Inter-Process Communication (IPC) (Accetta et al. 1986). In computernetworks, it is the basis of server-client oriented programming. In distributed computing, it is used for parallel processes toexchange data and synchronization. In operating system design, it is the basis of so called microkernel, etc. In this section,we shall show the design and implementation of several message passing schemes using semaphores. The goal of message passing is to allow processes to communicate by exchanging messages. If processes have distinct(user mode) address spaces, they can not access each other's memory area directly. In that case, message passing must gothrough the kernel. If all processes only execute in the same address space of a kernel, message passing allows processes toexchange information in a controlled manner but hides the synchronization details from the processes. The contents of amessage can be designed to suit the needs of the communicating processes. For simplicity, we shall assume that messagecontents are text strings of ﬁnite length, e.g. 128 bytes. To accommodate the transfer of messages, we assume that the kernelhas a ﬁnite set of message buffers, which are deﬁned as

typedef struct mbuf{

Initially, all message buffers are in a free mbufList. To send a message, a process must get a free mbuf ﬁrst. Afterreceiving a message, it releases the mbuf for reuse. Since the mbufList is accessed by many processes, it is a critical region(CR), which must be protected. So we deﬁne a semaphore mlock = 1 for processes to access the mbufList exclusively. Thealgorithm of get_mbuf() and put_mbuf() is

Instead of a centralized message queue, we assume that each PROC has a private message queue, which contains mbufsdelivered to, but not yet received by, the process. Initially, every PROC's mqueue is empty. The mqueue of each process isalso a CR because it is accessed by all the sender processes as well as the process itself. So we deﬁne another semaphorePROC.mlock = 1 for protecting the process message queue.

5.13.4.1 Asynchronous Message Passing

In the asynchronous message passing scheme, both send and receive operations are non-blocking. If a process can not sendor receive a message, it returns a failed status, in which case the process may retry the operation again later. Asynchronouscommunication is intended mainly for loosely-coupled systems, in which interprocess communication is infrequent, i.e.processes do not exchange messages on a planned or regular basis. For such systems, asynchronous message passing is moresuitable due to its greater flexibility. The algorithms of asynchronous send-receive operations are as follows.

The above algorithms work under normal conditions. However, if all processes only send but never receive, or amalicious process repeatedly sends messages, the system may run out of free message buffers. When that happens, themessage facility would come to a halt since no process can send anymore. One good thing about the asynchronous protocolis that there cannot be any deadlocks because it is non-blocking.

5.13.4.2 Synchronous Message Passing

In the synchronous message passing scheme, both send and receive operations are blocking. A sending process must "wait"if there is no free mbuf. Similarly, a receiving process must "wait" if there is no message in its message queue. In general,synchronous communication is more efﬁcient than asynchronous communication. It is well suited to tightly-coupled systems5.13 Process Communication 149

in which processes exchange messages on a planned or regular basis. In such a system, processes can expect messages tocome when they are needed, and the usage of message buffers is carefully planned. Therefore, processes can wait formessages or free message buffers rather than relying on retries. To support synchronous message passing, we deﬁneadditional semaphores for process synchronization and redesign the send-receive algorithm as follows.

The above s_send/s_recv algorithm is correct in terms of process synchronization, but there are other problems.Whenever a blocking protocol is used, there are chances of deadlock. Indeed, the s_send/s_recv algorithm may lead to thefollowing deadlock situations.

(1). If processes only send but do not receive, all processes would eventually be blocked at P(nmbuf) when there are no morefree mbufs.(2). If no process sends but all try to receive, every process would be blocked at its own nmsg semaphore.150 5 Process Management in Embedded Systems

(3). A process Pi sends a message to another process Pj and waits for a reply from Pj, which does exactly the opposite. ThenPi and Pj would mutually wait for each other, which is the familiar cross-locked deadlock.

As for how to handle deadlocks in message passing, the reader may consult Chap. 6 of (Wang 2015), which alsocontains a server-client based message passing protocol.

kprintf("P0 kfork tasks\n");

5.14 Uniprocessor (UP) Embedded System Kernel

An embedded system kernel comprises dynamic processes, all of which execute in the same address space of the kernel. Thekernel provides functions for process management, such as process creation, synchronization, communication and termi-nation. In this section, we shall show the design and implementation of uniprocessor (UP) embedded system kernels. Thereare two distinct types of kernels; non-preemptive and preemptive. In a non-preemptive kernel, each process runs until itgives up CPU voluntarily. In a preemptive kernel, a running process can be preempted either by priority or by time-slice.

5.14.1 Non-preemptive UP Kernel

A Uniprocessor (UP) kernel is non-preemptive if each process runs until it gives up the CPU voluntarily. While a processruns, it may be diverted to handle interrupts, but control always returns to the point of interruption in the same process at theend of interrupt processing. This implies that in a non-preemptive UP kernel only one process runs at a time. Therefore, thereis no need to protect data objects in the kernel from the concurrent executions of processes. However, while a process runs, itmay be diverted to execute an interrupt handler, which may interfere with the process if both try to access the same dataobject. To prevent interference from interrupt handlers, it sufﬁces to disable interrupts when a process executes a piece ofcritical code. This simpliﬁes the system design. The Example Program C5.8 demonstrates the design and implementation of a non-preemptive kernel for uniprocessorembedded systems. We assume that the system hardware consists of two timers, which can be programmed to generate timerinterrupts with different frequencies, a UART, a keyboard and a LCD display. The system software consists of a set ofconcurrent processes, all of which execute in the same address space but with different priorities. Process scheduling is bynon-preemptive priority. Each process runs until it goes to sleep, blocks itself or terminates. Timer0 maintains thetime-of-day and displays a wall-clock on the LCD. Since the task of displaying the wall clock is short, it is performed by thetimer0 interrupt handler directly. Two periodic processes, timer_task1 and timer_task2, each of which calls the pause(t)function to suspend itself for a number of seconds. After registering a pause time in the PROC structure, the process changesstatus to PAUSE, enters itself into a pauseList and gives up the CPU. On each second, Timer2 decrements the pause time ofevery process in the pauseList by 1. When the time reaches 0, it makes the paused process ready to run again. Although thiscan be accomplished by the sleep-wakeup mechanism, it is intended to show that periodic tasks can be implemented in the5.14 Uniprocessor (UP) Embedded System Kernel 153

Fig. 5.12 Demonstration of message passing

general framework of timer service functions. In addition, the system supports two sets of cooperative processes, whichimplement the producer-consumer problem to demonstrate process synchronization using semaphores. Each producerprocess tries to get an input line from UART0. Then it deposits the chars into a shared buffer, char pcbuffer[N] of size Nbytes. Each consumer process tries to get a char from the pcbuff[N] and displays it to the LCD. Producer and consumerprocesses share the common data buffer as a pipe. For brevity, we only show the relevant code segments of the system.

5.14.2 Demonstration of Non-preemptive UP Kernel

Figure 5.13 shows the outputs of running the program C5.8, which demonstrate a non-preemptive UP kernel.

5.14.3 Preemptive UP Kernel

In a preemptive UP kernel, while a process runs, CPU can be taken away from it to run another process. Process preemptionis triggered by events that require rescheduling of processes. For example, when a higher priority process becomes ready torun or, if using time-sliced process scheduling, when a process has exhausted its time quantum. The preemption policy canbe either restrictive or nonrestrictive (fully preemptive). In restrictive preemption, while a process is executing a piece ofcritical code that can not be interfered by other processes, the kernel may disable interrupts or the process scheduler toprevent process switch, thus deferring process preemption until it is safe to do so. In nonrestrictive preemption, process5.14 Uniprocessor (UP) Embedded System Kernel 159

Fig. 5.13 Demonstration of non-preemptive UP kernel

switch takes place immediately, regardless what the current running process is doing. This implies that, in a fully preemptiveUP kernel, processes run logically in parallel. As a result, all shared kernel data structures must be protected as criticalregions. This makes a fully preemptive UP kernel logically equivalent to a MP kernel since both must support the concurrentexecutions of multiple processes. The only difference between a MP kernel and a fully preemptive UP kernel is that, whereasprocesses in the former may run on different CPUs in parallel, processes in the latter can only run concurrently on the sameCPU, but their logical behavior are the same. In the following, we shall only consider fully preemptive UP kernels. MPkernels will be covered in Chap. 9 on multiprocessor systems. We demonstrate the design and implementation of a fully preemptive UP kernel by an example. The system hardwarecomponents are the same as in the sample program C5.8. The system software consists of a set of concurrent processes withdifferent priorities, all of which execute in the same address of the kernel. Process scheduling policy is by fully preemptivepriority. In order to support full preemption, we ﬁrst identify the shared data structures in the kernel that must be protected.These include

(1). PROC *freeList: which is used for dynamic task creation and termination.(2). PROC *readyQueue: which is used for process scheduling.(3). PROC *sleepList, *pauseList: which are used for sleep/wakeup operations.

For each shared kernel data structure, we implement its access functions as critical regions, each protected by a mutexlock. In addition, we deﬁne the following global variables to control process preemption.

(4). int swflag: switch process flag, cleared to 0 when a process is scheduled to run, set to 1 whenever a reschedule eventoccurs, e.g. when a ready process is added to the readyQueue.(5). int intnest: IRQ interrupts nesting counter, initially 0, increment by 1 when enter an IRQ handler, decrement by 1 whenexit an IRQ handler. We assume IRQ interrupts are processed in IRQ handlers directly, i.e. not by pseudo interruptprocessing tasks. Process switch may occur only at the end of interrupt processing. For nested interrupts, process switch isdeferred until the end of all nested interrupts processing.160 5 Process Management in Embedded Systems

Explanations of the ts.s ﬁle of Program C5.9:

Reset_handler: As usual, reset_handler is the entry point. It sets SVC mode stack pointer to the high end of proc[0] andcopies the vector table to address 0. Next, it changes to IRQ mode to set the IRQ mode stack pointer. Then it calls main() inSVC mode. During system operation, all processes run in SVC mode in the same address of the kernel.Irq_handler: Process switch is usually triggered by interrupts, which may wake up sleeping processes, make a blockedprocess ready to run, etc. Thus, irq_handler is the most important piece of assembly code relevant to process preemption. Sowe only show the irq_handler code. As pointer out in Chap. 3, the ARM CPU can not handle nested interrupts in IRQ mode.To handle nested interrupts, interrupt processing must be done in a different privileged mode. In order to support processpreemption due to interrupts, we choose to handle IRQ interrupts in SVC mode. The reader may consult Chap. 3 for how tohandle nested interrupts. The irq_handler code interacts with irq_chnadler() in C to support preemption.

/************* back to irq_handler in assembly *******/

(8). Issue EOI for interrupt

(2). Modiﬁed kernel functions for preemption

For brevity, we only show the modiﬁed kernel functions in support of process preemption. These include kwakeup, V on semaphore and mutex_unlock, all of which may make a sleeping or blocked process readyto run and change the readyQueue. In addition, kfork may also create a new process with a higher priority than the currentrunning process. All these functions call reschedule(), which may switch process immediately or defer process switch untilthe end of IRQ interrupts processing.

5.14.4 Demonstration of Preemptive UP Kernel

The sample system C5.9 demonstrates fully preemptive process scheduling. When the system starts, it creates and runs theinitial process P0, which has the lowest priority 0. P0 creates a new process P1 with priority=1. Since P1 has a higher prioritythan P0, it immediately preempts P0, which demonstrates direct preemption without any delay. When P1 runs in task1(), itﬁrst waits for a timer event by P(s1=0), which is Ved up by a timer periodically (every 4 seconds). While P1 waits on thesemaphore, P0 resumes running. When the timer interrupt handler V up P1, it tries to preempt P0 by P1. Since task switch isnot allowed inside interrupt handler, the preemption is deferred, which demonstrates preemption may be delayed byinterrupts processing. As soon as interrupt processing ends, P1 will preempt P0 to become running again. To illustrate process preemption due to blocking, P1 ﬁrst locks the mutex mp. While holding the mutex lock, P1 creates aprocess P2 with a higher priority=2, which immediately preempts P1. We assume that P2 does not need the mutex. It createsa process P3 with a higher priority=3, which immediately preempts P2. In the task3() code, P3 tries to lock the same mutexmp, which is still held by P1. Thus, P3 gets blocked on the mutex, which switches to run P2. When P2 ﬁnishes, it calls kexit() to terminate, causing P1 to resume running. When P1 unlocks the mutex, it unblocks P3, which has a higher priority thanP1, so it immediately preempts P1. After P3 terminates, P1 resumes running again and the cycle repeats. Figure 5.14 showsthe sample outputs of running the program C5.9. It is noted that, in a strict priority system, the current running process should always be the one with the highest priority.However, in the sample system C5.9, when the process P3, which has the highest priority, tries to lock the mutex that isalready held by by P1, which has a lower priority, it becomes blocked on the mutex and switches to run the next runnableprocess. In the sample system, we assumed that the process P2 does not need the mutex, so it becomes the running processwhen P3 gets blocked on the mutex. In this case, the system is running P2, which does not have the highest priority. Thisviolates the strict priority principle, resulting in what's known as a priority inversion (Lampson and Redell 1980), in whicha low priority process may block a higher priority process. If P2 keeps on running or it switches to another process of thesame or lower priority, process P3 would be blocked for an unknown amount of time, resulting in an unbounded priorityinversion. Whereas simple priority inversion may be considered as natural whenever processes are allowed to compete forexclusive control of resources, unbounded priority inversion could be detrimental to systems with cirtical timing require-ments. The sample program C5.9 actually implements a scheme called priority inheritance, which prevents unboundedpriority inversion. We shall discuss priority inversion in more detail later in Chap. 10 on real-time systems.5.15 Summary 165

Fig. 5.14 Demonstration of process preemption

5.15 Summary

This chapter covers process management. It introduces the process concept, the principle and technique of multitasking bycontext switching. It showed how to create processes dynamically and discussed the principles of process scheduling. Itcovered process synchronization and the various kinds of process synchronization mechanisms. It showed how to useprocess synchronization to implement event-driver embedded systems. It discussed the various kinds of process commu-nication schemes, which include shared memory, pipes amd message passing. It shows how to integrate these concepts andtechniques to implement a uniprocessor (UP) kernel that supports process management with both non-preemptive andpreemptve process scheduling. The UP kernel will be the foundation for developing complete operating systems in laterchapters.List of Sample Programs

1. In the example program C5.2, the tswitch function saves all the CPU registers in the process kstack and it restores all thesave registers of the next process when it resumes. Since tswitch() is called as function, it is clearly unnecessary tosave/restore R0. Assume that the tswitch function is implemented as

(1). Show how to initialize the kstack of a new process for it to start to execute the body() function.(2). Assume that the body() function is written as int body(int dummy, int pid, int ppid){ }where the parameters pid, ppid are the process id and the parent process id of the new process. Show how to modify thekfork() function to accomplish this.

2. Rewrite the UART driver in Chap. 3 by using sleep/wakeup to synchronize processes and the interrupt handler.3. In the example program C5.3, all the processes are created with the same priority (so that they take turn to run).

(1). What would happen if the processes are created with different priorities? (2). Implement a change_priority(int new_priority) function, which changes the running task's priority to new_priority. Switch process if the current running process no longer has the highest priority.

4. With dynamic processes, a process may terminate when it has completed its task. Implement a kexit() function for tasks toterminate.5. In all the example programs, each PROC structure has a statically allocate 4KB kstack.

(1). Implement a simple memory manager to allocate/deallocate memory dynamically. When the system starts, reserve a piece of memory, e.g. a 1MB area beginning at 4MB, as a free memory area. The function

char *malloc(int size)

allocates a piece of free memory of size bytes.When a memory area is no longer needed, it is released back to the freememory area by

void mfree(char *address, int size)

Design a data structure to represent the current available free memory. Then implement the malloc() and mfree() functions.5.15 Summary 167

(2). Modify the kstack ﬁeld of the PROC structure as an integer pointer int *kstack;

and modify the kfork() function as

int kfork(int func, int priority, int stack_size)

which dynamically allocates a memory area of stack_size (in 1KB units) for the new process.

(3). When a process terminates, its stack area must be freed. How to implement this?

6. It is well known that an interrupt handler must never go to sleep, become blocked or wait. Explain why?7. Misuse of semaphores may result in deadlocks. Survey the literatures to ﬁnd out how to deal with deadlocks by deadlockprevention, deadlock avoidance, deadlock diction and recovery.8. The pipe program C5.6 is similar to named pipes (FIFOs) in Linux. Read the Linux man page on ﬁfo to learn how to usenamed pipes for inter-process communication.9. Modify the example program C5.8 by adding another set of cooperative produced-consumer processes to the system. Letproducers get lines from the KBD, process the chars and pipe them to the consumers, which output the chars to the secondUART terminal.10. Modify the sample program C5.9 to handle nested IRQ interrupts in SYS mode but still allows task preemption at theend of nested interrupt processing.11. Assume that all processes have the same priority. Modify the sample program C5.9 to support process scheduling bytime slice.

6.1 Process Address Spaces

After power-on or reset, the ARM processor starts to execute the reset handler code in Supervisor (SVC) mode. The resethandler ﬁrst copies the vector table to address 0, initializes stacks of the various privileged modes for interrupts andexception processing, and enables IRQ interrupts. Then it executes the system control program, which creates and starts upprocesses or tasks. In the static process model, all the tasks run in SVC mode in the same address space of the system kernel.The main disadvantage of this scheme is the lack of memory protection. While executing in the same address space, tasksshare the same global data objects and may interfere with one another. An ill-designed or misbehave task may corrupt theshared address space, causing other tasks to fail. For better system security and reliability, each task should run in a privateaddress space, which is isolated and protected from other tasks. In the ARM architecture, tasks may run in the unprivilegedUser mode. It is very easy to switch the ARM CPU from a privileged mode to User mode. However, once in User mode, theonly way to enter privileged mode is by one of the following means.Exceptions: when an exception occurs, the CPU enters a corresponding privileged mode to handle the exceptionInterrupts: an interrupt causes the CPU to enter either FIQ or IRQ modeSWI: the SWI instruction causes the CPU to enter the Supervisor or SVC mode In the ARM architecture, System mode is a separate privileged mode, which share the same set of CPU registers withUser mode, but it is not the same system or kernel mode found in most other processors. To avoid confusion, we shall referto the ARM SVC mode as the Kernel mode. SWI can be used to implement system calls, which allow a User mode processto enter Kernel mode, execute kernel functions and return to User mode with the desired results. In order to separate andprotect the memory regions of individual tasks, it is necessary to enable the memory management hardware, which provideseach task with a separate virtual address space. In this chapter, we shall cover the ARM memory management unit(MMU) and demonstrate virtual address mapping and memory protection by example programs.

6.2 Memory Management Unit (MMU) in ARM

The ARM Memory Management Unit (MMU) (ARM926EJ-S 2008) performs two primary functions: First, it translatesvirtual addresses into physical addresses. Second, it controls memory access by checking permissions. The MMU hardwarewhich performs these functions consists of a Translation Lookaside Buffer (TLB) , access control logic and translation tablewalking logic. The ARM MMU supports memory accesses based on either Sections or Pages. Memory management bysections is a one-level paging scheme. The level-1 page table contains section descriptors, each of which speciﬁes a 1 MBblock of memory. Memory management by paging is a two-level paging scheme. The level-1 page table contains page tabledescriptors, each of which describes a level-2 page table. The level-2 page table contains page descriptors, each of whichspeciﬁes a page frame in memory and access control bits. The ARM paging scheme supports two different page sizes. Smallpages consist of 4 KB blocks of memory and large pages consist of 64 KB blocks of memory. Each page comprises 4sub-pages. Access control can be extended to 1 KB sub-pages within small pages and to 16 KB sub-pages within largepages. The ARM MMU also supports the concept of domains. A domain is a memory area that can be deﬁned withindividual access rights. The Domain Access Control Register (DACR) speciﬁes the access rights for up to 16 different

domains, 0–15. The accessibility of each domain is speciﬁed by a 2-bit permission, where 00 for no access, 01 for clientmode, which checks the Access Permission (AP) bits of the domain or page table entries, and 11 for manager mode, whichdoes not check the AP bits in the domain. The TLB contains 64 translation entries in a cache. During most memory accesses, the TLB provides the translationinformation to the access control logic. If the TLB contains a translated entry for the virtual address, the access control logicdetermines whether access is permitted. If access is permitted, the MMU outputs the appropriate physical address corre-sponding to the virtual address. If access is not permitted, the MMU signals the CPU to abort. If the TLB does not contain atranslated entry for the virtual address, the translation table walk hardware is invoked to retrieve the translation informationfrom a translation table in physical memory. Once retrieved, the translation information is placed into the TLB, possiblyoverwriting an existing entry. The entry to be overwritten is chosen by cycling sequentially through the TLB locations.When the MMU is turned off, e.g. during reset, there is no address translation. In this case every virtual address is a physicaladdress. Address translation takes effect only when the MMU is enabled.

6.3 MMU Registers

The ARM processor treats the MMU as a coprocessor. The MMU contains several 32-bit registers which control theoperation of the MMU. Figure 6.1 shows the format of MMU registers. MMU registers can be accessed by using the MRCand MCR instructions. The following is a brief description of the ARM MMU registers c0 to c10.

Register c0 is for access to the ID Register, Cache Type Register, and TCM Status Registers. Reading from this registerreturns the device ID, the cache type, or the TCM status depending on the value of Opcode_2 used.Register c1 is the Control Register, which speciﬁes the conﬁguration of the MMU. In particular, setting the M bit (bit 0)enables the MMU and clearing the M bit disables the MMU. The V bit of c1 speciﬁes whether the vector table is remappedduring reset. The default vector table location is 0x00. It may be remapped to 0xFFFF0000 during reset.Register c2 is the Translation Table Base Register (TTBR). It holds the physical address of the ﬁrst-level translation table,which must be on a 16kB boundary in main memory. Reading from c2 returns the pointer to the currently active ﬁrst-leveltranslation table. Writing to register c2 updates the pointer to the ﬁrst-level translation table.Register c3 is the Domain Access Control Register. It consists of 16 two-bit ﬁelds, each deﬁnes the access permissions forone of the sixteen Domains (D15-D0).Register c4 is currently not used.

Fig. 6.1 ARM MMU Registers

6.3 MMU Registers 171

Register c5 is the Fault Status Register (FSR). It indicates the domain and type of access being attempted when an abortoccurred. Bits 7:4 specify which of the sixteen domains (D15–D0) was being accessed, and Bits 3:1 indicate the type ofaccess being attempted. A write to this register flushes the TLB.Register c6 accesses the Fault Address Register (FAR). It holds the virtual address of the access when a fault occurred.A write to this register causes the data written to be treated as an address and, if it is found in the TLB, the entry is marked asinvalid. This operation is known as a TLB purge. The Fault Status Register and Fault Address Register are only updated fordata faults, not for prefetch faults.Register c7 controls the caches and the write bufferRegister c8 is the TLB Operations Register. It is used mainly to invalidate TLB entries. The TLB is divided into two parts:a set-associative part and a fully-associative part. The fully-associative part, also referred to as the lockdown part of the TLB,is used to store entries to be locked down. Entries held in the lockdown part of the TLB are preserved during an invalidateTLB operation. Entries can be removed from the lockdown TLB using an invalidate TLB single entry operation. Theinvalidate TLB operations invalidate all the unpreserved entries in the TLB. The invalidate TLB single entry operationsinvalidate any TLB entry corresponding to the virtual address.Register c9 accesses the Cache Lockdown and TCM Region Register on some ARM boards equipped with TCM.Register c10 is the TLB Lockdown Register. It controls the lockdown region in the TLB.

6.4 Accessing MMU Registers

The registers of CP15 can be accessed by MRC and MCR instructions in a privileged mode. The instruction format is shownin Fig. 6.2.

6.4.1 Enabling and Disabling the MMU

The MMU is enabled by writing the M bit, bit 0, of the CP15 Control Register c1. On reset, this bit is cleared to 0, disablingthe MMU.

6.4.1.1 Enable MMU

Before enabling the MMU, the system must do the following:

1. Program all relevant CP15 registers. This includes setting up suitable translation tables in memory.2. Disable and invalidate the Instruction Cache. The instruction cache can be enabled when enabling the MMU.

Fig. 6.2 MCR and MRC instruction format

172 6 Memory Management in ARM

To enable the MMU proceed as follows:

1. Program the Translation Table Base and Domain Access Control Registers.2. Program ﬁrst-level and second-level descriptor page tables as needed.3. Enable the MMU by setting bit 0 in the CP15 Control Register c1.

6.4.1.2 Disable MMU

To disable the MMU proceed as follows:

(1). Clear bit 2 in the CP15 Control Register c1. The Data Cache must be disabled prior to, or at the same time as the MMUbeing disabled, by clearing bit 2 of the Control Register.

If the MMU is enabled, then disabled, and subsequently re-enabled, the contents of the TLBs are preserved. If the TLBentries are now invalid, they must be invalidated before the MMU is re-enabled.

(2). Clear bit 0 in the CP15 Control Register c1.

When the MMU is disabled, memory accesses are treated as follows:

• All data accesses are treated as Noncacheable. The value of the C bit, bit 2, of the CP15 Control Register c1 should be zero.• All instruction accesses are treated as Cacheable if the I bit (bit 12) of the CP15 Control Register c1 is set to 1, and Noncacheable if the I bit is set to 0.• All explicit accesses are Strongly Ordered. The value of the W bit, bit 3, of the CP15 Control Register c1 is ignored.• No memory access permission checks are performed, and no aborts are generated by the MMU.• The physical address for every access is equal to its Virtual Address. This is known as a flat address mapping.• The FCSE PID Should Be Zero when the MMU is disabled. This is the reset value of the FCSE PID. If the MMU is to be disabled the FCSE PID must be cleared.• All CP15 MMU and cache operations work as normal when the MMU is disabled.• Instruction and data prefetch operations work as normal. However, the Data Cache cannot be enabled when the MMU is disabled. Therefore a data prefetch operation has no effect. Instruction prefetch operations have no effect if the Instruction Cache is disabled. No memory access permissions are performed and the address is flat mapped.• Accesses to the TCMs work as normal if the TCMs are enabled.

6.4.2 Domain Access Control

Memory access is controlled primarily by domains. There are 16 domains, each deﬁned by 2 bits in the Domain AccessControl register. Each domain supports two kinds of users.

Clients:Clients use a domain

Managers:Managers control the behavior of the domain.

The domains are deﬁned in the Domain Access Control Register. In Fig. 6.1, row 3 illustrates how the 32 bits of theregister are allocated to deﬁne sixteen 2-bit domains. Table 6.1 shows the meanings of the domain access bits.

Table 6.1 Access Bits in Domain Access Control Register

6.4 Accessing MMU Registers 173

Table 6.2 FSR Status Field Encoding

6.4.3 Translation Table Base Register

Register c2 is the Translation Table Base Register (TTBR), for the base address of the ﬁrst-level translation table. Readingfrom c2 returns the pointer to the currently active ﬁrst-level translation table in bits [31:14] and an Unpredictable value inbits [13:0]. Writing to register c2 updates the pointer to the ﬁrst-level translation table from the value in bits [31:14] of thewritten value. Bits [13:0] Should Be Zero. The TTBR can be accessed by the following instructions.

MRC p15, < Rd > ,c2, c0, 0 read TTBR

MRC p15, < Rd > ,c2, c0, 0 write TTBR

The CRm and Opcode_2 ﬁelds are SBZ (Should-Be-Zero) when writing to c2.

6.4.4 Domain Access Control Register

Register c3 is the Domain Access Control Register consisting of 16 two-bit ﬁelds. Each two-bit ﬁeld deﬁnes the accesspermissions for one of the 16 domains, D15–D0. Reading from c3 returns the value of the Domain Access Control Register.Writing to c3 writes the value of the Domain Access Control Register. The 2-bit Domain access control bits are deﬁned as

Value Meaning Description

00 No access Any access generates a domain fault01 Client Accesses are checked against the access permission bits in the section or page descriptor10 Reserved Currently behaves like the no access mode11 Manager Accesses are not checked against the access permission bits so a permission fault cannot be generated

The Domain Access Control Register can be accessed by the following instructions:

6.4.5 Fault Status Registers

Register c5 is the Fault Status Registers (FSRs). The FSRs contain the source of the last instruction or data fault. Theinstruction-side FSR is intended for debug purposes only. The FSR is updated for alignment faults, and external aborts thatoccur while the MMU is disabled. The FSR accessed is determined by the value of the Opcode_2 ﬁeld:174 6 Memory Management in ARM

The following describes the bit ﬁelds in the FSR.

Bits Description[31:9] UNP/SBZP[8] Always reads as zero. Writes ignored[7:4] Specify the domain (D15–D0) being accessed when a data fault occurred[3:0] Type of fault generated. Table 6.2 shows the encodings of the status ﬁeld in the FSR, and if the Domain ﬁeld contains valid information

6.4.6 Fault Address Register

Register c6 is the Fault Address Register (FAR). It contains the Modiﬁed Virtual Address of the access being attemptedwhen a Data Abort occurred. The FAR is only updated for Data Aborts, not for Prefetch Aborts. The FAR is updated foralignment faults, and external aborts that occur while the MMU is disabled. The FAR can be accessed by using the followinginstructions.

MRC p15, 0, < Rd > , c6, c0, 0 ; read FAR

MCR p15, 0, < Rd > , c6, c0, 0 ; write FAR

Writing c6 sets the FAR to the value of the data written. This is useful for a debugger to restore the value of the FAR to aprevious state. The CRm and Opcode_2 ﬁelds are SBZ (Should Be Zero) when reading or writing CP15 c6.

6.5 Virtual Address Translations

The MMU translates virtual addresses generated by the CPU into physical addresses to access external memory, and alsoderives and checks the access permission. Translation information, which consists of both the address translation data andthe access permission data, resides in a translation table located in physical memory. The MMU provides the logic needed totraverse the translation table, obtain the translated address, and check the access permission. The translation process consistsof the following steps.6.5 Virtual Address Translations 175

6.5.1 Translation Table Base

The Translation Table Base (TTB) Register points to the base of a translation table in physical memory which containsSection and/or Page descriptors.

6.5.2 Translation Table

The translation table is the level-one page table. It contains 4096 4-byte entries and it must be located on a 16 KB boundaryin physical memory. Each entry is a descriptor, which speciﬁes either a level-2 page table base or a section base. Figure 6.3shows the format of the level-one page entries.

6.5.3 Level-One Descriptor

A level-one descriptor is either a Page Table Descriptor or a Section Descriptor, and its format varies accordingly.Figure 6.3 shows the format of level-one descriptors. The descriptor type is speciﬁed by the two least signiﬁcant bits.

6.5.3.1 Page Table Descriptor

6.5.3.2 Section Descriptor

A section descriptor (third row in Fig. 6.3) has a 12-bit base address, a 2-bit AP ﬁeld, a 4-bit domain ﬁeld, the C and B bitsand a type identiﬁer (b10). The bit ﬁelds are deﬁned as follows.

Bits 31:20: base address of a 1 MB section in memory.

Bits 19:12: always 0. Bits 11:10 (AP): access permissions of this section. Their interpretation depends on the S and R bits (bits 8–9 of MMUcontrol register C1). The most commonly used AP and SR setting are as follows.

Bits 8:5: specify one of the sixteen possible domains (in the Domain Access Control Register) that form the primaryaccess controls. Bit 4: should be 1. Bits 3:2 (C and B) control the cache and write buffer related functions as follows: C—Cacheable: data at this address will be placed in the cache (if the cache is enabled). B—Bufferable: data at this address will be written through the write buffer (if the write buffer is enabled).

Fig. 6.3 Level-one Descriptors

176 6 Memory Management in ARM

Fig. 6.4 Translation of Section References

6.6 Translation of Section References

In the ARM architecture, the simplest kind of paging scheme is by 1 MB sections, which uses only a level-one page table.So we discuss memory management by sections ﬁrst. When the MMU translates a virtual address to a physical address, itconsults the page tables. The translation process is commonly referred to as a page table walk. When using sections, thetranslation consists of the following steps, which are depicted in Fig. 6.4.

(1). A Virtual address (VA) comprises a 12-bit Table index and a 20-bit Section index, which is the offset in the section.The MMU uses the 12-bit Table index to access a section descriptor in the translation table pointed by the TTBR.(2). The section descriptor contains a 12-bit base address, which points to a 1 MB section in memory, a (2-bit) AP ﬁeld anda (4-bit) domain number. First, it checks the domain access permissions in the Domain Access Control register. Then, itchecks the AP bits for accessibility to the Section.(3). If the permission checking passes, it uses the 12-bit section base address and the 20-bit Section index to generate thephysical address as

6.7.2 Level-2 Page Descriptors

The format of Level-2 Page Table Descriptors is shown in Fig. 6.5.

In a Level-2 page descriptor, the two least signiﬁcant bits indicate the page size and validity. Other bits are interpreted asfollows.6.7 Translation of Page References 177

Fig. 6.5 Page table entry (Level Two descriptor)

Bits 31:16 (large pages) or bits 31:12 (small pages) contain the physical address of the page frame in memory. Large pagesize is 64 KB and small page size is 4 KB. Bits 11:4 specify the access permissions (ap3–ap0) of the four sub-pages. This allows for ﬁner access control within apage, but it is rarely used in practice. Bit 3 C—Cacheable: indicates that data at this address will be placed in the IDC (if the cache is enabled). Bit 2 B—Bufferable: indicates that data at this address will be written through the write buffer (if the write buffer isenabled).

6.7.3 Translation of Small Page References

Page translation involves one additional step beyond that of a section translation: the Level-1 descriptor is a PageTable descriptor, which points to the Level-2 page table containing Level-2 page descriptors. Each Level-2 page descriptorpoints to a page frame in physical memory. Translation of small page references consists of the following steps, which aredepicted in Fig. 6.6.

(1). A Virtual Address VA comprises a 12-bit Level-1 Table Index, an 8-bit Level-2 Table Index and a 12-bit Page Index,which is the byte offset within the page.(2). Use the 12-bit Level-1 Table Index to access a Level-1 descriptor in the translation table pointed by the translation tablebase register (TTBR).(3). Check the domain access permission in the Level-1 descriptor as follows: 00 = abort, 01 = check AP in level-2 pagetable, 11 = do not check AP of page table.(4). The leading 22-bits of Level-1 descriptor speciﬁes the (physical) address of a Level-2 Page Table containing 256 pageentries. Use the 8-bit Level-2 Table Index to access a Level-2 page descriptor in the Level-2 Page table.

(32-bit)PA = ((20-bit)PageFrameAddress << 12) + (12-bit)PageIndex

6.7.4 Translation of Large Page References

Translation of large page references is similar to that of translating small pages, except for the following differences.

(1). For Large pages, a VA = [12-bit L1 Index| 8-bit L2 Index| 16-bit Page Index].(2). Since the upper four bits of the Page Index and low order four bits of the Level-2 Page Table index overlap, each pagetable entry for a Large Page must be duplicated 16 times in consecutive memory locations in the Level-2 Page Table. This isa rather peculiar property of the ARM paging tables for large pages. Since large pages are rarely used in practice, we shallonly consider small pages of 4 KB pages size.

6.8 Memory Management Example Programs

This section presents several programming examples which illustrate how to conﬁgure the ARM MMU for memorymanagement.

6.8.1 One-Level Paging Using 1 MB Sections

In the ﬁrst example program, denoted by C6.1, we shall use 1 MB sections to map the VA space to PA space. The programconsists of the following components: A ts.s ﬁle in assembly code and a t.c ﬁle in C, which are (cross) compile-linked to abinary executable t.bin. When running on the emulated Versatilepb board under QEMU, it will be loaded to 0x10000 andruns from there. The program supports the following I/O devices: A LCD for display, a keyboard for inputs, a UART forserial port I/O and also a timer. Since the object here is to demonstrate memory management, we shall focus on how to set upthe MMU for virtual address space. The ARM Versatilepb board supports 256 MB RAM and a 2 MB I/O space beginningat 256 MB. In this program, we shall use 1 MB sections to create an identity mapping of the low 258 MB virtual addressspace to the low 258 MB physical address space. The following lists the code of the C6.1 program.

Explanations of ts.s ﬁle: Since the program is compile-linked without using virtual addresses, the program code can beexecuted directly during reset when the MMU is off. Therefore, we may call functions in the C code when the program startsup. Upon entry to reset_handler, it ﬁrst sets the SVC mode stack pointer. Then it calls functions in C to initialize the LCD fordisplay and copy the vector table to address 0. Then it sets up the page table and enables the MMU for VA to PA addresstranslation. The steps are marked as (m1) to (m4), which are explained in more detail below.

(m1): It calls mkptable() in C to set up the level-1 page table using 1 MB sections. The emulated Versatilepb board underQEMU supports 256 MB RAM and a 2 MB I/O space at 256 MB. The level-1 page table is set up to create an identitymapping of the low 258 MB VA to PA, which includes 256 MB RAM and the 2 MB I/O space. The attributes of eachsection descriptor is set to 0x412 for AP = 01(client), domain = 0000 CB = 00 (D cache and W buffer disabled) andtype = 01 for 1 MB sections.(m2): It sets the Translation Table Base register (TTBR) to point at the page table.(m3): It sets the access bits of domain 0 to 01 (client) to ensure that the domain can only be accessed in privileged mode.Alternatively, we may also set the domain access bits to 11 (manager mode) to allow access in any mode without domainpermission checking.(m4): It enables the MMU for address translation.

After these steps, every virtual address (VA) is mapped to a physical address (PA). In this case, both addresses are thesame due to the identity mapping. The remaining parts of the ts.s code do the following. The program runs in SVC mode butit may enter IRQ mode to handle IRQ interrupts. It may also enter data abort mode to handle data abort exceptions. So, itinitializes stack pointers for the various modes. Then it calls main() in SVC mode with IRQ interrupts enabled.

The t.c ﬁle contains the main function of the program. It ﬁrst initializes the device drivers and IRQ interrupt handlers.Then it demonstrates memory protection by trying to access invalid virtual addresses, which generate data_abort exceptions.In the data abort handler, data_chandler(), it reads the MMU's data fault register c5 and fault address register c6 to displaythe reason of the exception (domain invalid) as well as the VA that caused the exception. It is noted that when a data abortexception occurs, PC-8 points to the instruction that caused the exception. In the data abort handler, if we adjust the linkregister by −8, it would return to the same bad instruction again, resulting in an inﬁnite loop. For this reason, we adjust thereturn PC by −4 to allow the execution to continue.

(4). Linker and mk scripts ﬁles t.ld: This is a standard linker script. It deﬁnes the program's entry point and allocatesmemory areas as privileged mode stacks(5) Compile-link command: This is a sh script used to (cross) compile-link the.s and.c ﬁles. The starting virtual address ofthe program is 0x10000.

6.8.2 Two-Level Paging Using 4 KB Pages

The second MMU example program, C6.2, uses 2-level paging. It consists of the following components.

1. ts.s ﬁle: This is identical to the t.s ﬁle of Program C6.1. During startup, it calls mkptable() in C to set up the two-level pagetables. Then it sets the TTB and domain access permission bits and enables the MMU. Then it call main() in SVC mode.2. t.c ﬁle: This is the same as the t.c ﬁle of Program C6.1, except for the modiﬁed mkptable() function. Instead of building alevel-1 page table using 1 MB sections, it builds a level-1 page table and its associated level-2 page tables for 2-levelpaging. For the sake of brevity, we only show the modiﬁed mkptable() function.

Figure 6.8 shows the sample outputs of running the C6.2 program, which demonstrates two-level paging. When theprogram starts, it ﬁrst builds the two-level page tables in 5 steps. Then it tests memory protection by trying to access someVA locations. As the ﬁgure shows, attempts to access VA = 0x00200000 (2 MB) and VA = 0x02000000 (16 MB), donot cause any data-abort exception because both are within the 258 MB VA space of the kernel. However, forVA = 0xA0000000, it generates a data_abort exception because the VA is outside of the 258 MB VA space of the kernel.

6.8.3 One-Level Paging with High VA Space

The third MMU program, C6.3, uses 1 MB sections to map the virtual address space to 0x80000000 (2 GB). The programwill be loaded to the physical address 0x10000 by QEMU. It is compile-linked with the starting virtual address 0x80010000.Since the program is compiled with virtual addresses, we can not call any function in the program's C code before setting upthe page table and enabling MMU to map VA to PA. For this reason, the initial page table must be built in assembly codeduring reset while the MMU is off. When the program starts, we ﬁrst set up an initial page table to ID map the lowest 1 MBVA to PA. This is because the vector table is located at the physical address 0 and the entry points of the exception handlersare located within 4 KB from the vector table. In addition to ID map the low 1 MB, we also ﬁll the page table entries 2048–2295 to map the virtual address space VA = (0x80000000, 0x80000000 + 258 MB) to the low 258 MB PA. Next, weenable the MMU to start VA to PA address translation. Then we call main() in C using its VA at 0x80000000 + main.Since the entire program resides in the lowest 1 MB physical memory, we may also call main() using its PA. The followinglist the assembly code.

As before, we focus on the code that sets up the MMU, which are labeled as (m1) to (m4).

(m1): Upon entry, it ﬁrst sets up a level-1 page table at 0x4000 (16 KB) using 1 MB sections. Entry 0 of the page table,ptable [0], is used to ID map the lowest 1 MB of VA to PA, which is required by the vector table. Then it ﬁlls the page tableentries ptable [2048] to ptable [2048 + 258] with section descriptors, which map VA = (0x80000000, 0x800000006.8 Memory Management Example Programs 187

+ 258 MB) to the low 258 MB PA. The attributes ﬁeld of each section descriptor is set to AP = 01 (client),domain = 0000, CB = 00 (D cache and W buffer off) and type = 10 (section).(m2): After setting up the level-1 page table, it sets the TTBR to the page table at 0x4000.(m3): It sets the access bits of domain0 to 01 for client mode.(m4): Then it enables the MMU for VA to PA translation.

With MMU enabled, it can now call functions in C, which are compiled with virtual addresses starting at 0x80010000. Itsets up the stacks for the various modes, copies the vector table to address 0 and calls main() in SVC mode.

Explanations of t.c code: In main(), we display the VA of some functions and variables to show that they are in thevirtual address range above 2 GB (0x80000000). Then we verify the memory protection mechanism of the MMU by tryingto access some invalid VA, which would generate data abort exceptions. In the data_abort_handler(), we read and display theMMU's fault_status and fault_addr registers to show the reason as well as the invalid VA that caused the exception.

3. Starting Virtual Address: In order to use virtual addresses starting from 0x80000000, the compile-link commands aremodiﬁed as

arm-none-eabi-as -mcpu = arm926ej-s ts.s -o ts.o

4. Other Modiﬁcations: With a starting VA = 0x80000000, the base addresses of all the I/O devices must be changed tovirtual addresses. These are done by a VA(x) macro #deﬁne VA(x) (0x80000000 + (u32)x) which adds 0x80000000 to their base addresses in the memory map.

5. Sample Outputs of Program C6.3: Figure 6.9 shows the sample outputs of running the C6.3 program. As the ﬁgureshows, the VA range is above 0x80000000, and any attempt to access an invalid VA generates a data_abort exception.

Fig. 6.9 Demonstration of One-level Paging with High VA Space

6.8 Memory Management Example Programs 189

6.8.4 Two-Level Paging with High VA Space

The sample program C6.4 uses 2-level paging with virtual address space beginning at 0x80000000 (2 GB). Since theprogram will be loaded at the physical address 0x10000 and runs from there, it is compiled-linked with the startingVA = 0x80010000. Similar to the Program C6.3, we must set up the page tables and enable the MMU in assembly codewhen the program starts. Since building page tables in assembly is very tedious, we shall do it in two separate steps. In theﬁrst step, we set up an initial one-level page table using 1 MB sections to map in the VA = (2 GB, 2 GB + 258 MB)range exactly the same as in Program C6.3. After enabling the MMU for address translation, we call a function in C to builda new level-1 page table (pgdir) at 32 KB and its associated level-2 page tables at 5 MB using 4 KB small pages. Then weswitch TTB to the new level-1 page table and flush the TLB, thereby switching the MMU from one-level paging to two-levelpaging. The following show the code of example program C6.4.

(1). ts.s ﬁle: The ts.s ﬁle is the same as that of Program C6.3, except for the added switchPgdir function, which is list below.

(2). t.c ﬁle: The t.c ﬁle is identical to that of Program C6.3, except for the added mkPtable() function, which creates a newlevel-1 page table (pgdir) at 32 KB and its associated level-2 page tables at 5 MB. Then it switches TTB to the new pgdir tolet the MMU use 2-level paging. The choices of the new pgdir at 32 KB and the level-2 page tables at 5 MB are quitearbitrary. They can be built anywhere is physical memory.

This chapter covers the ARM memory management unit (MMU) and virtual address space mappings. It covers theARM MMU in detail and shows how to conﬁgure the MMU for virtual address mapping using both one-level and two-levelpaging. In addition, it also shows the distinction between low VA space and high VA space mappings. Rather than onlydiscussing the principles, it demonstrates the various kinds of virtual address mappings by complete working exampleprograms.

List of Sample Programs

C6.1: One-level paging using 1 MB sections with VA mapped low

C6.3: One-level paging using 1 MB sections with VA mapped high

C6.4: Two-level paging using 4 KB pages with VA mapped high

Problems

1. In the example C6.1, the level-1 page table is built by the mkptable() function in C.(1). Why is it possible to build the page table in C?(2). Implement the mkptable() function in assembly code.2. The example program C6.2 implements 2-level paging using 4 KB small pages. Modify it to implement 2-level pagingusing 64 KB large pages.3. In the example program C6.3, which maps the VA space to 2 GB, the page table is built in assembly code, rather than inC, when the system starts.(1). Why is it necessary to build the page table in assembly code?(2).4. In the example program C6.4, which uses 2-level paging to map VA to 2 GB, the page tables are built in two stages, all inC. Alternatively, the page tables can also be built in a single step, all in assembly code.(1). Try to build the page tables in assembly code in one step. Compare the amount of programming efforts needed in bothapproaches.(2). The mkPtable() function of C6.4 contains the lines of code pgdir = (int *)VA(0x8000); // pgdir at 32 KB

which sets the level-1 page table at 32 KB of physical memory, and

pgdir[i + 2048] = (int)(0x500000 + i*1024) | 0x11;

which ﬁlls the level-1 page descriptors with page frames beginning at 5 MB of physical memory. While the ﬁrst line of codeuses VA, the second line of code uses PA.Why the difference?

7.1 User Mode Processes

In Chap. 5, we developed a simple uniprocessor kernel for process management. The simple kernel supports dynamicprocess creation, process synchronization and process communication. It can be used as a model for many simple embeddedsystems. A simple embedded system comprises a ﬁxed number of processes, all of which execute in the same address spaceof the kernel. The system can be implemented as event-driven, with processes as execution entities. Events can be interruptsfrom hardware devices, process cooperation through semaphores or messages from other processes. The disadvantage of thiskind of systems is the lack of memory protection. An ill-designed or malfunctioning process may corrupt the shared addressspace, causing other processes to fail. For both reliability and security reasons, each process should run in a private virtualaddress space that is isolated and protected from other processes. In order to support processes with virtual address spaces, itis necessary to use the memory management hardware to provide both virtual address mapping and memory protection. InChap. 6, we discussed the ARM Memory Management Unit (MMU) (ARM MMU 2008) in detail and showed how toconﬁgure the MMU for virtual address mappings. In this chapter, we shall extend the simple kernel to support user modeprocesses. In the extended kernel, each process may execute in two different modes, kernel mode and user mode. While inkernel mode, all processes execute in the same address space of the kernel, which is non-preemptive. While in user mode,each process executes in a private virtual address space and is preemptable. User mode processes may enter kernel throughexceptions, interrupts and system calls. System call is a mechanism which allows user mode processes to enter kernel modeto execute kernel functions. After executing a system call function in kernel, the process returns to user mode (except exitwhich never returns) with the desired results. For simplicity, we shall ignore exceptions ﬁrst and focus on developing akernel to support user mode processes and system calls.

7.2 Virtual Address Space Mapping

When an embedded system boots up, the system kernel is usually loaded to the low end of physical memory, e.g. to thephysical address 0 or 16 KB as in the case of the ARM Versatilepb VM under QEMU. When the kernel starts, it ﬁrstconﬁgures the Memory Management Unit (MMU) to enable virtual address translation. Each process may run in twodifferent modes, Kernel mode and User mode, each with a distinct virtual address space. With 32-bit addressing, the total VAspace range is 4 GB. We may divide the 4 GB VA space evenly into two equal halves and assign each mode a VA spacerange of 2 GB. There are two ways to create the Kernel and User mode VA spaces. In the Kernel Mapped Low (KML)scheme, the Kernel mode VA space is mapped to low virtual addresses and User mode VA space is mapped to high virtualaddresses. In this case, Kernel VA to PA mapping is usually one-to-one or identity mapping, so that every VA is the same asPA. User mode VA space is mapped to high virtual address range of 0x80000000 (2 GB) and above. In the KernelMapped High (KMH) scheme, the VA address mapping is reversed. In this case, the kernel mode VA space is mapped tohigh virtual addresses and User mode VA space is mapped to low virtual addresses. From a memory protection point ofview, there is no difference between the two mapping schemes. However, from a programming point of view, there may besome signiﬁcant differences. For example, in the KML scheme, the kernel can be compile-linked with real addresses. Whenthe kernel starts, it can execute in real address mode directly without conﬁguring the MMU for address translation ﬁrst. Incontrast, in the KMH scheme, the kernel must be compile-linked with virtual addresses. When the kernel starts, it can not

execute any code that uses virtual addresses directly. In this case, it must conﬁgure the MMU to use virtual address ﬁrst.The ARM architecture does not support the notion of floating vector tables, which allows the vector table to be remapped toany physical memory. On some ARM machines, the vector table can only be remapped to 0xFFFF0000 during booting.Without vector remapping, the vector table must be located at the physical address 0. In the vector table, the Branch or LDRinstructions have an address range limit of 4 KB. This implies that both the vector table and the exception handler entrypoints must all reside in the lowest 4 KB of physical memory. For these reasons, we shall mainly use the KML schemebecause it is more natural and simpler. However, we shall also show how to use the KMH scheme and demonstrate itsdifferences from the KML scheme by sample systems.

7.3 User Mode Process

From now on, we shall assume that a process may execute in two different modes; Kernel mode (SVC mode in ARM) andUser mode. For the sake of brevity, we shall simply refer to them as Kmode and Umode, respectively. Each mode has itsown VA space. When conﬁguring the MMU for virtual address mapping, we shall use the KML scheme so that the VAspace of Kmode is from 0 to the amount of physical memory, and the VA space of Umode is from 0x80000000 (2 GB) tothe size of Umode image.

7.3.1 User Mode Image

First, we show how to develop User mode process images. A user mode program consists of an assembly ﬁle, us.s, and a setof C ﬁles, which are shown and explained below.

Explanation of us.s ﬁle: us.s is the entry point of all Umode programs. As will be shown shortly, prior to entering us.s inUser mode, the kernel has already set up the execution environment of the program, including a Umode stack. So, uponentry it simply calls main(). If main() returns, it calls _exit(), which issues a syscall(99, 0, 0, 0) to terminate. Umodeprocesses may enter kernel to execute kernel functions via system calls, which is

int r = syscall(int a, int b, int c, int d)

When issuing a system call, the ﬁrst parameter a is the system call number, b, c, d are parameters to the kernel function,and r is the return value. In ARM based systems, system call, or syscall for short, is implemented by the SWI instruction,which causes the CPU to enter the privileged Supervisor (SVC) mode. Therefore, processes in kernel run in SVC mode. Thefunction get_cpsr() returns the current status register of CPU. It is used to verify that the process is indeed executing in Usermode (mode = 0x10).7.3 User Mode Process 195

Explanation of ucode.c ﬁle: ucode.c contains system call interface functions. When a user mode program runs, it ﬁrstdisplays some startup information, such as the CPU mode and starting virtual address. Then, it displays a menu and asks fora user command. For demonstration purpose, each user command issues a system call to the kernel. Each syscall is assigneda number for identiﬁcation, which corresponds to a function in kernel. The syscall number is entirely up to the systemdesigner's choice. Since User mode programs run in Umode address spaces, they can not access the I/O space in kerneldirectly. Therefore, basic I/O in Umode, such as ugetc() and uputc(), are also system calls. Since all user mode programs relyon system calls, the same ucode.c ﬁle can be shared by all user mode programs. In a real system, system call interface isusually pre-compiled as part of the link library, which is used by the linker to develop all user mode programs.

(3). u1.c ﬁle: This is the main body of the Umode program. It may be used as a template to develop other Umode programs.

Explanation of u1.c ﬁle: u1.c is the main body of a Umode program. After displaying the CPU mode to verify it isindeed executing in Umode, it issues system calls to get its pid and ppid of the parent process. Then it executes an inﬁniteloop. First, it shows the process ID and the starting virtual address of the Umode image. Then it displays a menu. To beginwith, the menu only includes four commands: getpid, getppid, ps, and chname. As we continue to expand the kernel, weshall add more user commands later. Each user command invokes an interface function in ucode.c, which issues a systemcall to execute a syscall function in kernel. For example, the ps command causes the process to execute kps() in kernel,which prints all the PROC status information. Each process is initialized with a name string in the PROC.name ﬁeld. Thechname command changes the name string of the current running process. After changing name, the user may use the pscommand to verify the results.

(4). mku script: The mku sh script is used to generate the binary image u1.o

Explanation of the mku script ﬁle: The mku script generates a binary executable image ﬁle. First, it (cross) compile-linkus.s and u1.c into an ELF ﬁle with the starting virtual address 0x800000000 (2 GB). Then it uses objcopy to convert theELF ﬁle into a raw binary image ﬁle. Before developing a loader to load program images, we shall use the binary image as araw data section in the kernel image. This is done in the linker script t.ld ﬁle.

7.4 System Kernel Supporting User Mode Processes

The system kernel consists of the following components: interrupt handlers, device drivers, I/O and queue manipulationfunctions, and process management functions. Most of the kernel components, e.g. interrupt handlers, device drivers andbasic process management functions, etc. are already covered in previous Chapters. In the following, we shall focus on thenew features of the kernel. For the sake of clarity, in addition to the section titles we shall also use sequence numbers (inparentheses) to show the kernel code segments.

char name[64]; // name field

int kstack[SSIZE]; // Kmode stack}PROC;

Each process is represented by a PROC structure. The new ﬁelds in the PROC structure in support of Umode operations areusp, upc, ucpsr: When a process enters kernel via syscall, it saves the Umode sp, lr and cpsr in the PROC structure forreturn to Umode later.pgdir: Each process has a level-1 page table (pgdir) pointed by PROC.pgdir. The pgdir and its associated page tables deﬁnethe virtual address spaces of the process in both Kernel and User modes.ppid and parent PROC pointer: parent process pid and pointer to parent PROCexitCode: for process to terminate with an exitCode valuename: process name string used to demonstrate system call. (2) ts.s ﬁle: The kernel's assembly code consists of ﬁve parts, which are highlighted in the code listing shown below.

7.4.2 Reset Handler

The reset_handler consists of three steps.

7.4.2.1 Exception and IRQ Stacks

Step 1: Set up stacks: Assume NPROC=9 PROCs, each PROC has a Kmode stack in the PROC structure. The system startsin SVC mode. Reset_handler initializes the SVC mode stack pointer to the high end of proc[0], thus making proc[0].kstackas the initial stack. It also sets spsr to User mode, making the CPU ready to return to User mode when it exits SVC mode.Then it initializes the stack pointers of other privileged modes, e.g. IRQ, data_abort, undef_abort, etc. Each privileged mode(except FIQ mode, which is not used) has a separate 4 KB stack area (deﬁned in linker script t.ld) for interrupt andexception processing.

7.4.2.2 Copy Vector Table

Step 2: Copy_vector_table: During reset, the memory management MMU is off and the vector table remap enable bit (V bitin the MMU control register c0) is 0, meaning that the vector table is not remapped to 0xFFFF0000. At this moment, everyaddress is a physical address. Reset_handler copies the vector table to the physical address 0 as required by the ARM CPU'svector hardware.202 7 User Mode Process and System Calls

7.4.2.3 Create Kernel Mode Page Table

Step 3: Create Kmode Page Table: After initializing the stack pointers of the various privileged modes, reset_handler callsmkPtable() to set up the Kernel mode page table. To begin with, we shall use simple one-level paging with 1 MB sections tocreate an identity mapping of VA to PA. Assuming 256 MB physical memory plus 2 MB I/O space (of the ARMVersatilepb VM) at 256 MB, the mkPtable() function in C is

The attributes of the page table entries are set to 0x41E for AP = 01, domain = 0000, CB = 11 and type = 10(Section). Alternatively, the CB bits can be set to 00 to disable instruction and data caching and write buffering. The entireKmode space is treated as domain 0 with permission bits = 01 for R|W in privileged modes but no access in User mode.Then it enables the MMU for VA to PA address translation. After these, every virtual address VA is mapped to a physicaladdress PA by the MMU hardware. In this case, the VA and PA addresses are the same due to the identity mapping.Alternative virtual address mapping schemes will be discussed later. Then it calls main() in C to continue the kernelinitialization.

7.4.2.4 Process Context Switching Function

Part 2: Process Context Switching: tswitch() is for switching process in Kernel. When a process gives up CPU, it callstswitch(), in which it saves CPU registers in the process kstack, saves the stack pointer into PROC.ksp and call scheduler() inC. In scheduler(), the process enters the readyQueue by priority if it is still READY to run. Then it picks the highest priorityprocess from the readyQueue as the next running process. If the next running process is different from the current process, itcalls switchPgdir() to switch page table to that of the next running process. SwitchPgdir() also flushes the TLB, invalidatesthe instruction and data caches and flushes the write buffer to prevent the CPU from using TLB entries belonging to an oldprocess context. In order to speed up the address translation process, the ARM MMU supports many advanced options, suchas lock-down instruction and data cache, invalidating selected TLB and cache entries, etc. In order to keep the systemsimple, we shall not use these advanced MMU features.

7.4.2.5 System Call Entry and Exit

Part 3: System Call entry and exit: User mode processes use syscall(a, b, c, d) to execute system call functions in kernel. Insyscall(), it issues a SWI to enter SVC mode, which is routed to SVC handler via the SWI vector.

7.4.2.6 SVC Handler

SVC handler: svc_entry is is the system call entry point. System call parameters a, b, c, d are passed in registers r0–r3.Upon entry, the process ﬁrst saves all the CPU registers in the process Kmode stack (PROC.kstack). In addition, it also savesUmode sp, lr and cpsr into PROC's usp, upc and ucpsr ﬁelds, respectively. In order to access Umode registers, it temporarilyswitches the CPU to System mode, which shares the same set of registers with User mode. Then, it replaces the saved lr inkstack with the Umode upc. This is because the saved lr points to the SWI instruction in Umode, not the upc at the time ofsystem call. Then, it enables IRQ interrupts and calls svc_handler() in C, which actually handles the system call. When theprocess exits kernel, it executes goUmode() to return to Umode. In the goUmode code, it ﬁsrt restores Umode sp and cpsrfrom the saved usp and cpsr in PROC. Then it returns to Umode by

ldmfd sp!, {r0-r12, pc}^

7.4 System Kernel Supporting User Mode Processes 203

The reader may wonder why is it necessary to save and restore Umode sp and cpsr during system calls. The problem is asfollows. When a process enters Kernel, it may not return to User mode immediately. For example, the process may becomesuspended in Kernel and switches to another process. When the new process returns from Kernel to User mode, the CPU'susp and spsr are that of the suspended process, not that of the current process. It is noted that in most stack-oriented architecture, saving and restoring User mode stack pointer and status registers isautomatic during system calls. For example, in the Intel x86 CPU (Intel 1990, 1992), the INT instruction is similar to theARM SWI instruction, both cause the process to enter Kernel mode. The major difference is that when the Intel x86 CPUexecutes the INT instruction, it automatically stacks User mode [uss, usp], uflags, [ucs, upc], which are equivalent to theUser mode SP, CPSR, LR of ARM CPU. When the Intel x86 CPU exits Kernel mode by IRET (which is similar to the ^operation involving PC in ARM), it restores all the saved Umode registers from the Kernel mode stack. In contrast, the ARMprocessor does not stack any Umode registers automatically when it enters a privileged mode. The system programmer mustdo the save and restore operations manually.

7.4.2.7 Exception Handlers

(4) Exception Handlers: For the time being, we only handle data_abort exceptions, which are used to demonstrate the memory protection capability of the MMU. All other exception handlers are while(1) loops. The following shows the algorithm of the data_abort exception handler.

/*** exceptions.c file: only show data_abort_handler ***/

7.4.3 Kernel Code

(5) Kernel.c ﬁle: This ﬁle deﬁnes the kernel data structures and implements kernel functions. When the system starts, reset_handler calls main(), which calls kernel_init() to initialize the kernel. First, it initializes the free PROC list and the readyQueue. Then it creates the initial process P0, which runs only in Kmode with the lowest priority 0. Then it sets up page tables for the PROCs. When setting up the process page tables, we shall assume that the kernel's VA space is mapped low, from 0 to the amount of available physical memory. The User mode VA space is mapped high, from 0x80000000 (2 GB) to 2 GB + Umode image size. Each PROC has a unique pid (1 to NPROC-1) and a level-1 page table pointer pgdir. The process page tables are constructed in the physical memory area of 6 MB. Each page table requires 4096 * 4 = 16 K bytes space. The 1 MB area from 6 to 7 MB has enough space for 64 PROC page tables. Each process (except P0) has a page table at 6 MB + (pid − 1) * 16 KB. In each page table the low 2048 entries deﬁne the process kernel mode address space, which are identical for all processes since they share the same address space in Kmode. The high 2048 entries of the page table deﬁne the process Umode address space, which are ﬁlled in only when the process is created.

In the following, we shall assume that each process (except P0) has a 1 MB Umode image at 8 MB + (pid − 1) *1 MB, e.g. P1 at 8 MB, P2 at 9 MB, etc. This assumption is not critical. If desired, the reader may assume different Umodeimage sizes. With a 1 MB Umode image size, each page table only needs one entry for User mode VA space, i.e. entry 2048points to the 1 MB Umode area of the process. Recall that we have designated the Kernel mode memory area as domain 0.We shall assign all User mode memory areas to domain 1. Accordingly, we set the Umode page entry attributes to 0xC3E(AP = 11, domain = 0001, CB = 11 and type = 10). The access permission (AP) bits of domain0 are set to 01 to allowaccess from Kmode but not from Umode. However, the AP bits of either the Umode page table descriptor or domain1 (of the204 7 User Mode Process and System Calls

domain access control register) must be set to 11 to allow access from Umode. Since the AP bits of domain1 are set to 01 inswitchPgdir(), the AP bits of Umode page descriptors must be set to 11. The following lists the kernel.c ﬁle code.

The function svc_handler() is essentially a system call router. System call parameters (a, b, c, d) are passed in registers r0–r3, which are the same in all CPU modes. Based on the system call number a, the call is routed to a corresponding kernelfunction. The kernel.c ﬁle implements all the system call functions in kernel. At this moment, our purpose is to demonstrate206 7 User Mode Process and System Calls

the mechanism and control flow of system calls. Exactly what the system call functions do is unimportant. So we onlyimplement four very simple system calls: getpid, getppid, ps and chname. Each function returns a value r, which is loaded tothe saved r0 in kstack as the return value to User mode.

(7) t.c ﬁle: This is the main body of the system kernel. Since most of the system components, such as interrupts, device drivers, etc. are already explained in previous chapters, we shall only focus on the new features, which are highlighted in bold faced lines. Before entering main(), the kernel mode page table is already set up in ts.s by the mkPtable() function and the MMU enabled for address translation. Because of the identity mapping of virtual address to physical address in kernel mode, no changes are needed in the kernel code. When main() starts, it copies the u1 program image to 8 MB. This is because we assume that the User mode image of the process P1 is at the physical address 8 MB but it runs in the virtual address space of 0x80000000 to 0x80100000 (2G to 2G + 1 MB) in User mode. Then it calls kfork() to create the process P1 and switches process to run P1.

7.4.3.1 Create Process with User Mode Image

(8) fork.c ﬁle: This ﬁle implements the kfork() function, which creates a child process with a User mode image u1 in the User mode area of the new process. In addition, it also ensures that the new process can execute its Umode image in User mode when it runs.

Explanation of kfork(): kfork() creates a new process with a User mode image and enters it into the readyQueue. Whenthe new process begins to run, it ﬁrst resumes in kernel. Then it returns to User mode to execute the Umode image. Thecurrent kfork() function is the same as in Chap. 5 except for the User mode image part. Since this part is crucial, we shallexplain it in more detail. In order for a process to execute its Umode image, we may ask the question: how did the processwind up in the readyQueue? The sequence of events must be as follows.

(1). It did a system call from Umode by SWI #0, which causes it to enter kernel to execute SVC handler (in ts.s), in which it uses STMFD sp!, {r0–r12, lr} to save Umode registers into the (empty) kstack, which becomes208 7 User Mode Process and System Calls

where the preﬁx u denotes Umode registers and −i means SSIZE-i. It also saves Umode sp and cpsr into PROC.usp andPROC.ucpsr. Then, it called tswitch() to give up CPU, in which it again uses STMFD sp!, {r0–r12, lr} to save Kmoderegisters into kstack. This adds one more frame to kstack, which becomes

where the preﬁx k denotes Kmode registers. In the PROC kstack,

kLR = where the process called tswitch() and that's where it shall resume to, uLR = where the process did the system call, and that's where it shall return to when it goes back to Umode. Since the process never really ran before, all other "saved" CPU registers do not matter, so they can all be set to 0.Accordingly, we initialize the new process kstack as follows.

1. Clear all "saved" registers in kstack to 0

for (i=1; i<29; i++){ p->kstack[SSIZE-i] = 0; }

2. Set saved ksp to kstack[SSIZE-28]

p->ksp = &(p->kstack[SSIZE-28]);

3. Set kLR = goUmode, so that p will resume to goUmode (in ts.s)

p->kstack[SSIZE-15] = (int)goUmode;

4. Set uLR to VA(0), so that p will execute from VA=0 in Umode

p->kstack[SSIZE-1] = VA(0); // beginning of Umode image

5. Set new process usp to point at ustack TOP and ucpsr to Umode

p->usp = (int *)VA(UIAMGE_SIZE); // high end of Umode image

p->ucpsr = (int *)0x10; // Umode status register

7.4.3.2 Execution of User Mode Image

With this setup, when the new process begins to run, it ﬁrst resumes to goUmode (in ts.s), in which it sets Umode sp=PROC.usp, cpsr=PROC.ucpsr. Then it executes

ldmfd sp!, {r0-r12, pc}^

which causes it to execute from uLR = VA(0) in Umode, i.e. from the beginning of the Umode image with the stackpointer pointing at the high end of the Umode image. Upon entry to us.s, it calls main() in C. To verify that the process isindeed executing in Umode, we get the CPU's cpsr register and show the current mode, which should be 0x10. To testmemory protection by the MMU, we try to access VAs outside of the process 1 MB VA range, e.g. 0x80200000, as well as7.4 System Kernel Supporting User Mode Processes 209

VA in kernel space, e.g. 0x4000. In either case, the MMU should detect the error and generate a data abort exception. In thedata_abort handler, we read the MMU's fault_status and fault_address registers to show the cause of the exception as well asthe VA address that caused the exception.When the data_abort handler ﬁnishes, we let it return to PC-4, i.e. skip over the badinstruction that caused the data_abort exception, allowing the process continue. In a real system, when a Umode processcommits memory access exceptions, it is a very serious matter, which usually causes the process to terminate. As for how todeal with exceptions caused by Umode processes in general, the reader may consult Chap. 9 of Wang (2015) on signalprocessing.

In the linker script, u1.o is used as a raw binary data section in the kernel image. For each raw data section, the linkerexports its symbolic addresses, such as

_binary_u1_start, _binary_u1_end, _binary_u1_size

which can be used to access the raw data section in the loaded kernel image.

7.4.5 Demonstration of Kernel with User Mode Process

Figure 7.1 shows the output screen of running the C7.1 program. When a process begins to execute the Umode image, itﬁrst gets the CPU's cpsr to verify it is indeed executing in User mode (mode = 0x10). Then it tries to access some VA outsideof its VA space. As the ﬁgure show, each invalid VA generates a data abort exception. After testing the MMU for memoryprotection, it issues syscalls to get its pid and ppid. Then, it shows a menu and asks for a command to execute. Eachcommand issues a syscall, which causes the process to enter kernel to execute the corresponding syscall function in kernel.Then it returns to Umode and prompts for a command to execute again. The reader may run the program and entercommands to test system calls in the sample system.

7.5 Embedded System with User Mode Processes

Based on the example program C7.1, we propose two different models for embedded systems to support multiple User modeprocesses.

7.5.1 Processes in the Same Domain

Instead of a single Umode image, we may create many Umode images, denoted by u1, u2, …, un. Modify the linker script toinclude all the Umode images as separate raw data sections in the kernel image. When the system starts, create n processes,210 7 User Mode Process and System Calls

Fig. 7.1 Demonstration of user mode process and system calls

P1, P2, …, Pn, each executes a corresponding image in User mode. Modify kfork() to kfork(int i), which creates process Piand loads the image ui to the memory area of process Pi. On ARM based systems, use the simplest memory managementscheme by allocating each process a 1 MB Umode image area by process PID, e.g. P1 at 8 MB, P2 at 9 MB, etc. Some ofthe processes can be periodic while others can be event-driven. All the processes run in the same virtual address space of[2 GB to 2 GB + 1 MB] but each has a separate physical memory area, which is isolated from other processes andprotected by the MMU hardware. We demonstrate such a system by an example.

7.5.2 Demonstration of Processes in the Same Domain

In this example system, we create 4 Umode images, denoted by u1 to u4. All Umode images are compile-linked with thesame starting virtual address 0x80000000. They executes the same ubody(int i) function. Each process calls ubody(pid) witha unique process ID number for identiﬁcation. When setting up the process page tables, the kernel space is assigned thedomain number 0. All Umode spaces are assigned the domain number 1. When a process begins execution in Umode, itallows the user to test memory protection by trying to access invalid virtual addresses, which would generate memoryprotection faults. In the data abort exception handler, it displays the MMU's fault_status and fault_addr registers to show theexception cause as well as the faulting virtual address. Then each process executes an inﬁnite loop, in which it prompts for acommand and executes the command. Each command invokes a system call interface, which issues a system call, causingthe process to execute the system call function in kernel. To demonstrate additional capabilities of the system, we add thefollowing commands:

switch : enter kernel to switch process;

If desired, we may also use sleep/wakeup to implement event-driven processes, as well as for process cooperation. Tosupport and test the added User mode commands, simply add them to the command and syscall interfaces.

In the kernel t.c ﬁle, after system initialization it creates 4 processes, each with a different Umode image. In kfork(pid), ituses the new process pid to load the corresponding image into the process memory area. The loading addresses are P1 at212 7 User Mode Process and System Calls

8 MB, P2 at 9 MB, P3 at 10 MB and P4 at 11 MB. Process page tables are set up in the same way as in Program C7.1.Each process has a page table in the 6 MB area. In the process page tables, entry 2048 (VA = 0x80000000) points to theprocess Umode image area in physical memory.

Figure 7.2 shows the outputs of running the C7.2 program. It shows that the switch command switches the runningprocess from P1 to P2. The reader may run the system and enter other User mode commands to test the system.

7.5.3 Processes with Individual Domains

In the ﬁrst system model of C7.2, each process has its own page table. Switching process requires switching page table,which in turn requires flushing the TLB and I&D caches. In the next system model, the system supports n < 16 user modeprocesses with only one page table. In the page table, the ﬁrst 2048 entries deﬁne the kernel mode virtual address space,which is ID mapped to the available physical memory (258 MB). As before, the kernel space is designated as domain 0.Assume that all user mode images are 1 MB in size. Entry 2048 + pid of the page table maps to the 1 MB physicalmemory of process Pi. The pgdir entry attributes are set to 0xC1E | (pid ≪ 5), so that each user image area is in a uniquedomain numbered 1 to pid. When switching process to Pi, instead of switching page tables, it calls

set_domain( (1 << 2*pid) | 0x01);

which sets the access bits of domain0 and domainPid to b01 and clears the access bits of all other domains to 0, makingthem inaccessible. This way, each process runs only in its own virtual address space, which is protected from otherprocesses. Naturally, processes in kernel mode can still access all the memory because it is running in the privileged SVCmode. The limitation of this model is that the system can only support 15 user mode processes. Another drawback is thateach user mode image must be compile-linked with a different starting virtual address that matches its page table index. Thesample system C7.3 implements processes with individual domains.

7.5.4 Demonstration of Processes with Individual Domains

Figure 7.3 shows the outputs of running the C7.3 program. As the ﬁgure shows, all processes share the same page table at0x4000 but each process has a different entry in the page table. The ﬁgure also shows that each process can only access its7.5 Embedded System with User Mode Processes 213

Fig. 7.2 Outputs of sample system C7.2

own VA space. Any attempt to access a VA outside of its VA space will generate a data_abort exception due to invaliddomains.

7.6 RAM Disk

In the previous programming examples, User mode images are included as raw data sections in the kernel image. When thekernel image boots up, it relies on the symbolic addresses generated by the linker to load (copy) the various User modeimages to their memory locations. This scheme works well if the number of User mode images is small. It can be verytedious when the number of User mode images becomes large, which also increases the kernel image size. If we intend torun a large number of User mode processes with different images, a better way to manage the user mode images is needed. Inthis section, we shall show how to use a ramdisk ﬁle system to manage user mode images. First, we create a virtual ramdiskand formt it as a ﬁle system. Then we generate User mode images as executable ELF ﬁles in the ramdisk ﬁle system. Whenbooting up the system, we also load the ramdisk image to make it accessible to the kernel. When the kernel starts, we movethe ramdisk image to a known memory area, e.g. at 4 MB, and use it as a RAMdisk in memory. There are two ways to makethe ramdisk image accessible to the kernel.

(1). Include ramdisk image as a raw data section: Convert the ramdisk image to binary and include it as a raw data section in the kernel image, similar to the individual Umode images before.(2). As an initial ramdisk image: Run QEMU with the –initrd ramdisk option, as in

qemu-system-arm –M versatilpb –m 256M –kernel t.bin –initrd ramdisk

214 7 User Mode Process and System Calls

Fig. 7.3 Outputs of sample system C7.3

QEMU will load the kernel image to 0x10000 (64 KB) and the initial ramdisk image to 0x4000000 (64 MB). Althoughthe QEMU documents state that the initial ramdisk image will be loaded to 0x800000 (8 MB), it actually depends on thememory size of the virtual machine. As a general rule, the VM's memory size should be a power of 2. The ramdisk imageloading address is the memory size divided by 2, with an upper limit of 128 MB. The following lists some of the commonlyused VM memory sizes and loading addresses of the initial ramdisk image.

VM memory size (MB) Ramdisk loading address (MB)

16 832 1664 32128 64512 128 (upper limit)

A simple way to ﬁnd out the loading address of the ramdisk image is to dump a string, e.g. "ramdisk begin" to thebeginning of the ramdisk image. When the kernel starts, scan each 1 MB memory area to detect the string. Once the loadingaddress is known, we can move the ramdisk image to a memory location and use it as a RAMdisk. In order to access Umodeimages in the RAMdisk ﬁle system, we add a RAMdisk driver to read-write RAMdisk blocks. Then we develop an ELF ﬁleloader to load the image ﬁles into process memory areas. There are several popular ﬁle systems used in embedded systems.Most early embedded systems use the Microsoft FAT ﬁle system. Since we use Linux as the development platform, we shalluse a Linux compatible ﬁle system in order to avoid any unnecessary ﬁle conversions. For this reason, we shall use the EXT2ﬁle system, which is totally Linux compatible. The reader may consult (EXT2 2001; Card et al. 1995; Cao et al. 2007) forEXT2 ﬁle system speciﬁcations. In the following, we show how to create an EXT2 ﬁle system image and use it as a ramdisk.7.6 RAM Disk 215

7.6.1 Creating RAM Disk Image

(1). Under Linux, run the following commands (or as a sh script). For small EXT2 ﬁle systems, it is better to use 1 KB ﬁle block size.

Alternatively, we may also rely on QEMU to load the ramdisk image by the –initrd option directly, as in

qemu-system-arm –M versatilpb –m 128M –kernel t.bin –initrd ramdisk

In that case, we may either move the ramdisk image from its loading address (64 MB) to 4 MB or use the ramdiskloading address directly. The following shows the RAMdisk block I/O functions, which are essentially memory copyfunctions.

7.6.2 Process Image File Loader

An image loader consists of two parts. The ﬁrst part is to locate the image ﬁle and check whether it is executable. The secondpart is to actually load the image ﬁle's executable contents into memory. We explain each part in more detail. Part 1 of image loader: In an EXT2 ﬁle system, each ﬁle is represented by a unique INODE data structure, whichcontains all the information of the ﬁle. Each INODE has an inode number (ino), which is the position (counting from 1) inthe INODE table. To ﬁnd a ﬁle amounts to ﬁnding its INODE. The algorithm is as follows.

(1). Read in the Superblock (block 1) Check the magic number (0xEF53) to verify it's indeed an EXT2 FS.(2). Read in the group descriptor block (block 2) to access the group 0 descriptor. From the group descriptor's bg_in- ode_table entry, ﬁnd the INODEs begin block number, call it the InodesBeginBlock.(3). Read in InodeBeginBlock to get the INODE of root directory /, which is second inode (ino = 2) in the INODE Table(4). Tokenize the pathname into component strings and let the number of components be n. For example, if path- name = /a/b/c, the component strings are "a", "b", "c", with n = 3. Denote the components by name[0], name[1], …, name[n − 1].(5). Start from the root INODE in (3), search for name[0] in its data block(s). Each data block of a DIR INODE contains dir_entry structures of the form

[ino rlen nlen NAME] [ino rlen nlen NAME] ……

where NAME is a sequence of nlen chars (without a terminating NULL). For each data block, read the block into memoryand use a dir_entry *dp to point at the loaded data block. Then use nlen to extract NAME as a string and compare it withname[0]. If they do not match, step to the next dir_entry by

dp = (dir_entry *)((char *)dp + dp->rlen);

and continue the search. If name[0] exists, we can ﬁnd its dir_entry and hence its inode number.7.6 RAM Disk 217

(6). Use the inode number (ino) to compute the disk block containing the INODE and its offset in that block by the Mailman's algorithm (Wang 2015).

blk ¼ ðino 1Þ=ðBLKSIZE=sizeof ðINODEÞÞ þ InodesBeginBlock;

offset ¼ ðino 1Þ%ðBLKSIZE=sizeof ðINODEÞÞ;

Then read in the INODE of /a, from which we can determine whether it's a DIR. If /a is not a DIR, there can't be /a/b, sothe search fails. If /a is a DIR and there are more components to search, continue for the next component name[1]. Theproblem now becomes: search for name[1] in the INODE of /a, which is exactly the same as that of Step (5).

(7). Since Steps 5–6 will be repeated n times, it's better to write a search function

u32 search(INODE *inodePtr, char *name)

{ // search for name in the data blocks of this INODE // if found, return its ino; else return 0 }

Then all we have to do is to call search() n times, as sketched below.

Assume: n, name[0], …., name[n-1] are globals

If the search loop ends successfully, ip must point at the INODE of pathname. Then we can check its ﬁle type and ﬁleheader (if necessary) to ensure it is executable. Part 2 of image loader: From the ﬁle's INODE, we know its size and ﬁle blocks. The second part of the loader loads theﬁle's executable contents into memory. This step depends on the executable ﬁle type.

(1). A flat binary executable ﬁle is a single piece of binary code, which is loaded in its entirety for direct execution. This can be done by converting all ELF ﬁles to binary ﬁles ﬁrst. In this case, loading ﬁle contents is the same as loading ﬁle blocks.(2). ELF executable ﬁle format: An ELF executable ﬁle begins with an elf-header, followed by one or more program section headers, which are deﬁned as ELF header and ELF program section header structures.

The reader may consult (ELF 1995) for the ELF ﬁle format. For help information, the reader may also use the (Linux)readelf command to view ELF ﬁle contents. For example,

readelfeSt file:elf

displays the headers (e), section headers (S) and section details (t) of an ELF ﬁle. For ELF executable ﬁles, the loader mustload the various sections of an ELF ﬁle to their speciﬁed virtual addresses. In addition, each loaded section should be markedwith appropriate R|W|Ex attributes for protection. For example, the code section pages should be marked for RO (read-only),data section pages should be marked for RW, etc. For generality, our image loader can load either binary or ELF executableﬁles. The loader's algorithm is as follows. The reader may consult the ELF loader code in the loadelf.c ﬁe for implementationdetails.

7.7 Process Management

7.7.1 Process Creation

Most embedded systems are designed for speciﬁc application environments. A typical embedded system comprises a ﬁxednumber of processes. Each process is an independent execution unit, which does not have any relation or interactions withother processes. For such systems, the current kfork() function, which creates a process to execute a speciﬁc function is7.7 Process Management 219

adequate. In order to support dynamic user mode processes, we shall extend the kernel to impose a parent-child relationbetween processes. When the kernel starts, it runs the initial process P0, which is handcrafted or created by brute force.Thereafter, every other process is created by

int newpid = kfork(char *ﬁlename, int priority);

which creates a new process with a speciﬁed priority to execute a Umode image ﬁlename. When creating a new process,the creator is the parent and the newly created process is the child. In the PROC structure, the ﬁeld ppid records the parentprocess pid, and the parent pointer points to the parent PROC. Thus, the processes form a family tree with P0 as the root.

7.7.2 Process Termination

In a multitasking system with dynamic processes, a process may terminate or die, which is a common term of processtermination. A process may terminate in two possible ways: Normal termination: The process has completed its task, which may not be needed again for a long time. To conservesystem resources, such as PROC structures and memory, the process calls kexit(int exitValue) in kernel to terminate itself,which is the case we are discussing here. Abnormal termination: The process terminates due to an exception, which renders the process impossible to continue.In this case it calls kexit(value) with a unique value that identiﬁes the exception. In either case, when a process terminates, it eventually calls kexit() in kernel. The general algorithm of kexit() is asfollows.

So far, our system model does not yet support a ﬁle system. Due to the simple memory allocation scheme, each processruns in a dedicated 1 MB memory area, deallocation of user-mode image is also trivial. When a process terminates, its usermode memory area will be left unused until a process with the same pid is created again. So we begin by discussing Step 2 ofkexit(). Since each process is an independent execution entity, it may terminate at any time. If a process with childrenterminates ﬁrst, all the children of the process would have no parent anymore, i.e. they become orphans. The question isthen: what to do with such orphans? In human society, they would be sent to grandma's house. But what if grandma alreadydied? Following this reasoning, it immediately becomes clear that there must be a process which should not terminate if thereare other processes still existing. Otherwise, the parent-child process relation would soon break down. In all Unix-likesystems, the process P1, which is also known as the INIT process, is chosen to play this role. When a process dies, it sendsall the orphaned children, dead or alive, to P1, i.e. become P1's children. Following suit, we shall also designate P1 as such aprocess. Thus, P1 should not die if there are other processes still existing. The remaining problem is how to implement Step2 efﬁciently. In order for a dying process to dispose of children, the process must be able to determine whether it has anychild and, if it has children, ﬁnd all the children quickly. If the number of processes is small, both questions can be answeredeffectively by searching all the PROC structures. For example, to determine whether a process has any child, simply searchthe PROCs for any one that is not FREE and its ppid matches the process pid. If the number of processes is large, e.g. in theorder of hundreds, this simple search scheme would be too slow. For this reason, most large OS kernels keep track of processrelations by maintaining a process family tree.220 7 User Mode Process and System Calls

7.7.3 Process Family Tree

Typically, the process family tree is implemented as a binary tree by a pair of child and sibling pointers in each PROC, as in

struct proc *child, *sibling, *parent;

where child points to the ﬁrst child of a process and sibling points to a list of other children of the same parent. Forconvenience, each PROC also uses a parent pointer pointing to its parent. With a process tree, it is much easier to ﬁnd thechildren of a process. First, follow the child pointer to the ﬁrst child PROC. Then follow the sibling pointers to traverse thesibling PROCs. To send all children to P1, simply detach the children list and append it to the children list of P1 (and changetheir ppid and parent pointer also). Because of the small number of PROCs in all the sample systems of this book, we do notimplement the process tree. This is left as a programming exercise. In either case, it should be fairly easy to implement Step 2of kexit(). Each PROC has a 2-byte exitCode ﬁeld, which records the process exit status. In Linux, the high byte of exitCode is theexitValue and the low byte is the exception number that caused the process to terminate. Since a process can only die once,only one of the bytes has meaning. After recording exitValue in PROC.exitCode, the process changes its status to ZOMBIEbut does not free the PROC. Then the process calls kwakeup() to wake up its parent and also P1 if it has sent any orphans toP1. The ﬁnal act of a dying process is to call tswitch() for the last time. After these, the process is essentially dead but stillhas a dead body in the form of a ZOMBIE PROC, which will be buried (set FREE) by the parent process through the waitoperation.

7.7.4 Wait for Child Process Termination

At any time, a process may call the kernel function

int pid = kwait(int *status)

to wait for a ZOMBIE child. If successful, the returned pid is the ZOMBIE child pid and status contains the exitCode ofthe ZOMBIE child. In addition, kwait() also releases the ZOMBIE PROC back to the freeList, allowing it to be reused foranother process. The algorithm of kwait is

In the kwait algorithm, the process returns −1 for error if it has no child. Otherwise, it searches for a (any) ZOMBIE child.If it ﬁnds a ZOMBIE child, it collects the ZOMBIE child pid and exitCode, releases the ZOMBIE PROC to freeList andreturns the ZOMBIE child pid. Otherwise, it goes to sleep on its own PROC address, waiting for a child to terminate.Correspondingly, when a process terminates, it must issue7.7 Process Management 221

kwakeup(parent); // parent is a pointer to parent PROC

to wake up the parent. When the parent process wakes up in kwait(), it will ﬁnd a dead child when it executes the whileloop again. Note that each kwait() call handles only one ZOMBIE child, if any. If a process has many children, it may haveto call kwait() multiple times to dispose of all dead children. Alternatively, a process may terminate ﬁrst without waiting forany dead child. When a process dies, all of its children become children of P1. As we shall see later, in a real system P1executes an inﬁnite loop, in which it repeatedly waits for dead children, including adopted orphans. Instead of sleep/wakeup,we may also use semaphore to implement the kwait()/kexit() functions. Variations to the kwait() operation include waitpidand waitid of Linux (Linux Man Page 2016), which allows a process to wait for a speciﬁc child by pid with many options.

7.7.5 Fork-Exec in Unix/Linux

When a process executes, it may need to save information generated by the execution. A good example is a log, whichrecords important events occurred during execution. The log can be used to trace process executions for debugging in casesomething went wrong. A process may also need inputs to control its paths of execution. Saving and retrieving informationrequire the support of a ﬁle system. An operating system kernel usually provides basic ﬁle system support to allow processesto do ﬁle operations. In such systems, the execution environment of each process includes both its execution context and itsability to access ﬁles. The current kfork() mechanism can only create processes to execute different images, but it does notprovide any means for ﬁle operations. In order to support the latter, we need an alternative way to create and run processes.In Unix/Linux, the system call

int pid = fork();

creates a child process with a Umode image identical to that of the parent. In addition, it also passes all opened ﬁledescriptors to the child, allowing the child to inherit the same ﬁle operation environment of the parent. If successful, fork()returns the child process pid. Otherwise, it returns −1. When the child process runs, it returns to its own Umode image andthe returned pid is 0. This allows us to write User mode programs as

int pid = fork(); // fork a child process

The code segment uses the returned pid to differentiate between the parent and child processes. Upon return from fork(),the child process usually uses the system call

int r = exec(char *ﬁlename, char *para-list);

to change its execution image to a different ﬁle, passing as parameters para-list to the new image when execution starts. Ifsuccessful, exec() merely replaces the original Umode image with a new image. It is still the same process but with adifferent Umode image. This allows a process to execute different programs. Fork and exec may be called the bread andbutter of Unix/Linux because almost every operation depends on fork-exec. For example, when a user enters a command lineof the form

cmdLine = "cmd arg1 arg2 …. argn"

the sh process forks a child and waits for the child to terminate. The child process uses exec to change its image to thecmd ﬁle, passing arg1 to argn as parameters to the new image. When the child process terminates, it wakes up the parent sh,which prompts for another command, etc. Note that fork-exec creates a process to execute a new image in two steps. Themain advantages of the fork-exec paradigm are twofold. First, fork creates a child with an identical image. This eliminates222 7 User Mode Process and System Calls

the need for passing information across different address spaces between the parent and the child. Second, before executingthe new image, the child can examine the command line parameters to alter its execution environment to suit its own needs.For example, the child process may redirect its standard input (stdin) and output (stdout) to different ﬁles.

7.7.6 Implementation of Fork

The implementation of fork-exec is rather simple. The algorithm of fork() is

1. get a PROC for the child and initialize it, e.g. ppid = parent pid, priority=1, etc.2. copy parent Umode image to child, so that their Umode images are identical;3. copy (part of) parent kstack to child kstack; Ensure that the child return to the same virtual address as the parent but in its own Umode image;4. copy parent usp and spsr to child;5. mark child PROC READY and enter it into readyQueue;6. return child pid;

We explain the fork() code in mode detail. When the parent executes fork() in kernel, it has saved Umode registers inkstack by stmfd sp!, {r0–r12, LR}, and replaced the saved LR with the proper return address to Umode. Therefore, its kstackbottom contains7.7 Process Management 223

-1 -2 -3 -4 -5 -6 -7 -8 -9 -10 -11 -12 -13 -14

which are copied to the bottom of the child's kstack. These 14 entries will be used by the child to return to Umode when itexecutes ldmfs sp!, {r0–12, pc}^ in goUmode. The copied LR at entry −1 allows the child to return to the same VA as theparent, i.e. to the same pid = fork() syscall. In order for the child to return pid = 0, the saved r0 at entry −14 must be set to0. In order for the child to resume in kernel, we append a RESUME stack frame to the child's kstack for it to resume when itis scheduled to run. The added stack frame must be consistent with the RESUME part of tswitch().The added kstack frame isshown below, and the child saved ksp points to entry −28.

Since the child resumes running in kernel, all the "saved" Kmode registers do not matter, except the resume klr at entry−15, which is set to goUmode. When the child runs, it uses the RESUME kstack frame to execute goUmode directly. Then itexecutes goUmode with the copied syscall stack frame, causing it to return to the same VA as the parent but in its ownmemory area with a 0 return value.

7.7.7 Implementation of Exec

Exec allows a process to replace its Umode image with a different executable ﬁle. We assume that the parameter to exec(char*cmdlie) is a command line of the form

cmdline = "cmd arg1 arg2 … argn";

where cmd is the ﬁle name of an executable program. If cmd starts with a /, it is an absolute pathname. Otherwise it is aﬁle in the default /bin directory. If exec succeeds, the process returns to Umode to execute the new ﬁle, with the individualtoken strings as command line parameters. Corresponding to the command line, the cmd program can be written as

main(int argc, char *argv[]){ }

where argc = n + 1 and argv is a null terminated array of string pointers, each points to a token string of the form

argv = [ * | * | * | … … | * | 0 ] "cmd" "arg1" "arg2" …… "argn"

The following shows the algorithm and implementation of the exec operation.

When execution begins from us.s in Umode, r0 contains p->usp, which points to the original cmdline in the Umode stack.Instead of calling main() directly, it calls a C startup function main0(char *cmdline), which parses the cmdline into argc andargv, and calls main(argc, argv). Therefore, we may write every Umode programs in the following standard form as usual.

#include "ucode.c" main(int argc, char *argv[ ]){……………}

The following lists the code segments of main0(), which plays the same role as the standard C startup ﬁle crt0.c.

7.7.8 Demonstration of Fork-Exec

We demonstrate a system that supports dynamic processes with fork, exec, wait and exit functions by the sample systemC7.4. Since all the system components have already been covered and explained before, we only show the t.c ﬁle containingthe main() function. In order to demonstrate exec, we need a different Umode image ﬁle. The u2.c ﬁle is identical to u1.c,except that it displays in German (just for fun).

Figure 7.4a shows the outputs of running the fork and switch commands. As the ﬁgure shows, P1 runs at the physicaladdress PA = 0x800000 (8 MB). When it forks a child P2 at PA = 0x900000 (9 MB), it copies both the Umode image andthe kstack to the child. Then it return to Umode with the child pid = 2. The switch command causes P1 to enter kernel toswitch process to P2, which returns to Umode with child pid = 0, indicating that it is the forked child. The reader may test theexit and wait operations as follows.

(1) While P2 runs, enter the command wait. P2 will issue a wait syscall to execute kwait() in kernel. Since P2 does not have any child, it returns -1 for no child error.(2) While P2 runs, enter the exit command and enter an exit value, e.g. 1234. P2 will issue an exit(1234) syscall to terminate in kernel. In kexit(), it records the exitValue in its PROC.exitCode, becomes a ZOMBIE and tries to wakeup its parent. Since P1 is not yet in the wait condition, the wakeup call of P2 will have no effect. After becoming a ZOMBIE, P2 is no longer runnable, so it switches process, causing P1 to resume running. While P1 runs, enter the wait command. P1 will enter kernel to executes kwait(). Since P2 has already terminated, P1 will ﬁnd the ZOMBIE child P2 without going to sleep. It frees the ZOMBIE PROC and returns the dead child pid = 2, as well as its exit value.(3) Alternatively, the parent P1 may wait ﬁrst and the child P2 exits later. In that case, P1 will go to sleep in kwait() until it is woken up by P2 when the latter (or any child) terminates. Thus, the orders of parent-wait and child-exit do not matter.

Figure 7.4b shows the outputs of executing the exec command with command line parameters. As the ﬁgure shows, forthe command line

u2 one two three

the process changes its Umode image to the u2 ﬁle. When execution of the new image starts, the command line parametersare passed in as argv[ ] strings with argc = 4.

7.7.9 Simple sh for Command Execution

With fork-exec, we can standardize the execution of user commands by a simple sh. First, we precompile main0.c as crt0.oand put it into the link library as the C startup code of all Umode programs. Then we write Umode programs in C as

Then compile all Umode programs as binary executables in the /bin directory and run sh when the system starts. This canbe improved further by changing P1's Umode image to an init.c ﬁle. These would make the system to have similar capabilityas Unix/Linux in terms of process management and command executions.

In all Unix-like systems, the standard way of creating processes to run different programs is by the fork-exec paradigm. Themain drawback of the paradigm is that it must copy the parent process image, which is time-consuming. In most Unix-likesystems, the usual behaviors of parent and child processes are as follows.

if (fork()) // parent fork() a child process

After creating a child, the parent waits for the child to terminate. When the child runs, it changes the Umode image to anew ﬁle. In this case, copying image in fork() would be a waste since the child process abandons the copied imageimmediately. For this reason, most Unix-like systems support a vfork operation, which create a child process without7.7 Process Management 229

copying the parent image. Instead, the child process is created to share the same image with the parent. When the child doesexec to change image, it only detaches itself from the shared image without destroying it. If every child process behaves thisway, the scheme would work ﬁne. But what if users do not obey this rule and allow the child to modify the shared image? Itwould alter the shared image, causing problems to both processes. To prevent this, the system must rely on memoryprotection. In systems with memory protection hardware, the shared image can be marked as read-only so that processessharing the same image can only execute but not modify it. If either process tries to modify the shared image, the image mustbe split into separate images. In ARM based systems, we can also implement vfork by the following algorithm.

/********************* Algorithm of vfork ***********************/

1. create a child process ready to run in Kmode, return −1 if fails;

2. copy a section of parent's ustack from parent.usp all the way back to where it called pid = vfork(), e.g. the bottom 1024 entries; set child usp = parent usp - 1204;3. let child pgdir = parent pgdir, so that they share the same page table;4. mark child as vforked; return child pid;

For simplicity, in the vfork algorithm we do not mark the shared page table entries READ-ONLY. Corresponding tovfork, the exec function must be modiﬁed to account for possible shared images. The following shows the modiﬁed execalgorithm

/******************** Modiﬁed exec algorithm *********************/

1. fetch cmdline from (possibly shared) Umode image;

2. if caller is vforked: switch to caller's page table and switchPgdir;3. load ﬁle to Umode image;4. copy cmdline to ustack top and set usp;5. modify syscall stack frame to return to VA = 0 in Umode6. turn off vforked flag; return usp;

In the modiﬁed exec algorithm, all the steps are the same as before, except step 2, which switches to the caller's pagetable, detaching it from the parent image. The following lists the code segments of kvfork() and (modiﬁed) kexec() functions,which support vfork.

7.7.11 Demonstration of vfork

The sample system C7.5 implements vfork. To demonstrate vfork, we add a vfork command to Umode programs. The vforkcommand calls the uvfork() function, which issues a syscall to execute kvfork() in kernel.

In uvfork(), the process issues a syscall to create a child by vfork(). Then it waits for the vforked child to terminate. Thevforked child issues an exec syscall to change image to a different program. When the child exits, it wakes up the parent,7.7 Process Management 231

Fig. 7.5 Demonstration of vfork

which would never know that the child was executing in its Umode image before. Figure 7.5 shows the outputs of runningthe samples system C7.5.

7.8 Threads

In the process model, each process is an independent execution unit in a unique Umode address space. In general, the Umodeaddress spaces of processes are all distinct and separate. The mechanism of vfork() allows a process to create a child processwhich temporarily shares the same address space with the parent, but they eventually diverge. The same technique can beused to create separate execution entities in the same address space of a process. Such execution entities in the same addressspace of a process are called light-weighted processes, which are more commonly known as threads (Silberschatz et al.2009). The reader may consult (Posix 1C 1995; Buttlar et al. 1996; Pthreads 2015), for more information on threads andthreads programming. Threads have many advantages over processes. For a detailed analysis of the advantages of threadsand their applications, the reader may consult Chap. 5 of Wang (2015). In this section, we shall demonstrate the technique ofextending vfork() to create threads.

7.8.1 Thread Creation

(1) Thread PROC structures:

As an independent execution entity, each thread needs a PROC structure. Since threads of the same process execute in thesame address space, they share many things in common, such the pgdir, opened ﬁle descriptors, etc. It sufﬁces to maintain232 7 User Mode Process and System Calls

only one copy of such shared information for all threads in the same process. Rather than drastically altering the PROCstructure, we shall add a few ﬁelds to the PROC structure and use it for both processes and threads. The modiﬁed PROCstructure is

During system initialization, we put the ﬁrst NPROC PROCs into freeList as before, but put the remaining NTHREADPROCs into a separate tfreeList. When creating a process by fork() or vfork(), we allocate a free PROC from the freeList andthe type is PROCESS. When creating a thread, we allocate a PROC from tfreeList and the type is THREAD. The advantageof this design is that it keeps the needed modiﬁcations to a minimum. For instance, the system falls back the pure processmodel if NTHREAD = 0. With threads, each process may be regarded as a container of threads. When creating a process, itis created as the main thread of the process. With only the main thread, there is virtually no difference between a process andthread. However, the main thread may create other threads in the same process. For simplicity, we shall assume that only themain thread may create other threads and the total number of threads in a process is limited to TMAX.

(2) Thread creation: The system call

int thread(void *fn, int *ustack, int *ptr);

creates a thread in the address space of a process to execute a function, fn(ptr), using the ustack area as its execution stack.The algorithm of thread() is similar to vfork(). Instead of temporarily sharing Umode stack with the parent, each thread has adedicated Umode stack and the function is executed with a speciﬁed parameter (which can be a pointer to a complex datastructure). The following shows the code segment of kthread() in kernel.

When a thread starts to run, it ﬁrst resumes to goUmode. Then it follows the syscall stack frame to return to Umode toexecute fn(ptr) as if it was invoked by a function call. When control enters fn(), it uses

stmfd sp!; ffp; lrg

to save the return link register in Umode stack. When the function ﬁnishes, it returns by

ldmfs sp!; ffp; pcg

In order for the fn() function to return gracefully when it ﬁnishes, the initial Umode lr register must contain a proper returnaddress. In the scheduler() function, when the next running PROC is a thread, we load the Umode lr register with the valueof VA(4). At the virtual address 4 of every Umode image is an exit(0) syscall (in ts.s), which allows the thread to terminatenormally. Each thread may run statically, i.e. in an inﬁnite loop, and never terminate. If needed, it uses either sleep orsemaphore to suspend itself. A dynamic thread terminates when it has completed the designated task. As usual, the parent(main) thread may use the wait system call to dispose of terminated children threads. For each terminated child, itdecrements its tcount value by 1. We demonstrate a system which supports threads by an example.

7.8.2 Demonstration of Threads

The sample system C7.6 implements support for threads. In the ucode.c ﬁle, we add a thread command and thread relatedsyscalls to the kernel. The thread command calls uthread(), in which it issues the thread syscall to create N ≤ 4 threads. Allthe threads executes the same function fn(ptr), but each has its own stack and a different ptr parameter. Then the processwaits for all the threads to ﬁnish. Each thread prints its pid, the parameter value and the physical address of the processimage. When all the threads have terminated, the main process continues.

Fig. 7.6 Demonstration of threads

for (i=0; i<N; i++){

Figure 7.6 shows the outputs of running the C7.6 program. As the ﬁgure shows, all the threads execute in the sameaddress space (PA = 0x800000) of the parent process P1. It also shows that the parent process P1 waits for all the childrenthreads to terminate by the usual wait for child termination operation.

7.8.3 Threads Synchronization

Whenever multiple execution entities share the same address space, they may access and modify shared (global) data objects.Without proper synchronization, they may corrupt the shared data objects, causing problems to all execution entities. Whileit's very easy to create many processes or threads, to synchronize their executions to ensure the integrity of shared dataobjects is no easy task. The standard tools for threads synchronization in Pthreads are mutex and condition variables(Pthreads 2015). We shall cover threads synchronization in Chap. 9 when we discuss multiprocessor systems.

7.9 Embedded System with Two-Level Paging

This section shows how to conﬁgure the ARM MMU for two-level paged virtual memory to support User mode processes.First, we show the system memory map, which speciﬁes the planned usage of the system memory space.7.9 Embedded System with Two-Level Paging 235

When the system starts, we ﬁrst set up a one-level paging using 1 MB sections as before. While in this simple pagingenvironment, we initialize the kernel data structures, create and run the initial process P0 in Kmode. In kernel_init(), we setup a new level-1 page table (pgdir), denoted by ktable, at 32 KB and its associated level-2 page tables (pgtables) at 5 MB tocreate an identity mapping of VA to PA. Then we create 64 pgdirs at 6 MB for other processes, each has a pgdir at 6 MB+ (pid − 1) * 16 KB. Since the Kmode address spaces of all the processes are identical, their pgdirs are simply copiedfrom ktable. Then we change pgdir to the ktable at 32 KB, which switches the MMU to 2-level paging. The followingshows the makePageTable() function code in C.

In the level-1 page table (pgdir) of each process, the high 2048 entries are initially 0's. These entries deﬁne the Umode VAspace of each process, which will be ﬁlled in when the process is created in kfork() or fork(). This part depends on thememory allocation scheme for process Umode images, which may be either static or dynamic.236 7 User Mode Process and System Calls

7.9.1 Static 2-Level Paging

In static paging, each process is allocated a ﬁxed size memory area of PSIZE for its Umode image. For simplicity, we mayassume that PSIZE is a multiple of 1 MB. We allocate each process Umode image at 8 MB + (pid − 1) * PSIZE. Forexample, if PSZIE = 4 MB, then P1 is at 8 MB, P2 is at 12 MB, etc. Then we set up the process page tables to access theUmode image as pages. The following shows the code segments in kfork(), fork() and kexec().

7.9.2 Demonstration of Static 2-Level Paging

Figure 7.7 shows the outputs of the sample system C7.7, which uses 2-level static paging.238 7 User Mode Process and System Calls

Fig. 7.7 Demonstration of static 2-level paging

7.9.3 Dynamic 2-Level Paging

In dynamic paging, the Umode memory area of each process consists of page frames that are allocated dynamically. In orderto support dynamic paging, the system must manage available memory in the form of free page frames. This can be done byeither a bitmap or a link list. If the number of page frames is large, it is more efﬁcient to manage them by a link list. When thesystem starts, we construct a free page link list, pfreeList, which threads all the free page frames in a linked list. When a pageframe is needed, we allocate a page frame from pfreeList. When a page frame is no longer needed, we release it back topfreeList for reuse. The following code segments show the free page list management functions.

int *free_page_list(int *startva, int *endva) // build pfreeList

7.9.3.1 Modifications to Kernel for Dynamic Paging

When using dynamic paging, the Umode image of each process is no longer a single piece of contiguous memory. Instead, itconsists of dynamically allocated page frames, which may not be contiguous. We must modify the kernel code that managesprocess images to accommodate these changes. The following shows the modiﬁed kernel code segments to suit the dynamicpaging scheme.

(1). fork1(): fork1() is the base code of both kfork() and fork(). It creates a new process p and sets up its Umode image. Since the level-1 page table (pgdir) must be a single piece of 16 KB aligned memory, we can not build it by allocating page frames because the page frames may not be contiguous or aligned at a 16 KB boundary. So we still build the proc pgdirs at 6 MB as before. We only need to modify fork1() to construct the Umode image by allocating page frames.

(2). kfork(): kfork() creates a new process with initial command line parameters passed in the Umode stack at the high end of the virtual address space. Since the caller's pgdir is different from that of the new process, we can not use VA(PSIZE) to access the (high end of) Umode stack of the new process. Instead, we must use the last allocated page frame to access its Umode stack.240 7 User Mode Process and System Calls

(3). fork(): When copying image, we must copy the parent page frames to that of the child process, as shown by the following code segment.

(4). kexec(): We assume that kexec() uses the same Umode image area of a process. In this case, no changes are needed, except for vforked process, which must create its own pgdir entries, allocate page frames and switch to its own pgdir.(5). loader: the image loader must be modiﬁed to load the image ﬁle into page frames of the process. The modiﬁed (pseudo) loader code is

7.10 KMH Memory Mapping

The sample systems C7.1 to C7.8 use the Kernel Mapped Low (KML) memory mapping scheme, in which the kernel modespace is mapped to low virtual addresses and the user mode space is mapped to high virtual addresses. The mapping schemecan be reversed. In the Kernel Mapped High (KMH) scheme, the kernel mode VA space is mapped high, e.g. to 2 GB(0x80000000) and user mode VA space is mapped low. In this section, we shall demonstrate the KMH memory mappingscheme, compare it with the KML scheme and discuss the differences between the two mapping schemes.

7.10.1 KMH Using One-Level Static Paging

(1). High VA in kernel image: Assume that the kernel VA space is mapped to 2 GB. In order to let the kernel use high virtual addresses, the link command must be modiﬁed to generate high virtual address. For the kernel image, the modiﬁed link command is

arm-none-eabi-ld -T t.ld ts.o t.o mtx.o -Ttext=0x80010000 -o t.elf

The kernel image is loaded at the physical address 0x10000 but its VA is 0x800100000. For user mode images, the link command is changed to

arm-none-eabi-ld -T u.ld us.o u1.o -Ttext=0x100000 -o u1.elf

Note that the starting virtual address of user mode images is not 0 but 0x100000 (1 MB). We shall explain and justify this later.

Fig. 7.8 Demonstration of dynamic 2-level paging

242 7 User Mode Process and System Calls

(2). VA to PA conversion: The kernel code uses VA but page table entries and I/O device base addresses must use PA. For convenience, we deﬁne the macros

#define VA(x) ((x) + 0x80000000)

#define PA(x) ((x) - 0x80000000)

for conversions between VA and PA.

(3). VA for I/O Base Addresses: The base addresses of I/O devices use PA, which must be remapped to VA. For the Versatilepb board, I/O devices are located in the 2 MB I/O space beginning at PA 256 MB. The I/O device base addresses must be mapped to VA by the VA(x) macro.(4). Initial Page Table: Since the kernel code is compile-linked with VA, we can not execute the kernel's C code before conﬁguring the MMU for address translation. The initial page table must be constructed in assembly code when the system starts.(5). The ts.s ﬁle: The reset_handler sets up the initial page table and enables the MMU for address translation. Then it sets up the privileged mode stacks, copies the vector table to 0 and call main() in C. For the sake of brevity, we shall only show the code segments that are relevant to memory mapping. The assembly code shown below sets up an initial one-level page table at PA = 0x4000 (16 KB) using 1 MB sections. In the one-level page table, the 0th entry points to the lowest 1 MB of physical memory, which creates an identity mapping of the lowest 1 MB memory space. This is because the vector table and the exception handler entry addresses are all within the lowest 4 KB of physical memory. In the page table, entries 2048 to 2048 + 258 map VA = (2 GB, 2 GB + 258 MB) to the low 258 physical memory, which includes the 2 MB I/O space at 256 MB. The kernel mode space is assigned the domain 0, with the access bits set to AP = 01 (client mode) to prevent User mode process from accessing kernel's VA space. Then it sets the TLB base register and enables the MMU for address translation. After these, the system is running with VA in the range of (2 GB to 2 GB + 258 MB).

(5). Kernel.c ﬁle: We only show the memory mapping part of kernel_init(). It creates 64 page directories (level-1 page tables) at 6 MB for 64 PROCs. Each proc's pgdir is at 6 MB + pid * 16 KB. Since the kernel mode VA spaces of all the processes are the same, their kernel mode pgdir entries are all identical. For simplicity, we assume that every process, except P0, has a Umode image of size 1 MB, which is statically allocated at the physical address (pid+7) MB, i.e. P1 at 8 MB, P2 at 9 MB, etc. In each process pgdir, entry 1 deﬁnes the process Umode image. Corresponding to this, every Umode image is compile-linked with the starting VA=0x100000 (1 MB). During task switch, if the current process differs from the next process, it calls switchPdgir() to switch pgdir to that of the next process.

(6). Use VA in Kernel Functions: All kernel functions, such as kfork, fork, image ﬁle loader and kexec, must use VA.(7). Demonstration of KMH Memory Mapping: Fig. 7.9 shows the output of running the program C7.9, which demonstrate the KMH address mapping using 1 MB sections.

The reader may run the system C7.9 and fork other processes. It should show that each process runs in a different PA areabut all at the same VA = 0x100000.Variations of the one-level paging scheme are left as programming projects in theProblems section.

7.10.2 KMH Using Two-Level Static Paging

In this section, we shall demonstrate the KMH memory mapping scheme using two-level paging. This is accomplished inthree steps.

Step 1: When the system starts, we ﬁrst set up an initial one-level page table and enable the MMU exactly the same as inprogram C7.9. While in this simple paging environment, we can execute the kernel code using high virtual addresses.Step 2: In kernel_init(), we build a two-level page pgdir at 32 KB (0x8000). The 0th entry of the level-1 pgdir is for ID mapthe lowest 1 MB memory. We build its level-2 page table at 48 KB (0xC000) to create an ID mapming of the lowest 1 MBmemory. Assume 256 MB RAM plus 2 MB I/O space at 256 MB. We build 258 level-2 page tables at 5 MB. Then webuild 64 level-1 pgdirs at 6 MB. Each proc has a pgdir at 5 MB + pid * 16 KB. The level-2 page tables of each pgdir arethe same. Then we switch pgdir to 6 MB to use two-level paging.7.10 KMH Memory Mapping 245

Fig. 7.9 Demonstration of KMH using one-level static paging

Step 3: Assume that each process Umode image is of size = USIZE MB, which is statically allocated at 7 MB + pid *1 MB. When creating a new process in fork1(), we build its Umode level-1 pgdir entries and level-2 page tables at 7 MB +pid*1 KB

Figure 7.10 shows the outputs of running the sample program C7.10, which demonstrates the KMH mapping schemeusing two-level static paging.

7.10.3 KMH Using Two-Level Dynamic Paging

It is fairly easy to extend the two-level KMH static paging to dynamic paging. This is left as an exercise in the Problemssection.

7.11 Embedded System Supporting File Systems

So far, we have used a RAMdisk as a ﬁle system. Each process image is loaded as an executable ﬁle from the RAMdisk.During execution, processes may save information by writing to the RAMdisk. Since the RAMdisk is held in volatilememory, all the RAMdisk contents will vanish when the system power is turned off. In this section, we shall use a SD card(SDC) as a persistent storage device to support ﬁle systems. Like a hard disk, a SDC can be divided into partitions. Thefollowing shows how to create a flat disk image with only one partition. The resulting disk image can be used as a virtualSDC in most virtual machines that support SD cards.246 7 User Mode Process and System Calls

7.11.1 Create SD Card Image

(1). Create a flat disk image ﬁle of 4096 1 KB blocks.

dd if=/dev/zero of=disk.img bs=1024 count=4096

(2). Divide disk.img into partitions. The simplest way is to create a single partition.

fdisk –H 16 –S 512 disk.img # enter n, then press enter keys

(3). Create a loop device for disk.img.

losetup –o 1048576 --sizelimit 4193792 /dev/loop1 disk.img

(4). Format the partition as an EXT2 ﬁle system.

mke2fs –b 1024 disk.img 3072

(5). Mount the loop device.

mount /dev/loop1 /mnt

(6). Populate it with ﬁles, then umount it.

Fig. 7.10 Demonstration of KMH using 2-level static paging

7.11 Embedded System Supporting File Systems 247

mkdir /mnt/boot; umount /mnt

On a disk image, sectors count form 0. Step (2) creates a partition with ﬁrst sector = 2048 (a default value by fdisk) andlast sector = (4096 * 2 − 1) = 8191. In Step (3), the start offset and size limit are both in bytes = sector * sector_size(512). In Setp (4), the ﬁle system size is 4096 − 1024=3072 (1 KB) blocks. Since the ﬁle system size is less than 4096blocks, it requires only one blocks group, which simpliﬁes both the ﬁle system traversal algorithm and the management ofinodes and disk blocks.

7.11.2 Format SD Card Partitions as File Systems

In the following sections, we shall use a SDC disk image with multiple partitions. The disk image is created as follows.

(2). The MBR ﬁle is a MBR image containing a partition table created by fdisk. Rather than using fdisk to partition the new disk image manually again, we simply dump the MBR ﬁle to the MBR sector of the disk image. The resulting disk image has 4 partitions.

Partition Start_sector End_sector Size (1 KB blocks)

The partition type does not matter but it should be set to 0x90 to avoid confusions with other operating sytems, such asLinux, which uses partition types 0x82–0x83.

7.11.3 Create Loop Devices for SD Card Partitions

(3). The mkdisk script creates loop devices for the partitions, formats each partition as an EXT2 ﬁle system and populates it with some directories. After creating the loop devices, each partition can be accessed by mounting its corresponding loop device, as in248 7 User Mode Process and System Calls

mount /dev/loopN MOUNT_POINT # N=1 to 4

(4). Modify the compile-link script by copying User mode images to the /bin directory of the SDC partitions.

7.12 Embedded System with SDC File System

In this section, we shall show an embedded system, denoted by C7.11, which supports dynamic processes with a SDC as themass storage device. All User mode images are ELF executables in the /bin directory of a (EXT2) ﬁle system on the SDC.The system uses 2-level dynamic paging. The kernel VA space is from 0 to 258 MB. User mode VA space is from 2 to2 GB + Umode image size. To simplify the discussion, we shall assume that every Umode image size is a multiple of1 MB, e.g. 4 MB. When creating a process, its level-2 page table and page frames are allocated dynamically. When aprocess terminates, its page table and page frames are released for reuse. The system consists of the following components.

(1). ts.s ﬁle: The ts.s ﬁle is the same as that of Program C7.8.(2). Kernel.c ﬁle: The kernel.c ﬁle is also the same as that of Program C7.8.

7.12.1 SD Card Driver Using Semaphore

(3). sdc.c ﬁle: This ﬁle contains the SDC driver. The SDC driver is interrupt-driven but it also supports polling for the following reason. When the system starts, only the initial process P0 is running. After initializing the SDC driver, it calls mbr() to display the SDC partitions. Since there is no other process yet, P0 can not go to sleep or become blocked. So it uses polling to read SDC blocks. Similarly, it also uses polling to load the Umode image of P1 from the SDC. After creating P1 and switching to run P1, processes and the SDC driver use semaphore for synchronization. In a real system, CPU is much faster than I/O devices. After issuing an I/O operation to a device, a process usually has plenty of time to suspend itself to wait for device interrupt. In this case, we may use sleep/wakeup to synchronize process and interrupt handler. However, an emulated virtual machine may not obey this timing order. It is observed that on the emulated ARM Versatilepb under QEMU, after a process issuing an I/O operation to the SDC, the SDC interrupt handler always ﬁnishes ﬁrst before the process suspends itself. This makes the sleep/wakeup mechanism unsuitable for synchronization between the process and SDC interrupt handler. For this reason, it uses semaphore for synchronization. In the SDC driver, we deﬁne a semaphore s with the initial value 0. After issuing an I/O operation, the process uses P(s) to block itself, waiting for SDC interrupts. When the SDC interrupt handler completes the data transfer, it uses V(s) to unblock the process. Since the order of P and V on semaphores does not matter, using semaphores prevents any race conditions between processes and the interrupt handler. The following lists the SDC driver code.

7.12.3 Demonstration of SDC File System

Figure 7.11 shows the sample outputs of running the sample system C7.11.

7.13 Boot Kernel Image from SDC

Usually, an ARM based system has an onboard booter implemented in ﬁrmware. When such an ARM system starts, theonboard ﬁrmware booter ﬁrst loads a stage-2 booter, e.g. Das Uboot (UBOOT 2016), from a (FAT) partition of a flashmemory or a SDC and executes it. The stage-2 booter then boots up a real operating system, such as Linux, from a differentpartition. The emulated ARM Verstilepb virtual machine is an exception. When the emulated Versatilepb VM starts, QEMUloads a speciﬁed kernel image to 0x10000 and transfers control to the loaded kernel image directly, bypassing the usualbooting phase of most other real or virtual machines. In fact, when the Vesatilepb VM starts, QEMU simply loads a speciﬁedimage ﬁle and executes the loaded image. It does not know, nor care about, whether the image is an OS kernel or just a pieceof executable code. The loaded image could be a booter, which can be used to boot up a real OS kernel from a storagedevice. In this section, we shall develop a booter for the emulated Versatilepb VM to boot up a system kernel from SDCpartitions. In this scheme, each partition of the SDC is a (EXT2) ﬁle system. The system kernel image is a ﬁle in the /bootdirectory of a SDC partition. When the system starts, QEMU loads a booter to 0x10000 and executes it ﬁrst. The booter mayask for a partition to boot, or it may simply boot from a default partition. Then it loads a system kernel image from the /bootdirectory in the SDC partition and transfer control to the kernel image, causing the OS kernel to startup. The advantages ofthis scheme are two-fold. First, the system kernel can be loaded to any memory location and runs from there, making it nolonger conﬁned to 0x10000 as dedicated by QEMU. This would make the system more compatible with other real or virtualmachines that require a booting phase. Second, the booter can collect information from the user and pass them as bootingparameters to the kernel. If desired, the booter may also set up an appropriate execution environment prior to transferringcontrol to the kernel, which simpliﬁes the kernel's startup code. For instance, if the system kernel is compiled with virtualaddresses, the booter can set up the MMU ﬁrst to allow the kernel to start up by using virtual addresses directly. Thefollowing shows the organization of such a system. It consists of a booter, which is loaded by QEMU to 0x10000 when theemulated Versatilepb VM starts. The booter then boots up a system kernel from a SDC partition and starts up the kernel.First, we show the components of the booter program.254 7 User Mode Process and System Calls

Fig. 7.11 Sample outputs of system with SDC

7.13.1 SDC Booter Program

(1). booter's ts.s ﬁle: This is the entry point of the booter program. It initializes a UART for serial port I/O during booting. In order to keep the booter code simple, it does not use interrupts. So the UART driver uses polling for serial port I/O. The booter loads a kernel image from a SDC partition to 1 MB. Then it jumps to there to start up the kernel.

printf("Welcome to ARM EXT2 Booter\n");

(3). Booter's sd.c ﬁle: This ﬁle implements the SDC driver of the booter. It provides a

getblk(int blk, char *address)

function, which loads a (1 KB) block from the SDC into the speciﬁed memory address. In order to keep the booter simple, the booter's SDC driver uses polling for block I/O.(4). The boot.c ﬁle: This ﬁle implements the SDC booter. For the emulated ARM Versatilepb VM, the booter is a separate image. It is loaded to 0x10000 by QEMU and starts to execute from there. It then boots up a kernel image in the /boot directory of a SDC partition.The function mbr() displays the partition table of the SDC and prompts for a partition number to boot. It writes the partition and the start sector number to 0x200000 (2 MB) for the kernel to get. Then it calls boot(), which locates a kernel image ﬁle in the /boot directory and loads the kernel image to 0x100000 (1 MB).

(5). Kernel and User Mode Images: Kernel and User mode images are generated by the following sh scripts ﬁles, which create image ﬁles and copies them to the SD partitions. Note that the kernel's starting VA is at 0x100000 (1 MB) and the starting VA of Umode images is at 0x80000000 (2 GB).

7.13.2 Demonstration of Booting Kernel from SDC

The sample system C7.12 demonstrates booting an OS kernel from SDC.

Figure 7.12 shows the UART screen of the booter. It displays the SDC partition table and asks for a partition to bootfrom. It locates the kernel image ﬁle, /boot/kernel, in the SDC partition and loads the kernel image to 0x100000 (1 MB).Then it sends the CPU to execute the loaded kernel code. When the kernel starts, it uses 2-level static paging. Since thekernel is compile-linked with real addresses, it can execute all the code directly when it starts up. In this case, there is noneed for the booter to build any page table for the kernel. The page tables will be built by the kernel itself when it starts up. Figure 7.13 shows the startup screen of the kernel. In kernel_init(), it initializes the kernel data structures, builds the2-level page tables for the processes and switches pgdir to use 2-level static paging.

Fig. 7.12 Demonstration of SDC booter

7.13 Boot Kernel Image from SDC 259

Fig. 7.13 Demonstration of booting OS Kernel from SDC

7.13.3 Booting Kernel from SDC with Dynamic Paging

The sample system C7.13 demonstrates booting a kernel that uses 2-level dynamic paging. The booter part is the same asbefore. Instead of static paging, the kernel uses 2-level dynamic paging. Figure 7.14 shows the startup screen of the kernel.As the ﬁgure shows, all the page table entries of P1 are dynamically allocated page frames.

Fig. 7.14 Booting OS Kernel from SDC with dynamic paging

260 7 User Mode Process and System Calls

7.13.4 Two-Stage Booting

In many ARM based systems, booting up an OS kernel consists of two stages. When an ARM system starts, the system'sonboard boot-loader in ﬁrmware loads a booter from a storage device, such as a flash memory or a SD card, and transferscontrol to the loaded booter. The booter then loads an operating system (OS) kernel from a bootable device and transferscontrol to the OS kernel, causing the OS kernel to start up. In this case, the system's onboard boot-loader is a stage-1 booter,which loads and executes a stage-2 booter, which is designed to boot up a speciﬁc OS kernel image. The stage-2 booter canbe installed on a SDC for booting up different kernel images. Some ARM boards require that the stage-2 booter be installedin a DOS partition, but the bootable kernel image may reside in a different partition. The principle and technique of installinga booter to SDC is the same as that of installing a booter to a regular hard disk or USB drive. The reader may consult Chap. 3on Booting Operating Systems of Wang (2015) for more details. In this section, we demonstrate a two-stage booter for theemulated ARM Versatilepb VM. First, we show the code segments of the stage-1 booter.

(3). t.c ﬁle of stage-1 booter.

sdc_init(); // initialize SDC driver

boot1(); // load stage-2 booter to 2MB}

7.13.4.2 Stage-2 Booter

The stage-2 booter is the same as the booter in Sect. 7.13.1. It is installed to the front part of a SDC, which will be loaded forexecution by the stage-1 booter. The stage-2 booter size is about 8 KB. On a SDC with partitions, partition 1 begins fromthe sector 2048, so blocks 1 to 1023 are free spaces on the SDC. The stage-2 booter is installed to blocks 1 to 8 of the SDCby the following dd command.

dd if=booter2.bin of=../sdc bs=1024 count=8 seek=1 conv=notrunc

7.13.5 Demonstration of Two-Stage Booting

The sample system C7.14 demonstrates two-stage booting. The system is run from the stage-1 booter directory by

Figure 7.15 shows the screen of the 2-stage booters.

Figure 7.16 shows the kernel startup screen after booting up by the 2-stage booters.

Fig. 7.15 Screen of two-stage booters

262 7 User Mode Process and System Calls

Fig. 7.16 Demonstration of 2-stage booters

7.14 Summary

This chapter covers process management, which allows us to create and run processes dynamically in embedded systems. Inorder to keep the systems simple, it only shows the basic process management functions, which include process creation,process termination, process synchronization and wait for child process termination. Throughout the chapter, it shows howto use memory management to provide each process with a private User mode virtual address space that is isolated fromother processes and protected by the MMU hardware. The memory management schemes use both one-level sections andtwo-level static and dynamic paging. In addition, it also discussed the advanced concepts and techniques of vfork andthreads. Lastly, it showed how to use SD cards for storing both kernel and user mode image ﬁles in a SDC ﬁle system andhow to boot up system kernel from SDC partitions. With this background, we are ready to show the design and imple-mentation of general purpose operating system for embedded systems.

1. In the example program C7.1, both tswitch() and svc_handler() use

stmfd sp!, {r0-r12, lr}

save all CPU registers, which may be unnecessary since the generated code of most ARM C compilers preserve registers r4–r12 during function calls. Assume that both tswitch() and svc_handler() use

stmfd sp!, {r0-r3, lr}

to save CPU registers. Rewrite tswitch(), svc_handler and kfork() of the system. Verify that the modiﬁed system works.2. Modify the sample system C7.1 to use 4 MB Umage image size.3. In the example program C7.1, it builds the level-1 page tables of all (64) PROCs statically in the memory area of 6 MB. Modify it to build the process level-1 page table dynamically, i.e. only when a process is created.4. In the example program C7.1, each process runs in a different Umode area but the Umode stack pointer of every process is initialized as Explain why and how does it work?

#deﬁne UIAMGE_SIZE 0x100000

p->usp = (int *)VA(UIMAGE_SIZE);

5. In the example program C7.1, it assumes that the VA spaces of both Kmode and Umode are 2 GB. Modify it for 1 GB Kmode VA space and 3 GB Umode VA space.6. For the sample system C7.4, implement process family tree and use it in kexit() and kwait().7. For the sample system C7.4, modify the kexit() function to implement the following policy. (1). A terminating process must dispose of its ZOMBIE children, if any, ﬁrst. (2). A process can not terminate until all the children processes have terminated. Discuss the advantages and disadvantages of these schemes.8. In all the example programs, each PROC structure has a statically allocate 4 KB kstack. (1). Implement a simple memory manager to allocate/deallocate memory dynamically. When the system starts, reserve a 1 MB area, e.g. beginning at 4 MB, as a free memory area. The function

char *malloc(int size)

allocates a piece of free memory of size in 1 KB bytes. When a memory area is no longer needed, it is released back to the free memory area by

void mfree(char *address, int size)

Design a data structure to represent the current available free memory. Then implement the malloc() and mfree() functions. (2). Modify the kstack ﬁeld of the PROC structure as an integer pointer

int *kstack;

and modify the kfork() function as

int kfork(int func, int priority, int stack_size)

which dynamically allocates a memory area of stack_size for the new task. (3). When a task terminates, its stack area must be (eventually) freed. How to implement this? If you think you may simply release the stack area in kexit(), think carefully again.264 7 User Mode Process and System Calls

9. Modify the example program C7.5 to support large Umode image sizes, e.g. 4 MB. 10. In the example program C7.5, assume that the Umode image size in not a multiple of 1 MB, e.g. 1.5 MB. Show how to set up process page tables to suit the new image size. 11. Modify the example program C7.10 to use 2-level static paging. 12. Modify the example program C7.10 to map kernel VA space to [2 GB, 2 GB+512 MB]. 13. Modify the example program C7.11 to use two-level dynamic paging.

8.1 General Purpose Operating Systems

A General Purpose Operating System (GPOS) is a complete OS that supports process management, memory management,I/O devices, ﬁle systems and user interface. In a GPOS, processes are created dynamically to perform user commands. Forsecurity, each process runs in a private address space that is isolated from other processes and protected by the memorymanagement hardware. When a process completes a speciﬁc task, it terminates and releases all its resources to the system forreuse. A GPOS should support a variety of I/O devices, such as keyboard and display for user interface, and mass storagedevices. A GPOS must support a ﬁle system for saving and retrieving both executable programs and application data. Itshould also provide a user interface for users to access and use the system conveniently.

8.2 Embedded General Purpose Operating Systems

In the early days, embedded systems were relative simple. An embedded system usually consists of a microcontroller, whichis used to monitor a few sensors and generate signals to control a few actuators, such as to turn on LEDs or activates relays tocontrol external devices. For this reason, the control programs of early embedded systems were also very simple. They werewritten in the form of either a super-loop or event-driven program structure. However, as computing power and demand formulti-functional systems increase, embedded systems have undergone a tremendous leap in both applications and com-plexity. In order to cope with the ever increasing demands for extra functionalities and the resulting system complexity,traditional approaches to embedded OS design are no longer adequate. Modern embedded systems need more powerfulsoftware. Currently, many mobile devices are in fact high-powered computing machines capable of running full-fledgedoperating systems. A good example is smart phones, which use the ARM core with gig bytes internal memory and multi-gigbytes micro SD card for storage, and run adapted versions of Linux, such as (Android 2016). The current trend in embeddedOS design is clearly moving in the direction of developing multi-functional operating systems suitable for the mobileenvironment. In this chapter, we shall discuss the design and implementation of general purpose operating systems forembedded systems.

8.3 Porting Existing GPOS to Embedded Systems

Instead of designing and implementing a GPOS for embedded systems from scratch, a popular approach to embedded GPOSis to port existing OS to embedded systems. Examples of this approach include porting Linux, FreeBSD, NetBSD andWindows to embedded systems. Among these, porting Linux to embedded systems is especially a common practice. Forexample, Android (2016) is an OS based on the Linux kernel. It is designed primarily for touch screen mobile devices, suchas smart phones and tablets. The ARM based Raspberry PI single board computer runs an adapted version of Debian Linux,called Raspbian (Raspberry PI-2 2016). Similarly, there are also widely publicized works which port FreeBSD (2016) andNetBSD (Sevy 2016) to ARM based systems. When porting a GPOS to embedded systems, there are two kinds of porting. The ﬁrst kind can be classiﬁed as aprocedural oriented porting. In this case, the GPOS kernel is already adapted to the intended platform, such as ARM based

systems. The porting work is concerned primarily with how to conﬁgure the header ﬁles (.h ﬁles) and directories in thesource code tree of the original GPOS, so that it will compile-link to a new kernel for the target machine architecture. In fact,most reported work of porting Linux to ARM based systems fall into this category. The second kind of porting is to adapt aGPOS designed for a speciﬁc architecture, e.g. the Intel x86, to a different architecture, such as the ARM. In this case, theporting work usually requires redesign and, in many cases, completely different implementations of the key components inthe original OS kernel to suit the new architecture. Obviously, the second kind of porting is much harder and challengingthan the procedural oriented porting since it requires a detailed knowledge of the architectural differences, as well as acomplete understanding of operating system internals. In this book, we shall not consider the procedural oriented porting.Instead, we shall show how to develop an embedded GPOS for the ARM architecture from ground zero.

8.4 Develop an Embedded GPOS for ARM

PMTX (Wang 2015) is a small Unix-like GPOS originally designed for Intel x86 based PCs. It runs on uniprocessor PCs in32-bit protected mode using dynamic paging. It supports process management, memory management, device drivers, aLinux-compatible EXT2 ﬁle system and a command-line based user interface. Most ARM processors have only a singlecore. In this chapter, we shall focus on how to adapt PMTX to single CPU ARM based systems. Multicore CPUs andmultiprocessor systems will be covered later in Chap. 9. For ease of reference, we shall denote the resulting system as EOS,for Embedded Operating System.

8.5 Organization of EOS

8.5.1 Hardware Platform

EOS should be able to run on any ARM based system that supports suitable I/O devices. Since most readers may not have areal ARM based hardware system, we shall use the emulated ARM Versatilepb VM (ARM Versatilepb 2016) under QEMUas the platform for implementation and testing. The emulated Versatilepb VM supports the following I/O devices.

For simplicity, the virtual SDC has only one partition, which begins from the (fdisk default) sector 2048. After creatingthe virtual SDC, we set up a loop device for the SDC partition and format it as an EXT2 ﬁle system with 4 KB block sizeand one blocks-group. The single blocks-group on the SDC image simpliﬁes both the ﬁle system traversal and the inodes anddisk blocks management algorithms. Then we mount the loop device and populate it with DIRs and ﬁles, making it ready foruse. The resulting ﬁle system size is 128 MB, which should be big enough for most applications. For larger ﬁle systems, theSDC can be created with multiple blocks-groups, or multiple partitions. The following diagram shows the SDC contents.

|– boot : bootable kernel images

On the SDC, the MBR sector (0) contains the partition table and the beginning part of a booter. The remaining part of thebooter is installed in sectors 2 to booter_size, assuming that the booter size is no more than 2046 sectors or 1023 KB (Theactual booter size is less than 10 KB). The booter is designed to boot up a kernel image from an EXT2 ﬁle system in a SDCpartition. When the EOS kernel boots up, it mounts the SDC partition as the root ﬁle system and runs on the SDC partition.

(2). LCD: the LCD is the primary display device. The LCD and the keyboard play the role of the system console.(3). Keyboard: this is the keyboard device of the Versatilepb VM. It is the input device for both the console and UARTserial terminals.(4). UARTs: these are the (4) UARTs of the Versatilepb VM. They are used as serial terminals for users to login. Although itis highly unlikely that an embedded system will have multiple users, our purpose is to show that the EOS system is capableof supporting multiple users at the same time.(5). Timer: the VersatilepbVM has four timers. EOS uses timer0 to provide a time base for process scheduling, timer servicefunctions, as well as general timing events, such as to maintain Time-of-Day (TOD) in the form of a wall clock.

EOS is implemented mostly in C, with less than 2% of assembly code. The total number of line count in the EOS kernel isapproximately 14,000.

8.5.4 Capabilities of EOS

The EOS kernel consists of process management, memory management, device drivers and a complete ﬁle system. Itsupports dynamic process creation and termination. It allows process to change execution images to execute differentprograms. Each process runs in a private virtual address space in User mode. Memory management is by two-level dynamicpaging. Process scheduling is by both time-slice and dynamic process priority. It supports a complete EXT2 ﬁle system thatis totally Linux compatible. It uses block device I/O buffering between the ﬁle system and the SDC driver to improveefﬁciency and performance. It supports multiple user logins from the console and serial terminals. The user interface shsupports executions of simple commands with I/O redirections, as well as multiple commands connected by pipes. It uniﬁesexception handling with signal processing, and it allows users to install signal catchers to handle exceptions in User mode.

8.5.5 Startup Sequence of EOS

The startup sequence of EOS is as follows. First, we list the logical order of the startup sequence. Then we explain each stepin detail.

(1). Booting the EOS kernel

(2). Execute reset_handler to initialize the system(3). Conﬁgure vectored interrupts and device drivers(4). kernel_init: initialize kernel data structures, create and run the initial process P0(5). Construct pgdir and pgtables for processes to use two-level dynamic paging(6). Initialize ﬁle system and mount the root ﬁle system(7). Create the INIT process P1; switch process to run P1(8). P1 forks login processes on console and serial terminals to allow user logins.(9). When a user login, the login process executes the command interpreter sh.(10). User enters commands for sh to execute.(11). When a user logout, the INIT process forks another login process on the terminal.8.5 Organization of EOS 269

(1). SDC Booting: An ARM based hardware system usually has an onboard boot-loader implemented in ﬁrmware. When anARM based system starts, the onboard boot-loader loads and executes a stage-1 booter form either a flash device or, in manycases, a FAT partition on a SDC. The stage-1 booter loads a kernel image and transfers control to the kernel image. For EOSon the ARM Versatilpb VM, the booting sequence is similar. First, we develop a stage-1 booter as a standalone program.Then we design a stage-2 booter to boot up EOS kernel image from an EXT2 partition. On the SDC, partition 1 begins fromthe sector 2048. The ﬁrst 2046 sectors are free, which are not used by the ﬁle system. The stage-2 booter size is less than10 KB. It is installed in sectors 2 to 20 of the SDC. When the ARM Versatilepb VM starts, QEMU loads the stage-1 booterto 0x10000 (64 KB) and executes it ﬁrst. The stage-1 booter loads the stage-2 booter from the SDC to 2 MB and transferscontrol to it. The stage-2 booter loads the EOS kernel image ﬁle (/boot/kernel) to 1 MB and jumps to 1 MB to execute thekernel's startup code. During booting, both stage-1 and stage-2 booters use a UART port for user interface and a simple SDCdriver to load SDC blocks. In order to keep the booters simple, both the UART and SDC drivers use polling for I/O. Thereader may consult the source code in the booter1 and booter2 directories for details. It also shows how to install the stage-2booter to the SDC.

8.5.6 Process Management in EOS

In the EOS kernel, each process or thread is represented by a PROC structure which consists of three parts.

In the PROC structure, the next ﬁeld is used to link the PROCs in various link lists or queues. The ksp ﬁeld is the savedkernel mode stack pointer of the process. When a process gives up CPU, it saves CPU registers in kstack and saves the stackpointer in ksp. When a process regains CPU, it resumes running from the stack frame pointed by ksp The ﬁelds usp, upc and ucpsr are for saving the Umode sp, pc and cpsr during syscall and IRQ interrupt processing. Thisis because the ARM processor does not stack the Umode sp and cpsr automatically during SWI (system calls) and IRQ(interrupts) exceptions. Since both system calls and interrupts may trigger process switch, we must save the process Umodecontext manually. In addition to CPU registers, which are saved in the SVC or IRQ stack, we also save Umode sp and cpsr inthe PROC structure. The ﬁelds pid, ppid, priority and status are obvious. In most large OS, each process is assigned a uniquepid from a range of pid numbers. In EOS, we simply use the PROC index as the process pid, which simpliﬁes the kernelcode and also makes it easier for discussion. When a process terminates, it must wakeup the parent process if the latter iswaiting for the child to terminate. In the PROC structure, the parent pointer points to the parent PROC. This allows the dyingprocess to ﬁnd its parent quickly. The event ﬁeld is the event value when a process goes to sleep. The exitValue ﬁeld is theexit status of a process. If a process terminates normally by an exit(value) syscall, the low byte of exitValue is the exit value.If it terminates abnormally by a signal, the high byte is the signal number. This allows the parent process to extract the exitstatus of a ZOMBIE child to determine whether it terminated normally or abnormally. The time ﬁeld is the maximaltimeslice of a process, and cpu is its CPU usage time. The timeslice determines how long can a process run, and the CPUusage time is used to compute the process scheduling priority. The pause ﬁeld is for a process to sleep for a number ofseconds. In EOS, process and thread PROCs are identical. The type ﬁeld identiﬁes whether a PROC is a PROCESS orTHREAD. EOS is a uniprocessor (UP) system, in which only one process may run in kernel mode at a time. For processsynchronization, it uses sleep/wakeup in process management and implementation of pipes, but it uses semaphores in devicedrivers and ﬁle system. When a process becomes blocked on a semaphore, the sem ﬁeld points to the semaphore. This allowsthe kernel to unblock a process from a semaphore queue, if necessary. For example, when a process waits for inputs from aserial port, it is blocked in the serial port driver's input semaphore queue. A kill signal or an interrupt key should let theprocess continue. The sem pointer simpliﬁes the unblocking operation. Each PROC has a res pointer pointing to a resourcestructure, which is

The PRES structure contains process speciﬁc information. It includes the process uid, gid, level-1 page table (pgdir) andimage size, current working directory, terminal special ﬁle name, executing program name, signal and signal handlers,message queue and opened ﬁle descriptors, etc. In EOS, both PROC and PRES structures are statically allocated. If desired,they may be constructed dynamically. Processes and threads are independent execution units. Each process executes in aunique address space. All threads in a process execute in the same address space of the process. During system initialization,each PROCESS PROC is assigned a unique PRES structure pointed by the res pointer. A process is also the main thread ofthe process. When creating a new thread, its proc pointer points to the process PROC and its res pointer points to the samePRES structure of the process. Thus, all threads in a process share the same resources, such as opened ﬁle descriptors, signalsand messages, etc. Some OS kernels allow individual threads to open ﬁles, which are private to the threads. In that case, each8.5 Organization of EOS 271

PROC structure must have its own ﬁle descriptor array. Similarly for signals and messages, etc. In the PROC structure,kstack is a pointer to the process/thread kernel mode stack. In EOS, PROCs are managed as follows. Free process and thread PROCs are maintained in separate free lists for allocation and deallocation. In EOS, which is aUP system, there is only one readyQueue for process scheduling. The kernel mode stack of the initial process P0 is staticallyallocated at 8 KB (0x2000). The kernel mode stack of every other PROC is dynamically allocated a (4 KB) page frameonly when needed. When a process terminates, it becomes a ZOMBIE but retains its PROC structure, pgdir and the kstack,which are eventually deallocated by the parent process in kwait().

8.5.7 Assembly Code of EOS

The ts.s File: ts.s is the only kernel ﬁle in ARM assembly code. It consists of several logically separate parts. For easy ofdiscussion and reference, we shall identiﬁed them as ts.s.1 to ts.s.5. In the following, we shall list the ts.s code and explainthe functions of the various parts.

msr cpsr, #0xDB

ts.s.1 is the reset_handler, which begins execution in SVC mode with interrupts off and MMU disabled. First, it initializesproc[0]’s kstack pointer to 8 KB (0x2000) and sets the SVC mode stack pointer to the high end of proc[0].kstack. Thismakes proc[0]’s kstack the initial execution stack. Then it initializes the stack pointers of other privileged modes forexceptions processing. In order to run processes later in User mode, it sets the SPSR to User mode. Then it continues toexecute the second part of the assembly code. In a real ARM based system, FIQ interrupt is usually reserved for urgentevent, such as power failure, which can be used to trigger the OS kernel to save system information into non-volatile storagedevice for recovery later. Since most emulated ARM VMs do not have such a provision, EOS uses only IRQ interrupts butnot the FIQ interrupt.

ts.s.2: the second part of the assembly code performs three functions. First, it copies the vector table to address 0. Then itconstructs an initial one-level page table to create an identity mapping of the low 258 MB VA to PA, which includes256 MB RAM plus 2 MB I/O space beginning at 256 MB. The EOS kernel uses the KML memory mapping scheme, inwhich the kernel space is mapped to low VA addresses. The initial page table is built at 0x4000 (16 KB) by the mkPtable()function (in t.c ﬁle). It will be the page table of the initial process P0, which runs only in Kernel mode. After setting up theinitial page table, it conﬁgures and enables the MMU for VA to PA address translation. Then it calls main() to continuekernel initialization in C.8.5 Organization of EOS 273

// now in SYS mode; restore Umode usp

ts.s.3: The third part of the assembly code contains the entry points of SWI (SVC) and IRQ exception handlers. Both theSVC and IRQ handlers are quite unique due to the different operating modes of the ARM processor architecture. So we shallexplain them in more detail. System Call Entry: svc_entry is the entry point of SWI exception handler, which is used for system calls to the EOSkernel. Upon entry, it ﬁrst saves the process (Umode) context in the process Kmode (SVC mode) stack. System callparameters (a,b,c,d) are passed in registers r0–r3, which should not be altered. So the code only uses registers r4–r10. First, itlets r6 point to the process PROC structure. Then it saves the current spsr, which is the Umode cpsr, into PROC.ucpsr. Thenit changes to SYS mode to access Umode registers. It saves the Umode sp and pc into PROC.usp and PROC.upc,respectively. Thus, during a system call, the process Umode context is saved as follows.

Umode registers ½r0 r12; r14 saved in PROC:kstack

Umode ½sp; pc; cpsr saved in PROC:½usp; upc; ucpsr

In addition, it also saves Kmode sp in PROC.ksp, which is used to copy the parent kstack to child during fork(). Then itenables IRQ interrupts and calls svc_chandler() to process the system call. Each syscall (except kexit) returns a value, whichreplaces the saved r0 in kstack as the return value back to Umode. System Call Exit: goUmode is the syscall exit code. It lets the current running process, which may or may not be theoriginal process that did the syscall, return to Umode. First, it disables IRQ interrupts to ensure that the entire goUmode codeis executed in a critical section. Then it lets the current running process check and handle any outstanding signals. Signalhandling in the ARM architecture is also quite unique, which will be explained later. If the process survives the signal, itcalls reschedule() to re-schedule processes, which may switch process if the sw_flag is set, meaning that there are processesin the readyQueue with higher priority. Then the current running process restores [usp, upc, cpsr] from its PROC structureand returns to Umode by

ldmfd sp!; fr0r12; pcg^

Upon return to Umode, r0 contains the return value of the syscall. IRQ Entry: irq_handler is the entry point of IRQ interrupts. Unlike syscalls, which can only originate from Umode, IRQinterrupts may occur in either Umode or Kmode. EOS is a uniprocessor OS. The EOS kernel is non-preemptive, whichmeans it does not switch process while in kernel mode. However, it may switch process if the interrupted process wasrunning in Umode. This is necessary in order to support process scheduling by time-slice and dynamic priority. Taskswitching in ARM IRQ mode poses a unique problem, which we shall elaborate shortly. Upon entry to irq_handler, it ﬁrstsaves the context of the interrupted process in the IRQ mode stack. Then it checks whether the interrupt occurred in Umode.If so, it also saves the Umode [usp, upc, cpsr] into the PROC structure. Then it calls irq_chandler() to process the interrupt.The timer interrupt handler may set the switch process flag, sw_flag, if the time-slice of the current running process hasexpired. Similarly, a device interrupt handler may also set the sw_flag if it wakes up or unblocks processes with higherpriority. At the end of interrupt processing, if the interrupt occurred in Kmode or the sw_flag is off, there should be no taskswitch, so the process returns normally to the originally point of interruption. However, if the interrupt occurred in Umodeand the sw_flag is set, the kernel switches process to run the process with the highest priority.

8.5.7.5 IRQ and Process Preemption

Unlike syscalls, which always use the process kstack in SVC mode, task switch in IRQ mode is complicated by the fact that,while interrupt processing uses the IRQ stack, task switch must be done in SVC mode, which uses the process kstack. In thiscase, we must perform the following operations manually.276 8 General Purpose Embedded Operating Systems

(1). Transfer the INTERRUPT stack frame from IRQ stack to process (SVC) kstack;(2). Flatten out the IRQ stack to prevent it from overflowing;(3). Set SVC mode stack pointer to the INTERRUPT stack frame in the process kstack.(4). Switch to SVC mode and call tswitch() to give up CPU, which pushes a RESUME stack frame onto the process kstackpointed by the saved PROC.ksp.(5). When the process regains CPU, it resumes in SVC mode by the RESUME stack frame and returns to where it calledtswitch() earlier.(6). Restore Umode [usp, cpsr] from saved [usp, ucpsr] in PROC structure(7). Return to Umode by the INTERRUPT stack frame in kstack.

Task switch in IRQ mode is implemented in the code segment irq_tswitch(), which can be best explained by the followingdiagrams.

ts.s.4: The fourth part of the assembly code implements task switching. It consists of three functions. tswitch() it for taskswitch in Kmode, irq_tswitch() is for task switch in IRQ mode and switchPgdir() is for switching process pgdir during taskswitch. Since all these functions are already explained before, we shall not repeat them here.

The last part of the assembly code implements various utility functions, such as lock/unlock, int_off/int_on, and gettingCPU status register, etc. Note the difference between lock/unlock and int_off/int_on. Whereas lock/unlock disable/enableIRQ interrupts unconditionally, int_off disables IRQ interrupts but returns the original CPSR, which is restored in int_on.8.5 Organization of EOS 279

These are necessary in device interrupt handlers, which run with interrupts disabled but may issue V operation on sema-phores to unblock processes.

8.5.8 Kernel Files of EOS

Part 2. t.c ﬁle: The t.c ﬁle contains the main() function, which is called from reset_handler when the system starts.

P1 forks login processes on console and serial terminals for users to login. Then it waits for any ZOMBIE children, whichinclude the login processes as well as any orphans, e.g. in multi-stage pipes. When the login processes start up, the system isready for use.

8.5.8.2 Kernel Initialization

kernel_init() function: The kernel_init() function consists of the following steps.

(1). Initialize kernel data structures. These include free PROC lists, a readyQueue for process scheduling and a FIFOsleepList containing SLEEP processes.(2). Create and run the initial process P0, which runs in Kmode with the lowest priority 0. P0 is also the idle process, whichruns if there are no other runnable processes, i.e. when all other processes are sleeping or blocked. When P0 resumes, itexecutes a busy waiting loop until the readyQueue is non-empty. Then it switches process to run a ready process with thehighest priority. Instead of a busy waiting loop, P0 may put the CPU in a power-saving WFI state with interrupts enabled.After processing an interrupt, it tries to run a ready process again, etc.(3). Construct a Kmode pgdir at 32 KB and 258 level-2 page table at 5 MB. Construct level-1 pgdirs for (64) processes inthe area of 6 MB and their associated level-2 page tables in 7 MB. Details of the pgdirs and page tables will be explained inthe next section on memory management.(4). Switch pgdir to the new level-1 pgdir at 32 KB to use 2-level paging.(5). Construct a pfreeList containing free page frames from 8 MB to 256 MB, and implement palloc()/pdealloc() functionsto support dynamic paging.(6). Initialize pipes and message buffers in kernel.(7). Return to main(), which calls fs_init() to initialize the ﬁle system and mount the root ﬁle system. Then it creates and runthe INIT process P1.

The remaining functions in t.c include scheduler(), schedule() and reschedule(), which are parts of the process schedulerin the EOS kernel. In the scheduler() function, the ﬁrst few lines of code apply only to the initial process P0. When the282 8 General Purpose Embedded Operating Systems

system starts up, P0 executes mount_root() to mount the root ﬁle system. It uses an I/O buffer to read the SDC, which causesP0 to block on the I/O buffer until the read operation completes. Since there is no other process yet, P0 can not switchprocess when it becomes blocked. So it busily waits until the SDC interrupt handler executes V to unblock it. Alternatively,we may modify the SDC driver to use polling during system startup, and switch to interrupt-driven mode after P0 has createdP1. The disadvantage is that it would make the SDC driver less efﬁcient since it has to check a flag on every read operation.

8.5.9 Process Management Functions

8.5.9.1 fork-execEOS supports dynamic process creation by fork, which creates a child process with an identical Umode image as the parent.It allows process to change images by exec. In addition, it also supports threads within the same process. These areimplemented in the following ﬁles. fork.c ﬁle: this ﬁle contains fork1(), kfork(), fork() and vfork(). fork1() is the common code of all other fork functions. Itcreates a new proc with a pgdir and pgtables. kfork() is used only by P0 to create the INIT proc P1. It loads the Umodeimage ﬁle (/bin/init) of P1 and initializes P1’s kstack to make it ready to run in Umode. fork() creates a child process with anidentical Umode image as the parent. vfork() is the same as fork() but without copying images. exec.c ﬁle: this ﬁle contains kexec(), which allows a process to change Umode image to a different executable ﬁle andpass command-line parameters to the new image. threads.c ﬁle: this ﬁle implements threads in a process and threads synchronization by mutexes.

8.5.9.2 exit-waitThe EOS kernel uses sleep/wakeup for process synchronization in process management and also in pipes. Process man-agement is implemented in the wait.c ﬁle, which contains the following functions.

ksleep(): process goes to sleep on an event. Sleeping PROCs are maintained in a FIFO sleepList for waking up in orderkwakeup(): wakeup all PROCs that are sleeping on an eventkexit(): process termination in kernelkwait(): wait for a ZOMBIE child process, return its pid and exit status

8.5.10 Pipes

The EOS kernel supports pipes between related processes. A pipe is a structure consisting of the following ﬁelds.

The syscall int r = pipe(int pd[]);

creates a pipe in kernel and returns two ﬁle descriptors in pd[2], where pd[0] is for reading from the pipe and pd[1] is forwriting to the pipe. The pipe's data buffer is a dynamically allocated 4 KB page, which will be released when the pipe isdeallocated. After creating a pipe, the process typically forks a child process to share the pipe, i.e. both the parent and thechild have the same pipe descriptors pd[0] and pd[1]. However, on the same pipe each process must be either a READER ora WRITER, but not both. So, one of the processes is chosen as the pipe WRITER and the other one as the pipe READER.The pipe WRITER must closes its pd[0], redirects its stdout (fd = 1) to pd[1], so that its stdout is connected to the write endof the pipe. The pipe READER must closes its pd[1] and redirects its stdin (fd = 0) to pd[0], so that its stdin is connected tothe read end of the pipe. After these, the two processes are connected by the pipe. READER and WRITER processes on thesame pipe are synchronized by sleep/wakeup. Pipe read/write functions are implemented in the pipe.c ﬁle. Closing pipe8.5 Organization of EOS 283

descriptor functions are implemented in the open_close.c ﬁle of the ﬁle system. A pipe is deallocated when all the ﬁledescriptors on the pipe are closed. For more information on the implementation of pipes, the reader may consult(Chap. 6.14, Wang 2015) or the pipe.c ﬁle for details.

8.5.11 Message Passing

In addition to pipes, the EOS kernel supports inter-process communication by message passing. The message passingmechanism consists of the following components.

(1). A set of NPROC message buffers (MBUFs) in kernel space.

(2). Each process has a message queue in PROC.res.mqueue, which contains messages sent to but not yet received theprocess. Messages in the message queue are ordered by priority.(3). send(char *msg, int pid): send a message to a target process by pid.(4). recv(char *msg): receive a message form proc's message queue.

In EOS, message passing is synchronous. A sending process waits if there are no free message buffers. A receivingprocess waits if there are no messages in its message queue. Process synchronization in send/recv is by semaphores. Thefollowing lists the mes.c ﬁle.

8.5.12 Demonstration of Message Passing

In the USER directory, the programs, send.c and recv.c, are used to demonstrate the message passing capability of EOS. Thereader may test send/recv messages as follows.

(1). login to the console. Enter the command line recv &. The sh process forks a child to run the recv command but does notwait for the recv process to terminate, so that the user may continue to enter commands. Since there are no messages yet, therecv process will be blocked on its message queue in kernel.(2). Run the send command. Enter the receiving proc’s pid and a text string, which will be sent to the recv process, allowingit to continue. Alternatively, the reader may also login from a different terminal to run the send command.

The EOS kernel code and data structures occupy the lowest 2 MB of physical memory. The memory area from 2 to8 MB are used by the EOS kernel as LCD display buffer, I/O buffers, level-1 and level-2 page tables of processes, etc., Thememory area from 8 to 256 MB is free. Free page frames from 8 to 256 MB are maintained in a pfreeList forallocation/dealloction of page frames dynamically.

8.6.2 Virtual Address Spaces

EOS uses the KML virtual address space mapping scheme, in which the kernel space is mapped to low Virtual Address(VA), and User mode space is mapped to high VA. When the system starts, the Memory Management Unit (MMU) is off, sothat every address is a real or physical address. Since the EOS kernel is compile-linked with real addresses, it can execute thekernel's C code directly. First, it sets up an initial one-level page table at 16 KB to create an identity mapping of VA to PAand enables the MMU for VA to PA translation.

8.6.3 Kernel Mode Pgdir and Page Tables

In reset_handler, after initializing the stack pointers of the various privileged modes for exception processing, it constructs anew pgdir at 32 KB and the associated level-2 page tables in 5 MB. The low 258 entries of the new pgdir point to theirlevel-2 page tables at 5 MB+i*1 KB (0<=i<258). Each page table contains 256 entries, each pointing to a 4 KB pageframe in memory. All other entries of the pgdir are 0's. In the new pgdir, entries 2048–4095 are for User mode VA space.Since the high 2048 entries are all 0's, the pgdir is good only for the 258 MB kernel VA space. It will be the pgdir of the286 8 General Purpose Embedded Operating Systems

initial process P0, which runs only in Kmode. It is also the prototype of all other pgdirs since their Kmode entries are allidentical. Then it switches to the new pgdir to use 2-level paging in Kmode.

8.6.4 Process User Mode Page Tables

Each process has a pgdir at 6 MB+pid*16 KB. The low 258 entries of all pgdirs are identical since their Kmode VA spacesare the same. The number of pgdir entries for Umode VA depends on the Umode image size, which in turn depends on theexecutable image ﬁle size. For simplicity, we set the Umode image size, USZIE, to 4 MB, which is big enough for all theUmode programs used for testing and demonstration. The Umode pgdir and page tables of a process are set up only when theprocess is created. When creating a new process in fork1(), we compute the number of Umode page tables needed asnpgdir = USIZE/1 MB. The Umode pgdir entries point to npgdir dynamically allocated page frames. Each page table usesonly the low 1 KB space of the (4 KB) page frame. The attributes of Umode pgdir entries are set to 0x31 for domain 1. Inthe Domain Access Control register, the access bits of both domains 0 and 1 are set to b01 for client mode, which checks theAccess Permission (AP) bits of the page table entries. Each Umode page table contains pointers to 256 dynamically allocatedpage frames. The attributes of the page table entries are set to 0xFFE for AP = 11 for all the (1 KB) subpages within eachpage to allow R|W access in User mode.

8.6.5 Switch Pgdir During Process Switch

During process switch, we switch pgdir from the current process to that of the next process and flush the TLB and I and Dbuffer caches. This is implemented by the switchPgdir() function in ts.s.

8.6.6 Dynamic Paging

In the mem.c ﬁle, the functions free_page_list(), palloc() and pdealloc() implement dynamic paging. When the system starts,we build a pfreeList, which threads all the free page frames from 8 to 256 MB in a link list. During system operation, palloc()allocates a free page frame from pfreeList, and pdealloc() releases a page frame back to pfreeList for reuse. The followingshows the mem.c ﬁle.

8.7 Exception and Signal Processing

During system operation, the ARM processor recognizes six types of exceptions, which are FIQ, IRQ, SWI, data_abort,prefecth_abort and undeﬁned exceptions. Among these, FIQ and IRQ are for interrupts and SWI is for system calls. So theonly true exceptions are data_abort, prefeth_abort and undeﬁned exceptions, which occur under the following circumstances. A data_abort event occurs when the memory controller or MMU indicates that an invalid memory address has beenaccessed. Example: attempt to access invalid VA. A prefetch_abort event occurs when an attempt to load an instruction results in a memory fault. Example: if 0x1000 isoutside of the VA range, then BL 0x1000 would cause a prefetch abort at the next instruction address 0x1004. An undeﬁned (instruction) event occurs when a fetched and decoded instruction is not in the ARM instruction set andnone of the coprocessors claims the instruction. In all Unix-like systems, exceptions are converted to signals, which are handled as follows.

8.7.1 Signal Processing in Unix/Linux

(1). Signals in Process PROC: Each PROC has a 32-bit vector, which records signals sent to a process. In the bit vector,each bit (except bit 0) represents a signal number. A signal n is present if bit n of the bit vector is 1. In addition, it also has aMASK bit-vector for masking out the corresponding signals. A set of syscalls, such as sigmask, sigsetmask, siggetmask,sigblock, etc. can be used to set, clear and examine the MASK bit-vector. A pending signal becomes effective only if it is notmasked out. This allows a process to defer processing masked out signals, similar to CPU masking out certain interrupts.(2). Signal Handlers: Each process PROC has a signal handler array, int sig[32]. Each entry of the sig[32] array speciﬁeshow to handle a corresponding signal, where 0 means DEFault, 1 means IGNore, other nonzero value means by apreinstalled signal catcher (handler) function in Umode.(3). Trap Errors and signals: When a process encounters an exception, it traps to the exception handler in the OS kernel.The trap handler converts the exception cause to a signal number and delivers the signal to the current running process. If theexception occurs in kernel mode, which must be due to hardware error or, most likely, bugs in the kernel code, there isnothing the process can do. So it simply prints a PANIC error message and stops. Hopefully the problem can be traced andﬁxed in the next kernel release. If the exception occurs in User mode, the process handles the signal by the signal handlerfunction in its sig[] array. For most signals, the default action of a process is to terminate, with an optional memory dump fordebugging. A process may replace the default action with IGNore(1) or a signal catcher, allowing it to either ignore thesignal or handle it in User mode.(4). Change Signal Handlers: A process may use the syscall

int r ¼ signalðint signal number; void handlerÞ;

to change the handler function of a selected signal number except SIGKILL(9) and SIGSTOP(19). Signal 9 is reserved as thelast resort to kill a run-away process, and signal 19 allows a process to stop a child process during debugging. The installedhandler, if not 0 or 1, must be the entry address of a function in User space of the form

void catcherðint signal numberÞf. . .. . .. . .. . .g

(5). Signal Processing: A process checks and handles signals whenever it is in Kmode. For each outstanding signal numbern, the process ﬁrst clears the signal. It takes the default action if sig[n] = 0, which normally causes the process to terminate.It ignores the signal if sig[n] = 1. If the process has a pre-installed catcher function for the signal, it fetches the catcher’saddress and resets the installed catcher to DEFault (0). Then it manipulates the return path in such a way that it returns toexecute the catcher function in Umode, passing as parameter the signal number. When the catcher function ﬁnishes, it returnsto the original point of interruption, i.e. to the place from where it lastly entered Kmode. Thus, the process takes a detour toexecute the catcher function ﬁrst. Then it resumes normal execution.(6). Reset user installed signal catchers: User installed catcher functions are intended to deal with trap errors in userprogram code. Since the catcher function is also executed in Umode, it may commit the same kind of trap error again. If so,the process would end up in an inﬁnite loop, jumping between Umode and Kmode forever. To prevent this, the processtypically resets the handler to DEFault (0) before executing the catcher function. This implies that a user installed catcherfunction is valid for only one occurrence of the signal. To catch another occurrence of the same signal, the Umode program8.7 Exception and Signal Processing 289

must install the catcher again. However, the treatment of user installed signal catchers is not uniform as it varies acrossdifferent versions of Unix. For instance, in BSD the signal handler is not reset but the same signal is blocked while executingthe signal catcher. Interested readers may consult the man pages of signal and sigaction of Linux for more details.(7). Inter-Process Signals: In addition to handling exceptions, signals may also be used for inter-process communication.A process may use the syscall

int r ¼ killðpid; signal numberÞ;

to send a signal to another process identiﬁed by pid, causing the latter to execute a pre-installed catcher function in Umode.A common usage of the kill operation is to request the targeted process to terminate, thus the (somewhat misleading) termkill. In general, only related processes, e.g. those with the same uid, may send signals to each other. However, a superuserprocess (uid=0) may send signals to any process. The kill syscall may use an invalid pid, to mean different ways ofdelivering the signal. For example, pid = 0 sends the signal to all processes in the same process group, pid = -1 for allprocesses with pid > 1, etc. The reader may consult Linux man pages on signal/kill for more details.(8). Signal and Wakeup/Unblock: kill only sends a signal to a target process. The signal does not take effect until the targetprocess runs. When sending signals to a target process, it may be necessary to wakeup/unblock the target process if the latteris in a SLEEP or BLOCKed state. For example, when a process waits for terminal inputs, which may not come for a longtime, it is considered as interruptible, meaning that it can be woken up or unblocked by arriving signals. On the other hand, ifa process is blocked for SDC I/O, which will come very soon, it is non-interruptible, which should not be unblocked bysignals.

8.8 Signal Processing in EOS

8.8.1 Signals in PROC Resource

In EOS, each PROC has a pointer to a resource structure, which contains the following ﬁelds for signals and signal handling.

int signal; ==31 signals; bit 0 is not used:

For the sake of simplicity, EOS dose not support signal masking. If desired, the reader may add signal masking to theEOS kernel.

8.8.2 Signal Origins in EOS

(1). Hardware: EOS supports the Control-C key from terminals, which is converted to the interrupt signal SIGINT(2)delivered to all processes on the terminal, and the interval timer, which is converted to the alarm signal SIGALRM(14)delivered to the process.(2). Traps: EOS supports data_abort, prefetch and undeﬁned instruction exceptions.(3). From Other Process: EOS supports the kill(pid, signal) syscall, but it does not enforce permission checking. Therefore, aprocess may kill any process. If the target process is in the SLEEP state, kill() wakes it up. It the target process is BLOCKedfor inputs in either the KBD of a UART driver, it is unblocked also.

8.8.3 Deliver Signal to Process

The kill syscall delivers a signal to a target process. The algorithm of the kill syscall is

8.8.5 Signal Handling in EOS Kernel

A CPU usually checks for pending interrupts at the end of executing an instruction. Likewise, it sufﬁces to let a processcheck for pending signals at the end of Kmode execution, i.e. when it is about to return to Umode. However, if a processenters Kmode via a syscall, it should check and handle signals ﬁrst. This is because if a process already has a pending signal,which may cause it to die, executing the syscall would be a waste of time. On the other hand, if a process enters Kmode dueto an interrupt, it must handle the interrupt ﬁrst. The algorithm of checking for pending signals is

8.8.6 Dispatch Signal Catcher for Execution in User Mode

In the algorithm of psig(), only step (4) is interesting and challenging. Therefore, we shall explain it in more detail. The goalof step (4) is to let the process return to Umode to execute a catcher(int sig) function. When the catcher() function ﬁnishes, itshould return to the point where the process lastly entered Kmode. The following diagrams show how to accomplish these.When a process traps to kernel from Umode, its privileged mode stack top contains a "trap stack frame" consisting of 14entries, as shown in Fig. 8.1. In order for the process return to execute catcher(int sig) with the signal number as a parameter, we modify the trap stackframe as follows.