An Engineer's Options & Futures

In this blog, you can read the 'aha' moments in my rendezvous with computers, computing, and electronics, HOWTOs and whys that you could not get an answer from your textbook and then once in a while, a strange mixture of global economics.

Thursday, May 05, 2011

Bloom Filter is an extremely space-efficient probabilistic data structure. You can read more about it in its wikipedia page: http://en.wikipedia.org/wiki/Bloom_filter. Bloom filter requires a very little storage space for a single key. An enormous amount of data can be stored in a small Bloom filter, if we can afford the risk of a very small percentage of false positive. Venti network storage system introduced in Plan 9 distributed operating system uses Bloom filter for storage.

A few days ago, Raymond Hettinger wrote a simple Bloom filter in ActiveState recipe. This reminded of the fact that I wrote a Bloom filter long time ago with a different hash function. So I forked Hettinger's code and extended the Bloom filter to allow union and intersection operations between two Bloom filters. Such operations are very easy, and allowed if both the Bloom filters use the same hash function. It may be used in distributed systems, where two Bloom filters are independently constructed and merged eventually. I also externalized the hash function to be given as the input to the Bloom filter class. So the user can test it with different hash function, if they have a yielding function to generate hash values.

You can find the Python source code for the Bloom filter in my github or ActiveState recipe.

Bloom filter is a very good data structure for space-efficient storage, whereas it is not suited if you insist on having no false positive in searches. Also if you want to do other operations like sorting the dataset or deleting elements from the dataset, Bloom filter is not suited.

However with some slight modification, sacrificing some space, we can achieve a few more things. For instance, if you want to count the number of times a given key is searched, you can allocate more than a bit for every hash value and increment the positional value as a counter for every search. Least positional value (minus 1) for a given string tells how many times that string has been searched. But this operation would increase the storage requirement. My implementation does not include any such variation.

Tuesday, April 12, 2011

I've tried a few IDEs for standalone Python development. But end of the day I came back to Emacs. I love Emacs because it is lightweight, I can pretty much do whatever I can with any other IDE and much more if I write some Lisp script.

Here in this article, I'd just cover how to step through your standalone Python code and debug it using Emacs as the editor/IDE. First of all you have to install the following Emacs extension modules and Python debugging tools.

Install Pymacs from this github site and thank François Pinard of Montreal for programming it

Check if you have PDB, just by typing it on command window (you'd mostly have)

Now the step-by-step debugging goes like this

With your Python program on buffer, type M-x pdb. It would ask if you want to run PDB like, pdb a.out. Replace a.out with your Python module name. In the screenshot, it is adder.py (doing nothing useful).

This opens a new PDB windows (Screenshot)

Go to the lines where you need breakpoint, type C-x (you can notice the screenshot for the statement about it). If you are using Ubuntu, you get a very good look-n-feel, like that screenshot.

Type c on the PDB buffer to run upto the breakpoint

From the breakpoint, n for next line or s to explore into functions on that line. In Ubuntu, you'd have a GUI to assist you.

To watch a variable, type p var, as in the screenshot.

Any time during debugging, w prints out the stack and u and d lets you go up and down the stack.

For most of the standalone Python script, this would improve your productivity a lot. You can do more complex debug operations, but this is a very good start.

Saturday, April 09, 2011

With all your web apps and client side scripts, cross-browser testing is vital. Great were the roaring 90s when Internet Explorer was the only browser. But now the end user uses any browser, mostly either Chrome, Firefox, Internet Explorer, or Safari. Despite several standardizations, there is no guarantee that scripts in one browser would run in the other. And as an end-user, the message "this page works only in Firefox" is the biggest turn off for me.

Here is an easy way of testing how your web UI looks in different browsers without really installing the browsers in your computer. Through Browserling. It's an open source node.js based StackVM implementation by Peteris Krumins, and you can follow the growth of this project and all his node.js tools, here.

Sunday, February 13, 2011

So far, we have covered the process creation and termination in Linux. In this post and the next one, we will go through the process scheduling. The task scheduler or process scheduler or simply scheduler is the part of the kernel that makes the decision on which process can be executed at any given time, keeping in mind best utilization of the available resources. The scheduler in Linux is preemptive – i.e. the scheduler wouldn't mind partially stopping an executing task and run a different task which is better suited. The preempted task is eventually scheduled to continue from the point where it stopped.

Scheduling policy is the strategy followed by the multitasking OS to maximize efficiency. In Linux or for that matter in most of the OS, the processes are separated into two kinds: IO-bound and processor-bound. An IO-bound process is one that depends on constant interrupt from the IO devices – like a word processor that depends on keyboard entry. A processor-bound task is the one in which the instruction sequence is executed consecutively on the processor until it is either preempted or finished. Linux tries to provide more importance to the IO-bound processes because of the user-interactive nature of it. This is done without compromising much on the background tasks that requires processor time. Linux scheduler is also a dynamic priority-based scheduler. All processes start with a base priority and (inversely) based on how often the processes use up their allocated timeslices without being interrupted by any IO device.

The Linux scheduler is defined in kernel/sched.c. The scheduling algorithm is called Completely Fair Scheduler (CFS), developed by Hungarian hacker, Ingo Molnar. Completely Fair Scheduler is a O(log n) algorithm that depends on a red-black tree which implements a timeline of future tasks ready for execution. Before CFS, Linux used O(1) scheduler. In that, a bitmap based priority queue data structure used of the priority based scheduling class is given by:

Here MAX_RT_PRIO is the number of priority levels which is equal to 140. For user-space, the number of priority levels is given by MAX_USER_RT_PRIO = 100. The main scheduling data structure is run-queue defined as struct rq. This structure actually belongs to the previous O(1) scheduling algorithm, except for a few changes. One important change is the introduction of the reference to struct cfs_rq and the leaves of the cfs_rq given by struct list_head leaf_cfs_rq_list.

The struct cfs_rq consists of struct load_weight is a simple structure with two unsigned longs used in load balancing.

nr_running variable gives the number of processes currently running. tasks_timeline is implemented as a red-black tree as already mentioned, with a reference to its root. The leftmost node of the red-black tree is also tracked. This points to the most eligible task that can for running (see enque method declared in sched_class).

CFS also eliminates the need for the priority queue by introducing two new structures into struct task_struct. They are: struct sched_entity and struct sched_class. The important part of struct sched_entity are:

Here rb_node is the node of the red-black tree. The struct sched_class contains pointers to different functions that are required to implement the CFS scheduler. It consists of enqueue and dequeue functions to add and remove the struct task_struct to the rq, by changing the status runnable of the task. check_preempt_curr function that ensures fairness (F of CFS) by checking whether the current task is pre-emptable or not. set_curr_task function to set the current task as running in the given CPU. The function prio_changed that gives information about change in priority in a dynamic priority driven scheduling. The average time for scheduling a task is given by:

return (u64)sysctl_sched_time_avg * NSEC_PER_MSEC / 2;

Here NSEC_PER_MSEC stands for nanoseconds per microsecond. One observation that can be made is that CFS scheduling algorithm in Linux tries to stretch the accuracy to the scale of nanoseconds.

Since Linux allows preemptive scheduling, the CPU time is divided into several time slices. After every timeslice context switching takes place. Timeslice length is usually a variable in CFS. When to schedule a new task is determined in sched_clock_tick() function in kernel/sched_clock.c or by resched_task() call made through change of priority requested by the user (using nice() system call and much more) or by sched_yield() system call.

In the next post, we will see how spinlocks and rcutree are used in scheduling. We will also look at the different scheduling policies in Linux.

Wednesday, September 29, 2010

Finally I found some time to get back to continuing this effort of writing about Linux kernel. This chapter is about how the process or task gets terminated.

In Linux, a task is terminated by an exit() system call, made either explicitly by the task or implicitly when the main() function of the task ends. After the task is terminated, its parent has to be informed about the termination through SIGCHLD signal. Major chunk of exit() operation is performed in do_exit() function in kernel/exit.c.

The first major process in do_exit() function is to validate the credential for exit(). Validation happens in cred.c with this piece of code:

If the validation is successful, then exit_irq_thread() function, implemented in kernel/irq/manage.c is called to set the IRQTF_DIED flag that prevents any attempt to wake up the thread. Then PF_EXITING flag is set on the task_struct of the exiting task. Then the exiting task is protected from cleaning the area of pi futex. A futex is the same as mutex (fast user-space mutex = futex) that is used in the Linux kernel to implement locks. Out of that, pi futex or to be technically correct, PI enabled futex stand for Priority Inheritance enabled futex – a set of lightweight futex, about which we will discuss later.

Following this, the do_exit() invokes exit_mm() function that clears out the user memory space allocated to the process. exit_mm() first releases the user-space by calling mm_release() in kernel/fork.c. mm_release() first removes all the register state saved by the process and inform the parent sleeping on vfork(). The user-space thread id field is wiped off if the exit is normal. On the contrary, if the exit is due to some signal like segmentation fault or bus fault (checked using PF_SIGNALED flag of the task_struct, then it is not wiped so that this information can be written in core dump. This is followed in order by invocations of exit_sem, exit_files, and exit_fs that dequeue if the task has queued any semaphore, removes the locks on files and file space respectively. Then the task status is set to tsk->state = TASK_DEAD.

As it can be observed here, the task_struct instance of the exiting task is set to a state: TASK_DEAD and it is not entirely wiped off. That means that the parent can still gather information about the finished task. Removal of the task_struct altogether is managed by invoking release_task(). The release_task() has this following code:

RCU stands for Read-Copy-Update, a synchronization mechanism used in Linux. So whatever falls between rcu_read_lock() and rcu_read_unlock() is a read-side critical section – just wanted to show a real piece of the OS kernel code that establishes a critical section. release_task() also calls do_notify_parent() function which notifies the parent with SIGCHLD. This is followed by a final disposal of the task by a call to architecture specific release_thread() function. In x86, it involves freeing up vm86 irq for this task; whereas in ARM or PowerPC, this function does nothing (serious). And thus, a task reaches its demise.

So in the past three episodes, we discussed the rise and fall of a task or process. But what happens in between is more interesting. The next chapter would be on task scheduling.

Wednesday, September 01, 2010

In the last chapter, we looked at the basics of process or task in Linux kernel and with a brief overview of struct task_struct. Now we are going to discuss how the process or task gets created.In Linux, a new task can be created using fork() system call. fork() call creates an almost exact copy of the parent process. The differences are in pid (unique), and parent pointer. It creates an exact copy of task descriptor, and resource utilization limit (set to 0, initially). But there are a lot of parameters in the task descriptor that must not be copied to child. So the forking operation is implemented in the kernel using clone() system call, which in turn calls do_fork() system call in kernel/fork.c. do_fork() predominantly uses an unsigned long variable called clone_flag to determine what parameters need to be shared or copied. The first call made by do_fork() is copy_process(), which has the following code:

retval = security_task_create(clone_flags);

The first step of copy_process() is to call security_task_create() implemented in security/security.c. In this function, a structure called struct security_operations is used. This function takes clone_flags as input and determines if the current process has sufficient permission to create a child process. This is done by calling selinux_task_create() function in security/selinux/hooks.c, which also has a function current_has_perm() that takes clone_flag and check for several permissions using access vector cache - a component that provides caching of access decision computation in order to avoid doing it over and over again. If security_task_create() returns 0, copy_process() cannot create the new task.

The next step is to duplicate the struct task_struct by calling dup_task_struct(). Once the struct task_struct is duplicated, the values of the pointers to parent tasks that do not make sense in the child are set to appropriate values as explained in the subsequent paragraph. This is followed by a call to copy_creds(), implemented in kernel/creds.c. The copy_creds() copies the credential of parent to the child and at this point a new clone_flag parameter called CLONE_THREAD has to be introduced.

In Linux, there is no differentiation between threads and tasks (or processes). They are both created and destroyed in the same way, but handled a little differently. CLONE_THREAD flag says that the child has to be placed in the same group as the parent. The parent of the child process in this case is the same as the parent of the task that called clone() and not the task that called clone() itself. The process and session keyrings (handled later) are shared between all the threads in a process. This explains why struct task_struct has two pointers, namely real_parent and parent. In case of CLONE_THREAD flag set, real_parent points to the task that invoked the clone(), but beyond that this "real parent" does not have any control over the newly created task. The parent of the newly created task is marked as the task that created all these threads. getppid() system call brings that parent process and only the parent process gets SIGCHLD on termination of child, if it makes wait() system call. So the copy_creds() has to copy the credentials and keyrings of the common parent (not the real parent) for all these threads. So threads in Linux are processes that share parent and user space. Note that the thread we are talking about here are user threads that any program creates using pthread or zthread libraries. They are completely different from kernel threads created by Linux kernel for its internal operations.

After this, the function takes care of setting the CPU time utilized to zero and splitting the total CPU time from parent to give it to the child. Then sched_fork() function defined in kernel/sched.c is invoked. Process scheduling is a separate topic of discussion.

In this chapter, we looked at how a task is created in Linux kernel or rather what important operations are performed during task creation. In the next chapter, we will see task termination. In the meantime, you can put printk() statements in copy_process() function of kernel/fork.c and check out the kernel log and observe its working.

Sunday, August 22, 2010

The source code of Linux Kernel was something that fascinated me eight years ago and I am still addicted and I follow changes that happen to it. I am planning to write a series of post to explore the source code of the kernel. I hope that this would help budding developers and fans to understand and change the Linux Kernel and improve the operating system. But if you do so, please remember that you do it at your own risk. There is a general checkpoint in your evolution as a C programmer to check if you are in a position to change the kernel source and recompile it. Ask yourself: how many segmentation fault do I get, when I write a complex C code? how many times I got problems due to dangling pointer? If the numbers you get as answers to these questions are significantly high, it's a good idea to wait. Apart from that, I think that writing these articles would help me revisit many portions of the kernel and refresh my knowledge.

Here are some questions that I might ask myself with the answers. Who are you to write it? Have you submitted patches to kernel? No, yet to. Have you modified and recompiled the kernel? Yes, several times. Most of the changes are very specific to the kind of work I am doing. Submitting that as a patch might not help the community. What Linux do you use? I use a Debian 5.0. Have you been paid to modify the kernel? Yes, as part of my job or as a freelance. Do you like doing that? I love doing that.

The order in which I go through the kernel is in the same order in which Robert Love structured his book (because I personally like that book), with some variations here and there. You would see the exact copy of Linux Kernel code in many places. Copyrights of the code belong to its developers (check github) under GPL or one of its variants.

Chapter 1: Introduction - Tasks or Processes

In Linux terminology, task stands for process. Wikipedia defines a process as an instance of a computer program that is being executed. Lets call it task going forward, because many of the variables in kernel source are named so. In Linux, a task can have multiple concurrent threads, a task can spawn multiple children-tasks, and each task operates on virtual memory with most of the time no regards to the other tasks. A task is created by a clone() or fork() system calls and terminated by an exit() system call.

In Linux, the information about each task is preserved in a structure: struct task_struct in linux/sched.h. This is a huge structure because of the information it has to store. Some important variables in that structure are:

volatile long state; - defines the state of the task. Negative means the task is not runnable; zero means the task is runnable and can be scheduled; and positive means the task is stopped or suspended for some reason.

pid_t pid; - unique process id. pid_t is usually int.

struct mm_struct* mm; - user address space for the task

struct files_struct *files; - pointer to the files kept open by this task

struct thread_struct thread; - this is the CPU specific state of the current task. We will look at this later

struct held_lock held_locks[MAX_LOCK_DEPTH]; - array of locks held by the task. Typical value of MAX_LOCK_DEPTH is 48

struct task_struct *parent; - pointer to the parent of the current task (which would wait() for it)

struct list_head children; - list of children of current task

struct list_head sibling; - list of siblings of current task

Instances of struct task_struct is dynamically created by the kernel. Also kernel uses a macro called current to determine the struct task_struct of the current task. In a processor with less registers, like Pentium, the pointer to struct task_struct is stored at the top of the stack (if it grows up) and fetched by accessing stack pointer (ESP in case of x86). In a RISC processor like PowerPC or MicroBlaze with a lot of registers, the pointer to struct task_struct is stored in the register (r1 usually) for fast access.

The state of any task can be changed by the kernel using the macro: set_task_state(task, state); which is defined as follows:

Here set_mb is again architecture specific, just like current macro. If you are interested, you can find all the architecture specific macros in folder arch/##architecture##. Specifically, the implementation details of set_mb() can be found in asm/system.h and asm/barrier.h in every architecture.

Now let us write some simple code using what we learned. The following code traces back the ancestry of the current task and prints them in the kernel log:

The code listed above are computationally expensive, and should be used only for debug purposes. That's it for now. In the next chapter (expect it on 5th of September), we will see how a process can be created and how threads in Linux kernel are different from processes (are they different?).