Thursday, March 19, 2009

In case you are unfamiliar with ccache, its a "compiler cache". Compiling is primarily CPU intensive task. So, ccache caches compiled objects - so next time we compile same code, it reuses these objects thereby significantly speeding-up compilation.

I need to recompile Linux kernel usually several times a day, with different permutations of config settings. This almost forces a 'make clean' or 'make mrproper' which deletes all compiled objects in build tree and then we have to rebuild everything all over again. This takes enormous amount of time. ccache comes to rescue! I'm surprised why I didn't use it earlier.

To use ccache to speedup Linux kernel compile, all I had to do was 'yum install ccache' on Fedora10 system. This automatically created all needed symlinks and PATH setting. After this do make O=build bzImage; make O=build modules as usual. First time you won't notice any speed gain as ccache is simply building up its cache. Subsequently doing a rebuild 'make clean; make' will be significantly fasterTo monitor ccache stats, do:

If you see cache hits not increasing while you are compiling (and this is not the first compile), this might mean that ccache currently contains useless compiled objects. In such cases, if cache size is already near max cache size, its better to clean-up ccache with:

UPDATEOn Fedora-13 system, the PATH is not automatically set after 'yum install ccache'. However, symlinks from ccache to compiler names are still created. So, you just need to add /usr/lib64/ccache to your PATH manually.

Wednesday, March 18, 2009

Linux kernel has few SLAB allocator variants included: SLAB, SLUB and SLOB. Of these, SLOB is especially meant to be used on embedded devices -- it tries to be more memory space efficient than other SLAB variants.

Yesterday, I had a detailed look at SLOB allocator for possible use in compcache poject and found it unacceptable for the purpose. I did it in response to feedback on xvmalloc allocator -- as part of compcache patches posted of inclusion in mainline Linux kernel:http://lkml.org/lkml/2009/3/17/116

The requirement of this project is allocation of large no. of randomly sized objects in range [32, f * PAGE_SIZE] where fraction 'f' is typically 3/4 and PAGE_SIZE 4Kb.

1) O(n) allocation:To find block of given size, it _linearaly_ scans corresponding free list to find a page with _total_ free space >= requested size. This free space might not be contiguous. So it runs through free blocks within each such candidate page until it finally finds some page with free contiguous area >= requested size.

2) Fragmentation: When growing SLOB cache, page is added to one of 3 freelists (depending on what size we are currently allocating). After this, this page can never move to any other list - even if its free space drops down to fall in next range below or vice versa. This has two problems: - Potentially large wastage due to "page internal fragmentation": e.g.: alloc(3096) is satisfied from a page in 'large free list'. Now it has 1000b free (assuming 4k page) which will now never be used. - It can trigger unnecessary cache grows: e.g.: even though we have such unfilled pages in 'large' list, allocation in 'small' range can still cause cache grow if 'small free list' is empty.

3) It only allocates from "low memory". This is clearly not acceptable for compcache on 32-bit systems.

All GMail addicts know power of tags! Surprisingly, today's systems have none/limited tag functionality for files. Fortunately I found this freeware for windows which lets you tag almost any filetype. Interface is really good with shell integration and search based on tags is fast too. A real life saver!

Very old but still as relevant... and very interesting too! Directly go to "anti-tip" section of this article.

"The moral of the story is: don’t get tricky. C programmers often try to minimize the number of lines of C in their program without consideration for what the compiler will generate. When in doubt, write clear code and give the optimizer a chance to maximize performance. Look at the compiler output. Your code will be easier to debug and probably faster too."

One of my Fedora 10 systems used to freeze very frequently. After lot of looking around I found its because of "KWin Composing" which gives OpenGL driven special effects for desktop. Unfortunately, Linux has always been bad at radeon drivers, so it better to disable these effects especially if you have radeon video cards.

in ~/.kde/share/config/kwinrc:

in [Compositing] section changeEnabled=true to Enabled=false.

Reboot after this change. Now I never get any system freeze - as is expected from solid Linux system :)

Friday, March 13, 2009

Recently I came across this paper published in OSDI '08. Its an extension to VMware's page-sharing and shows some amazing and hard to believe results. VMware page-sharing mechanism scans memory for all VMs and maps pages with same contents to a single page. This achieves memory savings if multiple VMs are hosted running same OS. However, with technique discussed in this paper, we find pages that are nearly same. For such pages, they save a base pageand other similar pages as delta of original page. For pages which are not similar to any other page are simply compressed. Their benchmarks shows upto 45% more memory saving over ESX page-sharing under some (specially crafted) workload.

The idea is surely very interesting but I have serious doubts if it really will be that effective for real workload. Below I give my notes on this paper and some possible extensions to this work.

Following notes refer to paper extensively. So keep the paper handy while reading these (link to paper in beginning of this post).

Notes

Page Sharing (for identical pages) by DE itself is much more efficient than that of ESX (with scan rate of 10,000 pages/sec). See Fig 9 and 11 in particular. After long run, ESX eventually catches up but saving is much inferior *during* benchmark.

For all 3 "real world" workloads (ident, MIXED-1, MIXED-2), memory saving contribution of page sharing for DE in just first few seconds is nearly 30-50%. This looks too good to be real.

They don't show data for any of real workloads where only *one* of mem saving mechanisms is active. We compress only the pages that are not "similar" to any other page - what are perf numbers if all pages were compressed? They only show contribution of individual methods (page sharing, delta for similar pages, compression for unique pages) but not the effect of individual methods working alone.

They show effect of individual methods for artificial workloads (Fig 5, 6, 7). Fig 7(b) shows big savings with patching. However this is case where pages are 95% "similar". Author has not noted in what terms they are similar. Are changes confined to end of page, beginning or at random offsets? For each of these cases patching algorithm can give significantly different patch sizes.

Comments

Nothing can guarantee good saving for any arbitrary workload. Authors choice of "real world" mixed workload however looks good. Proper performance evaluation with these mixed workloads should be good enough to show strength of DE however this is not done properly in this paper as noted above.

DE used xdelta for patching. A far more superior algorithm for this is bsdiff but this is has much higher mem/CPU requirements. Maybe work on more efficient variation is worth it.

Lots of performance numbers comparing different patching algorithms can be found in this thesis. (pdf)

Conclusion

Paper does not conclusively proves that data de-duplication gives significant space savings over compression alone even for its sample "real world" apps.

Future Work

Collect suspend images of various VMs with mixed workloads (similar to this paper) and check effectiveness of these cases:

share identical pages only

delta encode similar pages only (with sharing of ident pages)

compress individual pages only

All combined (showing individual contributions here).

Compare different patching algorithms in above cases.

Redundancy can be artificially created: BWT# transform followed by MTF# encoding is known to generate redundancy for various kinds of data. So, I expect more effective data de-duplication with following modification to DE:

BWT+MTF create more redundancy within each page and hence is expected to create more redundancy across pages. There is already a compression utility rzip that exploits this. Among other things, it applies BWT+MTF and then compresses using global dictionary. Its not useful as is but the idea can surely be used.

Thursday, March 12, 2009

Memory compression has been my pet project since last 2 years. It adds a compression layer to swap path -- whatever OS swaps out will be compressed and stored in memory itself. This is huge win over swapping to slow hard-disks which are typically used for swapping. Biggest challenges are what to compress? how much to compress? how to manage compressed blocks? handling incompressible data and list goes on...

This is all basically shifting load over to processor to avoid slow disks. Now in multi-core era this seems to only gain in relevance -- de/compression can be almost free if we can use all those cores smartly :)