EFL memory consumption - MOO!

During development of EFL for 1.8 memory usage has increase along with new features. Here I went on a rampage with my new CoW.

Since September, EFL 1.8 has been under heavy development. The Samsung Israel team added Eo (a new object model that should help unify all EFL objects, and should have its own blog entry), Profusion (now part of Intel) added Evas async rendering and Ephysics. Of course a bunch of other smaller changes went in, and overall the memory used in our tests grew quite a lot from 5.4MB to 8MB. That rang a bell, and I got into looking at what went wrong.

If you are already bored, The end results are that we are now back at 5.6MB and we should be able to gain another 300K to 400K before the release (something planed for April/May at this point). So now that you feel better, you can go back to scratching your under-arm hair, or spend some time reading the rest of this blog.

The first thing to do, when you are optimizing, is to compile a set of tests that will work for all revisions and give you number that you can trust during all your development. I chose some elementary_test cases that look exactly the same in 1.7 and in our development branch (which is 1.8 in-the-making). From there I used valgrindmassif and massif-visualizer. If you don't know about those, spend some time playing with them and learn how awesome they are !

And here where the winners (in terms of biggest memory footprints):

3MB of Eo objects

1.2MB of Evas_Object_Image

637KB of Evas_Object_Rectangle

465KB of Evas_Object_Text

345KB if Edje_Object

1.9MB of mempool

814KB of Edje Part

439KB of Eina_List for Eo type

350KB of Evas_Object callbacks

100KB of image draw command

463KB of Edje matching automate

370KB of image pixels data

The first thing that struck me was, what is this 439KB of memory for a stupid list of static strings? In fact, it was a shortcut taken during the development of Eo and instead of being part of the class information, they put it in each object. Of course that was a bug, was quickly fixed after @tasn spanked @JackDanielZ.

So what went so wrong? Why are all those objects that big? The reason is quite simple. We added more and more flags and features. The objects then grew in total size, even if most of these added values are never changed or only changed in some rare cases. Of course @raster came up with the crazy idea of compressing object on the fly in memory and decompressing them again... but instead of being crazy, what if we didn't duplicate memory for nothing in the first place? Thinking about it, most of those objects have exactly the same values, and even better, they never change them.

So after a quick round of grouping data in their structures when they are used together, I came up with the idea of doing a kind of "copy on write" infrastructure. The idea being that most of our access pattern are reading and very few are writing. Especially in the hot path. After a few rounds of design, Eina_Cow was born. It then started to roll into Evas. The result is that now we have:

400K of Evas_Object_Image

300K of Evas_Object_Rectangle

289K of Evas_Object_Text

228K of Edje_Object

And 479K of modified data. The next stage would be to run some memory comparison functions during idle time to merge modified data back together where duplicates within the modified section are found. I also didn't pay attention to Evas_Object_Text and Edje_Object. Their size went down just because Evas_Object sizes were reduced. That first improvement gave us back 1MB of memory.

What I learned when rolling in Eina_Cow, is that the only source of bugs comes from piece of code that uses the stack or memcpy()s parent structures to duplicate references without instructing Eina_Cow about it. The rest of the changes are pretty straightforward and easy to do. The interesting part is, that most of the code logic stays untouched, and there is no need to add tests for NULL all over the place.

Logically the next place to get the Cow treatment was Edje Parts. In these, we use a cache of previous calculations to avoid a lot of extra work. That cache was pretty big. It includes values for Evas_Map or Ephysics even if most objects don't use them. That was about 400 bytes per part per Edje object. By using Eina_Cow we got that one out and now Edje Parts use 464K.

I realized that Edje was duplicating much more data than needed. First, a small bug was duplicating program string match per object when they really should have been per class of object. That was small, but still 100K. The big one was Edje signal callbacks. Elementary always set the same triplet of signal, source, function that don't change plus a data that does change. So we lazily implemented one string match per object even if the match was always the same, as it only matched based on signal and source. I decided to implement some full logic to try and not duplicate those matches by detecting when the callbacks where the same. This was a difficult task as we do a lot of registering/unregistering of callbacks, and have many optimized paths. But I managed to do it and saved another 463KB.

So our savings in summary are:

System

Before

After

Evas_Object_Image

1200K

400K

Evas_Object_Rectangle

637K

300K

Evas_Object_Text

465K

289K

Edje_Object

345K

228K

Edje_Part

814K

464K

String Matches

100K

0K

Eo type strings

439K

0K

Eina_Cow

0K

479K

Total Saving

1840K

At this point, we are almost back to where we were before with 5.6MB, but with a lot of new cool features in. Looking at what is left, there is some hope to get rid of a big chunk of the memory allocated for Evas_Object callbacks and also the image pixels data should be shared with other process thanks to Evas Cserve2 and Profusions' work (This would really be worth another post on that topic alone). The idea for Evas_Object callbacks is also to de-duplicate them, as we register a lot of them together with always the same values except data. So we are going to have a way to register a static array of callbacks for an Object, and that alone should reduce our memory usage dramatically for callbacks.

Now that you have read this far, you are probably wondering why we care so much about memory. Why do we think hat it is not ok to add features AND also add memory footprint in return for it too. Why does it really matter when we have multiple GB of memory available? Why spend time on such useless optimization? No. Seriously. Why?

Well, the answer is simple: speed and power consumption. Most of our tasks are memory bound. Using less memory gives us more room for doing the actual rendering. The CPU is actually so insanely fast that one core is almost able to fill all memory bandwidth for most rendering operations. By using less memory, we hope to use less memory bandwidth for things that don't really matter and then have more bandwidth available for things that do. So before we start using S2TC in our software engine, using less data for everything else is clearly a good move.

As for power consumption, thanks to my work at Samsung, I know now that using memory is much more costly on battery than using the CPU cache. In fact, every level of CPU cache uses more power than the previous one, so the more you stay in L1, the better. Of course this directly affects performance as well, so measuring performance is a simple way of measuring potential power consumption. Of course L1 is to small to put everything inside, but you get the idea. Being smaller means less battery usage. Also on a mobile phone the bigger the main memory, the more battery it uses, even if you don't access it. So if your system uses less memory, you can ship it with less memory (Yeah, the marketing department is going to hate us, they can't play the game of bigger numbers...) and having more battery life with the same kind of applications.