Introduce new primitive-specific MutableVertex subclasses

Details

Description

As discussed on the list, MutableVertex<LongWritable,DoubleWritable,FloatWritable,DoubleWritable> (for example) could be highly optimized in its memory footprint if the vertex and edge data were held in a form which minimized Java object usage.

This is a toy version of LongDoubleFloatDoubleVertex, a proof of concept that you can get "SimplePageRankVertex extends LongDoubleFloatDoubleVertex" to pass its current unit tests without subclassing Vertex (and only using primitives internally!)

Jake Mannix
added a comment - 09/Sep/11 07:27 This is a toy version of LongDoubleFloatDoubleVertex, a proof of concept that you can get "SimplePageRankVertex extends LongDoubleFloatDoubleVertex" to pass its current unit tests without subclassing Vertex (and only using primitives internally!)

Note: this patch introduces a dependency on mahout-collections, a fairly tiny jar which contains a ton of primitive-collections stuff, pretty performance optimized (and with nearly every primitive/primitive pairing you might want, as it's autogenerated from a template)

Jake, this is pretty cool that you got it to work. Any thoughts on what to do next? Perhaps add this to the examples directory? Maybe do some sort of memory test to see the gains of a primitive subclass?

Avery Ching
added a comment - 11/Sep/11 06:23 Jake, this is pretty cool that you got it to work. Any thoughts on what to do next? Perhaps add this to the examples directory? Maybe do some sort of memory test to see the gains of a primitive subclass?

Yeah, I was going to mention that this patch is out of date now, as the github mirror of apache trunk hadn't caught up with the other commit. It should have caught up by now, so I'll re-pull and regenerate.

Avery: I would like to do some memory testing before we move forward with this, yes, that's the next step, after I rebase.

Jake Mannix
added a comment - 11/Sep/11 22:51 Yeah, I was going to mention that this patch is out of date now, as the github mirror of apache trunk hadn't caught up with the other commit. It should have caught up by now, so I'll re-pull and regenerate.
Avery: I would like to do some memory testing before we move forward with this, yes, that's the next step, after I rebase.

I tried to benchmark the memory footprint on this and was surprised to find that it's actually larger than the existing implementation!

At Jake's suggestion, I also added a dummy "TinyVertex" that just does the simplest thing possible, by keeping an array of primitives and resizing it when needed. This gives us the lower bound on what our memory utilization could look like. The answer is, we could be 30x more efficient in terms of memory utilization, at the cost of some CPU.

Dmitriy V. Ryaboy
added a comment - 12/Sep/11 00:28 I tried to benchmark the memory footprint on this and was surprised to find that it's actually larger than the existing implementation!
At Jake's suggestion, I also added a dummy "TinyVertex" that just does the simplest thing possible, by keeping an array of primitives and resizing it when needed. This gives us the lower bound on what our memory utilization could look like. The answer is, we could be 30x more efficient in terms of memory utilization, at the cost of some CPU.
Code is here: https://gist.github.com/1210245

Thank you for providing an interesting set of results. I agree that this can/will be an issue when there are lots of outgoing edges (common in real graphs).

The last grouping in your results especially is troubling:

Tiny: 10000 123592
Object: 10000 4290616
Primitive: 10000 4431744

It certainly appears that the edge relationship (dest id and edge value) data structure makes a big difference. However, I don't understand why the object and primitive results are similar, nor why using a TreeMap or org.apache.mahout.math.map.OpenLongFloatHashMap is so much higher overhead than a pair of primitive arrays.

Avery Ching
added a comment - 12/Sep/11 03:19 Thank you for providing an interesting set of results. I agree that this can/will be an issue when there are lots of outgoing edges (common in real graphs).
The last grouping in your results especially is troubling:
Tiny: 10000 123592
Object: 10000 4290616
Primitive: 10000 4431744
It certainly appears that the edge relationship (dest id and edge value) data structure makes a big difference. However, I don't understand why the object and primitive results are similar, nor why using a TreeMap or org.apache.mahout.math.map.OpenLongFloatHashMap is so much higher overhead than a pair of primitive arrays.

These numbers are much more reasonable =). Getting rid of the Edge class for users (still useful for storing the requests) seems like an easy win for the Object based Vertex (around a 20% memory reduction for edges). I think that for many applications, using a pair of arrays might be reasonable and provide a big win for memory reduction (i.e. page rank).

Now I'm really curious what was eating all the other memory though =)...

Avery Ching
added a comment - 12/Sep/11 06:02 Thanks for isolating to the edge data only.
These numbers are much more reasonable =). Getting rid of the Edge class for users (still useful for storing the requests) seems like an easy win for the Object based Vertex (around a 20% memory reduction for edges). I think that for many applications, using a pair of arrays might be reasonable and provide a big win for memory reduction (i.e. page rank).
Now I'm really curious what was eating all the other memory though =)...

Yeah, I don't know what's eating up the space in my Primitive class, but I bet it has something to do with my "wrapper" SortedMap I return when you call getOutEdgeMap(). This object should be lazily instantiated, but as it has a "get**" name to it, I wonder if the ObjectSizeCalculator is considering the returned value to be part of the object's memory? If so, then it *should be what we see. Dmitriy, can you try re-running your original test, on the real LongDoubleFloatDoubleVertex, but with getOutEdgeMap() implemented as just "return null;", and similarly with getMsgList()? If you test doesn't use those methods. My guess is that the measured size will drop down to close to what your results for "Primitive Map" are.

Regarding long[] and float[], it's actually not a totally crazy implementation, for the "final state" of the in-memory representation: most graph algorithms leave the graph structure alone (ie don't add or delete edges during iteration), in which case compacting the Vertex down to a pair of parallel sorted arrays after loading from the VertexReader is not such a unreasonable situation. Of course, it then requires that random-access reads to the outbound edges do a log(numOutEdges) binary search, but not all graph algorithms require lots of random access to specific edges (as opposed to bulk access to all edges).

I could imagine two separate primitive implementations - one based on a map-like structure (a la OpenXxxYyyHashMap), which gives random access to the edges (but then not sorted access! Note that the current code in my primitive vertex throws UnsupportedOperationException for lots of SortedMap methods, which happen to be non exercised in the unit tests), and one based on parallel arrays, which allows for very fast sequential read-only access, but slower random read-only access, and very much slower mutation speed. I wrote implementations like this for the Mahout Vector interface, and named them almost exactly like that: RandomAccessSparseVector, and SequentialAccessSparseVector (although I should have called the latter something like ReadOnlySequentialAccessSparseVector, but it would have been a lie, as you can modify it, you just shouldn't).

These numbers now make a lot of sense - Objects Are Big, that's the moral of the story. Any place you need to hang onto gazillions of things in memory on the JVM, you should do your best to stuff them into primitive arrays.

Jake Mannix
added a comment - 12/Sep/11 06:11 Yeah, I don't know what's eating up the space in my Primitive class, but I bet it has something to do with my "wrapper" SortedMap I return when you call getOutEdgeMap(). This object should be lazily instantiated, but as it has a "get** " name to it, I wonder if the ObjectSizeCalculator is considering the returned value to be part of the object's memory? If so, then it *should be what we see. Dmitriy, can you try re-running your original test, on the real LongDoubleFloatDoubleVertex, but with getOutEdgeMap() implemented as just "return null;", and similarly with getMsgList()? If you test doesn't use those methods. My guess is that the measured size will drop down to close to what your results for "Primitive Map" are.
Regarding long[] and float[], it's actually not a totally crazy implementation, for the "final state" of the in-memory representation: most graph algorithms leave the graph structure alone (ie don't add or delete edges during iteration), in which case compacting the Vertex down to a pair of parallel sorted arrays after loading from the VertexReader is not such a unreasonable situation. Of course, it then requires that random-access reads to the outbound edges do a log(numOutEdges) binary search, but not all graph algorithms require lots of random access to specific edges (as opposed to bulk access to all edges).
I could imagine two separate primitive implementations - one based on a map-like structure (a la OpenXxxYyyHashMap), which gives random access to the edges (but then not sorted access! Note that the current code in my primitive vertex throws UnsupportedOperationException for lots of SortedMap methods, which happen to be non exercised in the unit tests), and one based on parallel arrays, which allows for very fast sequential read-only access, but slower random read-only access, and very much slower mutation speed. I wrote implementations like this for the Mahout Vector interface, and named them almost exactly like that: RandomAccessSparseVector, and SequentialAccessSparseVector (although I should have called the latter something like ReadOnlySequentialAccessSparseVector, but it would have been a lie, as you can modify it, you just shouldn't).
These numbers now make a lot of sense - Objects Are Big, that's the moral of the story. Any place you need to hang onto gazillions of things in memory on the JVM, you should do your best to stuff them into primitive arrays.

we could do away with being tied to this TreeMap (although for now, keep it around in Vertex.java, as there's not much else possible in the generic object case, most likely), in addition to allowing me to remove my insane "pretend" SortedMap wrapper class.

Jake Mannix
added a comment - 12/Sep/11 06:19 So Avery, the question I have for you is regarding the getOutEdgeMap() method - if we get rid of that, and instead maybe offer something like the other methods discussed on the list thread:
E getEdge(I targetVertexId);
ImmutableList<I> getSortedOutVertices();
boolean removeEdge(I targetVertexId);
we could do away with being tied to this TreeMap (although for now, keep it around in Vertex.java, as there's not much else possible in the generic object case, most likely), in addition to allowing me to remove my insane "pretend" SortedMap wrapper class.

Avery Ching
added a comment - 12/Sep/11 06:37 I certainly agree that providing interfaces that reduce memory consumption on the edges makes a lot of sense (thanks for the evidence).
For supporting the all the implementations (especially primitive arrays implementation), perhaps
Iterator<Edge<I, E>> getOutEdgeIterator();
(Similar to the Pregel API makes sense). This would be easy to support with primitive arrays. Additionally, your suggested interfaces also make sense (slight modifications).
E getEdgeValue(I targetVertexId);
boolean removeEdge(I targetVertexId);
Iterator<Edge<I, E>> getSortedOutEdgeIterator();
What do you/others think? We should get others to weigh in as well, especially since it's a user facing API.

Jake Mannix
added a comment - 12/Sep/11 06:50 I like the Iterator more than ImmutableList, yeah, that's great. I wonder if then just making BasicVertex implement Iterable<Edge<I,E>> would be called for: for(Edge<I,E> edge : vertex)
{ ... }
? Not sure if that syntactic sugar is worth it.

Severin Corsten
added a comment - 12/Sep/11 07:00 I think this is a good idea. The most common use of the OutEdgeMap is to iterate over the values. The Map doesn't bring any additional information.
And I like Jakes idea bud I don't know whether it brings any advantage and should be implemented in addition to a getter.

One more possible change: instead of "boolean removeEdge(I targetVertexId)", how about "E removeEdge(I targetVertexId)" - returning the value of the edge removed (or null if there wasn't an edge to the target). Might come in handy to return the edge, even if it's usually ignored, and it's a little more information for the caller "for free".

Jake Mannix
added a comment - 12/Sep/11 07:07 One more possible change: instead of "boolean removeEdge(I targetVertexId)", how about "E removeEdge(I targetVertexId)" - returning the value of the edge removed (or null if there wasn't an edge to the target). Might come in handy to return the edge, even if it's usually ignored, and it's a little more information for the caller "for free".

Avery Ching
added a comment - 12/Sep/11 07:08 In regards to Jake's comment on BasicVertex implement Iterable<Edge<I, E>>, I don't have a strong opinion. Maybe too many ways to do the same thing (iterate over the edges)?

re: Iterable - I was thinking we'd just pick on way to iterate over the edges, not multiple: we could either have the getSortedOutEdges(), or simply iterator(), as a consequence of implementing Iterable, but not both.

Jake Mannix
added a comment - 12/Sep/11 07:31 re: Iterable - I was thinking we'd just pick on way to iterate over the edges, not multiple: we could either have the getSortedOutEdges(), or simply iterator(), as a consequence of implementing Iterable, but not both.
The benefits of Iterable is the nice for-loop, and plays nice with google collections api: Iterables.any() / .concat() / .transform() / .partition() / etc.
The downside is the non-informative method name.

newly regenerated patch, should apply cleanly against trunk. Tests still pass (yay), but still not terribly clean code. I would argue that until we nail down the getOutpuEdgeMap() API, this patch should be held off on merging.

But let's play with it more, and I think I'll turn Dmitriy's TinyVertex into another option in here, as it's a common use-case, and even smaller still than the current primitive container class.

Jake Mannix
added a comment - 12/Sep/11 17:52 newly regenerated patch, should apply cleanly against trunk. Tests still pass (yay), but still not terribly clean code. I would argue that until we nail down the getOutpuEdgeMap() API, this patch should be held off on merging.
But let's play with it more, and I think I'll turn Dmitriy's TinyVertex into another option in here, as it's a common use-case, and even smaller still than the current primitive container class.

I got it passing tests, but ran into a few things we may want to consider:

testing for existence of a target vertex is no longer possible: getEdgeValue(targetVertexId) returns the value associated with the edge. Edges are allowed to have null values and still denote a connection between the source and target vertex, right? Maybe we should just have an Edge<I, E> getEdge(I targetVertexId) method instead?

Secondly, far less importantly, is we need to have getNumOutEdges(), because clients often want to know the out-degree of the vertex, and they used to call getDestEdgeMap().size().

Thirdly: for the same reason that getEdgeValue() can return superfluous nulls, removeEdge(), used as a boolean, can trick the caller into thinking there was no connection to the target (because removeEdge() returned null), but really it's because I was trying to be clever and return the value which could be null. Having removeEdge() return the actual Edge fixes this.

I'll open another ticket for this stuff, as patching this patch seems a bit silly.

Jake Mannix
added a comment - 13/Sep/11 05:22 Ok, so I went ahead and made a 'straw-man' refactoring branch (on GitHub: https://github.com/jakemannix/giraph/tree/vertex_map_refactor ), removing the getDestEdgeMap() method, and having BasicVertex implement Iterable, as well as the random-access read method getEdgeValue(targetVertexId).
I got it passing tests, but ran into a few things we may want to consider:
testing for existence of a target vertex is no longer possible: getEdgeValue(targetVertexId) returns the value associated with the edge. Edges are allowed to have null values and still denote a connection between the source and target vertex, right? Maybe we should just have an Edge<I, E> getEdge(I targetVertexId) method instead?
Secondly, far less importantly, is we need to have getNumOutEdges(), because clients often want to know the out-degree of the vertex, and they used to call getDestEdgeMap().size().
Thirdly: for the same reason that getEdgeValue() can return superfluous nulls, removeEdge(), used as a boolean, can trick the caller into thinking there was no connection to the target (because removeEdge() returned null), but really it's because I was trying to be clever and return the value which could be null. Having removeEdge() return the actual Edge fixes this.
I'll open another ticket for this stuff, as patching this patch seems a bit silly.

Jake, I agree with all 3 of your points. The only question I have is regarding implementing Iterable. Will the user expect to get back a sorted iterator or a non-sorted one? I'm guessing we'll lose some functionality here? I don't think we would want to add a "switch" of sorts to choose. The other way we had discussed way the methods

Avery Ching
added a comment - 13/Sep/11 05:38 Jake, I agree with all 3 of your points. The only question I have is regarding implementing Iterable. Will the user expect to get back a sorted iterator or a non-sorted one? I'm guessing we'll lose some functionality here? I don't think we would want to add a "switch" of sorts to choose. The other way we had discussed way the methods
Iterator<Edge<I, E>> getOutEdgeIterator();
Iterator<Edge<I, E>> getSortedOutEdgeIterator();
as opposed to Iterable.

I'm suggesting that iterator() be always sorted. SortedMap implements Iterable (by way of Collection), and the iterator it returns is always in the sorted order. We'd have BasicVertex do the same thing.

Jake Mannix
added a comment - 13/Sep/11 05:41 I'm suggesting that iterator() be always sorted. SortedMap implements Iterable (by way of Collection), and the iterator it returns is always in the sorted order. We'd have BasicVertex do the same thing.

Also, to contradict my 1st and 3rd points, Dmitriy pointed out (in an out-of-band chat) that if we don't want to expose Edge to the user, because a) don't want to store it in memory (as his test showed that even switching TreeMap<I, Edge<I,E>> to TreeMap<I, E> reduced memory usage by a fair amount), and b) don't want to have to instantiate tons of useless objects by lazily creating them, we could instead just keep the getEdgeValue() and removeEdge() as they were, but also add a boolean hasEdge(I targetVertexId) to test for connection.

Then you get everything you need without exposing the Edge class (which only gets used internally for its Writable nature):

Jake Mannix
added a comment - 13/Sep/11 05:47 Also, to contradict my 1st and 3rd points, Dmitriy pointed out (in an out-of-band chat) that if we don't want to expose Edge to the user, because a) don't want to store it in memory (as his test showed that even switching TreeMap<I, Edge<I,E>> to TreeMap<I, E> reduced memory usage by a fair amount), and b) don't want to have to instantiate tons of useless objects by lazily creating them, we could instead just keep the getEdgeValue() and removeEdge() as they were, but also add a boolean hasEdge(I targetVertexId) to test for connection.
Then you get everything you need without exposing the Edge class (which only gets used internally for its Writable nature):
if(vertex.hasEdge(targetVertexId))
{
E value = vertex.getEdgeValue(targetVertexId);
vertex.removeEdge(targetVertexId);
}
etc...

Would we guarantee that iterator() would always be sorted? For instance, what about something like TinyVertex? A pair of arrays would be expensive to keep sorted...

Or are you suggesting that whether iterator is sorted or not depends on the implementation?

I agree that implementing Iterable is pretty convenient for users. Perhaps have both (implement Iterable and the two other methods)? This would allow applications based on TinyVertex to be optimized when not requiring sorting.

Avery Ching
added a comment - 13/Sep/11 05:51 Would we guarantee that iterator() would always be sorted? For instance, what about something like TinyVertex? A pair of arrays would be expensive to keep sorted...
Or are you suggesting that whether iterator is sorted or not depends on the implementation?
I agree that implementing Iterable is pretty convenient for users. Perhaps have both (implement Iterable and the two other methods)? This would allow applications based on TinyVertex to be optimized when not requiring sorting.

boolean hasEdge(I targetVertexId) is fine as an alternative. For the iterators though, we are returning an iterator of type Edge<I, E>, so we will have to create those Edge<I, E> objects on the fly for some implementations. I suppose that hasEdge() adds a bit of work though to check before adding or removing as opposed to returning the full edge. I think I'd lean a slight bit in the Edge<I, E> methods for consistency and reduced code. But I can be out-voted. =)

Avery Ching
added a comment - 13/Sep/11 05:57 boolean hasEdge(I targetVertexId) is fine as an alternative. For the iterators though, we are returning an iterator of type Edge<I, E>, so we will have to create those Edge<I, E> objects on the fly for some implementations. I suppose that hasEdge() adds a bit of work though to check before adding or removing as opposed to returning the full edge. I think I'd lean a slight bit in the Edge<I, E> methods for consistency and reduced code. But I can be out-voted. =)

Technically you shouldn't have to use hasEdge when adding and removing if you don't care. removeEdge() can return null ambiguously (value was null, or no such edge existed), and if you care, you can use hasEdge(), but if you don't, you don't. addEdge() can be an upsert.

Dmitriy V. Ryaboy
added a comment - 13/Sep/11 06:05 Technically you shouldn't have to use hasEdge when adding and removing if you don't care. removeEdge() can return null ambiguously (value was null, or no such edge existed), and if you care, you can use hasEdge(), but if you don't, you don't. addEdge() can be an upsert.

The alternative to Iterable<Edge<I, E>> is Iterable<I>, returning only the target vertices, and you can call getEdgeValue(targetVertexId) on any of these if you need it. Again, many algorithms will simply do something like

Jake Mannix
added a comment - 13/Sep/11 06:06 The alternative to Iterable<Edge<I, E>> is Iterable<I>, returning only the target vertices, and you can call getEdgeValue(targetVertexId) on any of these if you need it. Again, many algorithms will simply do something like
for(I targetId : vertex)
{
sendMsg(targetId, someFunction(baseMsg, getEdgeValue(targetId));
}
which is maybe a little nicer looking (or at least not uglier) than:
for(Edge<I, E> edge : vertex)
{
sendMsg(edge.getVertexId(), someFunction(baseMsg, edge.getValue());
}
And then there are no Edge objects hanging around.
Alternatively, Edge could act just like a typical Writable, and the Iterator<Edge<I, E>> iterates over the same Edge object setting different values on it as next() is called.

As for sorting, I'd imagine that assuming it always returns a sorted iterator is fine, but yes, some implementations I can imagine might not want to do that. I'd lean against having multiple iterators until it was known that they were needed, and maybe just document the ones which return nonsorted ones so that things don't get messed up?

Vertex subclasses are where the "algorithms" are implemented, right? So a Vertex knows whether it has a sorted iterator or not... the only question would be: are there generic methods implemented in things like BspServiceWorker, or GraphMapper, which would be expected to need to do things to a sorted iterator? Currently there are no such places that I can see. Without such cases, we could easily leave Vertex implementations to decide whether they needed to return sorted iterators or not.

Jake Mannix
added a comment - 13/Sep/11 06:13 As for sorting, I'd imagine that assuming it always returns a sorted iterator is fine, but yes, some implementations I can imagine might not want to do that. I'd lean against having multiple iterators until it was known that they were needed, and maybe just document the ones which return nonsorted ones so that things don't get messed up?
Vertex subclasses are where the "algorithms" are implemented, right? So a Vertex knows whether it has a sorted iterator or not... the only question would be: are there generic methods implemented in things like BspServiceWorker, or GraphMapper, which would be expected to need to do things to a sorted iterator? Currently there are no such places that I can see. Without such cases, we could easily leave Vertex implementations to decide whether they needed to return sorted iterators or not.

I'd caution against the approach of using a MutatorIterator (that's my name for that pattern. Like it? ).
It's effective, but leads to extremely confusing bugs when people try to do things like take the first three edges, etc. Presenting a familiar interface but providing a tricky unintuitive implementation is not super friendly to developers; I don't think we want people to have to study the API to such an extent they have to know these details.

Dmitriy V. Ryaboy
added a comment - 13/Sep/11 06:14 I'd caution against the approach of using a MutatorIterator (that's my name for that pattern. Like it? ).
It's effective, but leads to extremely confusing bugs when people try to do things like take the first three edges, etc. Presenting a familiar interface but providing a tricky unintuitive implementation is not super friendly to developers; I don't think we want people to have to study the API to such an extent they have to know these details.

Jake Mannix
added a comment - 13/Sep/11 06:25 What do you mean by the "MutatorIterator" pattern? Not being clear about whether it's sorted or not? Or forcing iterator() to always be sorted? Or something else, about the X-Men series?

Dmitriy V. Ryaboy
added a comment - 13/Sep/11 06:33 This:
Alternatively, Edge could act just like a typical Writable, and the Iterator<Edge<I, E>> iterates over the same Edge object setting different values on it as next() is called.

Jake Mannix
added a comment - 13/Sep/11 06:44 ah, yes. That can be a nasty pit of snakes for new developers, no matter how commonly it's found in Hadoop-land.
So I'll put in my vote for Iterable<I>, with your offline suggestion of boolean hasSortedIterator() (defaulting in BasicVertex to "return true;", overrideable in subclasses).
And I'll put in a patch on GIRAPH-31 so my code'll be where my mouth is (and we can continue this discussion on a ticket with a shorter thread [so far] ).

Ok another patch coming soon for this, but good news: this is the output of the object size calculator now:
(key: Primitive is what Dmitriy put in that test code, LDFD is a trivial class which extends the new LongDoubleFloatDoubleVertex class, and shows exactly the same memory as this)

Jake Mannix
added a comment - 15/Sep/11 19:19 I don't know what it was, I just re-patched with current trunk, after the refactorings of the most recent few patches. Memory use dropped to what it should be!

Jake Mannix
added a comment - 17/Sep/11 00:00 Most likely even if unit tests are passing, running on a real cluster with primitive-specific subclasses will fail if there are vertex range balancing happening.