REST Layer

Details

Description

This is a native rest layer for Cassandra implementing AbstractCassandraDaemon.

It uses JAX-RS fueled by Apache CXF.

Presently it supports the following operations JSON over HTTP:

Create keyspace

Drop keyspace

Create column family

Drop column family

Insert row

Fetch row

Delete row

Insert column

Delete column

Fetch column

The patch creates a new project in contrib/rest. You can compile the project using "ant", which uses ivy to pull in dependencies. To get setup, you can also use the pom.xml file and m2eclipse to get it into Eclipse.

Once compiled, simpy run "bin/rest_cassandra" and follow along in the README.txt

Activity

I've only briefly glanced over the code, but what I see looks good. It's great that you've already taken care of things like build and docs.

However, I'm not sure where (or even if )this should live within Cassandra. contrib/ would be the right place I think, except that we're in (have been) in the process of trying to eliminate that.

And, the impetus for removing contrib/ was that it was a place of second-class citizenship, with mixed expectations that didn't reflect well on Cassandra, or the authors of the contributed code. In other words, it should either by fully supported, or maintaining it out of tree is probably in everyone's best interest.

Have you considered maintaining this as a separate project? It seems as though it would be pretty easy, logistically speaking.

Eric Evans
added a comment - 19/Oct/11 20:35 Hi Brian,
I've only briefly glanced over the code, but what I see looks good. It's great that you've already taken care of things like build and docs.
However, I'm not sure where (or even if )this should live within Cassandra. contrib/ would be the right place I think, except that we're in (have been) in the process of trying to eliminate that.
And, the impetus for removing contrib/ was that it was a place of second-class citizenship, with mixed expectations that didn't reflect well on Cassandra, or the authors of the contributed code. In other words, it should either by fully supported, or maintaining it out of tree is probably in everyone's best interest.
Have you considered maintaining this as a separate project? It seems as though it would be pretty easy, logistically speaking.

My criterion for maintaining a REST API in-tree would be, I'd need to have a clear understanding of what problem the REST API solves that the CQL one does not. E.g. if I'm writing Python code why would I use the REST API instead of the Python dbapi driver?

Jonathan Ellis
added a comment - 19/Oct/11 20:44 My criterion for maintaining a REST API in-tree would be, I'd need to have a clear understanding of what problem the REST API solves that the CQL one does not. E.g. if I'm writing Python code why would I use the REST API instead of the Python dbapi driver?

That article boils down to three points: ease of use, standardization across interfaces, and decreased number of application dependencies.

Enterprises typically have technology eco-systems that have many services/capabilities. (e.g. ours includes Neo4j and SOLR) That ecosystem includes third-party services as well as internal. It helps if you can standardize on interfaces across the entire ecosystem. May people are standardizing on JSON/HTTP (as evidenced by support in CouchDB, MongoDO, Neo4j, SOLR, Elastic Search, etc.)

That standardization decreases the integration/adoption cost. Since many languages have native support for REST calls, typically an application can consume the capability without adding any additional application dependencies. (e.g. drivers)

Also, JSON/HTTP is especially nice if the data is making it out to the web. Many javascript frameworks can natively consume the data. Even if the browser/javascript isn't hitting the database directly (via HTTP), there services layer in between often can just become a proxy of sorts. (this has happened with us with SOLR)

Brian ONeill
added a comment - 19/Oct/11 21:08 Thanks for the comments guys.
The reason REST/JSON support is important is articulated well here:
http://nosql.mypopescu.com/post/411195754/nosql-protocols-are-important
That article boils down to three points: ease of use, standardization across interfaces, and decreased number of application dependencies.
Enterprises typically have technology eco-systems that have many services/capabilities. (e.g. ours includes Neo4j and SOLR) That ecosystem includes third-party services as well as internal. It helps if you can standardize on interfaces across the entire ecosystem. May people are standardizing on JSON/HTTP (as evidenced by support in CouchDB, MongoDO, Neo4j, SOLR, Elastic Search, etc.)
That standardization decreases the integration/adoption cost. Since many languages have native support for REST calls, typically an application can consume the capability without adding any additional application dependencies. (e.g. drivers)
Also, JSON/HTTP is especially nice if the data is making it out to the web. Many javascript frameworks can natively consume the data. Even if the browser/javascript isn't hitting the database directly (via HTTP), there services layer in between often can just become a proxy of sorts. (this has happened with us with SOLR)

I was originally going to maintain it as a separate project, before Gary Dubasbek chimed in and pointed out that it is inefficient to deploy a separate server (and use something like Hector) because you would end up marshaling the document twice with an added hop.

That is when I decided to write this as a patch for the tree since it would need to stay in lock-step with the CassandraServer object. (since the REST calls would come into the same VM and make the calls in-thread/natively against the java objects)

Ideally, this wouldn't remain in the contrib package. I think it would be better to get it integrated into the main tree (as not to have mismatched expectations). From the users perspective, I would expect to add an optional configuration parameter in the yaml file that would start the REST server on the specified port. when the parameter is defined.

Brian ONeill
added a comment - 19/Oct/11 21:17 I was originally going to maintain it as a separate project, before Gary Dubasbek chimed in and pointed out that it is inefficient to deploy a separate server (and use something like Hector) because you would end up marshaling the document twice with an added hop.
That is when I decided to write this as a patch for the tree since it would need to stay in lock-step with the CassandraServer object. (since the REST calls would come into the same VM and make the calls in-thread/natively against the java objects)
Ideally, this wouldn't remain in the contrib package. I think it would be better to get it integrated into the main tree (as not to have mismatched expectations). From the users perspective, I would expect to add an optional configuration parameter in the yaml file that would start the REST server on the specified port. when the parameter is defined.

Saying "Everything is REST" is about as useful as saying "Everything is TCP." It's fine for a driver to be REST-based but that doesn't make, say, Cassandra and Neo4j interchangeable. So I don't see the advantage from a development point of view. Nor do I see "import cql" as a deal breaker over "import urllib2" (especially since the former gives you a much better experience.)

Javascript frameworks is a valid point, although I hope everyone agrees that the browser hitting the db directly is a colossally bad idea. I can see though that it would simplify the proxy if it just has to decide "accept/reject" on a REST query vs translate REST into another API.

Jonathan Ellis
added a comment - 19/Oct/11 21:21 Saying "Everything is REST" is about as useful as saying "Everything is TCP." It's fine for a driver to be REST-based but that doesn't make, say, Cassandra and Neo4j interchangeable. So I don't see the advantage from a development point of view. Nor do I see "import cql" as a deal breaker over "import urllib2" (especially since the former gives you a much better experience.)
Javascript frameworks is a valid point, although I hope everyone agrees that the browser hitting the db directly is a colossally bad idea. I can see though that it would simplify the proxy if it just has to decide "accept/reject" on a REST query vs translate REST into another API.

We are literally trying to decide between CouchDB and Cassandra by Monday. Because CouchDB has a REST interface, we had it integrated into our application in hours. But we love Cassandra, and the hadoop/pig integration potential. But for obvious reasons, I'd like to avoid deploying/supporting/maintaining both if I can.

If Cassandra had a REST layer, we'd have "one database to rule them all" and we'd be able to pick our interface mechanism depending on who/what needed to operate on and access the data. The web front-end could come in through REST, while the backend services use Pig/Hadoop to do the heavy lifting.

Brian ONeill
added a comment - 19/Oct/11 21:23 One last point.... (then I'll shut up
We are literally trying to decide between CouchDB and Cassandra by Monday. Because CouchDB has a REST interface, we had it integrated into our application in hours. But we love Cassandra, and the hadoop/pig integration potential. But for obvious reasons, I'd like to avoid deploying/supporting/maintaining both if I can.
If Cassandra had a REST layer, we'd have "one database to rule them all" and we'd be able to pick our interface mechanism depending on who/what needed to operate on and access the data. The web front-end could come in through REST, while the backend services use Pig/Hadoop to do the heavy lifting.

Are you developing in a niche environment that doesn't have a decent client already? Because I don't see "Because X has a REST interface, we can integrate it easily" as cause and effect at all, or rather, the same would be true for any decent client ant not just a REST one.

I'm really trying to understand here if I'm missing something important or if your team just happens to really have a thing for REST.

Jonathan Ellis
added a comment - 19/Oct/11 21:36 Are you developing in a niche environment that doesn't have a decent client already? Because I don't see "Because X has a REST interface, we can integrate it easily" as cause and effect at all, or rather, the same would be true for any decent client ant not just a REST one.
I'm really trying to understand here if I'm missing something important or if your team just happens to really have a thing for REST.

I have no doubt that a REST interface is useful to some, but the question is, does it belong integrated directly into Cassandra?

I don't have any hard numbers, but requests for a REST interface have been intermittent, and come from what I perceive to be a small minority. I've also mentioned elsewhere that I believe some of these people were asking for REST, but were really looking for simpler client abstractions (which CQL now provides).

What is certain is that most Cassandra users are very interested in performance, and make use of Cassandra's rich types, both of which are sacrificed for this REST interface.

Again, that doesn't invalidate your approach, but it suggests to me that it isn't broadly applicable.

Another factor when asking if this should be integrated is how feasible it would be to maintain this outside of Cassandra. I could be missing something, but it seems like it would be quite easy.

Eric Evans
added a comment - 19/Oct/11 22:04 I have no doubt that a REST interface is useful to some, but the question is, does it belong integrated directly into Cassandra?
I don't have any hard numbers, but requests for a REST interface have been intermittent, and come from what I perceive to be a small minority. I've also mentioned elsewhere that I believe some of these people were asking for REST, but were really looking for simpler client abstractions (which CQL now provides).
What is certain is that most Cassandra users are very interested in performance, and make use of Cassandra's rich types, both of which are sacrificed for this REST interface.
Again, that doesn't invalidate your approach, but it suggests to me that it isn't broadly applicable.
Another factor when asking if this should be integrated is how feasible it would be to maintain this outside of Cassandra. I could be missing something, but it seems like it would be quite easy.

Looking at our end-to-end processing, here are three other examples that may be more universal. (increasing in priority/value) The general theme of these is that there are many off-the-shelf tools out there that can integrate with JSON/XML over HTTP "out of the box". Not all of them are easily extended to accommodate different drivers/data sources.

TESTING:
We use SOAP UI to test. We use it to smoke test and validate environments. It can easily hit REST interfaces on all our systems (and third party tools) to verify everything is in a good state and has the proper data populated.

ETL TOOLS:
Many GUI-based ETL Tools (e.g. Talend) can extract and load via XML/JSON over HTTP. They grab/push the data over HTTP natively. If Cassandra had a native REST interface, we could tap into it directly. Without it, we'll be looking at some additional process / plugins to make this happen. (PITA) The users of these systems are not developers, they are data stewards that just want to be able to point to a url and get access to the data from their GUI.

ESB / Workflow Tools:
Similarly, many of the ESB tools (e.g. Sonic ESB, ServiceMix and Mule) can also grab data via XML/JSON over HTTP. The workflow/process can then operate on that data, (mapping elements to inputs of services) without having to write any code. Again, to integrate Cassandra into some of those service orchestrations / mashups, we'll need to develop a plugin if we can't get to it via REST.

Brian ONeill
added a comment - 19/Oct/11 22:41 I'll admit. My team does have a thing for REST.
We process hundreds of data feeds, all with different schemas. We then cleanse, standardize, de-dupe, transform, match, and consolidate that information using systems and processes across languages and platforms. (which is why JSON is important to us)
http://www.healthmarketscience.com/insights/mdm-paradoxical-problem
Looking at our end-to-end processing, here are three other examples that may be more universal. (increasing in priority/value) The general theme of these is that there are many off-the-shelf tools out there that can integrate with JSON/XML over HTTP "out of the box". Not all of them are easily extended to accommodate different drivers/data sources.
TESTING:
We use SOAP UI to test. We use it to smoke test and validate environments. It can easily hit REST interfaces on all our systems (and third party tools) to verify everything is in a good state and has the proper data populated.
ETL TOOLS:
Many GUI-based ETL Tools (e.g. Talend) can extract and load via XML/JSON over HTTP. They grab/push the data over HTTP natively. If Cassandra had a native REST interface, we could tap into it directly. Without it, we'll be looking at some additional process / plugins to make this happen. (PITA) The users of these systems are not developers, they are data stewards that just want to be able to point to a url and get access to the data from their GUI.
ESB / Workflow Tools:
Similarly, many of the ESB tools (e.g. Sonic ESB, ServiceMix and Mule) can also grab data via XML/JSON over HTTP. The workflow/process can then operate on that data, (mapping elements to inputs of services) without having to write any code. Again, to integrate Cassandra into some of those service orchestrations / mashups, we'll need to develop a plugin if we can't get to it via REST.
(Any more compelling? =)

I agree. As it stands, the elementary interface that is implemented thus far hides some of the rich types, but I would expect that we would continue to enhance the layer to provide access to those rich types and it would end up much like how SOLR exposes the underlying features and functions of Lucene via its REST interface.

I think these are good, but the development is fragmented and is not native. (the first two use Hector I believe)
By pulling the effort into the main tree, we can increase the size of the community and consolidate the effort behind it.

Brian ONeill
added a comment - 19/Oct/11 22:57 As for "Does it belong integrated?"...
I agree. As it stands, the elementary interface that is implemented thus far hides some of the rich types, but I would expect that we would continue to enhance the layer to provide access to those rich types and it would end up much like how SOLR exposes the underlying features and functions of Lucene via its REST interface.
As for demand, I think there is significant interest; enough to spawn up projects:
http://code.google.com/p/restish/
https://github.com/stinkymatt/Helena
http://www.onemanclapping.org/2010/09/restful-cassandra.html
https://github.com/gdusbabek/cassandra
I think these are good, but the development is fragmented and is not native. (the first two use Hector I believe)
By pulling the effort into the main tree, we can increase the size of the community and consolidate the effort behind it.

Jonathan Ellis
added a comment - 19/Oct/11 23:53 I also remember that Mozilla cited REST support as a reason for using Riak ( http://blog.mozilla.com/data/2010/05/18/riak-and-cassandra-and-hbase-oh-my/ ). Not sure how valid that is but it wouldn't be surprising if they were dealing with a lot of JSON data too.

I agree. As it stands, the elementary interface that is implemented thus far hides some of the rich types, but I would expect that we would continue to enhance the layer to provide access to those rich types and it would end up much like how SOLR exposes the underlying features and functions of Lucene via its REST interface

How are you going to implement Cassandra's type system without implementing schema? Once you drag schema into the mix, how will you justify this approach when it's no longer possible to plug-and-play existing systems, or drive queries with curl?

As for demand, I think there is significant interest; enough to spawn up projects:

I believe the project in that first link is from Courtney Robinson, and I believe that he now advocates CQL (he started work on a CQL driver, and stopped working on that). I'm not sure what the story is behind the second link, other than it doesn't appear to have generated much interest (based on forks and watchers).

The last two links are both from Gary Dusbabek (a member of the Cassandra PMC). This was a proof-of-concept that he only spent a few hours on, and one that he decided not to continue with. It might be worth asking him why.

Honestly, I think this does more to prove why a REST interface does not belong integrated in Cassandra.

It is not enough to simply have code. That code needs to be maintained, and it needs more than one person who cares enough to make sure that happens. It also needs to be documented, and supported on all the usual forums, again, something that will require a little more critical mass.

And, introducing choice can be a Good Thing, but it can also be a Bad Thing. We need to know that this is going to be useful enough, to a large enough audience, to offset the confusion it will almost certainly generate. Think of the people who are going to be compelled to ask which interface they should use, and who are going to have to weigh the pros and cons and then make a choice (and remember that this would bring us from 2, to 3 choices of interface). Think of the users who are going to make assumptions about semantics or performance characteristics based on one interface or the other, only to find it's not applicable to their choice.

There is a cost associated with this, and it's fair to ask the hard questions to make ensure it's worth it.

You've also repeatedly alluded that not having this accepted as part of the project is something of a deal breaker. Why? Now that you have this code, doesn't it solve your particular needs?

I won't veto this if consensus is that it should be added, but I'm still not convinced that this will succeed where the other attempts have failed. Nor am I convinced that the only way to establish success is by committing it first. What would convince me is a moderately successful externally maintained project, with a modicum of users.

Eric Evans
added a comment - 20/Oct/11 00:45 - edited
I agree. As it stands, the elementary interface that is implemented thus far hides some of the rich types, but I would expect that we would continue to enhance the layer to provide access to those rich types and it would end up much like how SOLR exposes the underlying features and functions of Lucene via its REST interface
How are you going to implement Cassandra's type system without implementing schema? Once you drag schema into the mix, how will you justify this approach when it's no longer possible to plug-and-play existing systems, or drive queries with curl?
As for demand, I think there is significant interest; enough to spawn up projects:
http://code.google.com/p/restish/
https://github.com/stinkymatt/Helena
http://www.onemanclapping.org/2010/09/restful-cassandra.html
https://github.com/gdusbabek/cassandra
I believe the project in that first link is from Courtney Robinson, and I believe that he now advocates CQL (he started work on a CQL driver, and stopped working on that). I'm not sure what the story is behind the second link, other than it doesn't appear to have generated much interest (based on forks and watchers).
The last two links are both from Gary Dusbabek (a member of the Cassandra PMC). This was a proof-of-concept that he only spent a few hours on, and one that he decided not to continue with. It might be worth asking him why.
Honestly, I think this does more to prove why a REST interface does not belong integrated in Cassandra.
It is not enough to simply have code. That code needs to be maintained, and it needs more than one person who cares enough to make sure that happens. It also needs to be documented, and supported on all the usual forums, again, something that will require a little more critical mass.
And, introducing choice can be a Good Thing, but it can also be a Bad Thing. We need to know that this is going to be useful enough, to a large enough audience, to offset the confusion it will almost certainly generate. Think of the people who are going to be compelled to ask which interface they should use, and who are going to have to weigh the pros and cons and then make a choice (and remember that this would bring us from 2, to 3 choices of interface). Think of the users who are going to make assumptions about semantics or performance characteristics based on one interface or the other, only to find it's not applicable to their choice.
There is a cost associated with this, and it's fair to ask the hard questions to make ensure it's worth it.
You've also repeatedly alluded that not having this accepted as part of the project is something of a deal breaker. Why? Now that you have this code, doesn't it solve your particular needs?
I won't veto this if consensus is that it should be added, but I'm still not convinced that this will succeed where the other attempts have failed. Nor am I convinced that the only way to establish success is by committing it first. What would convince me is a moderately successful externally maintained project, with a modicum of users.

I can appreciate that perspective. When I first posted the thought of a REST layer to the dev list, Gary immediately responded with his thoughts and Jeremy asked I respond to the list when I had something. Based on their responses and Matt's link to his Helena project, I may have overestimated the demand/need. I'll reach out to Gary to get his input, but I don't mind letting this JIRA issue brew for a while to see if there is interest.

Acceptance is certainly not a deal breaker. Like you said, this code solves our needs and we'll continue to extend it. I can throw it out into an open source project to see if it sticks. Any preferred forum for that project?

(One final note that I posted today to the user list; we could potentially use this REST layer as an integration point for Elastic Search, much the way CouchDB integrated as a river, http://www.elasticsearch.org/tutorials/2010/08/01/couchb-integration.html. We may try to head that way depending on what is available in DataStax Enterprise. I'll let you guys know if that manifests itself.)

Brian ONeill
added a comment - 20/Oct/11 02:17
I can appreciate that perspective. When I first posted the thought of a REST layer to the dev list, Gary immediately responded with his thoughts and Jeremy asked I respond to the list when I had something. Based on their responses and Matt's link to his Helena project, I may have overestimated the demand/need. I'll reach out to Gary to get his input, but I don't mind letting this JIRA issue brew for a while to see if there is interest.
Acceptance is certainly not a deal breaker. Like you said, this code solves our needs and we'll continue to extend it. I can throw it out into an open source project to see if it sticks. Any preferred forum for that project?
(One final note that I posted today to the user list; we could potentially use this REST layer as an integration point for Elastic Search, much the way CouchDB integrated as a river, http://www.elasticsearch.org/tutorials/2010/08/01/couchb-integration.html . We may try to head that way depending on what is available in DataStax Enterprise. I'll let you guys know if that manifests itself.)

Jonathan Ellis
added a comment - 20/Oct/11 02:49 How are you going to implement Cassandra's type system without implementing schema?
If I were going to do it, I'd do it the way CQL does, which is to use the to/fromString methods to attempt to parse the json string as what the schema says it should be.

If I were going to do it, I'd do it the way CQL does, which is to use the to/fromString methods to attempt to parse the json string as what the schema says it should be.

Right, my point was that once you've built out the client code to introspect Cassandra's schema and marshal to string and back again, the curl examples stop looking so straightforward, and "understanding http + json" isn't good enough anymore.

Eric Evans
added a comment - 20/Oct/11 03:25
If I were going to do it, I'd do it the way CQL does, which is to use the to/fromString methods to attempt to parse the json string as what the schema says it should be.
Right, my point was that once you've built out the client code to introspect Cassandra's schema and marshal to string and back again, the curl examples stop looking so straightforward, and "understanding http + json" isn't good enough anymore.

Acceptance is certainly not a deal breaker. Like you said, this code solves our needs and we'll continue to extend it. I can throw it out into an open source project to see if it sticks. Any preferred forum for that project?

We recently moved drivers to external projects maintained on Apache Extras (http://code.google.com/a/apache-extras.org/hosting). We put them there mostly for branding purposes. Other than that, I'd stick with whatever works best and/or is easiest for you.

(One final note that I posted today to the user list; we could potentially use this REST layer as an integration point for Elastic Search, much the way CouchDB integrated as a river, http://www.elasticsearch.org/tutorials/2010/08/01/couchb-integration.html. We may try to head that way depending on what is available in DataStax Enterprise. I'll let you guys know if that manifests itself.)

Sure. Enabling a new and interesting use-case would also be a great validator.

Eric Evans
added a comment - 20/Oct/11 03:37
Acceptance is certainly not a deal breaker. Like you said, this code solves our needs and we'll continue to extend it. I can throw it out into an open source project to see if it sticks. Any preferred forum for that project?
We recently moved drivers to external projects maintained on Apache Extras ( http://code.google.com/a/apache-extras.org/hosting ). We put them there mostly for branding purposes. Other than that, I'd stick with whatever works best and/or is easiest for you.
(One final note that I posted today to the user list; we could potentially use this REST layer as an integration point for Elastic Search, much the way CouchDB integrated as a river, http://www.elasticsearch.org/tutorials/2010/08/01/couchb-integration.html . We may try to head that way depending on what is available in DataStax Enterprise. I'll let you guys know if that manifests itself.)
Sure. Enabling a new and interesting use-case would also be a great validator.

my point was that once you've built out the client code to introspect Cassandra's schema and marshal to string and back again, the curl examples stop looking so straightforward

I don't follow – it seems to me that it's straightforward for the same reason that CQL is – the schema business stays on the server. So obviously if you're going to GET a UUID column that's going to be a bit messy with curl (or CQL) but strings, ints, and so forth obviously have clean JSON represetations.

Jonathan Ellis
added a comment - 20/Oct/11 04:13 my point was that once you've built out the client code to introspect Cassandra's schema and marshal to string and back again, the curl examples stop looking so straightforward
I don't follow – it seems to me that it's straightforward for the same reason that CQL is – the schema business stays on the server. So obviously if you're going to GET a UUID column that's going to be a bit messy with curl (or CQL) but strings, ints, and so forth obviously have clean JSON represetations.

Jonathan Ellis
added a comment - 20/Oct/11 04:15 Oh, I get it: you're talking about the case where I say "GET object X" and there's no way to hand back the schema for X the way we added in CASSANDRA-2734 for CQL. Good point.

... but thinking about it, I don't think that's really a big deal unless you're building a data browser. As an app author, you presumably already know that user['photo'] is a blob, and as a curl user, you don't care a great deal whether '4c330d52-5f8e-4084-918c-8dd727014ab9' is a time or a lexical uuid, or possibly just a string that looks like one.

Jonathan Ellis
added a comment - 20/Oct/11 05:59 ... but thinking about it, I don't think that's really a big deal unless you're building a data browser. As an app author, you presumably already know that user ['photo'] is a blob, and as a curl user, you don't care a great deal whether '4c330d52-5f8e-4084-918c-8dd727014ab9' is a time or a lexical uuid, or possibly just a string that looks like one.

That's exactly our situation Jonathan. We have users contributing scripts written in ruby, groovy, perl, etc. The users have different goals: analytics, data quality, etc. Their scripts slurp in the data they care about (over HTTP) then they operate on it as a Hash. They know what they expect to find in the element they are operating on. Right now, they can access other parts of the system. (relationships stored in Neo4j, and fuzzy searches in SOLR), but they couldn't get the actual raw record out of Cassandra using the same model.

Brian ONeill
added a comment - 20/Oct/11 14:06 That's exactly our situation Jonathan. We have users contributing scripts written in ruby, groovy, perl, etc. The users have different goals: analytics, data quality, etc. Their scripts slurp in the data they care about (over HTTP) then they operate on it as a Hash. They know what they expect to find in the element they are operating on. Right now, they can access other parts of the system. (relationships stored in Neo4j, and fuzzy searches in SOLR), but they couldn't get the actual raw record out of Cassandra using the same model.

I hadn't looked at Riak much until you sent that link out last night. Looking at what they provide over HTTP, it is a pretty powerful, especially around the Map/Reduce capabilities:http://wiki.basho.com/MapReduce.html

Maybe there is an opportunity here to deliver the same kinds of capabilities on top of Cassandra?
It would be really slick if we could sling a PIG script at a URL and have it result in new column families or updated data in Cassandra.

If we were looking to get rid of the contrib directory, maybe we carve off rest+pig+hadoop capabilities into its own project. That project would provide services on top of Cassandra exposed via REST. (again.. very similar to what SOLR provided for Lucene for a long time, until they merged)

Brian ONeill
added a comment - 20/Oct/11 14:14 Hmmm....
I hadn't looked at Riak much until you sent that link out last night. Looking at what they provide over HTTP, it is a pretty powerful, especially around the Map/Reduce capabilities:
http://wiki.basho.com/MapReduce.html
Maybe there is an opportunity here to deliver the same kinds of capabilities on top of Cassandra?
It would be really slick if we could sling a PIG script at a URL and have it result in new column families or updated data in Cassandra.
If we were looking to get rid of the contrib directory, maybe we carve off rest+pig+hadoop capabilities into its own project. That project would provide services on top of Cassandra exposed via REST. (again.. very similar to what SOLR provided for Lucene for a long time, until they merged)

Jonathan Ellis
added a comment - 20/Oct/11 14:20 CASSANDRA-1805 has been open for a while to get rid of contrib/. Apparently it's a real bitch to move the pig package for reasons I don't really understand.

For what is worth, I'm also on the side of "it would be better to have this as a separate project". I'm not saying REST is useless or anything, and I'm sure a good REST layer would be useful to some, but I don't think it make sense to have a lot of different API in core Cassandra (having thrift and CQL is already too much).

Sylvain Lebresne
added a comment - 20/Oct/11 14:20 For what is worth, I'm also on the side of "it would be better to have this as a separate project". I'm not saying REST is useless or anything, and I'm sure a good REST layer would be useful to some, but I don't think it make sense to have a lot of different API in core Cassandra (having thrift and CQL is already too much).

Brian ONeill
added a comment - 20/Oct/11 15:34 Jonathan, I'll take a look at CASSANDRA-1805 . Maybe we can kill two birds with one stone here.
Would you have any problem making the pig code part of trial project?

... but thinking about it, I don't think that's really a big deal unless you're building a data browser. As an app author, you presumably already know that user['photo'] is a blob, and as a curl user, you don't care a great deal whether '4c330d52-5f8e-4084-918c-8dd727014ab9' is a time or a lexical uuid, or possibly just a string that looks like one.

A better way of stating it is, without layering schema (and the associated client-side code), then you're forced into a world of strings. For some things this makes no difference, for others it does. Brian's examples and use-cases are decidedly biased toward applications that use strings everywhere.

My point was that once you resort to schema in order to make it generally useful, then the benefits cited are moot. If your application relies on the size of an integer, or needs access to the time component of a uuid, or uses composite or custom types, then "speaking http + json" won't be enough to integrate with other systems, and using curl will be a lot more involved. If you're going to layer schema on REST, you're much better off to just use CQL.

Eric Evans
added a comment - 20/Oct/11 15:38 - edited
... but thinking about it, I don't think that's really a big deal unless you're building a data browser. As an app author, you presumably already know that user ['photo'] is a blob, and as a curl user, you don't care a great deal whether '4c330d52-5f8e-4084-918c-8dd727014ab9' is a time or a lexical uuid, or possibly just a string that looks like one.
A better way of stating it is, without layering schema (and the associated client-side code), then you're forced into a world of strings. For some things this makes no difference, for others it does. Brian's examples and use-cases are decidedly biased toward applications that use strings everywhere.
My point was that once you resort to schema in order to make it generally useful, then the benefits cited are moot. If your application relies on the size of an integer, or needs access to the time component of a uuid, or uses composite or custom types, then "speaking http + json" won't be enough to integrate with other systems, and using curl will be a lot more involved. If you're going to layer schema on REST, you're much better off to just use CQL.

Jonathan Ellis
added a comment - 20/Oct/11 16:32 Would you have any problem making the pig code part of trial project?
I think we're pretty sold on the benefits of having the base Pig load/storefunc stuff in-tree. But if you meant you want to add a REST API for Pig, go for it.

I think that if RESTful Cassandra has broad interest that it should belong in the main tree. One of my reasons for giving it a try and writing a blog post was to gauge the level interest. It turned out to be pretty low, so I didn't pursue it.

I admit: the simplicity and accessibility of a REST interface appeals to me, and I can see it being useful for some teams. Call me a closet REST fan. But the same time, if this code were to go in the tree and not get a lot of love (kind of like what we did with the experimental Avro transport) it would be bad.

I'd like to see a REST interface start life in Apache-extras (or github or whatever). If it ends up being generally useful and maintained, I think it will eventually find its way into the main tree.

The idea of using a REST interface as a way to spur adoption is somewhat orthogonal to the idea of utility, IMO, but may be justified if that were an aim of the project.

Gary Dusbabek
added a comment - 20/Oct/11 17:24 I think that if RESTful Cassandra has broad interest that it should belong in the main tree. One of my reasons for giving it a try and writing a blog post was to gauge the level interest. It turned out to be pretty low, so I didn't pursue it.
I admit: the simplicity and accessibility of a REST interface appeals to me, and I can see it being useful for some teams. Call me a closet REST fan. But the same time, if this code were to go in the tree and not get a lot of love (kind of like what we did with the experimental Avro transport) it would be bad.
I'd like to see a REST interface start life in Apache-extras (or github or whatever). If it ends up being generally useful and maintained, I think it will eventually find its way into the main tree.
The idea of using a REST interface as a way to spur adoption is somewhat orthogonal to the idea of utility, IMO, but may be justified if that were an aim of the project.

I threw the GUI in there because it should be dead simple. Since I'm running a lightweight HTTP server (Jetty) already, it should be trivial to create a javascript GUI (built on the REST services). Many of us have used cassandra-gui, but this would eliminate the need to deploy that separately. I'm going to reach out to those guys to see if they want to work together.

Brian ONeill
added a comment - 21/Oct/11 17:33 OK – setup a project on Apache Extras:
http://code.google.com/a/apache-extras.org/p/virgil/
I made the goals of the project threefold:
Provide a REST interface for the majority of Cassandra's functions
Make Pig/Hadoop on Cassandra accessible via REST
Provide a slick GUI that allows you to inspect data via a browser
I threw the GUI in there because it should be dead simple. Since I'm running a lightweight HTTP server (Jetty) already, it should be trivial to create a javascript GUI (built on the REST services). Many of us have used cassandra-gui, but this would eliminate the need to deploy that separately. I'm going to reach out to those guys to see if they want to work together.

Brian ONeill
added a comment - 22/Oct/11 14:17 I'm trying to drum up some support.
I threw this out there:
http://brianoneill.blogspot.com/2011/10/virgil-gui-and-rest-layer-for-cassandra.html
Also, pushed it to java.net...
http://weblogs.java.net/blog/boneill42/archive/2011/10/22/virgil-gui-and-rest-layer-cassandra
But you guys probably have a lot more pull. If you could push the message out through any of your media outlets, it'd be much appreciated.
cheers, brian

Brian: how much uptake have you seen of the rest based interface and other capabilities of virgil? It's been some time and sounds like you've had some decent success with all of your efforts and I wondered if there's a point to circle back and unify things so that Cassandra core could benefit. I'm just a contributor, but just trying to see if it's time to reconsider these things or if the separated out interface is fine as it is.

Jeremy Hanna
added a comment - 28/Jun/12 15:39 Brian: how much uptake have you seen of the rest based interface and other capabilities of virgil? It's been some time and sounds like you've had some decent success with all of your efforts and I wondered if there's a point to circle back and unify things so that Cassandra core could benefit. I'm just a contributor, but just trying to see if it's time to reconsider these things or if the separated out interface is fine as it is.

Agreed. I wouldn't say its been overwhelming, but we've had decent uptake: a few hundred downloads since our last release. Now, since we've expanded the capabilities in Virgil, its unclear whether the users are there for the REST interface, triggers, or indexing capabilities. The triggers capability seems to have the most uptake. We even have someone looking to build out an Elastic Search bridge using the trigger mechanism.

We've ensured Virgil can run embedded w/ Cassandra or remotely (using pure Thrift to support both runmodes). Thus, if we want to incorporate it into core, it should be an easy drop-in. We could simply add a config param in the yaml that would optionally start the REST server.

We're loving it internally. Even the QA teams have adopted it. They hit Virgil from Cucumber to automate testing (using REST from the ruby to make assertions about the state of the data) But... I'm biased. =)

Brian ONeill
added a comment - 29/Jun/12 18:45 Agreed. I wouldn't say its been overwhelming, but we've had decent uptake: a few hundred downloads since our last release. Now, since we've expanded the capabilities in Virgil, its unclear whether the users are there for the REST interface, triggers, or indexing capabilities. The triggers capability seems to have the most uptake. We even have someone looking to build out an Elastic Search bridge using the trigger mechanism.
We've ensured Virgil can run embedded w/ Cassandra or remotely (using pure Thrift to support both runmodes). Thus, if we want to incorporate it into core, it should be an easy drop-in. We could simply add a config param in the yaml that would optionally start the REST server.
We're loving it internally. Even the QA teams have adopted it. They hit Virgil from Cucumber to automate testing (using REST from the ruby to make assertions about the state of the data) But... I'm biased. =)