NoSQL quorum comparing to virtual sharding

After reading about few NoSQL techniques it looks for me that Quorum fails comparing with Virtual Sharding. Virtual sharding allow scalability and does not increase amount of read/writes across a system. What's also bad that I absolutely can't find any benefit of quorum over sharding.

Question:May you act as an advocate of quorum technique from perspective of data consistency/performance/scalability and bring a light to situations where it's better than sharding?

Below is my vision of the stuff:

Quorum:

Suppose I have a booking system which demand high data consistency. As one of approaches with NoSQL to achieve data consistency is quorum, means R + W > N, where R - read nodes, W - write nodes and N - total amount of nodes.

As I understand, if you use quorum than to write a row your db need to perform a write operation W times. Also to read something your db need to do R reads. Right?

Virtual sharding:

As I understand, sharding - it's when there's something similar to hashmap, which by some criteria tells you where income data should be stored / from where should be read. Suppose you have N nodes. Virtual means that in order to avoid scalability problems, that hash map would be bigger than N, but suppose 10*N. That allow easily reconfigure it on adding new nodes.

What is extremely good about it that it doesn't demand any replication like quorum! Of course in sake of availability/failover you can bring one master-slave backup for each node. But that won't increase amount of read/writes in a system.

Best How To :

The key differentiation that needs to be made here is that 'quorum' is a concept employed for eventual consistency among replicas in a partition, where 'sharding' is a concept for data partitioning and does not imply replication.

In a system like cassandra, replication is not a requirement. You could use cassandra for data partitioning/sharding only, assigning tokens to your nodes to establish ownership of data in the ring. Cassandra uses a concept called consistent hashing for distributing data across nodes in your cluster.

Quorum is an available consistency level when reading and writing data to cassandra. When you write to cassandra, all replicas receive and process the write request regardless of the consistency level that is used. However, cassandra will respond to the request as soon as enough replicas have successfully processed the write to meet the consistency level. For reads the process is somewhat different in that all nodes will create a digest over the data, while only enough nodes to meet the consistency level will perform the read (in the normal case).

As you indicate, without having multiple replicas, availability is a problem. If you had a master-slave configuration for each shard in your example, you are effectively writing the data twice. It depends on the database solution and configuration with regards to whether or not the database responds to the write when the master processes the write or if the write to the slave needs to be completed as well.

Cassandra excels in both partitioning/sharding and replication. The same is true for other AP nosql solutions. Also, since cassandra supports tunable consistency via consistency levels, this allows you to find an ideal balance between availability and consistency in your application. By using a quorum consistency level you can survive the loss of replicas (i.e. with 3 replicas, you could survive the loss of 1 node in a partition) while your application continues to work.

The advantage of replication using quorum consistency (or any other consistency for that matter) in cassandra over sharding+backing up in another solution is that if the master of a shard/partition fails, that partition is unavailable until the backup becomes active. In an AP system (like cassandra) on a replica failure the system continues working without issue as long as the consistency level is met. There is no need for an 'active-passive switchover' which can often not be transparent (really depends on the database solution). Additionally, if you have a high enough replication factor, you can support the loss of multiple nodes in a partition (i.e. using QUORUM with an RF of 5 nodes allows you to lose 2 nodes in a partition). Lastly another advantange is that since you can have many active replicas within a partition, they can all serve requests simultaneously, while in a Master-Slave setup, only the Master services reads/writes. This could lead to much better performance at scale.

I ended up getting around this error buy reinstalling everything. Something went wrong with the setup, so I pulled down a fresh copy of django, cassandra, and django_cassandra_engine. That seemed to do the trick.

In case of clustering columns, (a1, a2) < (b1, b2) can be true in any one of the following cases: 1) a1 < b1 2) a1=b1 and a2 < b2 This is how Cassandra internally does the ordering based on clustering columns Based on this, the results of Query 1...

It looks like you are running JDK9? /usr/lib64/jvm/java-1.9.0-openjdk-1.9.0/jre/conf/management/management.properties If I'm not misreading that, I think the answer is to downgrade to JDK7 or JDK8. Since the company behind Titan was bought, will the database development be stopped? The short answer is "no" and that development will continue. Please read: https://groups.google.com/d/msg/aureliusgraphs/WTNYYpUyrvw/pZh02Q2LlpsJ...

Normally it is a good approach to use secondary indexes together with the partition key, because - as you say - the secondary key lookup can be performed on a single machine. The other concept that needs to be taken into account is the cardinality of the secondary index. In...

The syntax of your original COPY command is also fine. The problem is with your column named timestamp, which is a data type and is a reserved word in this context. For this reason you need to escape your column name as follows: COPY product (uid, productcount, term, "timestamp") TO...

Secondary indexes are suggested only for fields with low cardinality. Your access_token field looks like it has very high cardinality (and may even be unique for all million rows). This is a known anti pattern in Cassandra. High cardinality fields are good for things like partition keys because they will...

Yes it will, but you have to be careful because a compaction is calculated and it generates temporary files and tmp link files that will increase disk space until the cleaned up compacted table is calculated. So I would go into your data directory and figure out what your keyspace...

Ok, here is my theory as to what is going on. You have to be careful with timestamps, because they will store data down to the millisecond. But, they will only display data to the second. Take this sample table for example: [email protected]:stackoverflow> SELECT id, datetime FROM data; id |...

So I've resolved it. When you setup datasource it Tableau, you have to specify cluster, database and table (a.k.a. column family). I specified CF/table 'tablename_i_try_to_query' and dragged it to the pane on the right. Then I specified database. It didn't work. Tableau generated query without specifying database. Then I removed...

Without knowing your full requirement, amount of inserts/updates one cannot predict is it a good or bad approach. Mongo is less preferable for heavy writes but it can support quite a good no. of inserts. So important thing is how many writes you have per unit time and based on...

Originally I was using 'sbt run' to start the application. Once I was able to use spark-submit to launch the application, everything worked fine. So yes, files under 10 MB can be stored as a column of type blob. The Spark MapReduce ran quickly with 200 rows....

Are you 100% sure that you are running against Spark 1.2.1? Also on the executors? The problem is that this metric accessor became private in Spark 1.3.0 and therefore cannot be found at runtime. See TaskMetrics.scala - Spark 1.2.2 vs TaskMetrics.scala - spark v1.3.0, so most probably there's a Spark1.3.x...

Instead of using multiple rows for every status change, if you updated the same event row instead, you could use a technique described in the DynamoDB documentation in the section 'Use a Calculated Value'. Basically this would involve adding another attribute (say 'derivedStatusId') which would be derived by appending a...

We can search the element value using case insensitive [...] Is there [something] similar for schema element [?] No. We may assume that JavaScript and JSON are case sensitive, and so are MongoDB queries. That being said, internally MongoDB uses BSON, and the specs say nothing about case-sensitivity of...

Given the size of your dataset (34 GB, assuming 16 bit integers), storing your dataset as HDF5 with PyTables would probably be the optimal choice. PyTables was specifically developed to work efficiently with extremely large datasets that can't be loaded at once in memory. Also, have a look at the...

According to the Docs, key word first is to limit the number of Columnns, not rows to limit the number of rows , you must just keyword limit. select col1..colN from table limit 100; the default limit is 10000 ...

The following line from your log suggests that the RDB is indeed loaded: [9480] 07 Jun 10:34:11.290 * DB loaded from disk: 3.540 seconds And this line begotten from INFO tells the whole thing: db2:keys=457985,expires=0,avg_ttl=0 Your keys are sitting in the database numbered 2, so to access them you'll need...

If there can be 1000s of devices per customer, and device messages are stored in device-specific collections, searching for the latest message for a customer would require you to find the latest record in a variable number of collections, which will not only hard to express in a query but...

Many people stuck at this stage and many times even Google group for Titan-Development don't answer queries, If there is no specific reason to use Hbase, I advice you to shift to Cassandra and continue your work. GL & HF.

Depending on your version and configuration, check the values specified for listen_address and/or rpc_address in your cassandra.yaml. If they are defined to anything other than localhost, you will need to provide that address when connecting with cqlsh. $ grep listen_address: /etc/cassandra/cassandra.yaml listen_address: 210.156.89.15 $ cqlsh 210.156.89.15 -u aploetz -p aploetz...

The new module is inside of the Spark Cassandra Connector not the apache Spark project. The new DataSource code is described in the new Dataframes section with the syntax for both SparkSQL and programatic access. https://github.com/datastax/spark-cassandra-connector/blob/master/doc/14_data_frames.md The code itself which you are looking for is https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra-connector/src/main/scala/org/apache/spark/sql/cassandra...

Will spark load all the data into RDD and then filter based on the where clause? It depends on your database schema. If your query explicitly restricts scan to a single C* partition (and ours where email = ? or mobile = ? definitely does not), Spark will load...

How to change the flush queue size of cassandra You can increase the number of flush writers by editing memtable_flush_writers in your cassandra.yaml file. See the related docs. How to assign more memory for the flush queue between memtable and sstable in Cassandra. You don't assign memory to a...

It will not look like a relational database table, instead Lucene uses the inverted index and cosine similarity formula for searching of any search words. To better understand you need to look for various terminology and formula to be used in lucene, you can check out it on lucene officially...

How do I define concrete values which will be used to replace "?" in the query? You don't. These parameterized values are set by the splits created by the input format. They are set automatically but can be adjusted (to a degree) by adjusting the split size. And what...

You'll need to open the server ports in your security group. Port 80 is not used for inter node communication. The full list is here: http://docs.datastax.com/en/cassandra/2.1/cassandra/security/secureFireWall_r.html

You already have a session instance. Use it to issue a create keyspace command: session.execute("CREATE KEYSPACE IF NOT EXISTS test WITH replication = { 'class': 'SimpleStrategy', 'replication_factor': '3' }; the IF NOT EXISTS part will prevent the statement from throwing an error in case the keyspace already exists....

You are not allowed to update a field that is part of the primary key, because if you do so you are rendering Cassandra unable to ever re-compose the hash of the row you are updating. Read here for details on the topic. In essence, if you had a HashMap[K,...

@vicg, first you need spark.cassandra.connection.host -- periods not dashes. Also note in the error how the IP is "127.0.1.1", not the one in the config. You can also pass the IP when you create a context, like: curl -X POST 'localhost:8090/contexts/my-context?spark.cassandra.connection.host=127.0.0.1' If the above don't work, try the following PR:...

after long testing i've found a solution for my problem: function(doc) { var vorl; if(doc.vorlesungen){ for(i=0;i<doc.vorlesungen.length;i++){ for(vorl in doc.vorlesungen[i]){ emit(vorl, 1); } } } } this shows me the corret output, but i think it's not very pretty cause of the for loops. In college cases like this were solved...

I would spend a bit of time thinking about these tags. For example, are these tags going to be user generated or will you provide a few tags and let users select which ones they want? Will you need to search on tags based on text matches? For example if...

To start with you should look into joinWithCassandraTable which should be an easier api for what you are doing (granted you have enough partitions to make it worth while). This api takes an RDD of partition keys and palatalizes and distributes their retrieval from C*. This would let you do...

This: session.execute(s_email_lookup_by_email, (email_address)) doesn't really pass a 1-tuple to the execute method. In python, to create a 1-tuple you need to follow the first element with a comma: session.execute(s_email_lookup_by_email, (email_address,)) otherwise the parentheses are interpreted as grouping operators and that makes (email_address) equivalent to email_address without parentheses....

Using only 16 vnodes per host is probably not a good idea. Each vnode token will be generated randomly with the expectation that an even balancing will happen with a large enough number of tokens. The lesser number of used vnodes, the higher the variation and possibility of an uneven...

I was including spark as an unmanaged dependency (putting the jar file in the lib folder) which used a lot of memory because it is a huge jar. Instead, I made a build.sbt file which included spark as a provided, unmanaged dependency. Secondly, I created the environment variable JAVA_OPTS with...