Description:
MySQL Fabric's HASH sharding functionality isn't implemented properly. Under the hood, a RANGE comparison is still used to calculate on which shard the data belongs. This produces unbalanced shards when the expected behavior is balanced shards (ie: number of rows per shard).
Consider the following sample application which creates 100 authors, each with 50 books.
https://gist.github.com/utdrmac/a9f9996a3c31ee4fb278
The author_no is the shard key created as follows:
mysqlfabric sharding create_definition HASH test_global
mysqlfabric sharding add_table 4 test.authors author_no
mysqlfabric sharding add_table 4 test.books author_no
mysqlfabric sharding add_shard 4 "testShard1,testShard2" --state=ENABLED
In a "normal" HASH-based sharding, each author_no would be MODULO'd to the number of shards in the cluster. For example, (456 % 2) = 0, author_no 456 lives on shard #0.
Thus, for 100 authors numbering 1 to 100, there should be 50 authors on testShard1 and 50 authors on testShard2. This is not the case.
In my testing using the code above, 30 authors are on testShard1 and 70 on testShard2 due to the way HASH sharding is implemented.
By packet-sniffing the traffic between 'mysqlfabric' and the backend store, a HASH of the group name is created using MD5 and this is stored as the "lower bound" for that shard. When it comes time to write data, or read data too, the key value is MD5'd as well and string-compared against the HASH of the group mentioned earlier. This produces a RANGE'd sharding result, not a true HASH.
How to repeat:
Execute the sample application above against a working MySQL Fabric setup of 1 global group and 2 individual groups. Run SELECT's on the tables on each group to discover imbalanced row counts.
Suggested fix:
Implement proper HASH sharding using the MODULO technique described above. When you partition a table in MySQL, this is how partitions are chosen, using modulo. There should be no difference in the MySQL universe on how "hash" partitioning/sharding is implemented.