Too many intermediate partition files

Details

Description

Unlike the before, the number of partitions are being currently determined by the volume size and the number of distinct keys. It can cause unnecessary overheads. We need to improve the partition number determiner to consider the number of cluster nodes.

When the size of the intermediate data is sufficiently large, the number of tasks looks to be the number of worker slots.
In my opinion, since the number of tasks is fixed regardless of the size of the intermediate data, the task failure overhead will be increased as the size of the intermediate data increases.
How about limit the maximum task size?

Jihoon Son
added a comment - 03/Dec/13 11:14 When the size of the intermediate data is sufficiently large, the number of tasks looks to be the number of worker slots.
In my opinion, since the number of tasks is fixed regardless of the size of the intermediate data, the task failure overhead will be increased as the size of the intermediate data increases.
How about limit the maximum task size?

I'm sorry that I misunderstood this issue.
The main purpose of this issue is to get the proper number of partitions.
Because each task processes each partition, I suggested like above.
However, the task size should be treated at something else where each task is created.
So, I agree with this implementation.
I'll review the remaining part of the patch.

Jihoon Son
added a comment - 04/Dec/13 01:48 I'm sorry that I misunderstood this issue.
The main purpose of this issue is to get the proper number of partitions.
Because each task processes each partition, I suggested like above.
However, the task size should be treated at something else where each task is created.
So, I agree with this implementation.
I'll review the remaining part of the patch.

Jinho Kim
added a comment - 04/Dec/13 02:22 Jihoon,
You're right. if the cluster size are small, the intermediate data size are increased.
I will compress the intermediate data.
Thank you for the review

In my opinion, reducing the size of the intermediate data also should be handled as a separate problem from deciding the task size.
Anyway, may I review your patch tonight?
If this issue is urgent, it would be better that any others review it.

Jihoon Son
added a comment - 04/Dec/13 02:47 In my opinion, reducing the size of the intermediate data also should be handled as a separate problem from deciding the task size.
Anyway, may I review your patch tonight?
If this issue is urgent, it would be better that any others review it.

Hyunsik Choi
added a comment - 04/Dec/13 15:18 For me, this is a good workaround code for this problem. Here is my comments about your patch.
It would be better to rename tajo.worker.start.cleanup to tajo.worker.tmpdir.cleanup-at-startup. It's because the config is for tajo.worker.tmpdir. It looks more consistent.
the below code should be inserted into the end of WorkerManagerService::cleanup(). In addition, cleanup's return type need to be BoolProto.
done.run(TajoWorker.TRUE_PROTO);
Async rpc internally keeps a callback sequence id in the concurrent map until it is returned. So, done.run must be called once.
For the same reason, the line 184 In QueryMaster should be changed to
tajoWorkerProtocolService.cleanup( null , queryId.getProto(), NullCallback.get());

There is a missing thing. This patch also should handle symmetric repartition join. The way is very similar to your group-by work.

For that, please take a look at the below code. This code chooses a smaller table and gets the proper number of partitions. After this patch is applied, the maximum number of partitions are limited to worker slots. So, you need to choose lager tables as the base table for calculating the number of task.