Reading from HDFS by increasing split size

How can I control the no.of tasks by increasing the split size? With default split size of 250 MB, several tasks are created. But I would like to have a specific no.of tasks created while reading from HDFS itself instead of using repartition() etc.,

Re: Reading from HDFS by increasing split size

> On 10. Oct 2017, at 09:14, Kanagha Kumar <[hidden email]> wrote:
>
> Hi,
>
> I'm trying to read a 60GB HDFS file using spark textFile("hdfs_file_path", minPartitions).
>
> How can I control the no.of tasks by increasing the split size? With default split size of 250 MB, several tasks are created. But I would like to have a specific no.of tasks created while reading from HDFS itself instead of using repartition() etc.,
>
> Any suggestions are helpful!
>
> Thanks
>

Re: Reading from HDFS by increasing split size

I have not tested this, but you should be able to pass on any map-reduce like conf to underlying hadoop config.....essentially you should be able to control behaviour of split as you can do in a map-reduce program (as Spark uses the same input format)

> On 10. Oct 2017, at 09:14, Kanagha Kumar <[hidden email]> wrote:
>
> Hi,
>
> I'm trying to read a 60GB HDFS file using spark textFile("hdfs_file_path", minPartitions).
>
> How can I control the no.of tasks by increasing the split size? With default split size of 250 MB, several tasks are created. But I would like to have a specific no.of tasks created while reading from HDFS itself instead of using repartition() etc.,
>
> Any suggestions are helpful!
>
> Thanks
>

I have not tested this, but you should be able to pass on any map-reduce like conf to underlying hadoop config.....essentially you should be able to control behaviour of split as you can do in a map-reduce program (as Spark uses the same input format)

> On 10. Oct 2017, at 09:14, Kanagha Kumar <[hidden email]> wrote:
>
> Hi,
>
> I'm trying to read a 60GB HDFS file using spark textFile("hdfs_file_path", minPartitions).
>
> How can I control the no.of tasks by increasing the split size? With default split size of 250 MB, several tasks are created. But I would like to have a specific no.of tasks created while reading from HDFS itself instead of using repartition() etc.,
>
> Any suggestions are helpful!
>
> Thanks
>

I have not tested this, but you should be able to pass on any map-reduce like conf to underlying hadoop config.....essentially you should be able to control behaviour of split as you can do in a map-reduce program (as Spark uses the same input format)

> On 10. Oct 2017, at 09:14, Kanagha Kumar <[hidden email]> wrote:
>
> Hi,
>
> I'm trying to read a 60GB HDFS file using spark textFile("hdfs_file_path", minPartitions).
>
> How can I control the no.of tasks by increasing the split size? With default split size of 250 MB, several tasks are created. But I would like to have a specific no.of tasks created while reading from HDFS itself instead of using repartition() etc.,
>
> Any suggestions are helpful!
>
> Thanks
>

I have not tested this, but you should be able to pass on any map-reduce like conf to underlying hadoop config.....essentially you should be able to control behaviour of split as you can do in a map-reduce program (as Spark uses the same input format)

> On 10. Oct 2017, at 09:14, Kanagha Kumar <[hidden email]> wrote:
>
> Hi,
>
> I'm trying to read a 60GB HDFS file using spark textFile("hdfs_file_path", minPartitions).
>
> How can I control the no.of tasks by increasing the split size? With default split size of 250 MB, several tasks are created. But I would like to have a specific no.of tasks created while reading from HDFS itself instead of using repartition() etc.,
>
> Any suggestions are helpful!
>
> Thanks
>

I have not tested this, but you should be able to pass on any map-reduce like conf to underlying hadoop config.....essentially you should be able to control behaviour of split as you can do in a map-reduce program (as Spark uses the same input format)

> On 10. Oct 2017, at 09:14, Kanagha Kumar <[hidden email]> wrote:
>
> Hi,
>
> I'm trying to read a 60GB HDFS file using spark textFile("hdfs_file_path", minPartitions).
>
> How can I control the no.of tasks by increasing the split size? With default split size of 250 MB, several tasks are created. But I would like to have a specific no.of tasks created while reading from HDFS itself instead of using repartition() etc.,
>
> Any suggestions are helpful!
>
> Thanks
>