hive-user mailing list archives

Thansk for your reply!
according to you because of its natural property of ORC, it cannot be
splited by the default chunk.
Because it is not composed of lines like csv.
Until you run out of capacity, a distributed system *has* to show sub-linear
scaling -
and will show flat scaling upto a particular point because of Amdahl's law.
This sentence is a bit confusing. so time of reading CSV file on Spark is
linearnly increasing as the data increase.
because it employes the full cluster, which means it runs out of capacity?
On the other hand, the reason why time of reading ORC format shows flat
scaling.
because it is not over the capacity yet?
but you know loading csv file is not much big as I guess.
Could you correct me?
Thanks in advance.
Best,
Phil
On Wed, Feb 10, 2016 at 10:51 PM, Mich Talebzadeh <mich@peridale.co.uk>
wrote:
> Hi,
>
>
>
> Your point on
>
>
>
> *" ORC readers are more efficient than reading text, but ORC readers
> cannot*
>
> *split beyond a 64Mb chunk, while text readers can split down to 1 line
> per*
>
> *task."*
>
>
>
> I thought you could decide on the stripe sizes less than default 64MB. For
> example 16MB with setting 'orc.stripe.size'='16777216'
>
>
>
> Thanks
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
>
>
>
>
>
> -----Original Message-----
> From: Gopal Vijayaraghavan [mailto:gopal@hortonworks.com] On Behalf Of
> Gopal Vijayaraghavan
> Sent: 10 February 2016 21:43
> To: user@hive.apache.org
> Subject: Re: reading ORC format on Spark-SQL
>
>
>
>
>
> > The reason why I am asking this kind of question is reading csv file on
>
> >Spark is linearly increasing as the data size increase a bit, but reading
>
> >ORC format on Spark-SQL is still same as the data size increses in
>
> ><figure 2>.
>
> ...
>
> > This cause is from (just property of reading ORC format) or (creating
>
> >the table for input and loading the input in the table) or both?
>
>
>
> ORC readers are more efficient than reading text, but ORC readers cannot
>
> split beyond a 64Mb chunk, while text readers can split down to 1 line per
>
> task.
>
>
>
> So, it's possible the CSV readers are producing many many more divisions
>
> and running the query using the full cluster always - splitting
>
> indiscriminately is not always faster as each task has some fixed overhead
>
> unrelated to the data size (like plan deserialization in Kryo).
>
>
>
> For ORC - 59 tasks can run in the same time as 193 tasks, as long as
>
> there's capacity to run 193 in a single pass (like 200 executors).
>
>
>
> Until you run out of capacity, a distributed system *has* to show
>
> sub-linear scaling - and will show flat scaling upto a particular point
>
> because of Amdahl's law.
>
>
>
> Cheers,
>
> Gopal
>
--
==========================================================
*Hae Joon Lee*
Now, in Germany,
M.S. Candidate, Interested in Distributed System, Iterative Processing
Dept. of Computer Science, Informatik in German, TUB
Technical University of Berlin
In Korea,
M.S. Candidate, Computer Architecture Laboratory
Dept. of Computer Science, KAIST
Rm# 4414 CS Dept. KAIST
373-1 Guseong-dong, Yuseong-gu, Daejon, South Korea (305-701)
Mobile) 49) 015-251-448-278 in Germany, no cellular in Korea
==========================================================