Author: Irina Knyazeva, ODS Slack nickname : iknyazeva

Tutorial

"HANDLE DIFFERENT DATASET WITH DASK AND TRYING A LITTLE DASK ML"

Dask provides high-level Array, Bag, and DataFrame collections that mimic NumPy, lists, and Pandas but can operate in parallel on datasets that don’t fit into main memory. Dask’s high-level collections are alternatives to NumPy and Pandas for large datasets.

Dask Array implements a subset of the NumPy ndarray interface using blocked algorithms, cutting up the large array into many small arrays. This lets us compute on arrays larger than memory using all of our cores. We coordinate these blocked algorithms using Dask graphs.dask array documentation

In dask there is three main structures: dask array (based on numpy array), dask dataframe (based on pandas dataframe) and dask bags (for unstructured data as text).

t_start=time.time()y.mean().compute()t_end=time.time()print('Compute the same with dask \n')print('Elapsed time for compute mean of dask array (ms):',round((t_end-t_start)*1000))

Compute the same with dask
Elapsed time for compute mean of dask array (ms): 21

Actually, this example will never be used in practice, because if your numpy already in memory, any partitioning will always raise computational time. But if you need to process data from HDF5, NetCDF or bulk of numpy files from disk it could be extremely useful

But dask could be useful for small data with delayed computation. It could easily parallelize computation. Let's see the example with our previous numpy array

In [69]:

deff(z):returnnp.sqrt(z+4)defg(y):returny-3defh(x):returnx**2time_start=time.time()x=np.random.randn(50*N)y=h(x);z=g(x);w=f(z+y);time_end=time.time()print('Elapsed time for compute complex functions with numpy array (ms):',round((time_end-time_start)*1000))

Elapsed time for compute complex functions with numpy array (ms): 426

In [10]:

y=delayed(h)(x)z=delayed(g)(x)w=delayed(f)(z+y)print('After we get dask delayed object',w)time_start=time.time()w.compute()time_end=time.time()print('Elapsed time for compute complex functions with numpy array with dask delayed (ms):',round((time_end-time_start)*1000))

After we get dask delayed object Delayed('f-10fe1849-e5f7-4f12-97df-e728a4123d43')
Elapsed time for compute complex functions with numpy array with dask delayed (ms): 98

It is easily understood why computation time decreased with the computational graph. Let's do this with the second way of introducing delay functions

Dask DataFrames coordinate many Pandas DataFrames/Series arranged along the index. A Dask DataFrame is partitioned row-wise, grouping rows by index value for efficiency. These Pandas objects may live on disk or on other machines.
(See documentation)[http://docs.dask.org/en/latest/dataframe.html]

m1=memory_footprint()dask_df=dd.read_csv(PATH)m2=memory_footprint()print('Dask do not allocate memory after creation:',m2-m1)

Dask do not allocate memory after creation: -5.16015625

In [16]:

print('But we could see data as in pandas dataframe:')dask_df.head()

But we could see data as in pandas dataframe:

Out[16]:

ID

Name

Sex

Age

Height

Weight

Team

NOC

Games

Year

Season

City

Sport

Event

Medal

0

1

A Dijiang

M

24.0

180.0

80.0

China

CHN

1992 Summer

1992

Summer

Barcelona

Basketball

Basketball Men's Basketball

NaN

1

2

A Lamusi

M

23.0

170.0

60.0

China

CHN

2012 Summer

2012

Summer

London

Judo

Judo Men's Extra-Lightweight

NaN

2

3

Gunnar Nielsen Aaby

M

24.0

NaN

NaN

Denmark

DEN

1920 Summer

1920

Summer

Antwerpen

Football

Football Men's Football

NaN

3

4

Edgar Lindenau Aabye

M

34.0

NaN

NaN

Denmark/Sweden

DEN

1900 Summer

1900

Summer

Paris

Tug-Of-War

Tug-Of-War Men's Tug-Of-War

Gold

4

5

Christine Jacoba Aaftink

F

21.0

185.0

82.0

Netherlands

NED

1988 Winter

1988

Winter

Calgary

Speed Skating

Speed Skating Women's 500 metres

NaN

In [17]:

# building delayed computationprint('We can do many operation the same way as in pandas, but without loading all data in memory \n ')sex_distr=dask_df.loc[dask_df['Games'].str.contains('1996')].groupby('Sex')['Age'].min()

We can do many operation the same way as in pandas, but without loading all data in memory

In [18]:

print('Here we done selecting and aggregation exactly the same way as we did in pandas \n')print('But there is not any computation, we create dask structure \ n')sex_distr

Here we done selecting and aggregation exactly the same way as we did in pandas
But there is not any computation, we create dask structure \ n

Dask Bag implements operations like map, filter, fold, and groupby on collections of Python objects. It does this in parallel with a small memory footprint using Python iterators. It is similar to a parallel version of PyToolz or a Pythonic version of the PySpark RDD.Dask bag documentation

Dask bags are often used to parallelize simple computations on unstructured or semi-structured data like text data, log files, JSON records, or user defined Python objects.

Let's look at one example
('{"_id": "https://medium.com/policy/medium-terms-of-service-9db0094a1e0f", "_timestamp": 1520035195.282891, "_spider": "medium", "url": "https://medium.com/policy/medium-terms-of-service-9db0094a1e0f", "domain": "medium.com", "published": {"$date": "2012-08-13T22:54:53.510Z"}, "title": "Medium Terms of Service \\u2013 Medium Policy \\u2013 Medium", "content": "<div><header class=\\"container u-maxWidth740\\"><div class=\\"uiScale uiScale-ui--regular uiScale-caption--regular postMetaHeader u-paddingBottom10 row\\"><div class=\\"col u-size12of12 js-postMetaLockup\\"><div class=\\"uiScale uiScale-ui--regular uiScale-caption--regular postMetaLockup postMetaLockup--authorWithBio u-flexCenter js-postMetaLockup\\"><div class=\\"u-flex0\\"><a class=\\"link u-baseColor--link avatar\\" href=\\"https://medium.com/@Medium?source=post_header_lockup\\" data-action=\\"show-user-card\\" data-action-source=\\"post_header_lockup\\" data-action-value=\\"504c7870fdb6\\" data-action-type=\\"hover\\" data-user-id=\\"504c7870fdb6\\" dir=\\"auto\\"><div class=\\"u-relative u-inlineBlock u-flex0\\"><img src=\\"https://cdn-images-1.medium.com/fit/c/120/120/1*6_fgYnisCa9V21mymySIvA.png\\" class=\\"avatar-image avatar-image--small\\" alt=\\"Go to the profile of Medium\\"><div class=\\"avatar-halo u-absolute u-textColorGreenNormal svgIcon\\" style=\\"width: calc(100% + 12px); height: calc(100% + 12px); top:-6px; left:-6px\\"><svg viewbox=\\"0 0 114 114\\" xmlns=\\"http://www.w3.org/2000/svg\\"><path d=\\"M7.66922967,32.092726 C17.0070768,13.6353618 35.9421928,1.75 57,1.75 C78.0578072,1.75 96.9929232,13.6353618 106.33077,32.092726 L107.66923,31.4155801 C98.0784505,12.4582656 78.6289015,0.25 57,0.25 C35.3710985,0.25 15.9215495,12.4582656 6.33077033,31.4155801 L7.66922967,32.092726 Z\\"></path><path d=\\"M106.33077,81.661427 C96.9929232,100.118791 78.0578072,112.004153 57,112.004153 C35.9421928,112.004153 17.0070768,100.118791 7.66922967,81.661427 L6.33077033,82.338573 C15.9215495,101.295887 35.3710985,113.504153 57,113.504153 C78.6289015,113.504153 98.0784505,101.295887 107.66923,82.338573 L106.33077,81.661427 Z\\"></path></svg></div></div></a></div><div class=\\"u-flex1 u-paddingLeft15 u-overflowHidden\\"><div class=\\"u-lineHeightTightest\\"><a class=\\"ds-link ds-link--styleSubtle ui-captionStrong u-inlineBlock link link--darken link--darker\\" href=\\"https://medium.com/@Medium?source=post_header_lockup\\" data-action=\\"show-user-card\\" data-action-source=\\"post_header_lockup\\" data-action-value=\\"504c7870fdb6\\" data-action-type=\\"hover\\" data-user-id=\\"504c7870fdb6\\" dir=\\"auto\\">Medium</a><span class=\\"followState js-followState\\" data-user-id=\\"504c7870fdb6\\"></span></div><div class=\\"ui-caption ui-xs-clamp2 postMetaInline\\">Everyone\\u2019s stories and ideas</div><div class=\\"ui-caption postMetaInline js-testPostMetaInlineSupplemental\\"><time datetime=\\"2012-08-13T22:54:53.510Z\\">Aug 13, 2012</time><span class=\\"middotDivider u-fontSize12\\"></span><span class=\\"readingTime\\" title=\\"5 min read\\"></span></div></div></div></div></div></header><div class=\\"postArticle-content js-postField js-notesSource js-trackedPost\\" data-post-id=\\"9db0094a1e0f\\" data-source=\\"post_page\\" data-collection-id=\\"675ebe56ac25\\" data-tracking-context=\\"postPage\\"><section name=\\"bb8c\\" class=\\"section section--body section--first section--last\\"><div class=\\"section-divider\\"><hr class=\\"section-divider\\"></div><div class=\\"section-content\\"><div class=\\"section-inner sectionLayout--insetColumn\\"><h1 name=\\"title\\" id=\\"title\\" class=\\"graf graf--h2 graf--leading graf--title\\">Medium Terms of\\u00a0Service</h1><p name=\\"571b\\" id=\\"571b\\" class=\\"graf graf--p graf-after--h2\\"><strong class=\\"markup--strong markup--p-strong\\">Effective: March 7, 2016</strong></p><p name=\\"c90b\\" id=\\"c90b\\" class=\\"graf graf--p graf-after--p\\">These Terms of Service (\\u201cTerms\\u201d) are a contract between you and A Medium Corporation. They govern your use of Medium\\u2019s sites, services, mobile apps, products, and content (\\u201cServices\\u201d).</p><p name=\\"238b\\" id=\\"238b\\" class=\\"graf graf--p graf-after--p\\">By using Medium, you agree to these Terms. If you don\\u2019t agree to any of the Terms, you can\\u2019t use Medium.</p><p name=\\"7769\\" id=\\"7769\\" class=\\"graf graf--p graf-after--p\\">We can change these Terms at any time. We keep a <a href=\\"https://github.com/Medium/medium-policy\\" data-href=\\"https://github.com/Medium/medium-policy\\" class=\\"markup--anchor markup--p-anchor\\" rel=\\"nofollow noopener\\" target=\\"_blank\\">historical</a> record of all changes to our Terms on GitHub. If a change is material, we\\u2019ll let you know before they take effect. By using Medium on or after that effective date, you agree to the new Terms. If you don\\u2019t agree to them, you should delete your account before they take effect, otherwise your use of the site and content will be subject to the new Terms.</p><h4 name=\\"8c81\\" id=\\"8c81\\" class=\\"graf graf--h4 graf-after--p\\"><strong class=\\"markup--strong markup--h4-strong\\">Content rights &amp; responsibilities</strong></h4><p name=\\"ac74\\" id=\\"ac74\\" class=\\"graf graf--p graf-after--h4\\">You own the rights to the content you create and post on Medium.</p><p name=\\"651b\\" id=\\"651b\\" class=\\"graf graf--p graf-after--p\\">By posting content to Medium, you give us a nonexclusive license to publish it on Medium Services, including anything reasonably related to publishing it (like storing, displaying, reformatting, and distributing it). In consideration for Medium granting you access to and use of the Services, you agree that Medium may enable advertising on the Services, including in connection with the display of your content or other information. We may also use your content to promote Medium, including its products and content. We will never sell your content to third parties without your explicit permission.</p><p name=\\"2584\\" id=\\"2584\\" class=\\"graf graf--p graf-after--p\\">You\\u2019re responsible for the content you post. This means you assume all risks related to it, including someone else\\u2019s reliance on its accuracy, or claims relating to intellectual property or other legal rights.</p><p name=\\"c207\\" id=\\"c207\\" class=\\"graf graf--p graf-after--p\\">You\\u2019re welcome to post content on Medium that you\\u2019ve published elsewhere, as long as you have the rights you need to do so. By posting content to Medium, you represent that doing so doesn\\u2019t conflict with any other agreement you\\u2019ve made.</p><p name=\\"0372\\" id=\\"0372\\" class=\\"graf graf--p graf-after--p\\">By posting content you didn\\u2019t create to Medium, you are representing that you have the right to do so. For example, you are posting a work that\\u2019s in the public domain, used under license (including a free license, such as <a href=\\"https://creativecommons.org/licenses/\\" data-href=\\"https://creativecommons.org/licenses/\\" class=\\"markup--anchor markup--p-anchor\\" rel=\\"nofollow noopener\\" target=\\"_blank\\">Creative Commons</a>), or a fair use.</p><p name=\\"0472\\" id=\\"0472\\" class=\\"graf graf--p graf-after--p\\">We can remove any content you post for any reason.</p><p name=\\"db2b\\" id=\\"db2b\\" class=\\"graf graf--p graf-after--p\\">You can delete any of your posts, or your account, anytime. Processing the deletion may take a little time, but we\\u2019ll do it as quickly as possible. We may keep backup copies of your deleted post or account on our servers for up to 14 days after you delete it.</p><h4 name=\\"baf1\\" id=\\"baf1\\" class=\\"graf graf--h4 graf-after--p\\"><strong class=\\"markup--strong markup--h4-strong\\">Our content and\\u00a0services</strong></h4><p name=\\"adc7\\" id=\\"adc7\\" class=\\"graf graf--p graf-after--h4\\">We reserve all rights in Medium\\u2019s look and feel. Some parts of Medium are licensed under third-party open source licenses. We also make some of our own code available under open source licenses. As for other parts of Medium, you may not copy or adapt any portion of our code or visual design elements (including logos) without express written permission from Medium unless otherwise permitted by law.</p><p name=\\"20e4\\" id=\\"20e4\\" class=\\"graf graf--p graf-after--p\\">You may not do, or try to do, the following: (1) access or tamper with non-public areas of the Services, our computer systems, or the systems of our technical providers; (2) access or search the Services by any means other than the currently available, published interfaces (e.g., APIs) that we provide; (3) forge any TCP/IP packet header or any part of the header information in any email or posting, or in any way use the Services to send altered, deceptive, or false source-identifying information; or (4) interfere with, or disrupt, the access of any user, host, or network, including sending a virus, overloading, flooding, spamming, mail-bombing the Services, or by scripting the creation of content or accounts in such a manner as to interfere with or create an undue burden on the Services.</p><p name=\\"f5dd\\" id=\\"f5dd\\" class=\\"graf graf--p graf-after--p\\">Crawling the Services is allowed if done in accordance with the provisions of our robots.txt file, but scraping the Services is prohibited.</p><p name=\\"71a8\\" id=\\"71a8\\" class=\\"graf graf--p graf-after--p\\">We may change, terminate, or restrict access to any aspect of the service, at any time, without notice.</p><h4 name=\\"12f1\\" id=\\"12f1\\" class=\\"graf graf--h4 graf-after--p\\"><strong class=\\"markup--strong markup--h4-strong\\">No children</strong></h4><p name=\\"2ce7\\" id=\\"2ce7\\" class=\\"graf graf--p graf-after--h4\\">Medium is only for people 13 years old and over. By using Medium, you affirm that you are over 13. If we learn someone under 13 is using Medium, we\\u2019ll terminate their account.</p><h4 name=\\"531c\\" id=\\"531c\\" class=\\"graf graf--h4 graf-after--p\\"><strong class=\\"markup--strong markup--h4-strong\\">Security</strong></h4><p name=\\"3155\\" id=\\"3155\\" class=\\"graf graf--p graf-after--h4\\">If you find a security vulnerability on Medium, tell us. We have a <a href=\\"https://medium.com/policy/medium-s-bug-bounty-disclosure-program-34b1c80764c2\\" data-href=\\"https://medium.com/policy/medium-s-bug-bounty-disclosure-program-34b1c80764c2\\" class=\\"markup--anchor markup--p-anchor\\" target=\\"_blank\\">bug bounty disclosure program</a>.</p><h4 name=\\"05cc\\" id=\\"05cc\\" class=\\"graf graf--h4 graf-after--p\\"><strong class=\\"markup--strong markup--h4-strong\\">Incorporated rules and\\u00a0policies</strong></h4><p name=\\"5207\\" id=\\"5207\\" class=\\"graf graf--p graf-after--h4\\">By using the Services, you agree to let Medium collect and use information as detailed in our <a href=\\"https://medium.com/p/f03bf92035c9\\" data-href=\\"https://medium.com/p/f03bf92035c9\\" class=\\"markup--anchor markup--p-anchor\\" target=\\"_blank\\">Privacy Policy</a>. If you\\u2019re outside the United States, you consent to letting Medium transfer, store, and process your information (including your personal information and content) in and out of the United States.</p><p name=\\"6230\\" id=\\"6230\\" class=\\"graf graf--p graf-after--p\\">To enable a functioning community, we have <a href=\\"https://medium.com/policy/medium-rules-30e5502c4eb4\\" data-href=\\"https://medium.com/policy/medium-rules-30e5502c4eb4\\" class=\\"markup--anchor markup--p-anchor\\" target=\\"_blank\\">Rules</a>. To ensure usernames are distributed and used fairly, we have a <a href=\\"https://medium.com/@Medium/medium-username-policy-7054a77fb04f\\" data-href=\\"https://medium.com/@Medium/medium-username-policy-7054a77fb04f\\" class=\\"markup--anchor markup--p-anchor\\" target=\\"_blank\\">Username Policy</a>. Under our <a href=\\"https://medium.com/policy/mediums-copyright-and-dmca-policy-d126f73695\\" data-href=\\"https://medium.com/policy/mediums-copyright-and-dmca-policy-d126f73695\\" class=\\"markup--anchor markup--p-anchor\\" target=\\"_blank\\">DMCA Policy</a>, we\\u2019ll remove material after receiving a valid takedown notice. Under our <a href=\\"https://medium.com/policy/mediums-trademark-policy-e3bb53df59a7\\" data-href=\\"https://medium.com/policy/mediums-trademark-policy-e3bb53df59a7\\" class=\\"markup--anchor markup--p-anchor\\" target=\\"_blank\\">Trademark Policy</a>, we\\u2019ll investigate any use of another\\u2019s trademark and respond appropriately.</p><p name=\\"21ad\\" id=\\"21ad\\" class=\\"graf graf--p graf-after--p\\">By using Medium, you agree to follow these Rules and Policies. If you don\\u2019t, we may remove content, or suspend or delete your account.</p><h4 name=\\"a2a2\\" id=\\"a2a2\\" class=\\"graf graf--h4 graf-after--p\\"><strong class=\\"markup--strong markup--h4-strong\\">Miscellaneous</strong></h4><p name=\\"b7da\\" id=\\"b7da\\" class=\\"graf graf--p graf-after--h4\\"><em class=\\"markup--em markup--p-em\\">Disclaimer of warranty.</em> Medium provides the Services to you as is. You use them at your own risk and discretion. That means they don\\u2019t come with any warranty. None express, none implied. No implied warranty of merchantability, fitness for a particular purpose, availability, security, title or non-infringement.</p><p name=\\"7073\\" id=\\"7073\\" class=\\"graf graf--p graf-after--p\\"><em class=\\"markup--em markup--p-em\\">Limitation of Liability</em>. Medium won\\u2019t be liable to you for any damages that arise from your using the Services. This includes if the Services are hacked or unavailable. This includes all types of damages (indirect, incidental, consequential, special or exemplary). And it includes all kinds of legal claims, such as breach of contract, breach of warranty, tort, or any other loss.</p><p name=\\"3d70\\" id=\\"3d70\\" class=\\"graf graf--p graf-after--p\\"><em class=\\"markup--em markup--p-em\\">No waiver.</em> If Medium doesn\\u2019t exercise a particular right under these Terms, that doesn\\u2019t waive it.</p><p name=\\"ab04\\" id=\\"ab04\\" class=\\"graf graf--p graf-after--p\\"><em class=\\"markup--em markup--p-em\\">Severability</em>. If any provision of these terms is found invalid by a court of competent jurisdiction, you agree that the court should try to give effect to the parties\\u2019 intentions as reflected in the provision and that other provisions of the Terms will remain in full effect.</p><p name=\\"bde8\\" id=\\"bde8\\" class=\\"graf graf--p graf-after--p\\"><em class=\\"markup--em markup--p-em\\">Choice of law and jurisdiction.</em> These Terms are governed by California law, without reference to its conflict of laws provisions. You agree that any suit arising from the Services must take place in a court located in San Francisco, California.</p><p name=\\"bbb3\\" id=\\"bbb3\\" class=\\"graf graf--p graf-after--p\\"><em class=\\"markup--em markup--p-em\\">Entire agreement.</em> These Terms (including any document incorporated by reference into them) are the whole agreement between Medium and you concerning the Services.</p><p name=\\"dbf1\\" id=\\"dbf1\\" class=\\"graf graf--p graf-after--p\\"><em class=\\"markup--em markup--p-em\\">Government use.</em> If you\\u2019re \\u200busing \\u200bMedium for the U.S. Government, <a href=\\"https://medium.com/@Medium/amendment-to-medium-terms-of-service-applicable-to-u-s-government-users-fccb00db67d7\\" data-href=\\"https://medium.com/@Medium/amendment-to-medium-terms-of-service-applicable-to-u-s-government-users-fccb00db67d7\\" class=\\"markup--anchor markup--p-anchor\\" target=\\"_blank\\">this Amendment</a> to \\u200bMedium\\u2019s Terms of Service \\u200bapplies to you\\u200b.</p><p name=\\"3318\\" id=\\"3318\\" class=\\"graf graf--p graf-after--p graf--trailing\\">Questions? Let us know at <a href=\\"mailto:%20legal@medium.com\\" data-href=\\"mailto:%20legal@medium.com\\" class=\\"markup--anchor markup--p-anchor\\" target=\\"_blank\\">legal@medium.com</a>.</p></div></div></section></div><footer class=\\"u-paddingTop10\\"><div class=\\"container u-maxWidth740\\"><div class=\\"row\\"><div class=\\"col u-size12of12\\"></div></div><div class=\\"row\\"><div class=\\"col u-size12of12 js-postTags\\"><div class=\\"u-paddingBottom10\\"><ul class=\\"tags tags--postTags tags--borderless\\"><li><a class=\\"link u-baseColor--link\\" href=\\"https://medium.com/tag/terms-and-conditions?source=post\\" data-action-source=\\"post\\">Terms And Conditions</a></li><li><a class=\\"link u-baseColor--link\\" href=\\"https://medium.com/tag/terms?source=post\\" data-action-source=\\"post\\">Terms</a></li><li><a class=\\"link u-baseColor--link\\" href=\\"https://medium.com/tag/medium?source=post\\" data-action-source=\\"post\\">Medium</a></li></ul></div></div></div><section class=\\"uiScale uiScale-ui--small uiScale-caption--regular u-borderTopLightest u-marginTop10 u-paddingTop20\\"><div class=\\"ui-h3 u-textColorDarker u-fontSize22\\">One clap, two clap, three clap, forty?</div><p class=\\"ui-body u-marginBottom20 u-textColorDark u-fontSize16\\">By clapping more or less, you can signal to us which stories really stand out.</p></section><div class=\\"postActions js-postActionsFooter\\"><div class=\\"u-flexCenter\\"><div class=\\"u-flex1\\"><div class=\\"multirecommend js-actionMultirecommend u-flexCenter u-width60\\" data-post-id=\\"9db0094a1e0f\\" data-is-icon-29px=\\"true\\" data-is-circle=\\"true\\" data-has-recommend-list=\\"true\\" data-source=\\"post_actions_footer-----9db0094a1e0f---------------------clap_footer\\"><div class=\\"u-relative u-foreground\\"><div class=\\"clapUndo u-width60 u-round u-height32 u-absolute u-borderBox u-paddingRight5 u-transition--transform200Spring u-background--brandSageLighter js-clapUndo\\" style=\\"top: 14px; padding: 2px;\\"></div></div><span class=\\"u-textAlignCenter u-relative u-background js-actionMultirecommendCount u-marginLeft10\\"></span></div></div><div class=\\"buttonSet u-flex0\\"></div></div></div></div><div class=\\"u-maxWidth740 u-paddingTop20 u-marginTop20 u-borderTopLightest container u-paddingBottom20 u-xs-paddingBottom10 js-postAttributionFooterContainer\\"><div class=\\"row js-postFooterInfo\\"><div class=\\"col u-size6of12 u-xs-size12of12\\"><li class=\\"uiScale uiScale-ui--small uiScale-caption--regular u-block u-paddingBottom18 js-cardUser\\"><div class=\\"u-marginLeft20 u-floatRight\\"><span class=\\"followState js-followState\\" data-user-id=\\"504c7870fdb6\\"></span></div><div class=\\"u-tableCell\\"><a class=\\"link u-baseColor--link avatar\\" href=\\"https://medium.com/@Medium?source=footer_card\\" title=\\"Go to the profile of Medium\\" aria-label=\\"Go to the profile of Medium\\" data-action-source=\\"footer_card\\" data-user-id=\\"504c7870fdb6\\" dir=\\"auto\\"><div class=\\"u-relative u-inlineBlock u-flex0\\"><img src=\\"https://cdn-images-1.medium.com/fit/c/120/120/1*6_fgYnisCa9V21mymySIvA.png\\" class=\\"avatar-image avatar-image--small\\" alt=\\"Go to the profile of Medium\\"><div class=\\"avatar-halo u-absolute u-textColorGreenNormal svgIcon\\" style=\\"width: calc(100% + 12px); height: calc(100% + 12px); top:-6px; left:-6px\\"><svg viewbox=\\"0 0 114 114\\" xmlns=\\"http://www.w3.org/2000/svg\\"><path d=\\"M7.66922967,32.092726 C17.0070768,13.6353618 35.9421928,1.75 57,1.75 C78.0578072,1.75 96.9929232,13.6353618 106.33077,32.092726 L107.66923,31.4155801 C98.0784505,12.4582656 78.6289015,0.25 57,0.25 C35.3710985,0.25 15.9215495,12.4582656 6.33077033,31.4155801 L7.66922967,32.092726 Z\\"></path><path d=\\"M106.33077,81.661427 C96.9929232,100.118791 78.0578072,112.004153 57,112.004153 C35.9421928,112.004153 17.0070768,100.118791 7.66922967,81.661427 L6.33077033,82.338573 C15.9215495,101.295887 35.3710985,113.504153 57,113.504153 C78.6289015,113.504153 98.0784505,101.295887 107.66923,82.338573 L106.33077,81.661427 Z\\"></path></svg></div></div></a></div><div class=\\"u-tableCell u-verticalAlignMiddle u-breakWord u-paddingLeft15\\"><h3 class=\\"ui-h3 u-fontSize18 u-lineHeightTighter\\"><a class=\\"link link--primary u-accentColor--hoverTextNormal\\" href=\\"https://medium.com/@Medium\\" property=\\"cc:attributionName\\" title=\\"Go to the profile of Medium\\" aria-label=\\"Go to the profile of Medium\\" rel=\\"author cc:attributionUrl\\" data-user-id=\\"504c7870fdb6\\" dir=\\"auto\\">Medium</a></h3><div class=\\"ui-caption u-textColorGreenNormal u-fontSize13 u-tintSpectrum u-accentColor--textNormal u-marginBottom7\\">Medium member since Aug 2017</div><p class=\\"ui-body u-fontSize14 u-lineHeightBaseSans u-textColorDark u-marginBottom4\\">Everyone\\u2019s stories and ideas</p></div></li></div><div class=\\"col u-size6of12 u-xs-size12of12 u-xs-marginTop30\\"><li class=\\"uiScale uiScale-ui--small uiScale-caption--regular u-block u-paddingBottom18 js-cardCollection\\"><div class=\\"u-marginLeft20 u-floatRight\\"></div><div class=\\"u-tableCell \\"><a class=\\"link u-baseColor--link avatar avatar--roundedRectangle\\" href=\\"https://medium.com/policy?source=footer_card\\" title=\\"Go to Medium Policy\\" aria-label=\\"Go to Medium Policy\\" data-action-source=\\"footer_card\\"><img src=\\"https://cdn-images-1.medium.com/fit/c/120/120/1*6_fgYnisCa9V21mymySIvA.png\\" class=\\"avatar-image u-size60x60\\" alt=\\"Medium Policy\\"></a></div><div class=\\"u-tableCell u-verticalAlignMiddle u-breakWord u-paddingLeft15\\"><h3 class=\\"ui-h3 u-fontSize18 u-lineHeightTighter u-marginBottom4\\"><a class=\\"link link--primary u-accentColor--hoverTextNormal\\" href=\\"https://medium.com/policy?source=footer_card\\" rel=\\"collection\\" data-action-source=\\"footer_card\\">Medium Policy</a></h3><p class=\\"ui-body u-fontSize14 u-lineHeightBaseSans u-textColorDark u-marginBottom4\\">The Fine Print</p><div class=\\"buttonSet\\"></div></div></li></div></div></div><div class=\\"js-postFooterPlacements\\"></div><div class=\\"u-padding0 u-clearfix u-backgroundGrayLightest u-print-hide supplementalPostContent js-responsesWrapper\\"></div><div class=\\"supplementalPostContent js-heroPromo\\"></div></footer></div>", "author": {"name": null, "url": "https://medium.com/@Medium", "twitter": "@Medium"}, "image_url": null, "tags": [], "link_tags": {"canonical": "https://medium.com/policy/medium-terms-of-service-9db0094a1e0f", "publisher": "https://plus.google.com/103654360130207659246", "author": "https://medium.com/@Medium", "search": "/osd.xml", "alternate": "android-app://com.medium.reader/https/medium.com/p/9db0094a1e0f", "stylesheet": "https://cdn-static-1.medium.com/_/fp/css/main-branding-base.Ch8g7KPCoGXbtKfJaVXo_w.css", "icon": "https://cdn-static-1.medium.com/_/fp/icons/favicon-rebrand-medium.3Y6xpZ-0FSdWDnPM3hSBIA.ico", "apple-touch-icon": "https://cdn-images-1.medium.com/fit/c/120/120/1*6_fgYnisCa9V21mymySIvA.png", "mask-icon": "https://cdn-static-1.medium.com/_/fp/icons/monogram-mask.KPLCSFEZviQN0jQ7veN2RQ.svg"}, "meta_tags": {"viewport": "width=device-width, initial-scale=1", "title": "Medium Terms of Service \\u2013 Medium Policy \\u2013 Medium", "referrer": "unsafe-url", "description": "These Terms of Service (\\u201cTerms\\u201d) are a contract between you and A Medium Corporation. They govern your use of Medium\\u2019s sites, services, mobile apps, products, and content (\\u201cServices\\u201d). By using\\u2026", "theme-color": "#000000", "og:title": "Medium Terms of Service \\u2013 Medium Policy \\u2013 Medium", "og:url": "https://medium.com/policy/medium-terms-of-service-9db0094a1e0f", "fb:app_id": "542599432471018", "og:description": "These Terms of Service (\\u201cTerms\\u201d) are a contract between you and A Medium Corporation. They govern your use of Medium\\u2019s sites, services, mobile apps, products, and content (\\u201cServices\\u201d). By using\\u2026", "twitter:description": "These Terms of Service (\\u201cTerms\\u201d) are a contract between you and A Medium Corporation. They govern your use of Medium\\u2019s sites, services, mobile apps, products, and content (\\u201cServices\\u201d). By using\\u2026", "author": "Medium", "og:type": "article", "twitter:card": "summary", "article:publisher": "https://www.facebook.com/medium", "article:author": "https://medium.com/@Medium", "robots": "index, follow", "article:published_time": "2012-08-13T22:54:53.510Z", "twitter:creator": "@Medium", "twitter:site": "@Medium", "og:site_name": "Medium", "twitter:label1": "Reading time", "twitter:data1": "5 min read", "twitter:app:name:iphone": "Medium", "twitter:app:id:iphone": "828256236", "twitter:app:url:iphone": "medium://p/9db0094a1e0f", "al:ios:app_name": "Medium", "al:ios:app_store_id": "828256236", "al:android:package": "com.medium.reader", "al:android:app_name": "Medium", "al:ios:url": "medium://p/9db0094a1e0f", "al:android:url": "medium://p/9db0094a1e0f", "al:web:url": "https://medium.com/policy/medium-terms-of-service-9db0094a1e0f"}}\n',)
CPU times: user 16.9 ms, sys: 26.1 ms, total: 43 ms
Wall time: 42.7 ms

In [31]:

print('We can parse date with json library and get dict like object \n')dict_items=items.map(json.loads)print(type(dict_items))

We can parse date with json library and get dict like object
<class 'dask.bag.core.Bag'>

In [32]:

dict_items.take(1)

Out[32]:

({'_id': 'https://medium.com/policy/medium-terms-of-service-9db0094a1e0f',
'_timestamp': 1520035195.282891,
'_spider': 'medium',
'url': 'https://medium.com/policy/medium-terms-of-service-9db0094a1e0f',
'domain': 'medium.com',
'published': {'$date': '2012-08-13T22:54:53.510Z'},
'title': 'Medium Terms of Service – Medium Policy – Medium',
'content': '<div><header class="container u-maxWidth740"><div class="uiScale uiScale-ui--regular uiScale-caption--regular postMetaHeader u-paddingBottom10 row"><div class="col u-size12of12 js-postMetaLockup"><div class="uiScale uiScale-ui--regular uiScale-caption--regular postMetaLockup postMetaLockup--authorWithBio u-flexCenter js-postMetaLockup"><div class="u-flex0"><a class="link u-baseColor--link avatar" href="https://medium.com/@Medium?source=post_header_lockup" data-action="show-user-card" data-action-source="post_header_lockup" data-action-value="504c7870fdb6" data-action-type="hover" data-user-id="504c7870fdb6" dir="auto"><div class="u-relative u-inlineBlock u-flex0"><img src="https://cdn-images-1.medium.com/fit/c/120/120/1*6_fgYnisCa9V21mymySIvA.png" class="avatar-image avatar-image--small" alt="Go to the profile of Medium"><div class="avatar-halo u-absolute u-textColorGreenNormal svgIcon" style="width: calc(100% + 12px); height: calc(100% + 12px); top:-6px; left:-6px"><svg viewbox="0 0 114 114" xmlns="http://www.w3.org/2000/svg"><path d="M7.66922967,32.092726 C17.0070768,13.6353618 35.9421928,1.75 57,1.75 C78.0578072,1.75 96.9929232,13.6353618 106.33077,32.092726 L107.66923,31.4155801 C98.0784505,12.4582656 78.6289015,0.25 57,0.25 C35.3710985,0.25 15.9215495,12.4582656 6.33077033,31.4155801 L7.66922967,32.092726 Z"></path><path d="M106.33077,81.661427 C96.9929232,100.118791 78.0578072,112.004153 57,112.004153 C35.9421928,112.004153 17.0070768,100.118791 7.66922967,81.661427 L6.33077033,82.338573 C15.9215495,101.295887 35.3710985,113.504153 57,113.504153 C78.6289015,113.504153 98.0784505,101.295887 107.66923,82.338573 L106.33077,81.661427 Z"></path></svg></div></div></a></div><div class="u-flex1 u-paddingLeft15 u-overflowHidden"><div class="u-lineHeightTightest"><a class="ds-link ds-link--styleSubtle ui-captionStrong u-inlineBlock link link--darken link--darker" href="https://medium.com/@Medium?source=post_header_lockup" data-action="show-user-card" data-action-source="post_header_lockup" data-action-value="504c7870fdb6" data-action-type="hover" data-user-id="504c7870fdb6" dir="auto">Medium</a><span class="followState js-followState" data-user-id="504c7870fdb6"></span></div><div class="ui-caption ui-xs-clamp2 postMetaInline">Everyone’s stories and ideas</div><div class="ui-caption postMetaInline js-testPostMetaInlineSupplemental"><time datetime="2012-08-13T22:54:53.510Z">Aug 13, 2012</time><span class="middotDivider u-fontSize12"></span><span class="readingTime" title="5 min read"></span></div></div></div></div></div></header><div class="postArticle-content js-postField js-notesSource js-trackedPost" data-post-id="9db0094a1e0f" data-source="post_page" data-collection-id="675ebe56ac25" data-tracking-context="postPage"><section name="bb8c" class="section section--body section--first section--last"><div class="section-divider"><hr class="section-divider"></div><div class="section-content"><div class="section-inner sectionLayout--insetColumn"><h1 name="title" id="title" class="graf graf--h2 graf--leading graf--title">Medium Terms of\xa0Service</h1><p name="571b" id="571b" class="graf graf--p graf-after--h2"><strong class="markup--strong markup--p-strong">Effective: March 7, 2016</strong></p><p name="c90b" id="c90b" class="graf graf--p graf-after--p">These Terms of Service (“Terms”) are a contract between you and A Medium Corporation. They govern your use of Medium’s sites, services, mobile apps, products, and content (“Services”).</p><p name="238b" id="238b" class="graf graf--p graf-after--p">By using Medium, you agree to these Terms. If you don’t agree to any of the Terms, you can’t use Medium.</p><p name="7769" id="7769" class="graf graf--p graf-after--p">We can change these Terms at any time. We keep a <a href="https://github.com/Medium/medium-policy" data-href="https://github.com/Medium/medium-policy" class="markup--anchor markup--p-anchor" rel="nofollow noopener" target="_blank">historical</a> record of all changes to our Terms on GitHub. If a change is material, we’ll let you know before they take effect. By using Medium on or after that effective date, you agree to the new Terms. If you don’t agree to them, you should delete your account before they take effect, otherwise your use of the site and content will be subject to the new Terms.</p><h4 name="8c81" id="8c81" class="graf graf--h4 graf-after--p"><strong class="markup--strong markup--h4-strong">Content rights &amp; responsibilities</strong></h4><p name="ac74" id="ac74" class="graf graf--p graf-after--h4">You own the rights to the content you create and post on Medium.</p><p name="651b" id="651b" class="graf graf--p graf-after--p">By posting content to Medium, you give us a nonexclusive license to publish it on Medium Services, including anything reasonably related to publishing it (like storing, displaying, reformatting, and distributing it). In consideration for Medium granting you access to and use of the Services, you agree that Medium may enable advertising on the Services, including in connection with the display of your content or other information. We may also use your content to promote Medium, including its products and content. We will never sell your content to third parties without your explicit permission.</p><p name="2584" id="2584" class="graf graf--p graf-after--p">You’re responsible for the content you post. This means you assume all risks related to it, including someone else’s reliance on its accuracy, or claims relating to intellectual property or other legal rights.</p><p name="c207" id="c207" class="graf graf--p graf-after--p">You’re welcome to post content on Medium that you’ve published elsewhere, as long as you have the rights you need to do so. By posting content to Medium, you represent that doing so doesn’t conflict with any other agreement you’ve made.</p><p name="0372" id="0372" class="graf graf--p graf-after--p">By posting content you didn’t create to Medium, you are representing that you have the right to do so. For example, you are posting a work that’s in the public domain, used under license (including a free license, such as <a href="https://creativecommons.org/licenses/" data-href="https://creativecommons.org/licenses/" class="markup--anchor markup--p-anchor" rel="nofollow noopener" target="_blank">Creative Commons</a>), or a fair use.</p><p name="0472" id="0472" class="graf graf--p graf-after--p">We can remove any content you post for any reason.</p><p name="db2b" id="db2b" class="graf graf--p graf-after--p">You can delete any of your posts, or your account, anytime. Processing the deletion may take a little time, but we’ll do it as quickly as possible. We may keep backup copies of your deleted post or account on our servers for up to 14 days after you delete it.</p><h4 name="baf1" id="baf1" class="graf graf--h4 graf-after--p"><strong class="markup--strong markup--h4-strong">Our content and\xa0services</strong></h4><p name="adc7" id="adc7" class="graf graf--p graf-after--h4">We reserve all rights in Medium’s look and feel. Some parts of Medium are licensed under third-party open source licenses. We also make some of our own code available under open source licenses. As for other parts of Medium, you may not copy or adapt any portion of our code or visual design elements (including logos) without express written permission from Medium unless otherwise permitted by law.</p><p name="20e4" id="20e4" class="graf graf--p graf-after--p">You may not do, or try to do, the following: (1) access or tamper with non-public areas of the Services, our computer systems, or the systems of our technical providers; (2) access or search the Services by any means other than the currently available, published interfaces (e.g., APIs) that we provide; (3) forge any TCP/IP packet header or any part of the header information in any email or posting, or in any way use the Services to send altered, deceptive, or false source-identifying information; or (4) interfere with, or disrupt, the access of any user, host, or network, including sending a virus, overloading, flooding, spamming, mail-bombing the Services, or by scripting the creation of content or accounts in such a manner as to interfere with or create an undue burden on the Services.</p><p name="f5dd" id="f5dd" class="graf graf--p graf-after--p">Crawling the Services is allowed if done in accordance with the provisions of our robots.txt file, but scraping the Services is prohibited.</p><p name="71a8" id="71a8" class="graf graf--p graf-after--p">We may change, terminate, or restrict access to any aspect of the service, at any time, without notice.</p><h4 name="12f1" id="12f1" class="graf graf--h4 graf-after--p"><strong class="markup--strong markup--h4-strong">No children</strong></h4><p name="2ce7" id="2ce7" class="graf graf--p graf-after--h4">Medium is only for people 13 years old and over. By using Medium, you affirm that you are over 13. If we learn someone under 13 is using Medium, we’ll terminate their account.</p><h4 name="531c" id="531c" class="graf graf--h4 graf-after--p"><strong class="markup--strong markup--h4-strong">Security</strong></h4><p name="3155" id="3155" class="graf graf--p graf-after--h4">If you find a security vulnerability on Medium, tell us. We have a <a href="https://medium.com/policy/medium-s-bug-bounty-disclosure-program-34b1c80764c2" data-href="https://medium.com/policy/medium-s-bug-bounty-disclosure-program-34b1c80764c2" class="markup--anchor markup--p-anchor" target="_blank">bug bounty disclosure program</a>.</p><h4 name="05cc" id="05cc" class="graf graf--h4 graf-after--p"><strong class="markup--strong markup--h4-strong">Incorporated rules and\xa0policies</strong></h4><p name="5207" id="5207" class="graf graf--p graf-after--h4">By using the Services, you agree to let Medium collect and use information as detailed in our <a href="https://medium.com/p/f03bf92035c9" data-href="https://medium.com/p/f03bf92035c9" class="markup--anchor markup--p-anchor" target="_blank">Privacy Policy</a>. If you’re outside the United States, you consent to letting Medium transfer, store, and process your information (including your personal information and content) in and out of the United States.</p><p name="6230" id="6230" class="graf graf--p graf-after--p">To enable a functioning community, we have <a href="https://medium.com/policy/medium-rules-30e5502c4eb4" data-href="https://medium.com/policy/medium-rules-30e5502c4eb4" class="markup--anchor markup--p-anchor" target="_blank">Rules</a>. To ensure usernames are distributed and used fairly, we have a <a href="https://medium.com/@Medium/medium-username-policy-7054a77fb04f" data-href="https://medium.com/@Medium/medium-username-policy-7054a77fb04f" class="markup--anchor markup--p-anchor" target="_blank">Username Policy</a>. Under our <a href="https://medium.com/policy/mediums-copyright-and-dmca-policy-d126f73695" data-href="https://medium.com/policy/mediums-copyright-and-dmca-policy-d126f73695" class="markup--anchor markup--p-anchor" target="_blank">DMCA Policy</a>, we’ll remove material after receiving a valid takedown notice. Under our <a href="https://medium.com/policy/mediums-trademark-policy-e3bb53df59a7" data-href="https://medium.com/policy/mediums-trademark-policy-e3bb53df59a7" class="markup--anchor markup--p-anchor" target="_blank">Trademark Policy</a>, we’ll investigate any use of another’s trademark and respond appropriately.</p><p name="21ad" id="21ad" class="graf graf--p graf-after--p">By using Medium, you agree to follow these Rules and Policies. If you don’t, we may remove content, or suspend or delete your account.</p><h4 name="a2a2" id="a2a2" class="graf graf--h4 graf-after--p"><strong class="markup--strong markup--h4-strong">Miscellaneous</strong></h4><p name="b7da" id="b7da" class="graf graf--p graf-after--h4"><em class="markup--em markup--p-em">Disclaimer of warranty.</em> Medium provides the Services to you as is. You use them at your own risk and discretion. That means they don’t come with any warranty. None express, none implied. No implied warranty of merchantability, fitness for a particular purpose, availability, security, title or non-infringement.</p><p name="7073" id="7073" class="graf graf--p graf-after--p"><em class="markup--em markup--p-em">Limitation of Liability</em>. Medium won’t be liable to you for any damages that arise from your using the Services. This includes if the Services are hacked or unavailable. This includes all types of damages (indirect, incidental, consequential, special or exemplary). And it includes all kinds of legal claims, such as breach of contract, breach of warranty, tort, or any other loss.</p><p name="3d70" id="3d70" class="graf graf--p graf-after--p"><em class="markup--em markup--p-em">No waiver.</em> If Medium doesn’t exercise a particular right under these Terms, that doesn’t waive it.</p><p name="ab04" id="ab04" class="graf graf--p graf-after--p"><em class="markup--em markup--p-em">Severability</em>. If any provision of these terms is found invalid by a court of competent jurisdiction, you agree that the court should try to give effect to the parties’ intentions as reflected in the provision and that other provisions of the Terms will remain in full effect.</p><p name="bde8" id="bde8" class="graf graf--p graf-after--p"><em class="markup--em markup--p-em">Choice of law and jurisdiction.</em> These Terms are governed by California law, without reference to its conflict of laws provisions. You agree that any suit arising from the Services must take place in a court located in San Francisco, California.</p><p name="bbb3" id="bbb3" class="graf graf--p graf-after--p"><em class="markup--em markup--p-em">Entire agreement.</em> These Terms (including any document incorporated by reference into them) are the whole agreement between Medium and you concerning the Services.</p><p name="dbf1" id="dbf1" class="graf graf--p graf-after--p"><em class="markup--em markup--p-em">Government use.</em> If you’re \u200busing \u200bMedium for the U.S. Government, <a href="https://medium.com/@Medium/amendment-to-medium-terms-of-service-applicable-to-u-s-government-users-fccb00db67d7" data-href="https://medium.com/@Medium/amendment-to-medium-terms-of-service-applicable-to-u-s-government-users-fccb00db67d7" class="markup--anchor markup--p-anchor" target="_blank">this Amendment</a> to \u200bMedium’s Terms of Service \u200bapplies to you\u200b.</p><p name="3318" id="3318" class="graf graf--p graf-after--p graf--trailing">Questions? Let us know at <a href="mailto:%20legal@medium.com" data-href="mailto:%20legal@medium.com" class="markup--anchor markup--p-anchor" target="_blank">legal@medium.com</a>.</p></div></div></section></div><footer class="u-paddingTop10"><div class="container u-maxWidth740"><div class="row"><div class="col u-size12of12"></div></div><div class="row"><div class="col u-size12of12 js-postTags"><div class="u-paddingBottom10"><ul class="tags tags--postTags tags--borderless"><li><a class="link u-baseColor--link" href="https://medium.com/tag/terms-and-conditions?source=post" data-action-source="post">Terms And Conditions</a></li><li><a class="link u-baseColor--link" href="https://medium.com/tag/terms?source=post" data-action-source="post">Terms</a></li><li><a class="link u-baseColor--link" href="https://medium.com/tag/medium?source=post" data-action-source="post">Medium</a></li></ul></div></div></div><section class="uiScale uiScale-ui--small uiScale-caption--regular u-borderTopLightest u-marginTop10 u-paddingTop20"><div class="ui-h3 u-textColorDarker u-fontSize22">One clap, two clap, three clap, forty?</div><p class="ui-body u-marginBottom20 u-textColorDark u-fontSize16">By clapping more or less, you can signal to us which stories really stand out.</p></section><div class="postActions js-postActionsFooter"><div class="u-flexCenter"><div class="u-flex1"><div class="multirecommend js-actionMultirecommend u-flexCenter u-width60" data-post-id="9db0094a1e0f" data-is-icon-29px="true" data-is-circle="true" data-has-recommend-list="true" data-source="post_actions_footer-----9db0094a1e0f---------------------clap_footer"><div class="u-relative u-foreground"><div class="clapUndo u-width60 u-round u-height32 u-absolute u-borderBox u-paddingRight5 u-transition--transform200Spring u-background--brandSageLighter js-clapUndo" style="top: 14px; padding: 2px;"></div></div><span class="u-textAlignCenter u-relative u-background js-actionMultirecommendCount u-marginLeft10"></span></div></div><div class="buttonSet u-flex0"></div></div></div></div><div class="u-maxWidth740 u-paddingTop20 u-marginTop20 u-borderTopLightest container u-paddingBottom20 u-xs-paddingBottom10 js-postAttributionFooterContainer"><div class="row js-postFooterInfo"><div class="col u-size6of12 u-xs-size12of12"><li class="uiScale uiScale-ui--small uiScale-caption--regular u-block u-paddingBottom18 js-cardUser"><div class="u-marginLeft20 u-floatRight"><span class="followState js-followState" data-user-id="504c7870fdb6"></span></div><div class="u-tableCell"><a class="link u-baseColor--link avatar" href="https://medium.com/@Medium?source=footer_card" title="Go to the profile of Medium" aria-label="Go to the profile of Medium" data-action-source="footer_card" data-user-id="504c7870fdb6" dir="auto"><div class="u-relative u-inlineBlock u-flex0"><img src="https://cdn-images-1.medium.com/fit/c/120/120/1*6_fgYnisCa9V21mymySIvA.png" class="avatar-image avatar-image--small" alt="Go to the profile of Medium"><div class="avatar-halo u-absolute u-textColorGreenNormal svgIcon" style="width: calc(100% + 12px); height: calc(100% + 12px); top:-6px; left:-6px"><svg viewbox="0 0 114 114" xmlns="http://www.w3.org/2000/svg"><path d="M7.66922967,32.092726 C17.0070768,13.6353618 35.9421928,1.75 57,1.75 C78.0578072,1.75 96.9929232,13.6353618 106.33077,32.092726 L107.66923,31.4155801 C98.0784505,12.4582656 78.6289015,0.25 57,0.25 C35.3710985,0.25 15.9215495,12.4582656 6.33077033,31.4155801 L7.66922967,32.092726 Z"></path><path d="M106.33077,81.661427 C96.9929232,100.118791 78.0578072,112.004153 57,112.004153 C35.9421928,112.004153 17.0070768,100.118791 7.66922967,81.661427 L6.33077033,82.338573 C15.9215495,101.295887 35.3710985,113.504153 57,113.504153 C78.6289015,113.504153 98.0784505,101.295887 107.66923,82.338573 L106.33077,81.661427 Z"></path></svg></div></div></a></div><div class="u-tableCell u-verticalAlignMiddle u-breakWord u-paddingLeft15"><h3 class="ui-h3 u-fontSize18 u-lineHeightTighter"><a class="link link--primary u-accentColor--hoverTextNormal" href="https://medium.com/@Medium" property="cc:attributionName" title="Go to the profile of Medium" aria-label="Go to the profile of Medium" rel="author cc:attributionUrl" data-user-id="504c7870fdb6" dir="auto">Medium</a></h3><div class="ui-caption u-textColorGreenNormal u-fontSize13 u-tintSpectrum u-accentColor--textNormal u-marginBottom7">Medium member since Aug 2017</div><p class="ui-body u-fontSize14 u-lineHeightBaseSans u-textColorDark u-marginBottom4">Everyone’s stories and ideas</p></div></li></div><div class="col u-size6of12 u-xs-size12of12 u-xs-marginTop30"><li class="uiScale uiScale-ui--small uiScale-caption--regular u-block u-paddingBottom18 js-cardCollection"><div class="u-marginLeft20 u-floatRight"></div><div class="u-tableCell "><a class="link u-baseColor--link avatar avatar--roundedRectangle" href="https://medium.com/policy?source=footer_card" title="Go to Medium Policy" aria-label="Go to Medium Policy" data-action-source="footer_card"><img src="https://cdn-images-1.medium.com/fit/c/120/120/1*6_fgYnisCa9V21mymySIvA.png" class="avatar-image u-size60x60" alt="Medium Policy"></a></div><div class="u-tableCell u-verticalAlignMiddle u-breakWord u-paddingLeft15"><h3 class="ui-h3 u-fontSize18 u-lineHeightTighter u-marginBottom4"><a class="link link--primary u-accentColor--hoverTextNormal" href="https://medium.com/policy?source=footer_card" rel="collection" data-action-source="footer_card">Medium Policy</a></h3><p class="ui-body u-fontSize14 u-lineHeightBaseSans u-textColorDark u-marginBottom4">The Fine Print</p><div class="buttonSet"></div></div></li></div></div></div><div class="js-postFooterPlacements"></div><div class="u-padding0 u-clearfix u-backgroundGrayLightest u-print-hide supplementalPostContent js-responsesWrapper"></div><div class="supplementalPostContent js-heroPromo"></div></footer></div>',
'author': {'name': None,
'url': 'https://medium.com/@Medium',
'twitter': '@Medium'},
'image_url': None,
'tags': [],
'link_tags': {'canonical': 'https://medium.com/policy/medium-terms-of-service-9db0094a1e0f',
'publisher': 'https://plus.google.com/103654360130207659246',
'author': 'https://medium.com/@Medium',
'search': '/osd.xml',
'alternate': 'android-app://com.medium.reader/https/medium.com/p/9db0094a1e0f',
'stylesheet': 'https://cdn-static-1.medium.com/_/fp/css/main-branding-base.Ch8g7KPCoGXbtKfJaVXo_w.css',
'icon': 'https://cdn-static-1.medium.com/_/fp/icons/favicon-rebrand-medium.3Y6xpZ-0FSdWDnPM3hSBIA.ico',
'apple-touch-icon': 'https://cdn-images-1.medium.com/fit/c/120/120/1*6_fgYnisCa9V21mymySIvA.png',
'mask-icon': 'https://cdn-static-1.medium.com/_/fp/icons/monogram-mask.KPLCSFEZviQN0jQ7veN2RQ.svg'},
'meta_tags': {'viewport': 'width=device-width, initial-scale=1',
'title': 'Medium Terms of Service – Medium Policy – Medium',
'referrer': 'unsafe-url',
'description': 'These Terms of Service (“Terms”) are a contract between you and A Medium Corporation. They govern your use of Medium’s sites, services, mobile apps, products, and content (“Services”). By using…',
'theme-color': '#000000',
'og:title': 'Medium Terms of Service – Medium Policy – Medium',
'og:url': 'https://medium.com/policy/medium-terms-of-service-9db0094a1e0f',
'fb:app_id': '542599432471018',
'og:description': 'These Terms of Service (“Terms”) are a contract between you and A Medium Corporation. They govern your use of Medium’s sites, services, mobile apps, products, and content (“Services”). By using…',
'twitter:description': 'These Terms of Service (“Terms”) are a contract between you and A Medium Corporation. They govern your use of Medium’s sites, services, mobile apps, products, and content (“Services”). By using…',
'author': 'Medium',
'og:type': 'article',
'twitter:card': 'summary',
'article:publisher': 'https://www.facebook.com/medium',
'article:author': 'https://medium.com/@Medium',
'robots': 'index, follow',
'article:published_time': '2012-08-13T22:54:53.510Z',
'twitter:creator': '@Medium',
'twitter:site': '@Medium',
'og:site_name': 'Medium',
'twitter:label1': 'Reading time',
'twitter:data1': '5 min read',
'twitter:app:name:iphone': 'Medium',
'twitter:app:id:iphone': '828256236',
'twitter:app:url:iphone': 'medium://p/9db0094a1e0f',
'al:ios:app_name': 'Medium',
'al:ios:app_store_id': '828256236',
'al:android:package': 'com.medium.reader',
'al:android:app_name': 'Medium',
'al:ios:url': 'medium://p/9db0094a1e0f',
'al:android:url': 'medium://p/9db0094a1e0f',
'al:web:url': 'https://medium.com/policy/medium-terms-of-service-9db0094a1e0f'}},)

In [33]:

print('We can take any key from all records \n')title_bag=dict_items.pluck('title')print('With take method we received tuple of objects \n')print(title_bag.take(3))

We can take any key from all records
With take method we received tuple of objects
('Medium Terms of Service – Medium Policy – Medium', 'Amendment to Medium Terms of Service Applicable to U.S. Government Users', '走入山與海之間：閩東大刀會和兩岸走私 – Yun-Chen Chien（簡韻真） – Medium')

We can write any function for processing data and apply it with map function

{'viewport': 'width=device-width, initial-scale=1',
'title': 'Amendment to Medium Terms of Service Applicable to U.S. Government Users',
'referrer': 'origin',
'description': 'This agreement (“Amendment”) is an amendment to Medium’s Terms. It is between Medium and the U.S. Government and applies to the use of Medium Services by the Government. The reason for this Amendment…',
'theme-color': '#000000',
'og:title': 'Amendment to Medium Terms of Service Applicable to U.S. Government Users',
'og:url': 'https://medium.com/policy/amendment-to-medium-terms-of-service-applicable-to-u-s-government-users-fccb00db67d7',
'fb:app_id': '542599432471018',
'og:description': 'This agreement (“Amendment”) is an amendment to Medium’s Terms. It is between Medium and the U.S. Government and applies to the use of…',
'twitter:description': 'This agreement (“Amendment”) is an amendment to Medium’s Terms. It is between Medium and the U.S. Government and applies to the use of…',
'author': 'Medium',
'og:type': 'article',
'twitter:card': 'summary',
'article:publisher': 'https://www.facebook.com/medium',
'article:author': 'https://medium.com/@Medium',
'robots': 'noindex, follow',
'article:published_time': '2015-08-03T07:44:50.331Z',
'twitter:creator': '@Medium',
'twitter:site': '@Medium',
'og:site_name': 'Medium',
'twitter:label1': 'Reading time',
'twitter:data1': '7 min read',
'twitter:app:name:iphone': 'Medium',
'twitter:app:id:iphone': '828256236',
'twitter:app:url:iphone': 'medium://p/fccb00db67d7',
'al:ios:app_name': 'Medium',
'al:ios:app_store_id': '828256236',
'al:android:package': 'com.medium.reader',
'al:android:app_name': 'Medium',
'al:ios:url': 'medium://p/fccb00db67d7',
'al:android:url': 'medium://p/fccb00db67d7',
'al:web:url': 'https://medium.com/policy/amendment-to-medium-terms-of-service-applicable-to-u-s-government-users-fccb00db67d7'}

%%time
print('Transform published column to datetime as we did with pandas, it will by slightly slowly than in pandas \n')
df['published'] = pd.to_datetime(df.published, format='%Y-%m-%dT%H:%M:%S.%fZ')

Transform published column to datetime as we did with pandas, it will by slightly slowly than in pandas
CPU times: user 277 ms, sys: 2.14 ms, total: 279 ms
Wall time: 277 ms

print('We can apply function with mixed transformation to dask dataframe written for pandas df without changes \n')defadditional_time_features_df(df,to_cat_cols=['Author','domain','month','year','day_of_week']):df['month']=df['published'].apply(lambdats:ts.month)df['year']=df['published'].apply(lambdats:ts.year)hour=df['published'].apply(lambdats:ts.hour)df['hour']=hourdf['morning']=((hour>=7)&(hour<=11)).astype('float64')df['day']=((hour>=12)&(hour<=18)).astype('int')df['evening']=((hour>=19)&(hour<=23)).astype('int')df['night']=((hour>=0)&(hour<=6)).astype('int')df['sin_hour']=np.sin(2*np.pi*df['hour']/24)df['cos_hour']=np.cos(2*np.pi*df['hour']/24)df=df.drop(["hour"],axis=1)day_of_week=df['published'].dt.dayofweek.astype('int')df['day_of_week']=day_of_weekdf['weekend']=(day_of_week>=5).astype('int')# turn to categorical df[to_cat_cols]=df[to_cat_cols].astype('category')returndf

We can apply function with mixed transformation to dask dataframe written for pandas df without changes

Dask ML provides scalable machine learning algorithms in python which are compatible with scikit-learn. Let us first understand how scikit-learn handles the computations and then we will look at how Dask performs these operations differently. See dask-ml tutorials: Examples from dask ml

The biggest model from our course was a random forest on text data in the week with Random Forest assignment. Below I just reproduce part of our assignment, but I reduced nrows and max features in Count vectorizer, but you can check with original parameters

fromsklearn.model_selectionimportStratifiedKFold,GridSearchCVfromsklearn.feature_extraction.textimportCountVectorizerfromsklearn.linear_modelimportLogisticRegressionfromsklearn.pipelineimportPipeline# Split on 3 foldsskf=StratifiedKFold(n_splits=3,shuffle=True,random_state=17)# In Pipeline we will modify the text and train logistic regressionclassifier=Pipeline([('vectorizer',CountVectorizer(max_features=500,ngram_range=(1,3))),('clf',LogisticRegression(random_state=17))])

Parallel to Gridsearch CV in sklearn, Dask provides a library called Dask-search CV (Dask-search CV is now included in Dask ML). It merges steps so that there are less repetitions. Below are the installation steps for Dask-search. We need to install it separately

In [61]:

#pip3 install dask-searchcvimportdask_searchcvasdcv

We can use a pipelines in dask grid search, and according the documentation we should use dask with pipelines with many opeations which could be parallelized, especially included feature union, but I've tried and get an error as a result... Anyway time consuming operations as CountVectorizer couldn't be parallelized, so here gridsearch from dask only for classifier documentation.

lr=LogisticRegression()parameters={'C':(0.1,1,10,100)}t_start=time.time()grid_search=dcv.GridSearchCV(lr,parameters,scoring='roc_auc',cv=skf)grid_search.fit(Xvect,y_text)t_end=time.time()print(f'Elapsed time for grid_search (without time spended to vectorization) {round((t_end - t_start))} (s):')

Elapsed time for grid_search (without time spended to vectorization) 0 (s):

In [64]:

grid_search.best_score_

Out[64]:

0.7020017187686919

I tried to see how good dask will be with random forest with original parameters, but sometimes this raise en error get "(OSError: [Errno 24] Too many open files) after execution, and I couldn't fix it...." Sometimes it works ok, for small data it works in most cases, but if you re-run this notebook several times there is a big chance to get such an error. So, I believe that dask-ml very usefull, but for know I definitely don't know how it should be used properly.

There are number of models rewritten in dask, which could take dask object (huge arrays) and compute models on them. You could read more in dask documentation. Below an example with KMeans, but also there are dask version of linear models, processing functions. The notation is very similar to scikit-learn, and it should be easy to use.

In [66]:

fromdask_mlimportdatasetsfromdask_ml.clusterimportKMeans

In [67]:

X,y=datasets.make_blobs(n_samples=10000000,chunks=1000000,random_state=0,centers=3)# Persist will give you back a lazy dask.delayed object X=X.persist()X

Actually I read the article about dask couple of days ago and I've decided that task with tutorial a good way to get acquainted with the library. So I ask you not to be very strict if I misunderstood something:))

This website does not host notebooks, it only renders notebooks
available on other websites.