Modin is an early stage DataFrame library that wraps pandas and transparently
distributes the data and computation, accelerating your pandas workflows with one line
of code change. The user does not need to know how many cores their system has, nor do
they need to specify how to distribute the data. In fact, users can continue using their
previous pandas notebooks while experiencing a considerable speedup from Modin, even on
a single machine. Only a modification of the import statement is needed, as we
demonstrate below. Once you’ve changed your import statement, you’re ready to use Modin
just like you would pandas, since the API is identical to pandas.

# import pandas as pdimportmodin.pandasaspd

Currently, we have part of the pandas API implemented and are working toward full
functional parity with pandas.

In order to use the most up-to-date version of Modin, please follow the instructions on
the installation page

Once you import the library, you should see something similar to the following output:

>>> import modin.pandas as pd
Waiting for redis server at 127.0.0.1:14618 to respond...
Waiting for redis server at 127.0.0.1:31410 to respond...
Starting local scheduler with the following resources: {'CPU': 4, 'GPU': 0}.
======================================================================
View the web UI at http://localhost:8889/notebooks/ray_ui36796.ipynb?token=ac25867d62c4ae87941bc5a0ecd5f517dbf80bd8e9b04218
======================================================================

Once you have executed importmodin.pandasaspd, you’re ready to begin
running your pandas pipeline as you were before.

In some cases, it may be useful to customize your Ray environment. Below, we have listed
a few ways you can solve common problems in data management with Modin by customizing
your Ray environment. It is possible to use any of Ray’s initialization parameters,
which are all found in Ray’s documentation.

importrayray.init()importmodin.pandasaspd

Modin will automatically connect to the Ray instance that is already running. This way,
you can customize your Ray environment for use in Modin!

If you are working with very large files or would like to exceed your memory, you may
change the primary location of the DataFrame. If you would like to exceed memory, you
can use your disk as backup for the memory. This API is experimental in the context of
Modin. Please let us know what you think!

Instead of limiting the size of your DataFrame to the amount of memory you have, you can
back your memory with disk:

importraynum_bytes=2**40# Make sure you have disk space to do this!ray.init(plasma_directory="/tmp",object_store_memory=num_bytes)importmodin.pandasaspd

Setting plasma_directory="/tmp" uses your disk for storing the DataFrame and setting
object_store_memory sets the maximum size of the plasma store.

Note: This will impact performance for most operations. This should be used when you are
trying to use very large datasets.

Warning: Make sure you have enough space in your disk for however many bytes yourequest for your DataFrame

By default, Modin will use all of the resources available on your machine. It is
possible, however, to limit the amount of resources Modin uses to free resources for
another task or user. Here is how you would limit the number of CPUs Modin used:

importrayray.init(num_cpus=4)importmodin.pandasaspd

Specifying num_cpus limits the number of processors that Modin uses. You may also
specify more processors than you have available on your machine, however this will not
improve the performance (and might end up hurting the performance of the system).