A Dask DataFrame is a large parallel dataframe composed of many smaller Pandas
dataframes, split along the index. These pandas dataframes may live on disk
for larger-than-memory computing on a single machine, or on many different
machines in a cluster. One Dask dataframe operation triggers many operations
on the constituent Pandas dataframes.

Dask dataframes coordinate many Pandas DataFrames/Series arranged along the
index. Dask.dataframe is partitioned row-wise, grouping rows by index value
for efficiency. These Pandas objects may live on disk or on other machines.

Because the dask.dataframe application programming interface (API) is a
subset of the Pandas API it should be familiar to Pandas users. There are some
slight alterations due to the parallel nature of dask:

By default dask.dataframe uses the multi-threaded scheduler.
This exposes some parallelism when Pandas or the underlying numpy operations
release the global interpreter lock (GIL). Generally Pandas is more GIL
bound than NumPy, so multi-core speed-ups are not as pronounced for
dask.dataframe as they are for dask.array. This is changing, and
the Pandas development team is actively working on releasing the GIL.

When dealing with text data you may see speedups by switching to the newer
distributed scheduler either on a cluster or
single machine.