.. _goals:
=======================
Goals and Motivations
=======================
The pandas codebase is now over 8 years old, having grown to over 200,000 lines
of code from its original ~10,000 LOC in the original 0.1 open source release
in January 2010.
At a high level, the "pandas 2.0" effort is based on a number of observations:
* The pandas 0.x series of releases have consisted with huge amounts of
iterative improvements to the library along with some major new features, bug
fixes, and improved documentation. There have also been a series of
deprecations, API changes, and other evolutions of pandas's API to account
for suboptimal design choices (for example: the ``.ix`` operator) made in the
early days of the project (2010 to 2012).
* The unification of Series and DataFrame internals to be based on a common
``NDFrame`` base class and "block manager" data structure (originally created
by me in 2011, and heroically driven forward to its modern form by Jeff
Reback), while introducing many benefits to pandas, has come to be viewed as
a long-term source of technical debt and code complexity.
* pandas's ability to support an increasingly broad set of use cases has been
significantly constrained (as will be examined in detail in these documents)
by its tight coupling to NumPy and therefore subject to various limitations
in NumPy.
* Making significant functional additions (particularly filling gaps in NumPy)
to pandas, particularly new data types, has grown increasingly complex with
very obvious accumulations of technical debt.
* pandas is being used increasingly for very large datasets on machines with
many cores and large amounts of RAM (100s of gigabytes to terabytes). It
would be nice to be able to better utilize these larger, beefier systems
within a single Python process.
* pandas is being used increasingly as a computational building block of some
larger system, such as Dask or Apache Spark. We should consider reducing the
overhead for making data accessible to pandas (i.e. via memory-mapping or
other low-overhead memory sharing).
* Rough edges in pandas's implementation (e.g. its handling of missing data
across data types) are being exposed to users.
These documents are largely concerned with pandas's internal design, which is
mostly invisible to average users. Advanced users of pandas are generally
familiar with some of these internal details, particular around performance and
memory use, and so the degree to which users are impacted will vary quite a
lot.
Goals
=====
Some high levels goals of the pandas 2.0 plan include the following:
* Fixing long-standing limitations or inconsistencies in missing data: null
values in integer and boolean data, and a more consistent notion of null /
NA.
* Improved performance and utilization of multicore systems
* Better user control / visibility of memory usage (which can be opaque and
difficult to conttrol)
* Clearer semantics around non-NumPy data types, and permitting new pandas-only
data types to be added
* Exposing a "libpandas" C/C++ API to other Python library developers: the
internals of Series and DataFrame are only weakly accessible in other
developers' native code. This has been a limitation for scikit-learn and
other projects requiring C or Cython-level access to pandas object data.
* Removal of deprecated functionality
Non-goals / FAQ
===============
As this will be a quite nuanced discussion, especially for those not intimately
familiar with pandas's implementation details, I wanted to speak to a couple of
commonly-asked questions in brief:
````
1. **Will this work make it harder to use pandas with NumPy, scikit-learn,
statsmodels, SciPy, or other libraries that depend on NumPy
interoperability?**
* We are not planning on it. Data that is representable without memory
copying or conversion in NumPy arrays will continue to be 100%
interoperable.
* Data containing missing (NA) values may require explicit conversion where
it is not currently required. For example: integer or boolean type arrays
with missing data. I trust this will be seen as a positive development.
* If anything, more performant and more precise data semantics in pandas will
generally make production code using a downstream library like scikit-learn
more dependable and future-proof.
````
2. **By decoupling from NumPy, it sounds like you are reimplementing NumPy or
adding a new data type system**
* Simply put: no. But it's more complicated than that because of the
numerous interpretations of "type system".
* pandas already contains a large amount (10s of KLOCs) of custom
computational code (see, for example,
``_) that implements
functionality not present in NumPy.
* pandas already features its own (what I will describe as a) "logical type
system", including things like custom data types (such as that of
``pandas.Categorical``), pandas-specific missing data representation, and
implicit type casting (e.g. integer to float on introduction of missing
data). Unfortunately, these logical data types are somewhat weakly
expressed, and the mix of NumPy dtype objects and custom pandas types is
problematic for many internal (implementation) and external (user API)
reasons. I will examine in detail the difference between **physical
types** (i.e. NumPy's dtypes) and **logical types** (i.e. what pandas
currently has, implicitly).
````
3. **Shouldn't you try to accomplish your goals by contributing work to NumPy
instead of investing major work in pandas's internals?**
* In my opinion, this is a "false dichotomy"; i.e. these things are not
mutually exclusive.
* Yes, we should define, scope, and if possible help implement improvements
to NumPy that make sense. As NumPy serves a significantly larger and more
diverse set of users, major changes to the NumPy C codebase must be
approached more conservatively.
* It is unclear that pandas's body of domain-specific data handling and
computational code is entirely "in scope" for NumPy. Some technical
details, such as our categorical or datetime data semantics, "group by"
functionality, relational algebra (joins), etc., may be ideal for pandas
but not necessarily ideal for a general user of NumPy. My opinion is that
functionality from NumPy we wish to use in pandas should "pass through" to
the user unmodified, but we must retain the flexibility to work "outside
the box" (implement things not found in NumPy) without adding technical
debt or user API complexity.
````
4. **API changes / breaks are thought to be bad; don't you have a
responsibility to maintain backwards compatibility for users that heavily
depend on pandas?**
* It's true that APIs should not be broken or changed, and as such should be
approached with extreme caution.
* The goal of the pandas 2.0 initiative is to only make "good" API breaks
that yield a net benefit that can be easily demonstrated. As an example:
adding native missing data support to integer and boolean data (without
casting to another physical storage type) may break user code that has
knowledge of the "rough edge" (the behavior that we are fixing). As these
changes will mostly affect advanced pandas users, I expect they will be
welcomed.
* Any major API change or break will be documented and justified to assist
with code migration.
* As soon as we are able, we will post binary development artifacts for the
pandas 2.0 development branch to get early feedback from heavy pandas
users to understand the impact of changes and how we can better help the
existing user base.
* Some users will find that a certain piece of code has been working "by
accident" (i.e. relying upon undocumented behavior). This kind of breakage
is already a routine occurrence unfortunately.
Summary
=======
Overall, the goal of the pandas 2.0 project is to yield a faster, more cleanly
architected, and more future-proof library that is a drop-in replacement for
90-95% of pandas user code. There will be API / code breakages, but the intent
of any code breakage will almost always be to fix something that has been
"wrong" or inconsistent. Many advanced users will have worked around some of
these rough edges, and so their workarounds may either need to be removed or
changed to accommodate the new (and hopefully it can be agreed in each case:
better) semantics.