pandas offers support for columns containing strings (ASCII or Unicode) on a
somewhat ad hoc basis.

Strings are stored in NumPy arrays of PyObject* / numpy.object_
dtype. This has several problems

Computations (e.g. groupby operations) typically utilize a code path
for generic Python objects. For example comparisons or hashing goes through
the PyObject_* C API functions. In addition to harming multithreading
due to GIL contention (you must acquire the GIL to use these functions),
these can also be significantly slower than algorithms that operate on
constchar*, potentially taking advantage of hardware optimizations.

String arrays often feature many copies of or references to the same
PyString. Thus, some algorithms may perform redundant computation. Some
parts of pandas, like pandas.read_csv, make an effort to deduplicate
strings to free memory and accelerate computations (e.g. if you do x==y, and x and y are references to the same PyObject*, Python
skips comparing their internal data).

Note that this is somewhat mitigated by using pandas.Categorical, but
this is not the default storage mechanism. More on this below.

Using PyString objects and PyObject* NumPy storage adds non-trivial
overhead (52 bytes in Python 3, slightly less in Python 2, see this
exposition for a
deeper drive) to each value.

The data is already categorical: cast to category dtype can be perform
very cheaply and without duplicating the underlying string memory buffer

Computations like groupby on dictionary-encoded strings will be as
performant as those on Categorical currently are. performant

Some drawbacks

This memory layout is best used as an immutable representation. Mutating
slots here becomes more complex. Whether single value assignments or put /
array-assignment may likely require constructing a new data buffer
(either by realloc or some other copying mechanism). Without a compaction
/ “garbage collection” step on this buffer it will be possible to have “dead”
memory inside it (for example, if you did arr[:]='a-new-string-value',
all the existing values would be orphaned).

Some systems have addressed this issue by storing all string data in a
“global string hash table”. This is something we could explore, but it
would add quite a bit of complexity to implement and may not be worthwhile
at this time.

Indexing into this data structure to obtain a single Python object will
probably want to call PyUnicode_FromStringAndSize to construct a string
(Python 3, therefore Unicode). This requires a memory allocation, whereas it
currently only has to do a Py_INCREF.

Many of pandas’s existing algorithms assuming Python objects would need to be
specialized to take advantage of this new memory layout. This is both a pro
and a con as it will most likely yield significantly better performance.

One trade-off is that creating the temporary Python strings is potentially
costly. This could be mitigated for Python str methods (optimized
array-oriented code path under the hood), but for arbitrary functions you would
have to pay.