When working with data in Python, you will be using data structures from core Python as well as from several libraries, especially Numpy and Pandas. This notebook provides a brief overview of some useful data structures.

The main numpy data structure is the ndarray, which stands for "n-dimenional array". First, since numpy is a library we need to import it.

In [17]:

importnumpyasnp

An ndarray is a homogeneous rectangular data structure. Being "homogeneous" means all data values in the container must have the same type. Numpy supports many data types. In the following cell, we create a 1-dimensional literal ndarray with 8 byte floating point (double precision) values.

In [18]:

x=np.asarray([4,1,5,4,7,3,0],dtype=np.float64)

Since the data are all integers, we could have used an integer data type instead:

In [19]:

x=np.asarray([4,1,5,4,7,3,0],dtype=np.int64)

We can even store them as single byte values (since none of the values exceeds 255):

In [20]:

x=np.asarray([4,1,5,4,7,3,0],dtype=np.uint8)

We can index and slice an ndarray just like we index and slice a Python list:

In [21]:

print(x[2])print(x[3:5])

5
[4 7]

In addition, ndarrays support two types of indexing that core Python lists do not. We can index with a Boolean array:

In [22]:

ii=np.asarray([False,False,True,False,True,False,False])print(x[ii])

[5 7]

We can also index using a list of positions:

In [23]:

ix=np.asarray([0,3,3,5])print(x[ii])

[5 7]

We can do elementwise arithmetic using numpy arrays as long as they are conformable (or can be broadcast to be conformable, but that is a more advanced topic). Note that numerical types are "upcast".

Pandas is a library that provides data structures that can be used to manipulate heterogeneous data sets. It is a library, so we begin by importing it:

In [28]:

importpandasaspd

Below we create a Pandas Series (a one-dimensional homogeneous data structure). All Pandas data structures are "indexed", meaning that each axis is labeled with arbitrary keys. These keys are pre-sorted, so element access and slicing using the index is quite efficient.

In [29]:

x=pd.Series([3,1,7,99,0],index=["a","b","c","d","e"])print(x)

a 3
b 1
c 7
d 99
e 0
dtype: int64

Pandas Series allow label-based indexing, here are three equivalent approaches:

In [30]:

print(x["b"])

1

In [31]:

print(x.loc["b"])

1

In [32]:

print(x.b)

1

Pandas Series objects also allow position-based indexing:

In [33]:

print(x.iloc[1])

1

In [34]:

print(x.iloc[-2])

99

Next we create another Pandas Series:

In [35]:

y=pd.Series([5,13,7],index=["b","e","f"])print(y)

b 5
e 13
f 7
dtype: int64

Two Series can be added (and subtracted, etc.). Note that if a position is missing in either summand, the result is NaN.

In [36]:

print(x+y)

a NaN
b 6
c NaN
d NaN
e 13
f NaN
dtype: float64

Here is another way to do the same thing:

In [37]:

x.add(y)

Out[37]:

a NaN
b 6
c NaN
d NaN
e 13
f NaN
dtype: float64

A "fill value" is used in place of any missing value:

In [38]:

x.add(y,fill_value=0)

Out[38]:

a 3
b 6
c 7
d 99
e 13
f 7
dtype: float64

Here is how we do position-based slicing for a Pandas Series (note that the end point of the range is not included in the slice):

Help on built-in function trace:
trace(...)
a.trace(offset=0, axis1=0, axis2=1, dtype=None, out=None)
Return the sum along diagonals of the array.
Refer to `numpy.trace` for full documentation.
See Also
--------
numpy.trace : equivalent function

When you slice an array, you are often getting a reference to the array that you sliced from. This means that if you change the slice, you also are changing the value in the array that it was sliced from. To illustrate this, let's first create an array and get a slice from it.

In [64]:

x=np.array([[1,4],[3,2],[5,6]])y=x[1,:]print(x)print("\n")print(y)

[[1 4]
[3 2]
[5 6]]
[3 2]

Now we change a value in the slice, and check the state of the parent array (x):

In [65]:

y[1]=88print(y)print("\n")print(x)

[ 3 88]
[[ 1 4]
[ 3 88]
[ 5 6]]

If you want to avoid this behavior, create a copy.

In [66]:

y=x[1,:].copy()

If you aren't sure whether you are getting a reference, use the id function: