Using Python for Streaming Data with Iterators

Next, the tutorial uses the Pandas read_csv() function and chunksize argument to read by chunk of the same dataset, World Bank World Development Indicator.

Another way to read data too large to store in memory in chunks is to read the file in as DataFrames of a certain length (using the chunksize).

First and foremost, import the pandas’ package, then use the .read_csv() function which creates an interable reader object. Then, it can use next() method on it to print the value. Refer below for the sample code.

The output of the above code which processes the data by chunk size of 10.

The next 10 records will be from index 10 to 19.

Next, the tutorials require to create another DataFrame composed of only the rows from a specific country. Then, zip together two of the columns from the new DataFrame, ‘Total Population’ and ‘Urban population (% of total)’. Finally, create a list of tuples from the zip object, where each tuple is composed of a value from each of the two columns mentioned.

It requires to use list comprehension to create a new DataFrame. The values in this new DataFrame column is ‘Total Urban Population’, therefore, the product of the first and second element in each tuple.

Furthermore, because the 2nd element is a percentage, it needs to divide the entire result by 100, or alternatively, multiply it by 0.01.

Then, using the Matplotlib’s package, plot the scatter plot with new column ‘Total Urban Population’ and ‘Year’. It is quite a lot of stuff combined together up to this point. See the code and result of the plot shown below.

I realized that the ‘Year’ is not in integer format. I printed out the value of the column ‘Year’ and it looked perfectly fine. Do you know what is wrong with my code above?

This time, you will aggregate the results over all the DataFrame chunks in the dataset. This basically means you will be processing the entire dataset now. This is neat because you’re going to be able to process the entire large dataset by just working on smaller pieces of it!

Below sample code consists of some of DataCamp’s code, so some variable names have changed to use their variable names. Here, it requires to append the DataFrame chunk into a variable called ‘data’ and plot the scatter plot.

Both of the df_pop_ceb look the same, so how come the scatter plots show differently. I still cannot figure out why my scatter plot is showing ‘Year’ with decimal point.

Lastly, wrap up the tutorial by creating an user defined function and taking two parameters, the filename and the country code to do all the above. The source code and another scatter plot is shown as below: