When you start using third-party libraries, pay attention to the license and/or copyright information that is written out. Generally, if the library/code has no license, it means all rights reserved for the author (do not use the code). If it’s [GPL](http://www.gnu.org/licenses/gpl.html), your application/script must also be licensed under the GPL. Although, technically any license that is GPL-compatible is fine to use too: GPL Compatible Licenses or GPL compatibility.

If it’s MIT/2-clause BSD you can do whatever you want (no need to use the same license, or even have a license), if it’s 3-clause BSD you can do whatever you want but have to credit the original author.

For the Curious

Code that is up on GitHub does _not_ mean that it is free to use. If you want to use a library, ask the developer if s/he has plans to include a LICENSE file or in the headers of the files if it’s not there already.

If you want to open source your code (yay, go you!), include your desired license either as a separate file or within the preamble/beginning of your code. Licensing your code is simply copying & pasting the required language of a license of your choice into your codebase.

CAUTION! Double check with your employer agreement. Sometimes, especially if you are in any tech-related role, there are statements in your employment contract that stipulates what and when code is actually the employers. It may be only code that is written on their equipment, and/or during work hours. Or it may be any code written during the time of employment. The stipulations can even change across states and countries within a single employer.

In order to read a CSV/Excel file, we have to import the csv module from Python’s standard library.

1

importcsv

MY_FILE is defining a global - notice how it‘s all caps, a convention for variables we won’t be changing. Included in this repo is a sample file to which this variable is assigned.

1

MY_FILE="../data/sample_sfpd_incident_all.csv"

The Parse Function

In defining the function, we know that we want to give it the CSV file, as well as the delimiter in which the CSV file uses to delimit each element/column.

1

defparse(raw_file,delimiter):

We also know that we want to return a JSON-like object. A JSON file/object is just a collection of dictionaries, much like Python’s dictionary.

1
2
3

defparse(raw_file,delimiter):returnparsed_data

Let’s be good coders and write a documentation-string (doc-string) for future folks that may read our code. Notice the triple-quotes:

1
2
3
4

defparse(raw_file,delimiter):"""Parses a raw CSV file to a JSON-line object."""returnparsed_data

For the Curious

If you are interested in understanding how docstrings work, Python’s PEP (Python Enhancement Proposals) documents spell out how one should craft his/her docstrings: PEP8 and PEP257. This also gives you a peek at what is considered “Pythonic”.
The difference between """docstrings""" and # comments have to do with who the reader will be. Within the a Python shell, if you call help on a particular function or class, it will return the """docstring""" that the developer has written.
There are also documentation programs that look specifically for """docstrings""" to help the developer automatically produce documentation separated out of the code. Within docstrings, it’s helpful to say imperatively what the function/method or class is supposed to do. Examples of how the documented code should work can also be written in the docstrings (and, subsequently, tested). # comments, on the otherhand, are for those reading through the code — the comments are to simply say what a specific piece/line of code is meant to do. Inline # comments are always appreciated by those reading through your code. Many developers also litter # TODO or # FIXME statements for combing through later.

What we have now is a pretty good skeleton - we know what parameters the function will take (raw_file and delimiter), what it is supposed to do (our """doc-string"""), and what it will return, parsed_data. Notice how the parameters and the return value is descriptive in itself.

Let’s sketch out, with comments, how we want this function to take a raw file and give us the format that we want. First, let’s open the file, and the read the file, then build the parsed_data element.

Thankfully, there are a lot of built-in methods that Python has that we can use to do all the steps that we’ve outlined with our comments. The first one we’ll use is open and pass raw_file to it, which we got from defining our own parameters in the parse function:

1

opened_file=open(raw_file)

So we’ve told Python to open the file, now we have to read the file. We have to use the CSV module that we imported earlier:

1

csv_data=csv.reader(opened_file,delimiter=delimiter)

Here, csv.reader is a function of the CSV module. We gave it two parameters: opened_file, and delimiter. It’s easy to get confused when parameters and variables share names. In delimiter=delimiter, the first delimiter is referring to the name of the parameter that csv.reader needs; the second delimiter refers to the argument that our parse function takes in.

Just to quickly put these two lines in our parse function:

1
2
3
4
5
6
7
8
9
10
11
12
13
14

defparse(raw_file,delimiter):"""Parses a raw CSV file to a JSON-line object"""# Open CSV fileopened_file=open(raw_file)# Read the CSV datacsv_data=csv.reader(opened_file,delimiter=delimiter)# Build a data structure to return parsed_data# Close the CSV filereturnparsed_data

For the Curious

The csv_data object, in Python terms, is now an iterator. In very simple terms, this means we can get each element in csv_data one at a time.

Alright — the building of the data structure might seem tricky. The best way to start off is to set up an empty Python list to our parsed_data variable so we can add every row of data that we will parse through.

1

parsed_data=[]

Good — we have a good data structure to add to. Now let’s first address our column headers that came with the CSV file. They will be the first row, and we’ll assign them to the variable fields:

1

fields=csv_data.next()

For the Curious

We were able to call the .next method on csv_data because it is a generator. We just call .next once, since headers are in the 1st and only row of our CSV file.

Let’s loop over each row now that we have the headers properly taken care of. With each loop, we will add a dictionary that maps a field (those column headers) to the value in the CSV cell.

1
2

forrowincsv_data:parsed_data.append(dict(zip(fields,row)))

Here, we iterated over each row in the csv_data item. With each loop, we appended a dictionary (dict()) to our list,parsed_data. We use Python’s built-in zip() function to zip together header → value to make our dictionary of every row.

Now let’s put the function together:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

defparse(raw_file,delimiter):"""Parses a raw CSV file to a JSON-like object"""# Open CSV fileopened_file=open(raw_file)# Read the CSV datacsv_data=csv.reader(opened_file,delimiter=delimiter)# Setup an empty listparsed_data=[]# Skip over the first line of the file for the headersfields=csv_data.next()# Iterate over each row of the csv file, zip together field -> valueforrowincsv_data:parsed_data.append(dict(zip(fields,row)))# Close the CSV fileopened_file.close()returnparsed_data

Using the new Parse function

Let’s define a main() function to act as the starting point for our script,
and use our new parse() function:

1
2
3
4
5
6

defmain():# Call our parse function and give it the needed parametersnew_data=parse(MY_FILE,",")# Let's see what the data looks like!printnew_data

We called our function parse() and gave it the MY_FILE global variable that we defined at the beginning, as well as the delimiter ",".

We assign the function to the variable new_data since the parse() function will return a parsed_data object. Last, we print new_data to see our list of dictionaries!

One final bit — when running a Python file from the command line, Python will execute all of the code found on it. Since the following bit is True,

1
2

if__name__=="__main__":main()

it will call the main() function. By doing the name == __main__ check, you can have that code only execute when you want to run the module as a program (via the command line) and not have it execute when someone just wants to import the parse() function itself into another Python file. This is referred to as “boilerplate code” — code doesn’t really do anything and yet is necessary.

Putting it to action

So you’ve written the parse function and your parse.py file looks like mine. Now what? Let’s run it and parse some d*mn files!

Be sure to have your virtualenv activated that you created earlier in setup. Your terminal prompt should look something like this:

1

(DataVizProj)$

Within the new-coder/dataviz/ directory, let’s make a directory for the python files you are writing with the bash command mkdir [Directory_Name]:

Go ahead and save your copy of parse.py into MySourceFiles (through “Save As” within your text editor). You should see the file in the directory if you return to your terminal and type ls.

To run the python code, you have to tell the terminal to execute the parse.py file with python:

1

(DataVizProj)$ python parse.py

If you got a traceback, or an error message, compare your parse.py file with new-coder/dataviz/tutorial_source/parse.py. Perhaps a typo, or you don’t have your virtualenv setup properly.

The output from the (DataVizProj) $ python parse.py should look like a bunch of dictionaries in one list. For reference, the last bit of output you should see in your terminal should look like (doesn’t have to be exact data, but the structure of {“key”: “value”} should look familiar):

You see this output because in the def main() function, and you explicitly say print new_data which feeds to the output of the terminal. You could, for instance, not print the new_data variable, and just pass the new_data variable to another function. Coincidently, that’s what Part II and Part III are about!

Explore further

Play around with parse.py within the Python interpreter itself. Make sure you’re in your MySourceFiles directory, then start the Python interpreter from there:

Those numbers from calling the id function reflect where the variable is saved in the computer’s memory. Since they are the same number, Python has set up a reference from copy_my_file to the same location that parse.MY_FILE was saved. No need to allocate new space in memory for what is essentially the same variable with a different name.

Here we checked ot the type of data that gets returned back to use from the parse function, as well as ways to simply check out what is the contents of the parsed data.

You can continue to play around; try >>> help(parse.parse) to see our docstring, see what happens if you feed the parse function a different file, delimiter, or just a different variable. Challenge yourself to see if you can create a new file to save the parsed data, rather than just a variable. The example in the python docs may help.