January 12, 2018

Flask is a popular micro web application framework for Python using which you can create web apps. Unlike another popular framework like Django, Flask keeps its foot print to a minimum providing only the basic functionality required instead of picking out the entire stack for you the way Django does. And we call it a micro framework for this very reason. Using flask's extensibility at the core, you can build any type of applications by picking the components you want to use. Several big name companies like LinkedIn, Pinterest use Flask for their products.

In this tutorial we will get started with using Flask and create a simple web application with it.

Prerequisites

To follow along with this series you should have some knowledge of Python language. I'm using 3.6 for these tutorials and if you would like to follow along without any issues, I would suggest you to use the same version. For any of the previous versions, there might be a couple of changes in the syntax but the ideas and concepts will remain same.

You will also need to install Flask. You can do that with pip.

pip install -U flask

This will install flask if you don't already have it and update the version to latest if you have a previous version installed.

With those two things, you are good to go.

Getting Started

Just like with anything else you start by importing the stuff you want.

from flask import Flask

And this will make Flask ready for you to use. After this you have to create an app object by calling the Flask constructor like this:

app = Flask("hello")

This will create our app object. The name hello I've specified in the constructor can be anything. But the usual convention is to keep it __main__. Also, the app is just a variable. So, you can name it anything you want.

Next you have to define the routes. Using routes you configure your server to do different actions. Let's say when you type in some website URL in to your browser, you will be taken to its home page. Now if you do a <website>/info it will take you to the info page. So, this mapping of the call to /info URL to the info page is what we call as a route. For the home page the route is simply /.

Let's say we want our server's homepage to display Hello World. You can configure that with a method like this:

@app.route('/')
def index():
return "Hello World"

With the @app.route('/'), we are defining a route on our server. So, when ever somebody opens that route, which for us i the homepage, the index() method associated with that route annotation will be called. And when the index() method is called it will return Hello World just as we expect it to.

And there is one final command to start and run our server which is:

app.run(debug=True)

And that's it. This will run the app that we have created when you run the python file. The debug=True option is useful while developing and testing applications. So, we'll keep that for now.

Just run your python script and you should output like this on the console:

If you are interested in contributing to any open source projects and haven't found the right project or if you were unsure on how to begin, I would like to suggest my own project, Delorean which is a Distributed Version control system, built from scratch in scala. You can contribute not only in the form of code, but also with usage documentation and also by identifying any bugs in its functionality.

January 10, 2018

JSON has become an ubiquitous data exchange format everywhere. Pretty much every service has a JSON API. And since it is so popular, most of the programming languages have built-in JSON parsers. And Of course, Python is no exception. In this article, I'll show you how you can parse JSON with Python's json library.

JSON parsing in Python is quite straight forward and easy unlike in some languages, where it is unnecessarily cumbersome. Like everything else in Python, You start by importing the library you want.

import json

In this article, I am going to use the following JSON I got from json.org

And now we parse this string into a dictionary object with the help of the json library's loads() method.

json_dict = json.loads(json_string)

And you're done. The JSON is parsed and is stored in the json_dict object. The json_dict here is a python dictionary object. If you want to verify, you can do that by calling the type() on it with

print(type(json_dict))

And it will show that it is <class 'dict'>.

Getting back, We have the entire json object as a dictionary in json_dict object and you can just drill down into the dictionary with the keys. On the top level, We just have one key in the dictionary which is menu. We get can get that by indexing the dictionary with that key.

menu = json_dict['menu']

And of course menu is a dictionary too with the keys id, value, and popup. We can access them and print them as well.

print(menu['id']) ## => 'file'
print(menu['value']) ## => 'File'

And then finally we've got popup which is another dictionary as well with the key menuitem which is a list. We can verify this by checking the types of these objects.

And of course each of these elements are dictionaries and so you can go further inside and access those keys and values.

For example, If you want to access New from the above output, you can do this:

print(menuitem[0]['value']) ## => New

And so on and so forth to get any value in the JSON.

And not only that, json library can also accept JSON responses from web services. One cool thing here is that, web server responses are byte strings which means that if you want to use them in your program you'd have convert them to regular strings by using the decode() method. But for json you don't have to do that. You can directly feed in the byte string and it will give you a parsed object. That's pretty cool!

If you are interested in this, make sure to follow me on Twitter @durgaswaroop. While you're at it, Go ahead and subscribe on medium and my blog as well.

If you are interested in contributing to any open source projects and haven't found the right project or if you were unsure on how to begin, I would like to suggest my own project, Delorean which is a Distributed Version control system, built from scratch in scala. You can contribute not only in the form of code, but also with usage documentation and also by identifying any bugs in its functionality.

In the last tutorial we've seen how to create parametrized datasets. Once you create datasets and perform some operations on them, you would like to save those results back into storage. This is what we'll try to do in this article - Saving Datasets to storage.

The first thing we'll do as always is to create the spark-session variable.

We want to store this dataset back on the disk. We can do that with the write() on spark session variable, just like read().

newDs.write().csv("processed-data");

The processed-data in the above command is not the name for the output CSV file but instead for the output directory. When you write a Dataset to a file, it will store the data in the format you asked for, CSV in this case, along with adding some check files and status flags as well creating a directory with that name.

There are two more hidden CRC files that I'm not showing here. The part-00000-31hxxxxxxxxx.csv is the actual data file which has the data from the new dataset.

You can also create a json file by running

newDs.write().json("processed-data")

And that will create another folder with json file and the _SUCCESS file inside it.

You can also save this data to an external Database if you want to. You'll use the jdbc() method along with the connection string and the table name. And Spark will write it to the DB.

Apart from the CSV and JSON formats, there is one more popular data format in the Data Science and Big Data world. That is Parquet. Parquet is a data format that is highly optimized and well suited for column-wise operations. It is widely used in a lot of projects in the Big Data ecosystem as a data serialization format. And In Spark, Parquet is the default file storage format. Of course one main difference between Parquet and formats like CSV, JSON is that Parquet is not meant to be used for humans. It can only be read by a parquet reader. A sample file looks something like this:

Utterly gibberish. But spark can read and understand it. In fact, As Parquet is designed for speed and throughput, it can be 10-100 times faster than reading/writing from an ordinary data format like CSV or JSON, depending on the type of data.

You save dataset to Parquet as follows:

newDs.write().parquet("processed");

And this will save the dataset as a parquet file along with the _SUCCESS status file.

If you are interested in contributing to any open source projects and haven't found the right project or if you were unsure on how to begin, I would like to suggest my own project, Delorean which is a Distributed Version control system, built from scratch in scala. You can contribute not only in the form of code, but also with usage documentation and also by identifying any bugs in its functionality.

January 06, 2018

Interviews are a great place to learn about your strengths and weaknesses, which makes them a great way to improve oneself. In one of my interviews, I was asked to Remove duplicate elements from an array. So, given the array a as below, I've to produce b.

Here b has the same order of elements as a but per the problem statement, It is not necessary to do that.

I was flustered for a bit after getting the question. It took me a while to get to a proper solution, not before getting my first solution rejected for using HashMap which apparently, I was not supposed to. I am attributing this mainly to the fact that I was told to write Java code on a piece of paper and not an IDE. Anyway, I came home after that and decided to try it out and find what other's have done online. That is what this article is about.

Since that particular interview was in Java, It is only fair that I use Java for the solution here, although I really wanted to do it in Python. Maybe some other time.

Approaches for solving the problem:

Approach #1

The most naive approach would be to just look through the entire array and compare each element to every other element to see if there's a duplicate. Of course, this is useless as its time complexity is O(n^2). So, Let's skip this one and go to the next one.

Approach #2

Another approach is using a HashMap to keep track of elements. This is what I've tried initially but was rejected because I've used HashMap when I wasn't supposed to. The pseudo code would be:

map = new Map // Create map
new_arrray = []
for number in numbers_array
if not map.contains(number)
map += number
new_array += number
print(new_array)

Of course, Since I wrote my implementation of this in Java, I had to make a few modifications to this as you need to first define the size of the array and only then can you add elements to it. So, I've added a count variable to count unique elements and then created a new array after the iteration with that size. This would require two loop iterations, but it is still O(n) which is fine. But Alas I couldn't use this.

And so, then comes my final approach.

Approach #3

The third solution is to first sort the array and then from the sorted array, remove duplicates. We can do this because the problem didn't want us to maintain the given input order. Otherwise, we wouldn't have been able to sort the array.

Sorting is easy enough. We just use the built-in sort method, which will sort the array in place.

Arrays.sort(numbers);

Then comes the major part which is removing the duplicates in that sorted array. We can accomplish that by using two pointers i, j on our array. i goes through the entire loop while j is the slow-moving pointer that only changes based on a condition.

The j index is basically playing catch up with i. When there is a duplicate element, i moves ahead while j stays back at the first duplicate element and then with numbers[j] = numbers[i], we assign the next unique value to the j location. After this, our original array has unique elements till index j but after that, we'll have leftover elements. To take care of that, we can create a new array from the numbers array.

The sorting of the array can be assumed to be done in O(nlogn). And then the iteration after that is O(n). Put together you still technically get O(n), which is the same as the previous case. Of course depending on a more specific kind of array, the sort might take less time as well. O(n) for the average case is what you finally get.

The full code is present as a gist:

Let me know if you have any more questions that need answers. That is all for this article.

If you are interested in contributing to any open source projects and haven't found the right project or if you were unsure on how to begin, I would like to suggest my own project, Delorean which is a Distributed Version control system, built from scratch in scala. You can contribute not only in the form of code but also with usage documentation and also by identifying any bugs in its functionality.

January 04, 2018

Whenever I want to upload images with my articles, I make sure they are of the right size first and then I have to check the file sizes and if they are too big, I will have to compress them. For this compression, I use Tinypng. They compress your images to a small size all the while keeping the image looking the same. I've tried some other services as well, but TinyPNG is definitely the best as their compression ratio is quite impressive.

In this article I'll show you how I'm planning to automate the image compression process using TinyPNG's developer API. And of-course we are going to using python.

Setting up

First of all, you need to have a developer key to connect to TinyPNG and use their services. So, go to Developer's API and enter your name and email.

Once you've registered, you'll get a mail from TinyPNG with a link and once you click on that, you'll go to your developers page which also has your API key and your usage information. Do keep it mind that for the free account, you can only compress 500 images per month. For someone like me, that's a number I won't really be reaching in a month anytime soon. But if do, you should probably check out their paid plans.

PS: That's not my real key :D

Get started

Once you've the developer key, you can start compressing images using their service. The full documentation for Python is here.

You start by installing Tinify, which is TinyPNG's library for compression.

pip install --upgrade tinify

Then we can start using tinify in code by importing it and setting the API key from your developer's page.

The file was original 29 KB and now after compression it is 25.3 KB which is a fairly good compression for such a small file. If the original file was bigger, you will be able to see an even tighter compression.

And since this is the free version, there's a limit on the number of requests we can make. We can keep track of that with a built-in variable compression_count. You can print that after every requests to make sure you don't go over that.

If you are interested in contributing to any open source projects and haven't found the right project or if you were unsure on how to begin, I would like to suggest my own project, Delorean which is a Distributed Version control system, built from scratch in scala. You can contribute not only in the form of code, but also with usage documentation and also by identifying any bugs in its functionality.

January 02, 2018

In the two last tutorials we have covered what Apache Spark is and also got ourselves familiar with Datasets in Apache Spark, which is the primary data abstraction in Spark. In this tutorial we will see how to read a data file as a parametrized Bean object Dataset using Encoders.

This tutorial is going to be short, but this is very important as you would find yourself doing this frequently. In the last article you've seen how to read a CSV or JSON file as a Dataset. You might have noticed that we were using Dataset<Row> for everything. If you're not familiar with Generics in Java, Dataset<Row> can be thought of as a Dataset consisting of Row objects. The Row object is a spark sql class and is the default when creating a Dataset.

Although the Row class has some useful methods, as a generic object suitable for all types, it is not suitable for everything. Since Datasets usually store data that usually corresponds to a Bean class, it is better to create a Dataset of that bean class instead of Row. With this, you'll have access to all your usual getters and setters of the bean class. That's what We'll do in this article. We'll create a Dataset of POJO's instead of Row objects.

I'm using the same fake-people.csv file that I used in the last article that looks like this:

As you can see the only difference in creating the Dataset is .as(fakePeopleEncoder) and that gets us Dataset<FakePeople> instead of Dataset<Row>. And with that, we now have access to all the getters, setters of FakePeople class which we wouldn't otherwise have with a Row object. We'll explore more about how this is useful in a future tutorial.

If you are interested in contributing to any open source projects and haven't found the right project or if you were unsure on how to begin, I would like to suggest my own project, Delorean which is a Distributed Version control system, built from scratch in scala. You can contribute not only in the form of code, but also with usage documentation and also by identifying any bugs in its functionality.

December 31, 2017

In the article My semi automated workflow for blogging, I have outlined what my blogging process is like and how I've started to automate it. Ofcourse, at the time of that article, the process was still in early stages and I hadn't automated everything I do. And, that's where this article comes in. This is the second attempt at automating my entire Blogging workflow.

Just to give you some context, here are the things that I do when I'm blogging.

Open a markdown file in Vim with the title of the article as the name along with some template text

Open a browser with the html of the newly created markdown file

Convert markdown to html with pandoc several times during the writing process

Once the article is done and html is produced, edit the html to make some changes specific based on whether I'm publishing on Medium or if I'm publishing on Blogger

Read the tags/labels and other attributes from the file and Publish the code as draft on Medium or Blogger.

Once it looks good, Schedule or Publish it (This is a manual process. There's no denying it.)

Finally tweet about the post with the link to the article

I have the individual pieces of this process ready. I have already written about them in the following articles.

Now, since the individual pieces are ready, it might seem that everything is done. But, as it turns out (unsurprisingly), the integration is of-course a big deal and took a lot more effort than I was expecting. And I am documenting that in this article along with the complete flow.

It starts with the script blog-it which opens vim for me, opens chrome and also sets up a process for converting markdown to html, continuously.

That script calls blog.py which is what opens the vim along with the default text template. I would like to put the complete gist here, but it is just too long and so instead I'm showing the meat of the script.

This ends one flow. Next comes, publishing. I have broken this down because publishing is a manual process for me unless I can complete the entire article in one sitting, which is never going to be possible. So, Once I'm doing with writing it, I'll start the publishing.

I'll run publish.py which depending on the comments in the html publishes it to either Blogger or Medium. Again, I'm only showing a part of it. The full gist is available here.

Actually this publishing does send it to the site as a draft instead of actually publishing it. This is a step that I don't know how to automate as I have to manually take a look at how the article looks in preview. May be I should try doing this with selenium or something like that.

Once, I've verified that the post looks good, I will publish it and take the URL of the published article and call the tweeter.py (Gist here) which then opens a Vim file with some default text for title, and URL already filled in along with some hashtags. I'll complete the tweet and once, I close it, It gets published on Twitter.

And that completes the process. Obviously there are still a couple of manual steps. Although I can't eliminate all of them, I might be able to minimize them as well. But, so far it looks pretty good especially with just the little effort I've put into this in just one week. Of course, I'll keep on tuning it as needed to make it even better and may be I'll publish one final article for that.

If you are interested in contributing to any open source projects and haven't found the right project or if you were unsure on how to begin, I would like to suggest my own project, Delorean which is a Distributed Version control system, built from scratch in scala. You can contribute not only in the form of code, but also with usage documentation and also by identifying any bugs in its functionality.