December 31, 2017

In the article My semi automated workflow for blogging, I have outlined what my blogging process is like and how I've started to automate it. Ofcourse, at the time of that article, the process was still in early stages and I hadn't automated everything I do. And, that's where this article comes in. This is the second attempt at automating my entire Blogging workflow.

Just to give you some context, here are the things that I do when I'm blogging.

Open a markdown file in Vim with the title of the article as the name along with some template text

Open a browser with the html of the newly created markdown file

Convert markdown to html with pandoc several times during the writing process

Once the article is done and html is produced, edit the html to make some changes specific based on whether I'm publishing on Medium or if I'm publishing on Blogger

Read the tags/labels and other attributes from the file and Publish the code as draft on Medium or Blogger.

Once it looks good, Schedule or Publish it (This is a manual process. There's no denying it.)

Finally tweet about the post with the link to the article

I have the individual pieces of this process ready. I have already written about them in the following articles.

Now, since the individual pieces are ready, it might seem that everything is done. But, as it turns out (unsurprisingly), the integration is of-course a big deal and took a lot more effort than I was expecting. And I am documenting that in this article along with the complete flow.

It starts with the script blog-it which opens vim for me, opens chrome and also sets up a process for converting markdown to html, continuously.

That script calls blog.py which is what opens the vim along with the default text template. I would like to put the complete gist here, but it is just too long and so instead I'm showing the meat of the script.

This ends one flow. Next comes, publishing. I have broken this down because publishing is a manual process for me unless I can complete the entire article in one sitting, which is never going to be possible. So, Once I'm doing with writing it, I'll start the publishing.

I'll run publish.py which depending on the comments in the html publishes it to either Blogger or Medium. Again, I'm only showing a part of it. The full gist is available here.

Actually this publishing does send it to the site as a draft instead of actually publishing it. This is a step that I don't know how to automate as I have to manually take a look at how the article looks in preview. May be I should try doing this with selenium or something like that.

Once, I've verified that the post looks good, I will publish it and take the URL of the published article and call the tweeter.py (Gist here) which then opens a Vim file with some default text for title, and URL already filled in along with some hashtags. I'll complete the tweet and once, I close it, It gets published on Twitter.

And that completes the process. Obviously there are still a couple of manual steps. Although I can't eliminate all of them, I might be able to minimize them as well. But, so far it looks pretty good especially with just the little effort I've put into this in just one week. Of course, I'll keep on tuning it as needed to make it even better and may be I'll publish one final article for that.

If you are interested in contributing to any open source projects and haven't found the right project or if you were unsure on how to begin, I would like to suggest my own project, Delorean which is a Distributed Version control system, built from scratch in scala. You can contribute not only in the form of code, but also with usage documentation and also by identifying any bugs in its functionality.

December 29, 2017

In my article My semi automated workflow for blogging, I have talked about my blogging workflow. There were two main things (actually one thing) in that flow that were not automated. i.e., automatically Uploading to Blogger and automatically Uploading to Medium. I have talked about the first one here. This article is about uploading posts to Medium automatically.

Developer documentation for Medium is a breath of fresh air after the mess that is Google API’s. Of course, Google API’s are complex because they have so many different services, but they could’ve done a better job at organizing all that stuff. Anyway, Let’s see how you can use Medium API’s.

Setting Up

We don’t really need any specific dependencies for what we’re doing in this article. You can do everything with urllib which is already part of the python standard library. I’ll be using requests as well to make it a bit more simpler but you can achieve the same without it.

Getting the access token

To authenticate yourself with Medium, you need to get an access token that you’ll pass along to every request. There are two ways to get that token.

Browser-based authentication

Self-issues access tokens

Which one you should go with, depends on what kind of application you’re trying to build. As you can probably guess based on the title, we’ll be covering the second method in this article. The first method needs an authentication server setup which can accept callback from Medium. But, since at this moment, I don’t have that setup, I’m going with the second option.

The Self-issued access tokens method is quite easy to work with as you directly take the access token without having to have the user authenticate via the browser.

To get the access token, Go to Profile Settings and scroll down till you see Integration tokens section.

There enter some description for what you’re going to use this token and click on Get integration token. Copy that generated token which looks something like 181d415f34379af07b2c11d144dfbe35d and save it some where to be used in your program.

Using Access token to access Medium

Once you have the access token, you’ll use that token as your password and send it along with every request to get the required data.

Let’s get started then. As, I’ve said we’ll be using requests library for url connections. We’ll also be using the json libary for parsing the responses. So, Let’s import them.

If we got that response like above, then we know that the access token we have is valid.

From there, I extract, the user_id from the JSON string, with

user_id = json_me_response['data']['id']

Get User’s Publications

From the above request, we’ve validated that the access token is correct and we also have got the user_id. Using that we can get access to the publications of a user. For that, we’ve to make a GET to https://api.medium.com/v1/users/{{userId}}/publications and you’ll see the list of the publications by that user.

Now, one weird thing about Medium’s API is that they don’t have a GET for posts. From the API’s we can get a list of all the publications but you can’t get a user’s posts. You can only publish a new post. Although, it is odd for that to be missing, It is not something I’m looking for anyway, as I am only interested in publishing an article. But if you need that, you probably should check to see if there are any hacky ways of achieving the same (at your own volition).

Create a New Post

To create a new post, we have to make a POST request to https://api.medium.com/v1/users/{{authorId}}/posts. The authorId here would be the same as the userId of the user whose access-token you have.

I’m using requests library for this as making a POST request becomes easy with it. Of course, first you need to create a payload to be uploaded. The payload should look something like the following, as described here

As you see, for contentFormat, I’ve set markdown and for content I read it straight from the file. I didn’t want to publish this as it is just a dummy post and so I’ve set the publishStatus to draft. And sure enough, it works as expected and I can see this draft added on my account.

Do note that the title in the payload object won’t actually be the title of the article. If you want to have a title, you add it in the content itself as a <h*> tag.

If you are interested in this, make sure to follow me on Twitter @durgaswaroop. While you’re at it, Go ahead and subscribe to this blog and my blog on Medium as well.

If you are interested in contributing to any open source projects and haven’t found the right project or if you were unsure on how to begin, I would like to suggest my own project, Delorean which is a Distributed Version control system, built from scratch in scala. You can contribute not only in the form of code, but also with usage documentation and also by identifying any bugs in the functionality.

In my previous post I have talked about Apache Spark. We have also built an application for counting the number of words in a file, which is the hello world equivalent of the big data world.

It has been over 18 months since that article and spark has changed quite a lot in this time. A new major release of spark, which is spark-2.0 came out and now the latest version is 2.2.1. And with a new version comes new API’s and improvements. In-fact the first thing you’ll probably notice is that, you don’t need to create SparkContext or JavaSparkContext objects anymore. The various context and configurations have been put together into a new class SparkSession. You can still access the SparkContext or the SqlContext from the SparkSession object itself. So, you’ll be starting your programs with this now:

And you can use this spark variable the way you’d use other context variables.

Another change in Spark 2.0 is that, there is a heavy emphasis on the usage of Dataset API’s, and for a good reason. Datasets are more performant and memory efficient than RDD’s. RDD (Resilient Distributed Datasets) have been pushed to second place now. You can still use RDD’s if you want but Datasets are the preferred API. In fact, datasets have some nice convenience methods that we can use them for even unstructured data like text as well. Let’s generate some cool lipsum from Malevole. It looks something like this:

Ulysses, Ulysses - Soaring through all the galaxies. In search of Earth, flying in to the night. Ulysses, Ulysses - Fighting evil
and tyranny, with all his power, and with all of his might. Ulysses - no-one else can do the things you do. Ulysses - like a bolt of
thunder from the blue. Ulysses - always fighting all the evil forces bringing peace and justice to all....

Now, you might try to use an RDD to read this, but let’s see what we can do with Datasets.

Here we are reading the text file using the spark object we created earlier and that gives us a Dataset<String>lipsumDs. The show() method on the dataset object prints the dataset. And we get the following output:

What we see here are the lines of the text file. Each line in the file is now a row in the Dataset. There are now a rich set of functions available to you in Datasets which weren’t in RDD’s. You can do filters on the rows for certain words, do a count on the table, perform groupBy operations, etc all like you would on a Database table. For a full list of all the available operations on Dataset, read this: Dataset: Spark Documentation.

I hope that’s enough talk about unstructured data analysis. Let’s get to the main focus of this article, which is using Datasets for structured data. More specifically, csv and json. For this tutorial, I am using the data created from Mockaroo, an online data generator. I’ve created 1000 csv records that look like this:

Note: Spark can read json only of this format where we have one object per row. Otherwise you will see _corrupt_record when you print your dataset. That’s your cue to make sure the json is formatted as per spark’s need.

And you read json very similar to the way you read csv. Since in json we don’t have headers, we don’t need the header option.

You can see the order of columns is jumbled. This is because JSON data doesn’t usually keep any specified order and so, when you read JSON data into a dataset, the order might not be same as what you’ve given. Of course if you want to display the columns in a particular order, you can always do a select operation.

And that would print it in the right order. This is exactly like the SELECT query in SQL, if you’re familiar with it.

Now, that we have seen how to create Datasets, let’s see some of the operations we can perform on them.

Operations on Datasets

Datasets are built on top of Dataframes. So, if you’re already familiar with Dataframes in the spark 1.x releases you already know a ton about Datasets. Some of the operations you can perform on Dataset are as follows:

Those are some of the functions that you can use with Datasets. There are still several Database table type operations on Datasets, like group By, aggregations, joins, etc. We’ll look at them in the next article on Spark as I think this article already has a lot of information already and I don’t want to overload you with information.

So, that is all for this article. If you’re someone that has never tried Datasets or Dataframes, I hope this article gave a good introduction on the topic to keep you interested in learning more.

If you are interested in this, make sure to follow me on Twitter @durgaswaroop. While you’re at it, Go ahead and subscribe to this blog and my blog on Medium as well.

If you are interested in contributing to any open source projects and haven’t found the right project or if you were unsure on how to begin, I would like to suggest my own project, Delorean which is a Distributed Version control system, built from scratch in scala. You can contribute not only in the form of code, but also with usage documentation and also by identifying any bugs in the functionality.

December 25, 2017

Programmers love to automate things and I'm no exception. I always like automate my common tasks. Whether it is checking for stock prices or checking to see when the next episode of my favorite show is coming, I've automated scripts for that. Today I am going to add one more thing in that list i.e., automated tweeting. I tweet quite frequently and I would love to have a way of automating this as well. And that's exactly what we're going to do today. We are tweeting using python.

We'll use a python library called tweepy for this. Tweepy is a simple, easy to use library for accessing Twitter API.

Accessing twitter API's programmatically is not only just an accessibility feature but can be of enormous value too. Mining the twitter verse data is one of the key steps in sentimental analysis. Twitter chat bots have also become quite popular now a days with hundreds and thousands of bot accounts. This article, although, only barely scratches the surface, hopefully will helping in building yourself towards that.

Setting Up

First thing's first, install tweepy by running pip install tweepy. The latest version at the time of the writing this article is 3.5.0.

Then we need to have our Twitter API credentials. Go to Twitter Apps. If you don't have any apps registered already, go ahead and click the Create New App button.

To register your app you have to provide the following three things

Name of your application

Description

Your website url

There is one more option which is callback URL. You can ignore that for now. Then after reading the Twitter developer agreement (wink wink), click on Create your Twitter application button to create a new app.

Once the app is created you should see that in your twitter apps page. Click on it and GOTO the Keys and Access Tokens tab.

There you will see four pieces of information. First you have your app API keys which are consumer key and consumer secret. Then you have your access token and access token secret.

We'll need all of them to access twitter API's. So, have them ready. I have copied all of them and exported them as system variables. You could do the same or if you'd like, you can read them from a file as well.

Let's get started

First you have to import tweepy and os(only if you are accessing system variables).

import tweepy
import os

Then I'll populate the access variables by reading them environment variables.

When you run the previous commands, you'll see that there is a lot of output that is printed on the terminal. This is a status object with a lot of useful data like the number of followers you got, your profile picture URL, your location etc., pretty much everything you get from your twitter page. We can make use of this information, If we are building something more comprehensive.

Apart from sending regular tweets, you can also reply to existing tweets. To reply to a tweet you'd first need its tweet_id which you can get from the tweet's URL.

That's how easy it is to access Twitter API with tweepy. We now know how to tweet and how to reply to a tweet. Building up from this knowledge, In a later tutorial, I can show you how to create your own twitter chat-bot and also twitter streaming analysis.

The full code of things covered in this article is available as gist at

If you are interested in this, make sure to follow me on Twitter @durgaswaroop. While you're at it, Go ahead and subscribe to this blog and my blog on Medium as well.

If you are interested in contributing to any open source projects and haven't found the right project or if you were unsure on how to begin, I would like to suggest my own project, Delorean which is a Distributed Version control system, built from scratch in scala. You can contribute not only in the form of code, but also with usage documentation and also by identifying any bugs in the functionality.

December 23, 2017

I love to have a lot of tabs open at the same time in Vim. Being the deputy Scrum master (Yeah, it is a thing) of our dev-team, I have to keep track of a lot of things going on in the team. I maintain a repository of all the links to product documentation, stories, tasks etc. I also need to keep track of the discussions that happened in various team meetings. On top of this, as a backend engineer I have my own stories and tasks to manage as well. All of this means that, I have a ton of tabs and splits open at any given time in Vim. Something like the following:

Now, the problem comes when I have to shutdown and start my computer. All the tabs I have kept open for several days will be gone and I have to open them up all again and put them in the order I want. Oh, the pain!. There has to a better way!

Luckily for us Vim always has a better way to do something. There is an inbuilt feature just for this.

It is called Vim-sessions and with that you can get back all your tabs with just one command. How nice!

How to create a new session?

To create a vim session, run the command :mksession episodes.session. Here episodes is the name of the session I want to create.

In short, :mks <session-name>.session. And that’s it. Your session is now saved. It saves information about what tabs are currently open, what splits are open and even what buffers are open, all into that session file.

Note: The .session suffix is not needed. But it is the preferred way as you can easily identify the session file.

Once this is done, you can go ahead and close your tabs as all of that is stored in the session file.

How to open an existing session?

Now that we have the session saved, the next time you want to open all of those tabs, all you have to do is tell Vim to run that session. You do that by running the command :so <session-file-path>(:so is short for :source).

And boom. All of your windows and tabs are back with just one command. You don’t have to have multiple tmux or screen buffers running anymore. Vim can do all of that with just one command.

That is all you need to know about sessions in vim to make yourself productive. You can always try the vim help with :help session-file and find out more.

If you are interested in this, make sure to follow me on Twitter @durgaswaroop. While you’re at it, Go ahead and subscribe to this blog and my blog on Medium as well.

If you are interested in contributing to any open source projects and haven’t found the right project or if you were unsure on how to begin, I would like to suggest my own project, Delorean which is a Distributed Version control system, built from scratch in scala. You can contribute not only in the form of code, but also with usage documentation and also by identifying any bugs in the functionality.