An interactive SQL learning experience

We run a micro-courses business, Afterskills, to help professionals ramp up their skills on specific and technical topics.

One of our products, MBASQL, is designed to help MBA’s, Business Analysts, and other professionals in Marketing, Online Fashion, and E-Commerce ramp up their skills with Data.

Today, we launched that course online through Udemy.com

Course Description

Our course is designed to equip MBAs & professionals with the skills required to extract data from relational databases. Most importantly, students will learn how to use data to bring insight into their decision making process through the lens of commercial problems faced by leading e-commerce businesses.

To assist in your learning, you will gain access to our very own:

Simulated web store having the same functionality as all e-commerce businesses.

Rich database of customer & product information that you’ll be able to browse.

SQL platform to take what you’ve learned & turn that into your very own SQL code.

If you own a business and sell online, you will have had to – at some point or another – name a whole bunch of images in a specific way to stay organised. If you sell on behalf of brands, naming your images is even more important.

Problem: Lets say you had your images sorted in a bunch of folders. Each folder had the name of the brand with the brand’s images in the folder. To make it more complicated, a brand could have a sub-brand, and therefore a subfolder with that sub-brand’s images.

This looks like this in practice:

I was surprised at how difficult this was to accomplish. After scouring stack overflow for solutions that do something close, I came up with the solution presented here.

We will be using Python’s os library. First lets create a function that takes 2 parameters:

The top-most folder that contains all the files and subfolders.

A number indicating how many folders deep do we want to go

This look like this:

def renameFiles(path, depth=99):

def renameFiles(path, depth=99):

We will be calling this function recursively (over and over) until we are as deep as we want to go. For every folder deep we go, we want to reduce the depth by one. If we hit 0, then we know to go back up the folder tree. Lets continue building our code:

def renameFiles(path, depth=99):
if depth <0: return

def renameFiles(path, depth=99):
if depth < 0: return

Now for every file we hit, we want to test for a few things:

That the path to that file is not a shortcut or symbolic link

That the path to the file is a real folder and exists

If the file we look at is a real folder, we go one level deep

To do this we add the following code:

def renameFiles(path, depth=99):
# Once we hit depth, returnif depth <0: return# Make sure that a path was supplied and it is not a symbolic linkifos.path.isdir(path)andnotos.path.islink(path):
# We will use a counter to append to the end of the file name
ind =1# Loop through each file in the start directory and create a fullpathforfileinos.listdir(path):
fullpath = path + os.path.sep + file# Again we don't want to follow symbolic linksifnotos.path.islink(fullpath):
# If it is a directory, recursively call this function # giving that path and reducing the depth.ifos.path.isdir(fullpath):
renameFiles(fullpath, depth - 1)

def renameFiles(path, depth=99):
# Once we hit depth, return
if depth < 0: return
# Make sure that a path was supplied and it is not a symbolic link
if os.path.isdir(path) and not os.path.islink(path):
# We will use a counter to append to the end of the file name
ind = 1
# Loop through each file in the start directory and create a fullpath
for file in os.listdir(path):
fullpath = path + os.path.sep + file
# Again we don't want to follow symbolic links
if not os.path.islink(fullpath):
# If it is a directory, recursively call this function
# giving that path and reducing the depth.
if os.path.isdir(fullpath):
renameFiles(fullpath, depth - 1)

We are now ready to process a file that we find in a folder. We want to first build a new name for the file then test for a few things before renaming the file.

That we are only changing image files

We are not overwriting an existing file

That we are in the correct folder

Once we are sure that these tests pass, we can rename the file and complete the function. Your function is then complete:

importos'''
Function: renameFiles
Parameters:
@path: The path to the folder you want to traverse
@depth: How deep you want to traverse the folder. Defaults to 99 levels.
'''def renameFiles(path, depth=99):
# Once we hit depth, returnif depth <0: return# Make sure that a path was supplied and it is not a symbolic linkifos.path.isdir(path)andnotos.path.islink(path):
# We will use a counter to append to the end of the file name
ind =1# Loop through each file in the start directory and create a fullpathforfileinos.listdir(path):
fullpath = path + os.path.sep + file# Again we don't want to follow symbolic linksifnotos.path.islink(fullpath):
# If it is a directory, recursively call this function # giving that path and reducing the depth.ifos.path.isdir(fullpath):
renameFiles(fullpath, depth - 1)else:
# Find the extension (if available) and rebuild file name # using the directory, new base filename, index and the old extension.
extension =os.path.splitext(fullpath)[1]# We are only interested in changing names of images. # So if there is a non-image file in there, we want to ignore itif extension in('.jpg','.jpeg','.png','.gif'):
# We want to make sure that we change the directory we are in# If you do not do this, you will not get to the subdirectory namesos.chdir(path)# Lets get the full path of the files in question
dir_path =os.path.basename(os.path.dirname(os.path.realpath(file)))# We then define the name of the new path in order# The full path, then a dash, then 2 digits, then the extension
newpath =os.path.dirname(fullpath) + os.path.sep + dir_path \
+ ' - '+"{0:0=2d}".format(ind) + extension
# If a file exists in the new path we defined, we probably do not want# to over write it. So we will redefine the new path until we get a unique namewhileos.path.exists(newpath):
ind +=1
newpath =os.path.dirname(fullpath) + os.path.sep + dir_path \
+ ' - '+"{0:0=2d}".format(ind) + extension
# We rename the file and increment the sequence by one. os.rename(fullpath, newpath)
ind +=1return

import os
'''
Function: renameFiles
Parameters:
@path: The path to the folder you want to traverse
@depth: How deep you want to traverse the folder. Defaults to 99 levels.
'''
def renameFiles(path, depth=99):
# Once we hit depth, return
if depth < 0: return
# Make sure that a path was supplied and it is not a symbolic link
if os.path.isdir(path) and not os.path.islink(path):
# We will use a counter to append to the end of the file name
ind = 1
# Loop through each file in the start directory and create a fullpath
for file in os.listdir(path):
fullpath = path + os.path.sep + file
# Again we don't want to follow symbolic links
if not os.path.islink(fullpath):
# If it is a directory, recursively call this function
# giving that path and reducing the depth.
if os.path.isdir(fullpath):
renameFiles(fullpath, depth - 1)
else:
# Find the extension (if available) and rebuild file name
# using the directory, new base filename, index and the old extension.
extension = os.path.splitext(fullpath)[1]
# We are only interested in changing names of images.
# So if there is a non-image file in there, we want to ignore it
if extension in ('.jpg','.jpeg','.png','.gif'):
# We want to make sure that we change the directory we are in
# If you do not do this, you will not get to the subdirectory names
os.chdir(path)
# Lets get the full path of the files in question
dir_path = os.path.basename(os.path.dirname(os.path.realpath(file)))
# We then define the name of the new path in order
# The full path, then a dash, then 2 digits, then the extension
newpath = os.path.dirname(fullpath) + os.path.sep + dir_path \
+ ' - '+"{0:0=2d}".format(ind) + extension
# If a file exists in the new path we defined, we probably do not want
# to over write it. So we will redefine the new path until we get a unique name
while os.path.exists(newpath):
ind +=1
newpath = os.path.dirname(fullpath) + os.path.sep + dir_path \
+ ' - '+"{0:0=2d}".format(ind) + extension
# We rename the file and increment the sequence by one.
os.rename(fullpath, newpath)
ind += 1
return

The last part is to add a line outside the function that executes it in the directory where the python file lives:

renameFiles(os.getcwd())

renameFiles(os.getcwd())

With this your code is complete. You can copy the code at the bottom of this post.

How do you use this?

Once you have the file saved. Drop it in the folder of interest. In the image below, we have the file rename.py in the top most folder.

Next, you fire up terminal and type in “cd” followed by the folder where you saved your python script. Then type in “python” followed by the name of the python file you saved. In our case it is rename.py:

The results look like this:

The entire code

importos'''
Function: renameFiles
Parameters:
@path: The path to the folder you want to traverse
@depth: How deep you want to traverse the folder. Defaults to 99 levels.
'''def renameFiles(path, depth=99):
# Once we hit depth, returnif depth <0: return# Make sure that a path was supplied and it is not a symbolic linkifos.path.isdir(path)andnotos.path.islink(path):
# We will use a counter to append to the end of the file name
ind =1# Loop through each file in the start directory and create a fullpathforfileinos.listdir(path):
fullpath = path + os.path.sep + file# Again we don't want to follow symbolic linksifnotos.path.islink(fullpath):
# If it is a directory, recursively call this function # giving that path and reducing the depth.ifos.path.isdir(fullpath):
renameFiles(fullpath, depth - 1)else:
# Find the extension (if available) and rebuild file name # using the directory, new base filename, index and the old extension.
extension =os.path.splitext(fullpath)[1]# We are only interested in changing names of images. # So if there is a non-image file in there, we want to ignore itif extension in('.jpg','.jpeg','.png','.gif'):
# We want to make sure that we change the directory we are in# If you do not do this, you will not get to the subdirectory namesos.chdir(path)# Lets get the full path of the files in question
dir_path =os.path.basename(os.path.dirname(os.path.realpath(file)))# We then define the name of the new path in order# The full path, then a dash, then 2 digits, then the extension
newpath =os.path.dirname(fullpath) + os.path.sep + dir_path \
+ ' - '+"{0:0=2d}".format(ind) + extension
# If a file exists in the new path we defined, we probably do not want# to over write it. So we will redefine the new path until we get a unique namewhileos.path.exists(newpath):
ind +=1
newpath =os.path.dirname(fullpath) + os.path.sep + dir_path \
+ ' - '+"{0:0=2d}".format(ind) + extension
# We rename the file and increment the sequence by one. os.rename(fullpath, newpath)
ind +=1return
renameFiles(os.getcwd())

import os
'''
Function: renameFiles
Parameters:
@path: The path to the folder you want to traverse
@depth: How deep you want to traverse the folder. Defaults to 99 levels.
'''
def renameFiles(path, depth=99):
# Once we hit depth, return
if depth < 0: return
# Make sure that a path was supplied and it is not a symbolic link
if os.path.isdir(path) and not os.path.islink(path):
# We will use a counter to append to the end of the file name
ind = 1
# Loop through each file in the start directory and create a fullpath
for file in os.listdir(path):
fullpath = path + os.path.sep + file
# Again we don't want to follow symbolic links
if not os.path.islink(fullpath):
# If it is a directory, recursively call this function
# giving that path and reducing the depth.
if os.path.isdir(fullpath):
renameFiles(fullpath, depth - 1)
else:
# Find the extension (if available) and rebuild file name
# using the directory, new base filename, index and the old extension.
extension = os.path.splitext(fullpath)[1]
# We are only interested in changing names of images.
# So if there is a non-image file in there, we want to ignore it
if extension in ('.jpg','.jpeg','.png','.gif'):
# We want to make sure that we change the directory we are in
# If you do not do this, you will not get to the subdirectory names
os.chdir(path)
# Lets get the full path of the files in question
dir_path = os.path.basename(os.path.dirname(os.path.realpath(file)))
# We then define the name of the new path in order
# The full path, then a dash, then 2 digits, then the extension
newpath = os.path.dirname(fullpath) + os.path.sep + dir_path \
+ ' - '+"{0:0=2d}".format(ind) + extension
# If a file exists in the new path we defined, we probably do not want
# to over write it. So we will redefine the new path until we get a unique name
while os.path.exists(newpath):
ind +=1
newpath = os.path.dirname(fullpath) + os.path.sep + dir_path \
+ ' - '+"{0:0=2d}".format(ind) + extension
# We rename the file and increment the sequence by one.
os.rename(fullpath, newpath)
ind += 1
return
renameFiles(os.getcwd())

To start, I have to say that it is really heartwarming to get feedback from readers, so thank you for engagement. This post is a response to a request made collaborative filtering with R.

The approach used in the post required the use of loops on several occassions.
Loops in R are infamous for being slow. In fact, it is probably best to avoid them all together.
One way to avoid loops in R, is not to use R (mind: #blow). We can use Python, that is flexible and performs better for this particular scenario than R.
For the record, I am still learning Python. This is the first script I write in Python.

Refresher: The Last.FM dataset

The data set contains information about users, gender, age, and which artists they have listened to on Last.FM.
In our case we only use Germany’s data and transform the data into a frequency matrix.

Before we calculate our similarities we need a place to store them. We create a variable called data_ibs which is a Pandas Data Frame (… think of this as an excel table … but it’s vegan with super powers …)

Now we can start to look at filling in similarities. We will use Cosin Similarities.We needed to create a function in R to achieve this the way we wanted to. In Python, the Scipy library has a function that allows us to do this without customization.
In essense the cosine similarity takes the sum product of the first and second column, then dives that by the product of the square root of the sum of squares of each column.

This is a fancy way of saying “loop through each column, and apply a function to it and the next column”.

# Lets fill in those empty spaces with cosine similarities# Loop through the columnsfor i inrange(0,len(data_ibs.columns)) :
# Loop through the columns for each columnfor j inrange(0,len(data_ibs.columns)) :
# Fill in placeholder with cosine similarities
data_ibs.ix[i,j]=1-cosine(data_germany.ix[:,i],data_germany.ix[:,j])

# Lets fill in those empty spaces with cosine similarities
# Loop through the columns
for i in range(0,len(data_ibs.columns)) :
# Loop through the columns for each column
for j in range(0,len(data_ibs.columns)) :
# Fill in placeholder with cosine similarities
data_ibs.ix[i,j] = 1-cosine(data_germany.ix[:,i],data_germany.ix[:,j])

With our similarity matrix filled out we can look for each items “neighbour” by looping through ‘data_ibs’, sorting each column in descending order, and grabbing the name of each of the top 10 songs.

User Based collaborative Filtering

The process for creating a User Based recommendation system is as follows:

Have an Item Based similarity matrix at your disposal (we do…wohoo!)

Check which items the user has consumed

For each item the user has consumed, get the top X neighbours

Get the consumption record of the user for each neighbour.

Calculate a similarity score using some formula

Recommend the items with the highest score

Lets begin.

We first need a formula. We use the sum of the product 2 vectors (lists, if you will) containing purchase history and item similarity figures. We then divide that figure by the sum of the similarities in the respective vector.
The function looks like this:

The rest is a matter of applying this function to the data frames in the right way.
We start by creating a variable to hold our similarity data.
This is basically the same as our original data but with nothing filled in except the headers.

# Create a place holder matrix for similarities, and fill in the user name column
data_sims = pd.DataFrame(index=data.index,columns=data.columns)
data_sims.ix[:,:1]= data.ix[:,:1]

# Create a place holder matrix for similarities, and fill in the user name column
data_sims = pd.DataFrame(index=data.index,columns=data.columns)
data_sims.ix[:,:1] = data.ix[:,:1]

We now loop through the rows and columns filling in empty spaces with similarity scores.

Note that we score items that the user has already consumed as 0, because there is no point recommending it again.