Dimensionality reduction is an important technique to overcome the curse of dimensionality in data science and machine learning. As the number of predictors (or dimensions or features) in the dataset increase, it becomes computationally more expensive (ie. increased storage space, longer computation time) and exponentially more difficult to produce accurate predictions in classification or regression […]

Dimensionality reduction is an important technique to overcome the curse of dimensionality in data science and machine learning. As the number of predictors (or dimensions or features) in the dataset increase, it becomes computationally more expensive (ie. increased storage space, longer computation time) and exponentially more difficult to produce accurate predictions in classification or regression models. Moreover, it is hard to wrap our head around to visualize the data points in more than 3 dimensions.

How do we reduce the number of features in dimensionality reduction?

We first determine which feature and how much of this feature contributes to the representation of the data and remove the features that don’t contribute much to the representation of the data.

This article will review the linear technique for dimensionality reduction, in particular, principal component analysis (PCA) and linear discriminant analysis (LDA) in scikit learn using the wine dataset (available HERE).

Let’s say our dataset has d number of independent or feature variables (dimensions). It is not feasible to visualize our data if it contains more than 2 or 3 dimensions. Applying PCA on our dataset will allow us to visualize the result since PCA will reduce the number of dimensions and extract k dimensions (k<d) that explain the most variance of the dataset. In other words, PCA reduces the dimensions of a d-dimensional dataset by projecting it onto a k-dimensional subspace, where k < d.

PCA is an unsupervised linear transformation technique: although PCA learns the relationship between dependent and independent variables to find a list of principal components, we do not consider the dependent variable in the model.

Unlike PCA, LDA extracts the k new independent variables that maximize the separation between the classes of the dependent variable. LDA is a supervised linear transformation technique since the dependent variable (or the class label) is considered in the model.

TAKEAWAY: PCA maximizes the variance in the dataset while LDA maximizes the component axes for class-separation.

Let’s take a look at the wine dataset from UCI machine learning repository. This data examines the quantities of 13 ingredients found in each of the three types of the wines grown in the same region in Italy. Let’s see how we can use PCA and LDA to reduce the dimensions of the dataset.

If you want to follow along, you can download my jupyter notebook HERE for materials throughout this tutorial. Click HERE to look at instructions on how to install jupyter notebook.

Data Preprocessing: Steps to do before performing PCA or LDA

Now, why don’t you click on our wine dataset (click on wine.data)? What do you see?

You will notice that the csv file has no specified header names. The first line of the csv is the first data point, not the header.

So, to denote the first line of data, let’s pass in the parameter header and assign None (header=None). For more information on other parameters on pandas.read_csv, click HERE.

By default, it will print out the first 5 rows.

Since our dataset has no header, let’s label each column with its corresponding name.

Scroll down to the Relevant Information Section. The first column is the class label (class 1, 2, 3 representing three different wine types). The remaining columns are the 13 attributes shown in this Relevant Information Section.

Now, our columns are labeled.

The first column is the column of interest since it has the ‘Class label’ (or dependent variable or y). We can use iloc for index’s position. Since python is zero-index based, the first column is ‘0’.

The reminding columns include the 13 ingredients present in wine, which are our features (or X).

After defining our X (independent variables) and y (dependent variable), we can split our dataset for training and testing purposes.

Since the 13 variables are measured in different scales, we have to normalize (or standardize) all of them using StandardScaler from sklearn.

After standardizing all the features, we are ready to perform either PCA or LDA.

Let’s perform PCA first.

Principal Component Analysis (PCA)

We can now pass in the number of principal components; let’s choose 2 components.

We can take a look at the explained variance ratio.

The first component accounts for 36.9% of the variance while the second component accounts for 19.3%.

Now, let’s fit a logistic regression to the training set.

X_test contains the 2 principal components that were extracted and predict the test set result.

Let’s evaluate the performance of the model by constructing the confusion matrix. We should get a good result since we extracted the 2 principal components that explain the most variance, (60%). In other words, these components were direction of the maximum variance in our dataset.

How should we interpret the result above? Let’s just convert the result above into the following table for an easier interpretation.

n=36

Predicted class 1

Predicted class 2

Predicted class 3

Actual class 1

14

0

0

Actual class 2

1

15

0

Actual class 3

0

0

16

It is a very good result; the diagonal of the table represents the correct predictions. There were 14 cases of correct prediction of class label 1, 15 cases of correct prediction of class label 2, and 6 cases of correct prediction of class label 3. There was one incorrect prediction, where the real outcome was class label 1, but it was predicted to be class label 2.

Accuracy is very good; 35 correct predictions divided by the total number of 36 cases should give us about 97.2% accuracy.

We can visualize our training result as below.

In the visualization above, the color yellow, orchid, and light-salmon color were the predicted regions of class 1, 2, and 3 respectively. The little circles were the real observations in our wine dataset (turquoise=class 1, blue=class 2, deep-pink=class 3).

What we are more interested in is the visualization of our test result. This should be consistent with the evaluation we just obtained from the confusion matrix earlier.

Let’s assess our diagram of the test result. So, we see one blue circle(which is the real outcome for class label 2) but it was predicted to be in the yellow region, which was the class label 1. This was the same incorrect prediction we observed in the confusion matrix above. Overall, using PCA, we are able to obtain ~97% accuracy.

Let’s take a look at how we can apply LDA and whether we can improve the accuracy.

Linear Discriminant Analysis (LDA)

We apply LDA before fitting the classification model to the training set.

*One thing to note is because LDA is a supervised technique, we need to include the dependent variable ‘y_train’ in our model.

The rest of what we do in this section should now be more familiar.

n=36

Predicted class 1

Predicted class 2

Predicted class 3

Actual class 1

14

0

0

Actual class 2

0

15

0

Actual class 3

0

0

16

Were there any incorrect predictions?

NO! HOORAY!

Again, the diagonal of the table represents the correct predictions. There were 14 cases of correct prediction of class label 1, 16 cases of correct prediction of class label 2, and 6 cases of correct prediction of class label 3.

We obtained 100% Accuracy with LDA (36 correct predictions divided by total number of 36 cases).

Let’s visualize our result.

Again, the visualization of our test result should be consistent with the evaluation we just obtained from the confusion matrix earlier.

The little circles, which were the real observations in our wine dataset (turquoise=class 1, blue=class 2, deep-pink=class 3), were located in the correctly predicted regions (yellow=class 1, orchid=class 2, and light-salmon=class 3).

In summary, both PCA and LDA performed very well. We reduced from 13 dimensions to 2 and we still got great results for predicting class labels for the 3 different types of wine.

It’s important to note that both PCA and LDA are the dimensionality reduction for linear transformation only. For non-linear transformation, we will need to use kernel-PCA. Stay tuned in the next article.

What is a Keywords Dictionary? A Keywords Dictionary is a set of words put together based on a common theme. Consider this example, you are managing a bank and you want to improve your customer service. You give your customers a feedback form? For you to make sense of what your customers complaint or talk […]

What is a Keywords Dictionary?

A Keywords Dictionary is a set of words put together based on a common theme. Consider this example, you are managing a bank and you want to improve your customer service. You give your customers a feedback form? For you to make sense of what your customers complaint or talk about, you need to have a list of keywords that are related to the business and this is where the need to have a Keywords Dictionary comes in.

Since Keywords Dictionaries are specific to a business purpose, they might not be easily available unless someone has already built it and made it available to the public. Hence, it is usually better to build your own custom Keywords Dictionary.

How to build your own Keywords Dictionary?

First of all, we need to define a Data Source – which is usually a website with a list of Keywords that we are interested in.

Extract Website content from the given URL

Scrape the desired content (Keywords) from the website content

Clean the scraped data if required and store locally for future use

Now, you may be wondering about some new jargon introduced in one of the points – “scrape”. It comes from the process called “Web Scraping” which is “a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis” as defined by Wikipedia.

Getting started with Web Scraping:

Python has two great packages for web scraping.
1. BeautifulSoup
2. Scrapy.
Both these packages are great for different reasons – BeautifulSoup has an elegant and consistent API that makes it very simple for a beginner to get started with Web Scraping. Scrapy does a lot of complex scraping tasks like extracting content after logging in or form submission that is not straightforward with BeautifulSoup. IT also comes with its own set of complexities which are handy in complex scraping pipeline for someone who’s already familiar with Web Scraping. Hence, We will use “BeautifulSoup” in this post to scrape data from the Web. While BeautifulSoup can do the job of parsing the html and making sense of the web content, we need to “get” the website in the first place and we will use the “requests” package for that.

How to install requests & BeautifulSoup:

Requests can be installed using pip.

pip install requests – if you are using Python version lesser than 3

pip3 install requests – if you are using Python version greater than 3

BeautifulSoup also can be installed using pip or pip3 if you are using Python 3.x.

pip install beautifulsoup4

pip3 install beautifulsoup4

Loading both the libraries:

In [24]:

from bs4 import BeautifulSoup

import requests

Data Source:

We will use Moneycontrol.com’s Glossary page to build our Finance Keywords Dictionary. Note that this post is just for educational purposes and make sure you don’t violate the Terms of Service of the websites from which you are trying to scrape.

As we have defined the url, now let us extract the content of the url.

In [27]:

content = requests.get(url) #sends a GET http request to collect the content

We can check if the request was successful by checking the response status.

In [29]:

content.status_code #200 is succssful

Out[29]:

200

In [30]:

content_text = content.text #extracting the response content as text

Now, as the content is ready as text. We can use BeautifulSoup to make a “soup” – ideally, parsing the html

In [33]:

soup = BeautifulSoup(content_text)

As we have seen in the above screenshot, what we are interested in the extracted content is the html tag “a”. But there are so many links in the website that also could include junk like social media links and other irrelevant links. Giving a deep look in the above screenshot could also reveal that our desired urls have a common pattern that is “/glossary/”. Hence we would be extracting the content with two conditions:

only “a” tag

“a” tag with “href” containing the string “glossary” in it

To extract all the “a” tag links, we will use the function “find_all()” and to find the string “glossary” in “href”, we will use “regex” for pattern matching using the python package “re”.

In [42]:

import re

alllinks = soup.findall(“a”, href = re.compile(“glossary”))

Now, we are ready to extract the Keywords, which are nothing but the text values in each of those links that we extracted and stored in “alllinks”. We will use a “for” loop to iterate through each element of “alllinks” and extract “text” value of it and store it in a list.

That’s it! We have successfully built a Finance Keywords Dictionary of length 3165. Please note that some of the keywords might need a little bit of cleaning up and business domain knowledge for further refinement before using in your Machine Learning model. This post could be easily be replicated based on your needs with a simple change of the source url and a few other tweaks. This code is also available as a Jupyter notebook on github.

Python functions are a lot more flexible than you would think. They are much more than just code generation specifications for a compiler. They are full-blown objects stored in pieces of memory as such that they can be freely passed around a program and called indirectly. They also support operations that have little to do […]

Python functions are a lot more flexible than you would think. They are much more than just code generation specifications for a compiler. They are full-blown objects stored in pieces of memory as such that they can be freely passed around a program and called indirectly. They also support operations that have little to do with calls at all like Attribute storage and Annotation.

“First Class” objects: Indirect function calls

Python functions fundamentally follow the python object model. It really means we can pass functions to other functions, embedded into other data structures, return a function as if they were basic data types like strings, numbers etc. Function objects also happen to support a special operation: they can be called by listing arguments in parentheses after a function expression. This is usually called a First Class Object Model. It is ubiquitous in python and a necessary part of functional programming.

For example:

Let us try to understand a following piece of code here. def statement will define a variable name used as if it had appeared on the left of an ‘=’ sign. After def runs, the function name is simply a reference to an object. It can be reassigned to other names freely and call it through any reference. In the below code, ‘p’ references the function name printmessage. We can call object through name by adding ().

>>> def printmessage(msg):

… print(msg)

…

>>> printmessage(‘Hello World: Direct Call’)

Hello World : Direct Call

>>> p = printmessage

>>> p(‘Hello world : Indirect Call’)

Hello World : Indirect Call

>>>

We can even place the function objects into data structures, as though they were integers or strings. The following code snippet embeds the function twice in a list of tuples, as a sort of actions table.

>>> def printmessage(msg):

… print(msg)

…

>>> schedule = [ (printmessage, ‘World’), (printmessage,’Welcome’)]

>>> for (func,arg) in schedule:

…func(arg)

…

World

Welcome

>>>

Functions can also be created and returned for use elsewhere – the closure created in this mode also retains the state from the enclosing scope:

>>>

>>> def makelabel(labelname):

…def printlabel(message):

…print(labelname + ‘ : ‘ + message)

…return printlabel

…

>>> l = makelabel(‘Byte’)

>>> l(‘Academy’)

Byte:Academy

>>>

Python’s universal first class object model and lack of type declarations make this language incredible and flexible programming language.

Function Introspection

Because they are objects, we can also process functions with normal object tools. In fact, functions are more flexible than you might expect. We can inspect the function objects as follows, these introspection tools allow us to explore implementation details too – functions have attached code objects, for example, which provide details on aspects such as functions local variables and arguments.

Function Attributes

Function objects are not limited to system-defined attributes, it is possible to attach arbitrary user-defined attributes as shown below.

Python Decorators

One of the applications of python function objects is Python Decorators. What is Python Decorator? Decoration is a way to specify management or augmentation code for functions and classes.

A decorator in python is any callable python object that is used to modify a function or a class.

Python decorators come in 2 related flavors.

Function Decorators: A reference to a function is passed to decorator and the decorator returns a modified function. The modified functions contain references and calls to the original function.

Class Decorators: A reference to a class is passed to a decorator and the decorator returns a modified class. The modified classes contain references and calls to the original class.

Function Decorators Example:

def color_text(name):

return “Coloring this ”.format(name)

def ptext_decorate(func):

def ptag_name(name):

return “<p>{0}</p>”.format(func(name))

return ptag_name

mytext = ptextdecorate(colortext)

print(mytext(‘Byte Academy’))

Output:

<p>Coloring this Byte Academy </p>

Python Decorator Syntax – Syntactic Sugar

Python makes creating and using decorators a bit cleaner and nicer for the programmer through syntactic sugar. In the above example, to decorate a text we don’t have to

mytext = ptextdecorate(colortext)

There is a neat shortcut for this. The name of the decorator should be prepended with an @symbol as follows:

def ptext_decorate(func):

def ptag_name(name):

return “<p>{0}</p>”.format(func(name))

return ptag_name

@ptext_decorate

def color_text(name):

return “Coloring this ”.format(name)

print(color_text(‘Byte Academy))

Class Decorators Example: By Decorating Methods

In python, methods are functions that expect their first parameter to be a reference to the current object. We can build decorators for methods the same way, while taking self into consideration in ptag_name function.

def ptext_decorate(func):

def ptag_name(name):

return “<p>{0}</p>”.format(func(name))

return ptag_name

class Person(object):

def init(self):

self.name = ‘Dennis’

self.family = ‘Gutridge’

@ptext_decorate

def getfullname(self):

return self.name + self.family

person1 = Person()

print(person1.getfullname())

Applications of Decorators

In general, decorators are ideal for extending the behavior of functions that we don’t want to modify. As structuring tools, decorators naturally foster encapsulation of code, which reduces redundancy and makes future changes easier. The reference page mentioned here has all the decorator code pieces in the form of a central repository.https://wiki.python.org/moin/PythonDecoratorLibrary

In machine learning, we are often in the realm of “function approximation”. That is, we have a certain ground-truth (y) and associated variables (X) and our aim is to use identify a function to wrap our variables in that does a good job in approximating the ground-truth. This exercise in function approximation is also known […]

In machine learning, we are often in the realm of “function approximation”. That is, we have a certain ground-truth (y) and associated variables (X) and our aim is to use identify a function to wrap our variables in that does a good job in approximating the ground-truth. This exercise in function approximation is also known as “supervised-learning”.

“Unsupervised learning” on the other hand is a slightly different problem to tackle. Here, our data does not consist of a ground-truth. All we have is our variables. Let’s elaborate on how this situation is different from supervised learning.

Since we do not have a ground-truth, our task here is not to predict or approximate any outcome. Consequently, there is no loss/cost function providing feedback on how close or far our function’s output is to the ground-truth. Isn’t this perplexing? If there is no feedback on the “goodness” of our output, then how do we know if our output is desirable, or complete hogwash?

In this tutorial, we will look at what actually is unsupervised learning and comprehensively understanding and execute a common unsupervised learning task, clustering.

Clustering

In the absence of a teacher/ground-truth, what can we do with just out variables? Let’s take an example with the Online Retail Dataset. This is a dataset which contains all the transactions occurring between 2010-2011 for a UK-based online retailer. Let’s take a peek at the data using pandas.

Online Retail Dataset

This dataset has 8 columns and more than 500,000 rows.

Now, imagine you work for this Online-retailer. What can be done with this data? Well, one activity could be to try and identify customer-types; how many different customer types do I get? This is a common task for e-commerce companies known as customer-segmentation. Customer segmentation is the practice of dividing a customer base into groups of individuals that are similar in specific ways relevant to marketing, such as age, gender, interests and spending habits. So, the purpose here it to identify customers that are similar to each-other, placing them in one group and then seeking other such groups or segments.

In machine learning, this task of identifying similarity is known as clustering. The most popular clustering technique used in “K-Means”

K-Means Algorithm

K-Means is an iterative clustering algorithm that seeks to cluster homogeneous or similar subgroups in our data. The first thing we need, then, is to explicitly define similarity/dissimilarity.

Similarity amongst our observations, in the simplest terms, can be stated via Euclidean distance between data points. Let’s illustrate this with the example below of a plot of individuals by their height and weight, clearlygreendata points (individuals) are more similar to each other than to the red data points.

Simple Example of Similarity: Heights and Weights

So, if two data points are similar, we will consider them as part of one cluster. Ideally, we would want data points in one cluster to be as close to each other. We then can state the goal of clustering formally: to minimize the distance between observations within one cluster, across all clusters. Let’s express this via a function.

The inner-part of the function W(Ck), Σ(xij-xi’j) reads, “Within cluster variation is the squared sum of Euclidean distance between observations in the Kth cluster”. The outer Σ is only summing up the euclidean distance across clusters (1 through K clusters, hence the name K-means!).

Our goal then is to minimize the function W(Ck). Ok, but how? The iterative algorithm (steps) to accomplish this minimization are:

Randomly assign a number between 1 and K ( we will get to how to choose this later) to each row in your data. This is the initial cluster assignment.

For each cluster, compute the cluster centroid. The centroid is simply a vector of feature mans for observations in each of the K clusters. The size of this vector depends on the numbers of features (p) in the dataset. For the Online Retail, this would be 8.

Revisit the initial cluster assignment and re-assign each row to the cluster whose centroid’s pthfeature mean is closest to it.

Iterate (repeat steps 1-3) until cluster assignment stops changing or is within a tolerable level (more on this later)

The class K-means below is an implementation of the above steps. Reading through the comments will help you understand the steps.

In the Online Retail dataset, if we divide our customers into only 2 segments and send out marketing material to each customer as per their cluster assignment, we may have too general of a marketing pitch. Thereby, customers may not return to our e-commerce website. On the other hand, if we divide our customers into 100 segments, we may only have a handful of customers per segment and it will be a nightmare to send out 100 variations of marketing material. So, while the choice of K is a business decision we do have techniques to guide our final decision. There are a few things left to explain before we apply the algorithm to our data. First, how do we choose how many clusters we want in our data? Well, this is exactly at the heart of the clustering problem. We do not know if there are 4 types of customers (clusters) as per our data, or 7 since this is indeed an unsupervised problem. So, part of our task is to identify the most suitable number of K clusters to segment our data. However, there is no “correct” answer here since there is no ground-truth. Indeed, choosing the value of K is often a business decision.

Choosing the “right” K

Elbow Method

The elbow method allows us to make a decision on a value of K via visual aid. We try breaking up our data into a different number of K clusters and plot each K cluster-type against the corresponding W(Ck). An example is below.

We choose the value of K at the position when the decrease in the W(Ck) for values of K begins increasing. So, for the example below, the optimal K appears to be 2 since the decrease in W(Ck) between K = 1 and K=2 is larger then drop in K = 3 and K =2. So, we visually look for the “elbow” of the curve.

Silhouette Method/Analysis

Silhouette analysis can be used to study the separation distance between the resulting clusters. The silhouette coefficient displays a measure of how close each point in one cluster is to points in the neighboring clusters and thus provides a way to assess parameters like the number of clusters. This measure has a range of [-1, 1].

Silhouette coefficients (as these values are referred to as) near +1 indicate that the sample is far away from the neighboring clusters. A value of 0 indicates that the sample is on or very close to the decision boundary between two neighboring clusters and negative values indicate that those samples might have been assigned to the wrong cluster.

The Silhouette Coefficient is calculated using the mean within-cluster distance/variation (a) and the mean nearest-cluster distance (b) for each sample. The Silhouette Coefficient for a sample is (b – a) / max(a, b). To clarify, b is the distance between a sample and the nearest cluster that the sample is not a part of

Clustering is challenging in that at the outset, one is not exactly sure whether the output is going to be useful. If one was to create 3 clusters or 8 clusters with a dataset, how do we know which one is the correct choice? Let us say using the Online Retail Dataset, one concluded that there are 6 customer-types or clusters. Based on this, the marketing department of the company sent out email advertisements to customers as per their cluster assignment. A clustering would be useful, say if customer interested in deals in electronic products actually received an email with those products. He or she would hopefully click on the advertisement and purchase an item.

Over-time, then one could evaluate the clustering based on the overall response of the customers on the email advertisements. Many clicks or purchases would reflect an appropriate clustering. If not, however, clearly the clustering needs to be adjusted.

This is exactly the challenge with clustering using K-means or any other method. While guides like the Elbow or Silhouette method exist, we can never be exactly sure of the validity of our clustering. Even so, K-means is a powerful and quick algorithm that if used wisely in conjunction with domain knowledge, can produce great results.

By Jithin J and Karthik Ravindra, Byte Academy Analyzing a Time Series Data needs special attention. Here, we would like to explore working with time series data and identify the effect of autocorrelation to come up with a more practical approach to work in Linear Regression Models. When using some data to try to estimate […]

Analyzing a Time Series Data needs special attention. Here, we would like to explore working with time series data and identify the effect of autocorrelation to come up with a more practical approach to work in Linear Regression Models. When using some data to try to estimate some value, say equity prices, Autocorrelation is a common feature. It is defined as the situation when the error terms of the linear regression model are correlated. So, if one error term is positive (or negative), and this fact causes the next error term to also be positive (or negative), we say that the model suers from autocorrelation. It is a very serious problem, as it violates the common assumption that the error term is stochastic and non-deterministic. Maintaining a stochastic error term is important to maintain the integrity of a linear regression otherwise it risks inducing bias in the model’s estimations.

Let’s take an example of some financial data during a stock market crash. The crash on day one increases the likelihood of observing a downward trend for the next few days, perhaps even weeks. If the model suers from autocorrelation and is used for extrapolation, the model will estimate a similar stock market crash in the future as well. Therefore, we must first be able to identify the presence of this trend.

To prepare this article, we decided to pick a financial data set. After some quick research, we decided to work on Shiller PE ratio and estimate the movement of S&P monthly closing price. The data was taken from: http://www.multpl.com/shiller-pe/table?f=m.

Domain Knowledge

The Shiller P/E is a valuation measure usually applied to the US S&P 500 equity market. It is defined as price divided by the average of ten years of earnings (moving average), adjusted for inflation. As such, it is principally used to assess likely future returns from equities over timescales of 10 to 20 years, with higher than average values implying lower than average long-term annual average returns.

Web scraping

We start with extracting data scraping the Shiller P/E ratio and S&P closing prices from http://www.multpl.com/shiller-pe/table?f=m . If interested in web scraping, the Python code is here: https://github.com/jithinjkumar.

Once our data has been extracted we store in pandas DataFrames. We create a pandas data frame with index column as time series and S&P closing and Shiller Ratio as our column.

Once the data is stored, we need to clean and prepare it for analysis.

Data Preparation and Data Cleaning using Pandas library: Creating a Time Series

So we have Shiller ratio data and S&P closing price in two different data frames, now let’s perform a lookup function to get the Shiller PE ratio for each month into the closing price data frame.

We have 1769 entries and 4 columns SandPDate and shDate are date columns we could easily drop one of them and we need to check for null values.

sh_Ratio has 120 null values, we could drop these values from our dataset safely as this accounts to less than 6% of total row items

Now we create a time series for which the S&P Date column needs to format correctly so that we are able to assign the correct data type for each column.

Now our Dataframe is in a time series format and ready for further analysis.

Stay tuned for the next post in this series, in which we will discuss Time Series Analysis.

Any trading symbols displayed are for illustrative purposes only and are not intended to portray recommendations.Originally posted on the IBKR Quant Blog here.

Blockchain technology is commonly misconstrued as being applicable only for cryptocurrencies. As a secure, immutable ledger technology, blockchain can be applied to many use cases in several industries. The very nature of blockchain makes it a secure application for use in transfer of value or as storehouses of critical information. Because of its flexibility, the […]

Blockchain technology is commonly misconstrued as being applicable only for cryptocurrencies. As a secure, immutable ledger technology, blockchain can be applied to many use cases in several industries. The very nature of blockchain makes it a secure application for use in transfer of value or as storehouses of critical information. Because of its flexibility, the finance industry is already looking at other use cases for blockchain, one of them being online payments. The transparency and simplicity of blockchain makes it easy for anyone with an internet connection to transfer and receive funds without having to pay exorbitant amounts in banking fees.

Many industries are looking to incorporate blockchain to their IT strategies.Joe Lubin, co-founder of Ethereum explains why – “There won’t ever be a single powerful entity that controls the system or controls gatekeeping into the system the way blockchain does”. We can take a look at how some of the industries can leverage blockchain technology –

Real Estate

Blockchain based smart contracts can make the buying, selling and renting of property more transparent by directly connecting owners, buyers, landlords and tenants. Rentberry, Atlant and Beetoken use blockchain to remove third party dealers, fraudulent deposits and fake reviews from the equation.

Healthcare

Patient records are critical and confidential information that needs to be stored in a highly secure environment. With the help of blockchain, universal health records can be established by aggregating and placing a person’s health history onto a blockchain ledger, which can only be accessed by an authorised healthcare professional. Doc.AI is one of the frontrunners in the healthcare sector to secure patient data through blockchain technology to generate medical insights.

Finance & Insurance

The financial sector is exploring the different use cases of blockchain, and top banks like Barclays, JP Morgan Chase and Goldman Sachs are investigating the potential applications where blockchain can be used. One of the use cases is how ICOs are being used to crowdfund early stage projects in cryptocurrency by releasing their own tokens in exchange for Bitcoin..

Retail

Blockchain will help transform the supply chain into a more efficient process with the use of smart contracts. As transparency of raw material and fund sourcing can be recorded and locked and only used when a certain number of orders are reached, it will eliminate overestimation of demand and waste of resources. Provenance uses blockchain to offer a platform to retailers that makes the retailer’s products and supply chains more transparent and traceable.

Transport

Ride sharing apps can be better enhanced with the help of blockchain technology, where drivers and riders can pair with each other through a universally accepted blockchain authenticated id. Drivers can set their own rates, work on their own time and remove the middle man entirely. Drivers and riders can pair up according to their compatibility based on location, distance travelled and fee charged, there creating a more satisfying experience for the travellers. La’Zooz is using the blockchain distributed ledger technology to make the ride sharing concept even more decentralized. Rather than dealing with fixed prices for rides, the blockchain technology used by La’Zooz allows the community members to decide how to reward the riders and drivers using sophisticated protocols.

Pharma & Chemical

The immutable ledger system of blockchain can help build a secure and transparent supply chain ledger that can combat the $75 billion counterfeit medication market by tracking production and distribution. Recently, IBM partnered up with a Chinese supply chain management firm, Easysight Supply Chain Management, to introduce the Yijian Blockchain Technology Application System, a new platform based on the hyperledger fabric open source blockchain framework. This system plans to cover several pharma retailers, hospitals and banks to track the movement of drugs through supply chain and encrypt trading records to legitimize the authenticity of transactions.

Utilities & Energy

Use of blockchain and cryptocurrencies can help the government and the public reduce disputed energy transaction through secure online payments. Without the requirement of 3rd party validation the energy transactions can be stored in a transparent ledger and the payments processed quickly. The use of blockchain in the energy sector is already set in motion. In early 2016, in an experiment with a US based company in Brooklyn, New York, the owner of a solar panel sold energy to his neighbor through Ethereum smart contract. This happened on the Brooklyn microgrid managed by Lo3 Energy.

Education

Deploying blockchain solution in online education can create a centralized ledger of coursework and certifications. This will help make accredited MOOC certifications available to the developing nations and validating the hours put in by students and the mitigate any fraudulent claims of courses completed and unearned certifications. edChain is one company using blockchain in this area.

Agriculture

As blockchain technology grows more sophisticated in supply chain management, it can also be used in battling foodborne outbreaks in the agriculture industry. As every production, packaging and inspection of food items is tracked in an immutable ledger, an epidemic related a particular food item can be immediately checked. IBM is partnering with Walmart, Nestle, Dole, Tyson Foods, Kroger to use blockchain technology to track food throughout the complex, global supply chain in order to identify contaminated foods within minutes. This process would otherwise take days and can significantly reduce foodborne diseases.

With these many industries adopting blockchain for various uses, it is a great time to software developers to grab the opportunity of learning blockchain. With an academic certification in blockchain technology, and with a background in any of these industries, developers can make a real difference in how efficiently blockchain technology is being applied by the industry.

De Beers, the world’s biggest diamond producer, has tracked its first diamonds all the way from the mine to the jewelry producer using blockchain, the technology behind Bitcoin. It is even planning to launch a platform for other diamond retailers later this year which uses blockchain to track a diamond through the entire value chain. […]

De Beers, the world’s biggest diamond producer, has tracked its first diamonds all the way from the mine to the jewelry producer using blockchain, the technology behind Bitcoin. It is even planning to launch a platform for other diamond retailers later this year which uses blockchain to track a diamond through the entire value chain. Known as Tracr, the program gives each stone a unique ID that stores diamond characteristics such as weight, color and clarity.

De Beers has been in collaboration with five other major diamond manufacturers, Diacore, Diarough, KGK Group, Rosy Blue NV, and Venus Jewel for Tracr’s development. Improving transparency within the diamond industry is key for the joint initiative. Amit Bhansali, Managing Director of Rosy Blue NV, states, “Initiatives that use blockchain can drive this process even further as their implementation requires collaboration and trust creation among all industry stakeholders.”

Diamonds aren’t the only industry in the luxury market harnessing blockchain’s power; companies in areas such as the handbags and fine arts also are utilizing this technology. They want to ensure authenticity and verify that, indeed, the bag you’re holding really is a Hermes Birkin, or, that painting is indeed a Warhol.

On a larger scale, the advent of blockchain may solve the problem of counterfeit goods in global trade with luxury goods being amongst areas hit the hardest. According to the Organization for Economic Cooperation and Development, imports of counterfeit goods are worth nearly half a trillion dollars annually, making up 2.5% of global imports, and it’s American, Italian, and French brands that are hit the hardest.

With blockchain as the standard, consumers will less likely be fooled by knockoffs, and they will be more certain that their money is being well spent. Two companies facilitating this consumer certainty include VeChain and Luxochain. Both are using blockchain in order to ensure the authenticity of goods and help track items even after purchase to ensure that they aren’t lost or stolen.

Blockchain’s secure nature is key to its application in diamonds and luxury goods. From a technical perspective, each entry, or node, of the blockchain is connected to the next, which contains specific information about the product, including its current owner and a timestamp. The data at one node cannot be manipulated without access to all subsequent nodes. Because a blockchain is an open, distributed ledger, there is no centralized database. Hence, the data cannot be tampered with – it is secure.

Take the example a designer handbag again. At production, that designer handbag is tagged with a unique serial code and registered on a luxury goods blockchain app, such as VeChain or Luxochain. At each point through the supply chain, it is scanned. New nodes are added onto the blockchain representing the bag, updating the ownership and timestamping it each time. When it reaches stores, a consumer can walk in, scan this bag to ensure that it is truly genuine, and follow through with the purchase if he or she so desires. At this point, the transaction is placed on the blockchain, and the bag’s new official owner is the customer. This blockchain solution is a simple process, and one that adds security and ease of mind.

Data Science is the coveted new career around the block but not many can define the exact role of a data scientist. Being a relatively new field of work with people signing up for the role from different backgrounds, data science as a discipline requires a very broad skill set. Data mining, data analysis, machine […]

Data Science is the coveted new career around the block but not many can define the exact role of a data scientist. Being a relatively new field of work with people signing up for the role from different backgrounds, data science as a discipline requires a very broad skill set. Data mining, data analysis, machine learning, business analysis, data visualization, A/B testing are some of the skills a data scientist should have.

Machine learning is a large discipline in itself, with companies like Facebook relying on machine learning algorithms to sift through user behaviour patterns on a daily basis. Machine learning also involves a lot of data analysis, A/B testing and data visualization. More often than not machine learning and data science are used as mutually exclusive terms but they shouldn’t be.

If we were to explain data science and machine learning through a venn diagram, machine learning would be a subset of data science. To understand the differences in a simpler way, it would be better to start with what is data science and machine learning. Once we are thorough with the basic differences, we can delve deeper into understanding the overlap and the distinction between these two fields.

What is data science?

Data science is behind deriving actionable inputs from raw data. It is used to derive insights from the chaos of big data though predictive modelling, data analytics and machine learning. Data science is behind pattern recognition, structuring big data and finally advising the top management on critical outcomes that is possible. It is decision science.

Data science is multidisciplinary. Apart from having technical knowledge in statistics, data mining, machine learning, databases, data processes, visualizations, pattern recognition and AI, a data scientist also needs to have domain knowledge, expertise in business strategy, inquisitiveness and good communication and presentation skills.

What is machine learning?

Machine learning, when explained in simple terms, means the use of software programs with the application of artificial intelligence to learn to detect patterns in data by itself without being specifically programmed. It begins with observations in data patterns and mapping them to earlier run programs. The aim is to allow computers to run programs without explicit human intervention.

We inadvertently use machine learning in our daily lives without realising it. Effective web search is a prime example of machine learning and now it is being used in self driving cars and speech recognition.

Data Science vs Machine Learning

As explained earlier, machine learning is but a subset of data science. Machine learning can be an analysis that maybe used in data science but it is not a condition for data science, unlike statistics. While machine learning is mostly used in pattern recognition, data science is used for find answers to the questions. For example, if the supply managers at say Amazon wanted to find out if they needed to source more blue jackets than red jackets this winter – they would ask a data scientist.

The main difference between data science and machine learning is this – data science is used for predictive and prescriptive analysis usually to answer critical business questions. Machine learning algorithms are used for predictions – eg. predicting the future trends of an event and for pattern recognition. Data science is a bigger field of study than machine learning. These two terms are not interchangeable.

Data Science / applied analytics is certainly in and is here to stay and thrive. Be a Data Scientist, what Harvard Business Review calls the “Sexiest job of the 21st Century”. Take the leap – get into intensive data science bootcamps and work on live projects. Sooner the better, to make most out this wave !

]]>https://byteacademy.co/blog/Data_Science_vs_Machine_Learning/feed/0Blockchain Developers are in High Demand: Are you Ready for the Next Big Leap?https://byteacademy.co/blog/Blockchain_Developers_are_in_HighDemand
https://byteacademy.co/blog/Blockchain_Developers_are_in_HighDemand#respondMon, 11 Jun 2018 07:19:10 +0000http://byteacademy.co/?p=10900Byte Academy

Blockchain has emerged as the next big thing in highly paid IT careers with the demand increasing almost 2000% according to Upwork (global freelancing platform). From a part time passion project to the hottest skill in demand – blockchain technology has come a long way. Blockchain distributed ledger system is proving to be effective in […]

Blockchain has emerged as the next big thing in highly paid IT careers with the demand increasing almost 2000% according to Upwork (global freelancing platform). From a part time passion project to the hottest skill in demand – blockchain technology has come a long way. Blockchain distributed ledger system is proving to be effective in securely storing records of transactions and industries from finance to medicine to agriculture are moving quickly to integrate the technology into their services and systems.

To do so, they need blockchain developers which have rocketed in demand. Since mid 2016, the interest in blockchain expanded beyond a few hobbyists and enthusiasts to being THE sought after technology by larger companies in industries such as banking, healthcare, cryptocurrency exchange, education and video collaboration. At the beginning of this year, Upwork (a global freelancing platform) reported that blockchain development is the second most demanded skill in the labor market comparing the growth of blockchain similar to that of ‘cloud’ in mid 2000s.

As they search for blockchain developers, hiring managers are realizing that there is a huge skill gap in the market, particularly as the technology has not matured like Java and Cloud. The general consensus is that there are 14 job openings for 1 blockchain developer at any given time. In the US, blockchain developers can command salaries anywhere between $150,000 to $250,000charging as much as $200 per hour for freelancing (Upwork).Consistent with these numbers, Toptal, the on-demand tech talent marketplace says that the demand for blockchain developers on their platform has grown by 700%.

Given the immense opportunity, you may find yourself eager to go into blockchain development. If you are a web developer with extensive knowledge in C, C++ and Python,you have an advantage most employers look for proficiency in C, C++, Java or Python. Blockchain developers tend to be hired to design, implement and support a distributed ledger blockchain based network.. Experience in blockchain iterations like Hyperledger and Solidity will give you a better chance of commanding an even higher salary. Strong understanding of cryptography, hash functions, encryption, block integration will be a huge plus. Full stack experience is also a preferred by employers looking for blockchain candidates.

Why Python is the preferred language among developers?

Click on any blockchain tutorial on the net and the chances are you will find one where the blockchain code is written in Python. Python is the preferred language among developers for its simplicity and minimalism. Extensive support libraries for string operations, operating system interfaces, user friendly data structures and its clean object oriented design make it a favorite among developers as it increases productivity and speed of development. Most developers prefer to pair C++ and Python while developing a blockchain code.

Opting to learn Python full stack and blockchain development can open you to a world of new possibilities.

Web developers, rejoice! If you’ve been looking for a way to make a foray into the world of Machine Learning and Deep Learning, your learning curve has gotten that much more gentle with the introduction of the TensorFlow library in JavaScript. Introduction to Javascript JavaScript is a high level, interpreted type programming language. It […]

Web developers, rejoice! If you’ve been looking for a way to make a foray into the world of Machine Learning and Deep Learning, your learning curve has gotten that much more gentle with the introduction of the TensorFlow library in JavaScript.

Introduction to Javascript

JavaScript is a high level, interpreted type programming language. It was initially created to make web pages “alive” and is one of the three core technologies, along with HTML as well as CSS of the World Wide Web.

Starting off as a language only implemented client side in web browsers, they are now embedded in many other types of host software and provide functionalities that have moved far beyond its original use as a language that makes web pages more interactive.

While there are many machine learning libraries already available in the JavaScript ecosystem, the introduction of TensorFlow in JS marks a huge step forward for implementing machine learning algorithms directly in the browser.

Introduction to TensorFlow

TensorFlow is an open source software framework that helps implement machine learning algorithms in a variety of environments. Originally developed by researchers and engineers from the Google Brain team within Google’s AI organization, it has found wide usage and is used by companies such as Airbnb, Nvidia, Uber, Kakao to name a few.

Now, with the launch of TensorFlow.js, it has become possible to launch, train, and execute machine intelligence within the browser using just JavaScript and a high-level layers API.

Why use Machine Learning in Javascript?

Conventionally, machine learning is often associated with languages such as R and Python, so why bother implementing it in a language such as JavaScript?

For starters, implementing these algorithms in web browsers allows us to try out machine learning from client-side servers, which means that one can use the platform for rapid prototyping, interactive explanations, as well as visualizations. From a user’s perspective, this means that a person does not need to install any libraries or drivers to implement machine intelligence.

Even the performance of the algorithms is accelerated due to its integration with WebGL, which means that the code is accelerated whenever a GPU is available. This means that the user does not require a CUDA or a cuDNN installations, and can even work on most mobile GPUs(!). However, the performance will not be comparable to that of CUDA/cuDNN.

What can you do with TensorFlow.js?

You can convert your pre-existing TensorFLow and Keras models to implement them in the browser for inference.

You can use transfer learning to re-train an existing model by providing it data collected in the software. You can use this method, called Image Retraining, to create new image classifications using the same model as before.

You can create models directly in the browser.

How can I install TensorFlow.js?

You can either start using TensorFlow.js via the script tags or install it from the NPM.

<!– Place your code in the script tag below. You can also use an external .js file –>
<script>
// Notice there is no ‘import’ statement. ‘tf’ is available on the index-page// because of the script tag above.

// Train the model using the data.
model.fit(xs, ys).then(() => {
// Use the model to do inference on a data point the model hasn’t seen before:// Open the browser devtools to see the output
model.predict(tf.tensor2d([5], [1, 1])).print();
});
</script>
</head>

Our Thoughts

This is a great opportunity for developers familiar with Javascript to make a foray into the world of data science and machine learning. Making a leap into a different domain is a difficult task, more so for a field that requires as much technical knowledge as machine learning. However, the familiarity of the language can make things easier and that’s what we’re hoping for with the introduction of TensorFlow.js.

To get a better understanding of learning data science from scratch, check out our course details right here.