~ Broaden your Horizon

Distilled News

Classifying time series data? Is that really possible? What could potentially be the use of doing that? These are just some of the questions you must have had when you read the title of this article. And it’s only fair – I had the exact same thoughts when I first came across this concept! The time series data most of us are exposed to deals primarily with generating forecasts. Whether that’s predicting the demand or sales of a product, the count of passengers in an airline or the closing price of a particular stock, we are used to leveraging tried and tested time series techniques for forecasting requirements.

Garbage in, garbage out. This mantra – which has been around since at least the 1950s – is paramount in a world where data has become more and more central to computational methods. At Opendoor, we feel the pain of poor data quality acutely: it can easily throw our home valuation models’ predictions off hundreds of basis points. For many of our models, one of the most important characteristics of a home is the subdivision. Roughly speaking, a subdivision consists of a group of homes that were developed generally around the same time, by the same builder, with similar materials and similar architectural styles. Subdivisions thus provide meaningful clusterings of homes in the United States housing market.

Measuring the similarity between texts is a common task in many applications. It is useful in classic NLP fields like search, as well as in such far from NLP areas as medicine and genetics. There are many different approaches of how to compare two texts (strings of characters). Each has its own advantages and disadvantages and is good only for a range of specific use cases. To help you better understand the differences between the approaches we have prepared the following infographic.

Testing in R seems simple. Start by using usethis::test_name(‘name’) and off you go by coding your tests in testthat with functions like expect_equal. You can find a lot of tutorials online, there is even a whole book on ‘Testing R Code’. Sadly, this is not the way I can go. As I mentioned a few times, I work in a strongly regulated environment. Inside such environments your tests are not only checked by coders, but also by people who cannot code. Some of your tests will even be written by people who cannot code. Something like specflow or cucumber would really help them to write such tests. But those do not exist in R. Additionally, those people cannot read command line test reports. You can train them to do it, but we decided it is easier to provide us and them with a pretty environment for testing, called RTest.

Yann LeCun described it as ‘the most interesting idea in the last 10 years in Machine Learning’. Of course, such a compliment coming from such a prominent researcher in the deep learning area is always a great advertisement for the subject we are talking about! And, indeed, Generative Adversarial Networks (GANs for short) have had a huge success since they were introduced in 2014 by Ian J. Goodfellow and co-authors in the article Generative Adversarial Nets. So what are Generative Adversarial Networks ? What makes them so ‘interesting’ ? In this post, we will see that adversarial training is an enlightening idea, beautiful by its simplicity, that represents a real conceptual progress for Machine Learning and more especially for generative models (in the same way as backpropagation is a simple but really smart trick that made the ground idea of neural networks became so popular and efficient). Before going into the details, let’s give a quick overview of what GANs are made for. Generative Adversarial Networks belong to the set of generative models. It means that they are able to produce / to generate (we’ll see how) new content. To illustrate this notion of ‘generative models’, we can take a look at some well known examples of results obtained with GANs.

Let’s just agree, we are all bad at calculus when it comes to large neural networks. It is impractical to calculate gradients of such large composite functions by explicitly solving mathematical equations especially because these curves exist in a large number of dimensions and are impossible to fathom. To deal with hyper-planes in a 14-dimensional space, visualize a 3-D space and say ‘fourteen’ to yourself very loudly. Everyone does it -Geoffrey Hinton This is where PyTorch’s autograd comes in. It abstracts the complicated mathematics and helps us ‘magically’ calculate gradients of high dimensional curves with only a few lines of code. This post attempts to describe the magic of autograd.

During holidays I wanted to ramp up my reinforcement learning skills. Knowing absolutely nothing about the field, I did a course where I was exposed to Q-learning and its ‘deep’ equivalent (Deep-Q Learning). That’s where I got exposed to OpenAI’s Gym where they have several environments for the agent to play in and learn from.

Data science requires an insightful understanding of data. As more and more data accumulates, it becomes harder to answer the following question: How do I spatially represent my data in an accurate and meaningful way? I claim that a super useful step in answering this question is understanding what a manifold is. Here’s the good news: It’s very likely you already understand what a manifold is. Manifolds are visual by nature, so everyday examples are abundant. In this article I will:
1. Explain what a manifold is and give a conceptual definition.
2. Visualize examples of manifolds in various contexts.
3. Show how manifolds are used in data science.

Language models and transfer learning have become one of the cornerstones of NLP in recent years. Phenomenal results were achieved by first building a model of words or even characters, and then using that model to solve other tasks such as sentiment analysis, question answering and others. While most of the models were built for a single language or several languages separately, a new paper?-?Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond?-?presents a different approach. The paper uses a single sentence encoder that supports over 90 languages. Its language model is trained on a dataset that contains sentences from all these languages. However, it can be utilized for a given task by only training the target model (e.g. classifier) on a single language, a technique named Zero-Shot. This Universal language model achieves strong results across most languages in tasks such as Natural Language Inference (classifying relationship between two sentences), that are state-of-the-art for Zero-Shot models. This technique is faster to train and has the potential to support hundreds of languages with limited training resources.

Regularization is just like the cat shown above when some of the weights want to be ‘big’ in magnitude, we penalize them. And today I wanted to see what kind of changes does regularization term brings on the gradient. Below is the list of different regularization terms that we are going to compare. (? is the weight of each layer.).

My colleague at work recommended me several wonderful Machine Learning libraries and some of them were new to me. Therefore, I decided to try them out, one by one. Today is TPOT’s turn. The data set was about predicting engineers of Daimler’s Mercedes-Benz cars speed of testing system, the purpose is to reduce the time that cars spend on testing, with over three hundreds of features. Frankly, I have little or no domain expertise in automobile industry. Regardless, I will try to make the best predictions I can, using TPOT, a Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

In an earlier post, I discussed a model agnostic feature selection technique called forward feature selection which basically extracted the most important features required for the optimal value of chosen KPI. It had one caveat though?-?large time complexity. In order to circumvent that issue feature importance can directly be obtained from the model being trained. In this post, I will consider 2 classification and 1 regression algorithms to explain model-based feature importance in detail.

Processing of high dimensional data can be very challenging. By ‘high’ it is meant thousands of dimensions, try to imagine(even though you can’t) a 70k dimensional space. Algorithms which rely on Euclidean distance as the measure of distance between 2 points start breaking down. This is referred to as the curse of dimensionality. Models such as K Nearest Neighbors and Linear Regression can easily overfit to high dimensional data and thus require careful hyperparameter tuning. Thus dimensionality reduction can be quite advantageous for any predictive model. However one cannot just throw away features randomly, after all, it is data which is the new oil. Dimensionality reduction techniques have been developed which not only facilitate extraction of discriminating features for data modeling but also help in visualizing high dimensional data in 2D, 3D or nD(if you can visualize it) space by transforming high dimensional data into low dimensional embeddings while preserving some fraction of originally available information. In case of PCA, this information is contained in the variance of extracted features whereas TSNE(T distributed stochastic neighborhood embedding) tries to preserve neighborhood information for as many points as it can, based on perplexity of the model. TSNE is state-of-the-art technique presently available.

If you are familiar with matrix and vectors then it would not take much time for you to understand what SVD is, however, if you are not familiar with matrix, I would suggest you to first get the grasp of that. SVD is the method in which we represent data in form of matrix and then we reduce the number of columns it has in order to represent same information. Why wouldn’t data be lost? One might ask. The answer for that question is the essence of SVD and we are going to see how it works.

Quantum computers have the potential to be exponentially faster than traditional computers which will revolutionise the way we currently solve a number of applications. You can find out how in this detailed article but we are still years away from general purpose Quantum Computers. However, for certain applications Bayesian Optimization can help stabilise quantum circuits and this article will summarise how OPTaaS did so in this paper which has been submitted to Science. The team behind the paper was composed of researchers from the University of Maryland, UCL, Cambridge Quantum Computing, Mind Foundry, Central Connecticut State University and IonQ.

Compliance is not a sexy topic, and so here’s the nightmare scenario: You are the CFO in a Fortune 500 company, and many internal reports reach your desk on a regular basis. Let’s say your annual internal audit happens in January and an important report lands on your desk in February. If the key information in that report is not dealt with by your office until the next annual audit, then your job may be at risk. The difficulty here is that non-financial risks such as reputation risk, data breach, privacy violations, or even planning errors can lead to losses down the road. These risks are hard to detect because they reach your desk as text, not numbers. Often the are just too many long reports, and key insights are missed. They don’t show up in the books until the crisis is upon you, be it in the form of a fine, a lawsuit, or lost client accounts. It’s hard to justify in a board meeting why you had the report on your desk (or in your inbox) but never surfaced the issue. In a small company, this just isn’t a problem. There are not enough reports to motivate an AI-based solution. It’s only when you face a tidal wave of reports, that suddenly, the executives can’t handle the volume and variety of risks effectively. These reports are typically curated for decades in a Document Management System (DMS), but don’t reach some analytics tool to generate insights, and don’t get assessed by humans either. Moreover, reading every single report in detail will still miss the big picture. You can’t see the non-financial performance of business units on a regular basis, leading to a best case scenario of a human-generated annual snapshot, or no picture at all, rather than a live view of the situation in the company, at a detail-oriented level, for whatever keywords you make up on the fly.

A rtificial Intelligence. The term has become ubiquitous in this digital age. Highly complex, opaque yet celebrated. We are grateful for the ease it brings through recommendation engines on retail, music and streaming platforms. Nevertheless, overwhelming choices have led us to trade in our purchasing behaviour and transaction data for an algorithm that simply reinforces our own tastes of the past.

Conversational applications are becoming part of our everyday’ s lives using channels such as digital assistants or bots. Despite the unquestionable progress made in natural language processing(NLP) in the last few years, most AI-driven conversation applications today rarely resemble human dialogs and focus on basic question-answering or command-action patterns. A good percentage of human’s conversations deviate from that pattern of digital assistants and focus on expressing opinions about a specific subject. Last year, IBM Research unveiled Project Debater as one of the first artificial intelligence(AI) agents that can debate complex topics at a human level. Yesterday, IBM announced the first application of Project Debater: Speech by Crowd which crowdsources opinions about a specific arguments and formulates complex points of views about both sides of a debate. Speech by Crowd is the materialization of an impressive list of AI Research on different NLP areas.