~ Broaden your Horizon

Distilled News

VisualDL is a deep learning visualization tool that can help design deep learning jobs. It includes features such as scalar, parameter distribution, model structure and image visualization. Currently it is being developed at a high pace. New features will be continuously added. At present, most DNN frameworks use Python as their primary language. VisualDL supports Python by nature. Users can get plentiful visualization results by simply add a few lines of Python code into their model before training. Besides Python SDK, VisualDL was writen in C++ on the low level. It also provides C++ SDK that can be integrated into other platforms.

We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT representations can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE benchmark to 80.4% (7.6% absolute improvement), MultiNLI accuracy to 86.7 (5.6% absolute improvement) and the SQuAD v1.1 question answering Test F1 to 93.2 (1.5% absolute improvement), outperforming human performance by 2.0%.

MatchZoo is a toolkit for text matching. It was developed with a focus on facilitating the design, comparison and sharing of deep text matching models. There are a number of deep matching methods, such as DRMM, MatchPyramid, MV-LSTM, aNMM, DUET, ARC-I, ARC-II, DSSM, and CDSSM, designed with a unified interface (collection of papers: awesome-neural-models-for-semantic-match). Potential tasks related to MatchZoo include document retrieval, question answering, conversational response ranking, paraphrase identification, etc. We are always happy to receive any code contributions, suggestions, comments from all our MatchZoo users.

Auto-Keras is an open source software library for automated machine learning (AutoML). It is developed by DATA Lab at Texas A&M University and community contributors. The ultimate goal of AutoML is to provide easily accessible deep learning tools to domain experts with limited data science or machine learning background. Auto-Keras provides functions to automatically search for architecture and hyperparameters of deep learning models.

This repository contains the source code for TensorFlow Privacy, a Python library that includes implementations of TensorFlow optimizers for training machine learning models with differential privacy. The library comes with tutorials and analysis tools for computing the privacy guarantees provided. The TensorFlow Privacy library is under continual development, always welcoming contributions. In particular, we always welcome help towards resolving the issues currently open.

If you had to pick one deep learning technique for computer vision from the plethora of options out there, which one would you go for? For a lot of folks, including myself, convolutional neural network is the default answer. But what is a convolutional neural network and why has it suddenly become so popular? Well, that’s what we’ll find out in this article! CNNs have become the go-to method for solving any image data challenge. Their use is being extended to video analytics as well but we’ll keep the scope to image processing for now. Any data that has spatial relationships is ripe for applying CNN – let’s just keep that in mind for now.

Big Data is currently a hot topic of research and development across several business areas mainly due to recent innovations in information and communication technologies. One of the main challenges of Big Data relates to how one should efficiently handle massive volumes of complex data. Due to the notorious complexity of the data that can be collected from multiple sources, usually motivated by increasing data volumes gathered at high velocity, efficient processing mechanisms are needed for data analysis purposes. Motivated by the rapid growth in technology, development of tools, and frameworks for Big Data, there is much discussion about Big Data querying tools and, specifically, those that are more appropriated for specific analytical needs. This paper describes and evaluates the following popular Big Data processing tools: Drill, HAWQ, Hive, Impala, Presto, and Spark. An experimental evaluation using the Transaction Processing Council (TPC-H) benchmark is presented and discussed, highlighting the performance of each tool, according to different workloads and query types.

Huge amount data have been generated every day on various platforms such as Wikipedia, technical and non-technical blogs, social media, online news articles etc. Around five millions of articles are present in Wikipedia alone and every day thousands of new articles are added to it. Due to the huge amount data gathered every day, the users are bombarded with the large volume of data. For the human being, it is difficult to assimilate this huge amount of data. So, effective techniques are required to help the user to assimilate the data and make the data available for his use.

In application programming, we have debugging and error checking statements like print, assert, try-catch, etc. But when it comes to deep neural networks, debugging becomes a bit tricky. Fortunately, Convolutional Neural Networks (ConvNets or CNNs) have inputs (images) which are visually interpretable by humans so we have various techniques for understanding how they work, what do they learn and why they work in a given manner. Whereas for other deep neural network architectures visualizations are even more difficult. Nonetheless, visualizing convnets gives us good intuition about the world of neural networks. In this post, we will go deeper in convnets and understand how image classifiers work and what actually happens when we feed a 224x224x3 image into a convnet.

This post is intended to be a best practices guide to individual contributor data science careers. This section picks up from my summary on launching and scaling data science teams. I am targeting a data scientist who has developed a standard toolkit, via any source outside of industry. IMO, there are exhaustive amounts of resources to scale ICs from zero to Kaggle master/interview ready. This post is the continuation of those resources, for data scientists asking this question: ‘I’ve learned the skills. I’ve made it through interviews. It’s my first day at ______. Now what?’

This post is intended to be a best practices guide to managing a data science team. This section picks up from my summary on Launching and Scaling Data Science Teams, and contains some overlap with the Data Science IC playbook. I am targeting a data scientist who has had success as an individual contributor, and is currently struggling with classic managerial questions: how do I scale, hire, evaluate, communicate? Within data science, this is an even tougher problem given the massive differences in the backgrounds of qualified candidates. This post is for people asking the question: ‘I’ve had success as a data science IC and gained responsibilities. I’ve been tasked with growing and leading a team at ________. Now what?’

This post is intended to be a best practices guide to interacting with a data science team. This section picks up from my summary on Launching and Scaling Data Science Teams, and contains some overlap with the Data Science IC and Data Science Manager playbook. I am targeting a business stakeholder (PM, ops director, business analyst, CMO) that depends heavily on their data science team heavily, but is having trouble communicating with individuals and integrating with the team’s workflow. This post is for people asking the question: ‘I know I need better results out of collaborations with my data science team at ___________, but I don’t know how to get them, or what I can do to improve.’

This is the final output of a Fall semester independent project on data science leadership and organization within companies. I am in the second year of my MBA at Harvard Business School, and Kris Ferreira in the Technology and Operations (TOM) unit is my advisor. I interviewed 29 people (DS individual contributors, DS managers, non-technical business leaders) across 13 companies. I used those interviews develop a best practices playbook for launching and scaling data science teams and careers: one for data scientists, one for data science managers, one for business stakeholders.

Time series analysis is one of the fastest growing areas across disciplines such as machine learning or probabilistic programming. The emergence of markets such as the internet of things(IOT) or social networks has increased the relevance of time series infrastructure to power analysis over real time data. The importance of time series analysis have influenced the release of open source stacks such as Graphite or Prometheus. However, many of the top internet have regularly outgrown those stacks and pursued the path of building their own time series infrastructure. Uber is one of the companies that have contributed the most to the time series data infrastructure space. Earlier this year, the transportation giant decided to open source the stack that has been powering their time series analysis for years: M3.

M3, a metrics platform, and M3DB, a distributed time series database, were developed at Uber outof necessity. After using what was available as open source and finding we were unable to use themat our scale due to issues with their reliability, cost and operationally intensive naturewe built our own metrics platform piece by piece. We used our experience to help us build a nativedistributed time series database, a highly dynamic and performant aggregation service, query engineand other supporting infrastructure.

Blackbox algorithms can be loosely defined as algorithms whose output is not easily interpretable or is non-interpretable altogether. Meaning you get an output from an input but you don’t understand why. Not to mention GDPR’s reinforcement of the ‘right to explanation’, avoiding blackbox algorithms is a matter of transparency and accountability?-?two values that any company should strive for.

SHAP (SHapley Additive exPlanations) is a unified approach to explain the output of any machine learning model. SHAP connects game theory with local explanations, uniting several previous methods [1-7] and representing the only possible consistent and locally accurate additive feature attribution method based on expectations (see the SHAP NIPS paper for details).

Time series is a series of data that is indexed in time order. The time order can be expressed in days, weeks, months or years. The most common way to visualize time series data is to use a simple line chart, where the horizontal axis plots the increments of time and the vertical axis plots the variable that is being measured. The visualization can be achieved using geom_line() in ggplot2 or simply using the plot() function in Base R. In this tutorial, I will introduce a new tool to visualize Time Series Data called Time-Series Calendar Heatmap. We will look at how Time-Series Calendar Heatmaps can be drawn using ggplot2. We will also explore the calendarHeat() function written by Paul Bleicher (released as open source under GPL license) which provides an easy way to create the visualization.

About 7 years ago, while still in high school, working as a web developer and studying psychology as a hobby, I stumbled upon an article about artificial neural networks. It was exciting. Right after I had finished reading it, I started looking for a theory of intelligence that can explain what I had already learned about human intelligence and somehow connect it with artificial counterparts. I’ve looked into psychology, neuroscience, cybernetics, cognitive science, computer science, biology, chemistry, physics, theology, sociology and many other fields. Thousands of articles and papers, hundreds of books and dozens of courses later, I still haven’t found the answer that would satisfy me. So this article was born.

Data preprocessing is an integral step in ML as the quality of data and the useful information that can be derived from it directly affects the ability of our model to learn. So it is extremely important that we preprocess our data before feeding it into our model.

Inverse reinforcement learning (IRL) is the field of learning an agent’s objectives, values, or rewards by observing its behavior. For example, we might observe the behavior of a human in some specific task and learn which states of the environment the human is trying to achieve and what the concrete goals might be. IRL is a paradigm relying on Markov Decision Processes (MDPs), where the goal of the apprentice agent is to find a reward function from the expert demonstrations that could explain the expert behavior.

In a recent post I introduced three existing approaches to explain individual predictions of any machine learning model. In this post I will focus on one of them: Local Interpretable Model-agnostic Explanations (LIME), a method presented by Marco Tulio Ribeiro, Sameer Singh and Carlos Guestrin in 2016.

While computational modelling gets more complex and more accurate, its calculation costs have been increasing alike. However, working on big data environments usually involves several steps of massive unfiltered data transmission. In this paper, we continue our work on the PArADISE framework, which enables privacy aware distributed computation of big data scenarios, and present a study on how linear algebra operations can be calculated over parallel relational database systems using SQL. We investigate the ways to improve the computation performance of algebra operations over relational databases and show how using database techniques impacts the computation performance like the use of indexes, choice of schema, query formulation and others. We study the dense and sparse problems of linear algebra over relational databases and show that especially sparse problems can be efficiently computed using SQL. Furthermore, we present a simple but universal technique to improve intraoperator parallelism for linear algebra operations in order to support the parallel computation of big data.

Speech processing is a very popular area of machine learning. There is a significant demand in transforming human speech into text and text into speech. It is especially important regarding the development of self-services in different places: shops, transport, hotels, etc. Machines replace more and more human labor force, and these machines should be able to communicate with us using our language. That’s why speech recognition is a perspective and significant area of artificial intelligence and machine learning.

The objective of this research is to propose a probabilistic modeling approach, which can quantify manufacturing process uncertainties, integrates machining models with experimental data to infer the performance metrics, and finally incorporates historical data to current and future analysis in a sequential manner. The output of this research provides the applicability of Bayesian methodology to the area of product design and manufacturing process. In this regard, the probabilistic approach can reduce the cost of the expensive and hazardous experiments by incorporating past and current information into the future analysis.

Music psychological research has either focused on individual differences of music listening behavior or investigated situational influences. The present study addresses the question of how much of people’s listening behavior in daily life is due to individual differences and how much is attributable to situational effects. We aimed to identify the most important factors of both levels (i.e., person-related and situational) driving people’s music selection behavior. Five hundred eighty-seven participants reported three self-selected typical music listening situations. For each situation, they answered questions on situational characteristics, functions of music listening, and characteristics of the music selected in the specific situation (e.g., fast – slow, simple – complex). Participants also reported on several person-related variables (e.g., musical taste, Big Five personality dimensions). Due to the large number of variables measured, we implemented a statistical learning method, percentile-Lasso, for variable selection, which prevents overfitting and optimizes models for the prediction of unseen data. Most of the variance in music selection behavior was attributable to differences between situations, while individual differences accounted for much less variance. Situation-specific functions of music listening most consistently explained which kind of music people selected, followed by the degree of attention paid to the music. Individual differences in musical taste most consistently accounted for person-related differences in music selection behavior, whereas the influence of Big Five personality was very weak. These results show a detailed pattern of factors influencing the selection of music with specific characteristics. They clearly emphasize the importance of situational effects on music listening behavior and suggest shifts in widely-used experimental designs in laboratory-based research on music listening behavior.