We describe BEEF, a computational framework that explains in easily understood plain English, the evidence for and against a forecast made by a binary classifier, irrespective of the underlying classification engine. We will also provide a brief demonstration of BEEF on a set of diverse classification tasks using a set of diverse classifiers.

We present a novel approach to predict the confidentiality/sensitivity level of an organization’s documents based on their contents. Identifying sensitive information is critical to reduce information risk. We use Natural Language Processing and Machine Learning and show that we can accurately predict the confidentiality level of a document for 93% of the documents in our first use case.

As the applications we build are increasingly driven by text, doing data ingestion, management, loading, and preprocessing in a robust, organized, parallel, and memory-safe way can get tricky. In this talk we walk through the highs (a custom billion-word corpus!), the lows (segfaults, 400 errors, pesky mp3s), and the new Python libraries we built to ingest and preprocess text for machine learning.

In this sessin we cover how we selected and prioritized the components of a solution to leverage Spark and the issues we faced while developing and testing the transformed solution with Spark code. The solution is an end-to-end multi-tenant enterprise application comprising of several components - data transformations, quality checks, analytics, ML functions and visualization.

We will guide attendees through the entire geospatial machine learning workflow. Attendees will be be exposed to a variety of open source tools used to process, model, and visualize geospatial data. This workshop will focus on concepts unique to handling geospatial data such as spatial autocorrelation, lagged spatial features, and spatial partitioning.

Productionalizing a ML model doesn't need to be an exercise of learning a complex workflow system. Instead, by decorating your functions with the provenance library you can quickly setup a pipeline with serialization and provenance (lineage) tracking. Using the same system you can also share models and features to facilitate team collaboration in a research setting. Learn how in this talk!

In aggregate, billions of individuals click, read, watch, and share hundreds of millions of pieces of unique content every day. A widely-used content/audience analytics platform has a unique dataset in this area. In this talk we will ask and answer this question: can studying Internet traffic today help to predict web-wide attention -- and thus real-world events -- tomorrow?

A typical Security Operation Center (SOC) employs security analysts who monitor security log from heterogeneous devices. The analysts identify whether there is a security threat and how to respond to that threat by analyzing that data. Visualizing this large-scale data to a succinct human digestible form can reduce their cognitive load and enable them to operate more efficiently.

Text in NLP, Information extraction, and supervised or unsupervised learning, brings challenges to the researchers that are none existant with structured data. Here, we present 4 scenarios and unique approaches developed to deal with them.

Employing machine learning in practice is half search, half expertise, and half blind luck. In this talk we will explore how to make the luck half less blind by using visual pipelines to steer model selection from raw input to operational prediction. We will look specifically at extending transformer pipelines with visualizers for sentiment analysis and topic modeling text corpora.

Data scientists often hit roadblocks when "productionizing" their machine learning models. This talk is about making that "last mile" of analysis easier by leveraging a popular workflow tool called Airflow. We'll walk through how Nest uses it to build and deploy machine learning models for fraud detection and then discuss more generally the unique benefits it provides to Pythonic data scientists.

We are firmly in the world of "big data," where more data is almost always considered better. But sometimes our instincts as programmers or data scientists run afoul of laws that decree that too much data, or data from the wrong source, is illegal. This talk is an exploration of three legal situations - ownership, provenance, and privacy - where the law restricts which data we can use.

Medicare payments, UPC code descriptions, fertility rate and fires. All of it is data, some of which is erroneous and some of which is anomalous. Seeking Exotics introduces the audience to the world of outliers and anomaly detection through the use of metrics, visualizations and open source machine learning tools.

Given the diverse talent and skill sets of today’s Data Scientist, it is time for an analytic platform where you should not have to choose a single approach. To be viable in the open ecosystem of today’s economy and analytics, methods have to be open and integrated. You should not have to choose between analytics languages like Python, R, or SAS. Find out how you can literally have it all.

Real estate transactions are geographically and temporally sparse. There is often both a listing and a selling agent. Pricing models typically rely on physical parameters; there has been little work done in assessing the contribution of the realtor. A realtor 'farm' may be discoverable by cluster identification, and analyzed for negotiation strength in listing and sales prices.

Cloud service providers are vying for your attention in the Machine Learning space. Both Amazon and Microsoft are feverishly working on creating compelling solutions for developers to build intelligent solutions upon. Is one better than the other? What are each one’s strengths and weaknesses?

We present a case study on how our data team models and optimizes the behavioral patterns of consumers seeking legal services. By linking together the regional offline & online consumer data into a common model, we are able to optimize digital advertising investment, and take advantage of time-of-day and location-based opportunities, in arguably the most competitive Google Adwords vertical.

In this talk, you'll learn the weirdest edge cases in the United States addressing system; the hierarchy between city and state, the sublime beauty behind the zipcode, and the constitute parts of an address.

We'll cover how to compare addresses in a way that gives you an F-score you'll be proud of.

We will describe the python package pomegranate, which implements flexible probabilistic modeling. We will highlight several supported models including mixtures, hidden Markov models, and Bayesian networks. At each step we will show how the supported flexibility allows for complex models to be easily constructed. We will also demonstrate the parallel and out-of-core APIs.

Inspiration for this talk comes from Net Promoter Survey data and the drive to create a better Net Promoter System for a specific business context. This talk will focus on the Net Promoter Scores and examine a few ways to enhance the insights gained by building a context-rich Net Promoter System.

Data has evolved beyond reporting to underpinning technology that guides policy decisions and transforms service delivery to citizens. While the power of data is no longer questioned, we continue to struggle with the lack of capacity to use data to support missions. A collaborative effort to build open sourced data infrastructure is imperative to support the responsible use of data.

After anomalous network traffic has been identified there can still be an abundance of results for an analyst to process. This talk is for Data Scientist and Network Security professionals who want to increase the signal-to-noise through feature extraction and post-processing output.

The entire data science process can be organized into multiple steps/phases, and it is helpful to establish a standardized workflow for team members to collaborate effectively and generate valuable results. In this presentation, we will provide a detailed walk-through of seven phases of the data science process.

Post the model build process, we often have a black-box which can be used for prediction.

The usefulness of the model could still be questionable unless one understands the true behavior of the algorithm. As machine learning models are actively getting adopted in solving real world problems, one needs to look beyond just wins and losses. One needs more detailed information about model’s behavior.

There currently does not exist a comprehensive software complexity methodology that takes into considerations different dimensions of software applications, allowing software applications to grow unnecessarily complex as they mature. To assess the complexity of software, we strive to develop a model that considers different types of dimensions as foundational features.

We shall study about linearly separable and inseparable datasets. We shall then then apply various clustering algorithms to these datasets. This is a hands-on workshop where the attendees will be using our online learning platform, refactored.ai to execute code on their laptops.

We aim to differentiate between two handwriting samples by modeling and learning between writer variation and within writer variation. In order to achieve this task we use a dataset of handwriting samples which was a true representative of the US population. We went about by training a Convolutional Neural Network(CNN) and tuning it by experimenting with several architectures and techniques.

With the increase of computer power machine learning methods become a method of choice for solving many real world problems, where previously analytical approximations would be more appropriate in terms of speed. This creates a great demand for machine learning software that is easy to use. We present an approach for constructing such methods in systematic way. We demonstrate several tutorials that help to understand essential building blocks and parameters of machine learning methods.

Sargur Srihari, Unversity at Buffalo, The State University of New York Audience level:
Novice
Topic area:
Modeling

This talk will discuss the current generation of AI methods, and how they differ from previous generation methods. In particular I will discuss algorithms which are based on discriminative/generative models and computational architectures consisting of a hierarchy of concepts known as deep learning.

Interpreting deep learning and machine learning models is not just another regulatory burden to be overcome. Practitioners, researchers, and consumers that use these technologies in their work and their day-to-day lives have the right to trust and understand AI. This talk is an overview of techniques for interpreting deep learning and machine learning models and telling stories from their results.

My talk is about how to set up national networks, support national networks and international networks that work on a specific issue. I'll pick specific examples from my work in the anti trafficking space and how I worked nationally and internationally. I'll choose specific tools I built and the problems they solved with in the anti trafficking space.

There is no single machine learning model that is best for all applications. In the process of building a malware classifier, Endgame used a bakeoff process in order to choose the model best suited for us. We will describe this process, how the results could be improved with further research, and the challenge of using machine learning for malware classification in general.

Data version control or DVC is a new open source tool which is designed to help data scientists keep track of their ML processes and file dependencies in the simple form of git-like commands. This presentation post walks you through an iterative process of building a machine learning model with DVC.

The ENCASE project aims to leverage the latest advances in web security and privacy to design and implement a browser-based architecture for the protection of minors from malicious actors in online social networks, by exploiting sentiment and affective analysis along with graph mining.

Current bio-informatics tools do not capitalize on the great advancements made in Machine Learning (ML) that can enable them to generate more, and more rapid, breakthroughs. Big Data technologies can facilitate the complete integration of heterogeneous sets of experimental data to identify key metabolic pathways and drug targets to enable precision medicine.

Penn Signals is an award-winning (https://goo.gl/MHqwVv) microservices software platform for processing real-time clinical data from a variety of systems. This talk demonstrates how the data science team at Penn Medicine has combined open source technologies that allow data scientists and researchers to create and use predictive applications to support improvements in health care.