Data Science Milanhttp://datasciencemilan.org
Data Science Milan is the community of data scientists and major hub for professionals, academia and companies working in data science fields and based in the Milan area.Fri, 22 Feb 2019 12:29:10 +0000en-UShourly1https://wordpress.org/?v=5.0.3https://i0.wp.com/datasciencemilan.org/wp-content/uploads/2017/05/cropped-DataScienceMeetup_logo-original.png?fit=32%2C32Data Science Milanhttp://datasciencemilan.org
3232128362032Continual/Lifelong Learning with Deep Architectureshttp://datasciencemilan.org/2019/02/03/continual-lifelong-learning-with-deep-architectures/
Sun, 03 Feb 2019 16:40:36 +0000http://datasciencemilan.org/?p=556“Towards Artificial Intelligence” On 28th January 2019 at Mikamai, Data Science Milan has organized a workshop about Continual Learning topic. Deep learning can solve multiple tasks all at once, but what happens if you introduce a new task? “Continual/Lifelong Learning with Deep Architectures”, by Vincenzo Lomonaco, PhD student, Author, Teaching Assistant In this talk Vincenzo Read more about Continual/Lifelong Learning with Deep Architectures[…]

On 28th January
2019 at Mikamai, Data Science Milan has organized a workshop about Continual Learning
topic. Deep learning can solve multiple tasks all at once, but what happens if
you introduce a new task?

In this talk Vincenzo Lomonaco explained concepts behind Continual/Lifelong Learning. Artificial intelligence requires the ability to learn tasks in a sequential way, but neural networks are not able to do it, they fall in the “catastrophic forgetting”; a phenomenon that happens when deep learning are trained sequentially on multiple tasks and the network loses knowledge achieved in the previous ones because weights that are important for a current task are different in the following one. The goal of Continual Learning is to overcome “catastrophic forgetting”, in this way the architecture is able to smoothly update the prediction model using several tasks and data distributions.

There are several strategies to figure out this matter, in the talk were explained three: -Naïve Strategy; -Rehearsal Strategy; -Elastic Weight Consolidation Strategy

Vincenzo showed these strategies by a hands-on workshop with Google Colaboratory on MNIST dataset and using PyTorch. Look at the Github repository.

After an initial good training of the dataset (94% accuracy), he permuted the dataset and tried to use the same model to solve a new task obtaining bad results.

The Naïve Strategy of fine-tuning from one task to the next one consists on continuing the back-propagation process keeping samples from the previous one. The Rehearsal Strategy fine-tuning shuffle data of the current task and then combines it with the previous task before the training process. The last method requires regularization updating weights in order to keep the knowledge from previous tasks and avoiding “catastrophic forgetting”. The Elastic Weight Consolidation Strategy (EWC) estimates weights’ importance by Fisher information and introduces new regularization loss penalizing weights of previous tasks. Look at the video.

]]>5563D Point Cloud Analysis using Deep Learninghttp://datasciencemilan.org/2018/10/28/3d-point-cloud-analysis-using-deep-learning/
Sun, 28 Oct 2018 17:14:34 +0000http://datasciencemilan.org/?p=540“Deep Learning no limits” On 17th October 2018 at Buildo, Data Science Milan has organized an event about 3D image processing. Deep learning on 2D images has been achieved good results on classification tasks thanks to the use of Convolutional Neural Networks and the availability of data. Now, 3D data are growing in fast Read more about 3D Point Cloud Analysis using Deep Learning[…]

On 17th October 2018 at Buildo, Data Science Milan has organized an event about 3D image processing. Deep learning on 2D images has been achieved good results on classification tasks thanks to the use of Convolutional Neural Networks and the availability of data. Now, 3D data are growing in fast way.

In this talk were showed several technologies used to manage 3D point clouds, so what is the mean of point cloud?

Point cloud is a database containing points in the three-dimensional coordinate system. It is a very accurate digital record of an object or space and it is saved in form of a large amount of points that cover surfaces of an identified object.

Tasks with point cloud can be shared into neural network challenges (unstructured grid data for CNN filters, invariance to permutations of point clouds, the number of points changes depending from the sensor used) and data challenges (the use of scanned models bring missing data, noise from sensors used and rotation implies different point clouds).

Octree-based Convolutional Neural Network (CNN) for 3D shape analysis (O-CNN) is built upon the octree representation of 3D shapes. It takes for input the average of vectors from 3D model sampled and performs 3D CNN operations on the octants occupied by the 3D profile surface. O-CNN supports numerous CNN architectures and works for 3D images in different representations. Look out the github repository.

The architecture approach of PointNet is the use of a single symmetric function: max pooling. The network learns a set of optimization functions/criteria that select informative points of the point cloud and encode the purpose for their selection. The last fully connected layers of the network aggregate these learnt optimal values into the global descriptor for the entire shape (shape classification) or they are used to predict point labels (shape segmentation). Check out the code

SPLATNet is based on architecture that process point clouds without any pre-processing, it takes point clouds as input and computes hierarchical and spatially features with lattice filters; it allows easy mapping of 2D information into 3D images and vice-versa. Apply the code

]]>540Deep Time-to-Failure: predicting failures, churns and customer lifetime with RNNhttp://datasciencemilan.org/2018/09/30/deep-time-to-failure-predicting-failures-churns-and-customer-lifetime-with-rnn/
Sun, 30 Sep 2018 06:11:24 +0000http://datasciencemilan.org/?p=531“Survival analysis in evolution” On 20th September 2018 at Spirit De Milan, Data Science Milan has organized an event as part of IBM #Party Cloud: Deep Time to Failure. Machineries and customers are an asset for companies as well are subjected to failure: break down for machineries and churn for customers. “Traditional Survival Analysis”, Read more about Deep Time-to-Failure: predicting failures, churns and customer lifetime with RNN[…]

Predict failure requires survival study and Gianmario in the first part of his talk has explained traditional method for survival analysis.

Survival analysis is used to analyse data in which the time until the event is of interest. The response is often referred to as a failure, survival time or event time.

The survival function S(t) gives the probability that a subject will survive past time t and has the following properties:

-Monotonically decreasing;

-Right-continuous;

-The probability of surviving past time 0 is 1; as time goes to infinity, the survival curve goes to 0.

In theory, the survival function is smooth. In practice, we observe events on a discrete time scale (days, weeks, etc.).

The survival model can be described by the hazard function, h(t), that is theinstantaneous rate at which events occur, given no previous events, or by the cumulative hazard function H(t) that describes the accumulated risk up to time t.

Given one of these previous functions S(t), H(t), h(t) is possible to derive the other two ones and to derive the time-to-failure, namely the remaining time work for a device or other product.

With incomplete raw data (truncated or censored), raw empirical estimators will not produce good results and in this scenario two techniques are available: the Kaplan-Meier product limit estimator that can be used to generate a survival distribution function or the Nelson-Aalen estimator that can be used to generate a cumulative hazard rate function.

The survival distribution can be estimated by making parametric assumptions: for this task has been used Weibull distribution that is applied in many real-world use cases.

They are examples of univariate analysis and useful when the predictor variable is categorical.

An alternative method is the Cox proportional hazards regression analysis, which works for both quantitative predictor variables and for categorical variables. Furthermore, the Cox regression model can assess simultaneously the effect of several risk factors on survival time. The idea behind the Cox model is to separate the estimation of the heterogeneity parameter on one hand and the baseline hazard function on the other one. When the proportional hazard hypothesis are not satisfied, is possible to turn into Aalen’s additive model where coefficients can be parametric, semiparametric or nonparametric.

In the second part of the talk Gianmario go deeply into the wtte-rnn application (Weibull time-to-event RNN).

In time-to-failure Weibull distribution gives a distribution for which the failure rate is proportional to a power of time. It’s flexible and explained by two parameters: α and β. The first one is the scale parameter of the distribution and the second one is the shape parameter.

-β<1 indicates that the failure rate decreases over time;

-β=1 indicates that the failure rate is constant over time and the shape is an exponential distribution;

-β>1 indicates that the failure rate increases with time;

-β=2 the shape is a log-normal distribution;

-3,5<β<4 the shape is a gaussian distribution.

The task is to estimate α and β by Recurrent Neural Networks.

Recurrent neural networks are a kind of neural network where outputs from previous time steps are taken as inputs for the current time step, with one time of step there is a generation of a cycle.

RNNs are fit and make predictions over many time steps.

Considering multiple time steps of input (X(t), X(t+1), …), multiple time steps of internal state (u(t), u(t+1), …), and multiple time steps of output (y(t), y(t+1), …) the previous cycle is removed and outputs (y(t) and u(t)) from previous time step are passed into the network as inputs for processing the next time step, so the network doesn’t change between the unfolded time steps. Same weights are used for each time step, only the outputs and the internal states differs.

Gianmario has showed how wtte-rnn works and has explained a practical application: a dataset of jet of engines from NASA.

]]>531From Kaggle to Enterprise Machine Learninghttp://datasciencemilan.org/2018/07/18/from-kaggle-to-enterprise-machine-learning/
Wed, 18 Jul 2018 03:51:15 +0000http://datasciencemilan.org/?p=515“Kaggle interacts with business process” On 11th July 2018 at Cerved, Data Science Milan has organized an event about Kaggle topic. This is a platform well known by data science community where you can find dataset, learn data science with exercises, compete with other data scientists and not only, if you win you can gain Read more about From Kaggle to Enterprise Machine Learning[…]

On 11th July 2018 at Cerved, Data Science Milan has organized an event about Kaggle topic. This is a platform well known by data science community where you can find dataset, learn data science with exercises, compete with other data scientists and not only, if you win you can gain either money or a job!!!

“Kaggle – State of the art ML”, by Alberto Danese, Cerved

What is Kaggle?

It’s the biggest predictive modelling competition platform in the world born in 2010 and bought by Google in 2017.

In this platform enterprise come to look for a predictive solution for some of their problems through data scientists’ answers that compete from over the world proposing best performing algorithms.

This platform affords companies to recruit best scientists and researchers are available to trial new technologies: Keras and XGboost were tested in Kaggle before their success and the same is happening with LightGBM.

How does it work?

Companies make real datasets available on the platform with anonymous features and splitting it into train set and test set. The first one with outcome and the second one without it because it is used to evaluate predictive models for 20%-30% in the public leaderboard and the rest in the private leaderboard.

You can realize your predictive models with R, Python, Julia programming languages and submit the solution in a csv file; there is also the availability of kernels used to run your code and releasing it for everyone.

The job of a kaggler is more focused on machine learning activity meanwhile a data scientist works in a wide process that embeds machine learning but starts from the definition of the problem to the identification of data, algorithms, engineering of the solution with pipelines and deployment until storytelling.

Does it worth?

From Alberto Danese opinion Kaggle worths to try because is a very good data science platform where you can learn machine learning, try solutions and understand what works and what not, look at code available and so on, the only weakness is it requires time because you need to compete with kagglers from all the world.

The second talk has shown a business case on how machine learning is applied in the real world: credit scoring by Cerved rating agency

Credit scoring is a statistical model that combines several financial characteristics to evaluate a default risk of an enterprise by a single score to assess a customer creditworthiness.

It works in a regulated framework: Basel II/III that is an internationally agreed set of measures developed by the Basel Committee on Banking Supervision regarding the capital requirements of banks, according to which banks must set aside proportional shares of capital, based on the risk assumed and evaluated by a rating tool.

In the pillar I there are 3 approaches to evaluate credit risk: standard, foundation and advanced.

In the first approach banks don’t develop any internal model and for the minimal capital requirements banks use rating from external agencies, instead for the third approach banks develop an internal model to evaluate the expected loss (EL).

EL = PD x EAD x LGD

The Expected Loss is the amount expected to be lost on a credit risk exposure within a year timeframe.

PD: Probability of Default provides a likelihood assessment that a counterparty will be unable to pay back its debt obligations within a specified timeframe.

EAD: Exposure At Default is an outstanding expected amount following a default by a counterparty, taking account of: any credit risk mitigation, drawn balances, any undrawn amounts of commitments and contingent exposures.

LGD: Loss Given Default is the estimated loss on an exposure, following a default by the counterparty. It’s the share of an asset that is lost when a borrower default. The recovery rate is defined as (1-LGD), the share of an asset that is recovered when a borrower default.

In the advanced model is required the Unexpected Loss calculated through formulas provided by the Supervision Regulator.

The output of the model is a master scale of classes linked with a probability of default score.

As used in Kaggle competitions the goal is to use a machine learning model to calculate the probability of default using accuracy/AUC as a metric of evaluation, but while in Kaggle competition you need to optimize the accuracy, in the real world you need to respect some rules defined by the Regulator: a calibrated PD in appropriate range, robustness of the model, use the same parameters to evaluate counterparties, a transparent model, good quality of data used and understandability.

In a regulated sector unsupervised machine learning can be used to decide how many models you can build for each target of the market by cluster analysis, correlation analysis, component analysis.

Feature selection is used to evaluate variables to custom the model and supervised machine learning can be used as a benchmark to perform better the model.

Both traditional approaches and modern approaches can be used to define the PD master scale and the calibration using also econometrics approaches.

Natural language processing is a branch of artificial intelligence representing a bridge between humans and computers; it can be broadly defined as the automatic manipulation of natural language, like speech and text, by software. There are many ways to represents words in NLP and you cannot use text data directly on machine learning algorithms.

The first step is to transform raw text into numerical features by vectorization of words and there are several techniques:

-Bag of words: it is a way of extracting features from text as input for machine learning, it is a representation of text that describes the occurrence of words taken from a vocabulary obtained within a corpus by labelling it in a binary vector. It is called “bag of words” because it doesn’t care about the order of structure of words in the corpus.

-Hashing trick: hash function can be used to map data of arbitrary size to data of a fixed vector size set of numbers. Hashing trick or feature hashing consist to apply hashing function to the features and using their hashing values directly: with same input we have same output. A binary score or count can be used to score the word. Hash function is a one way process and sometimes it can be a problem because you can’t back from the output to the input space and collisions between the two mapped spaces can happen.

-TF-IDF: another approach is to rescale the frequency of words by how often they appear in all documents; this approach is called Term Frequency – Inverse Document Frequency. Term Frequency apply a score of the frequency of the word in the current corpus; Inverse Document Frequency apply a score of how rare the word is across corpus. With tf-idf technique terms are weighted and so on with scores are highlighted words with useful information.

The second level is word embedding with the goal to generate vectors encoding semantics: individual words are represented by vectors in a predefined vector space. Also for word embedding there are several techniques:

-Word2vec: it is a neural network that try to maximize the probability to see a word in a context window, the cosine similarity between two vectors. This task is achieved by two learning models: continuous bag-of-words or CBOW model that try to predict a word from its context; continuous skip-gram model that try to predict the context from a word.

-GloVe: it is an extension of word2vec, it constructs an explicit word-context or word co-occurrence matrix using statistics across the whole text corpus instead to predict words with boost results in terms of computation and it uses cosine similarity.

-FastText: it is a library for learning of word embeddings and text classification created by Facebook’s AI Research (FAIR). It gains the same accuracy of the previous models but with better performance, it can be explained by this relationship:

FastText : Word Embeddings = XGBoost : Random Forest

The last level is sentence embeddings, with this approach the goal is to represent more than single words by vectors and also in this case are available several models:

-Doc2vec: it works in the same way of word2vec but the goal is to use a network of paragraphs and words, so a sentence can be thought as another word, which is document-unique. There is Distributed Memory (DM) similar to CBOW model and Distributed Bag of Words (DBOW) similar to skip-gram as in word2vec.

-CNNs: Convolutional Neural Network were born for computer vision and more recently they are also applied to problems in Natural Language Processing. They are basically composed by several layers of convolutions with nonlinear activation functions applied to the results. Convolutions are used over the input layer to compute the output, this results in local connections, where each region of the input is connected to a neuron of the output. Each layer applies different filters and combines their results. The process start stacking words together creating a matrix, filters scan words, max pooling highlight the most important words and LSTM layer keeps the words order.

-LSTM: Long Short Term Memory are fancy Recurrent Neural Networks with some additional features among which memory cell for every time step. An application of RNN is Google searches, it links a search with an item. LSTM layer creates from input words a new output giving relevance to the words order and the next filter layers give relevance to the most important local features.

After this presentation has been showed a demo using a dataset from Kaggle and GloVe vectors with repository code available on Github.

]]>484Operations Research and Optimization: Improving Decisions from Datahttp://datasciencemilan.org/2018/06/10/operations-research-and-optimization-improving-decisions-from-data/
Sun, 10 Jun 2018 05:28:07 +0000http://datasciencemilan.org/?p=475“Moving towards prescriptive analytics” On 24th May 2018 at Mikamai, Data Science Milan has organized another interesting meetup about operations research topic. It is the application of scientific methods, techniques and tools used looking for the optimum solutions for the problems. “Operations Research & Optimization: A New Dimension to Data Science”, by Andrea Taverna, Università Read more about Operations Research and Optimization: Improving Decisions from Data[…]

On 24th May 2018 at Mikamai, Data Science Milan has organized another interesting meetup about operations research topic. It is the application of scientific methods, techniques and tools used looking for the optimum solutions for the problems.

Data science (DS) and operation research (OR) can be seen as complementary to each other, where the first one is more focused on data, how to extract information and knowledge from data to take decisions; the second one tries to evaluate decisions and modelling them in the process with the goal to find optimal solutions. In this way, operations research can be considered as a new dimension of data science.

If you look at the growth of operational research it has a flat trend, meanwhile data science & machine learning have been growing in recent years, but looking at the analytics maturity model by studies from PWC, Gartner and SAS we’ll move towards prescriptive analytics and operation research is positioned in this fourth dimension.

-in machine learning there are sub-problems which are optimization problems;

-replacing some heuristic methods with some exacts methods;

-solving prescriptive analytical questions.

An application of prescriptive analytics is developed by Mobile Edge Computing network (MEC): given an existing MEC with virtualization facilities of limited capacity and a set of mobile Access Points (AP) whose data traffic demand changes over time, the aim is to find plans for assigning APs traffic to MEC facilities satisfying each AP demand without exceed MEC facility capacity.

In the data-driven architecture there are two fundamental components: pre-processing and optimization. The first one is used to map the problem and optimization component is used to solve the problem by mathematical programming.

In the last section has been showed “Pyomo”, a Python module that allows users to formulate optimization problems using Python language.

The first application has regarded knapsack problem: given a set of items, each with a weight and a value, the goal is to determine the number of each item to include in a collection so that the total weight must be less than or equal to a given limit and the total value as large as possible, it can be considered as a maximization profit problem.

The second application has showed flight assignment problem: given a set of flights and a crew of airline company, the goal is to create a weekly plan minimizing the overall cost.

The last example has explained drones surveillance problem: given a number of drones equipped with camera and an area to be controlled, the goal is to optimize the number of drones to cover the whole area.

]]>475TensorFlow Dev Summit 2018 viewing partyhttp://datasciencemilan.org/2018/05/31/tensorflow-dev-summit-2018-viewing-party/
Thu, 31 May 2018 20:28:43 +0000http://datasciencemilan.org/?p=454“Problems that were impossible are now possible” On 22th May 2018 at Fintech District, Data Science Milan in collaboration with Google and BCG Italy have gathered its community to view together some of the main talks of TensorFlow Dev Summit 2018 from last March in Mountain View (CA). TensorFlow is an open source library for Read more about TensorFlow Dev Summit 2018 viewing party[…]

On 22th May 2018 at Fintech District, Data Science Milan in collaboration with Google and BCG Italy have gathered its community to view together some of the main talks of TensorFlow Dev Summit 2018 from last March in Mountain View (CA). TensorFlow is an open source library for machine learning started as a library for deep learning and neural networks. Now is a machine learning platform collecting many algorithms with the goal to make easier to use them.

“Keynote”

TensorFlow represents a revolution in the field of machine learning and helps to build artificial intelligence applications; problems that were impossible to solve before are now solved using this technology.

TensorFlow has added value to many different areas such as astronomy (discovering a new planet), healthcare (help to asses a person’s risk for cardiovascular diseases looking at scans of the human eye), aviation (predict the trajectory of a flight) and many others applications.

TensorFlow is at the forefront of machine learning, making it all possible; it is a platform that can solve challenging problems for all of us, it is powerful, scalable and its popularity has grown in the last two years with several innovations, such as TensorFlow Hub, a library to help developers share and reuse models.

TensorFlow.js is an open-source library you can use to define, train, and run machine learning models entirely in the browser, using JavaScript and a high-level layers API.

It uses TensorFlow Playground, an in-browser visualization of a small neural network and it shows in real time all the internals of the network that it’s training.

The browser has become a development environment where you can share the things you build, with anyone with just a link; people that open your app don’t have to install any drivers and can give access to the sensors like microphone, camera and accelerometer, making the app highly interactive.

In the livestream Nikhil Thorat and Daniel Smilkov trained a model to control a pac-man game using computer vision and a webcam, entirely in the browser.

TensorFlow Lite is a lightweight library and tools for doing machine learning on embedded and small platforms, with a different architecture from the one TensorFlow uses: there is an interpreter which runs on-device, there are a set of optimized kernels and then there are interfaces you can use to take advantage of hardware acceleration when it is available.

It’s cross-platform, it supports Android and iOS and also have support for Raspberry Pi and most of other devices which are running Linux.

In the workflow you take a trained TensorFlow model and then you convert it to the TensorFlow Lite format using a converter then you update your apps to invoke the interpreter using the Java or C++ APIs.

TensorFlow has granted to update Coca-Cola North America loyalty marketing programs into a mobile-web platform. The pipeline starts from a pin-code recognition from a cap by OCR (Optical Character Recognition) system to apply a CNNs (Convolutional Neural Networks) combined with TensorFlow to train and predict strings from images that contain small character sets with lots of variance. An active learning system with user interface by a feedback loop allows the model to gradually improve by returning correct predictions to the training pipeline.

RL bumped into popularity when Deepmind, which wasn’t owned by Google yet, published a paper on Nature (https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf).
In the paper Deepmind coupled reinforcement learning with DL. The algorithm could have equal or exceed human performance on some Atari games. It used only raw pixels as inputs and from that could devise a strategy. The agent had no previous knowledge about game rules.

Luca introduced the concepts of Agent, State, Action, Environment and Reward, which are all foundational to the theory of RL.
He then explained the concept of Markov Decision Process, policy, value function and q-value function, and how quickly becomes unfeasible to compute optimal policies and hence the need for function approximations.

For a detailed introduction on the topic one can look at the following references:

Daniele Cortinovis, physicist by training and Data Scientist at Orobix, gave then a great overview on the process of training the agent on some classic examples like the Cart-pole problem, Atari Breakout and Atari Pong, using PyTorch and OpenAI Gym.

]]>421General Data Protection Regulation (GDPR): a data science perspectivehttp://datasciencemilan.org/2018/03/26/general-data-protection-regulation-gdpr-data-science-perspective/
Mon, 26 Mar 2018 17:59:50 +0000http://datasciencemilan.org/?p=405“Anyone has the right to protect its personal data” Reviewed by Fabio Concina On 25th May, 2018, the General Data Protection Regulation (GDPR) will be fully enforceable in the European Union. This new regulation succeeds the Data Protection Directive, a two-decade old directive that’s grown slowly in recent years due to the growth of Read more about General Data Protection Regulation (GDPR): a data science perspective[…]

On 25th May, 2018, the General Data Protection Regulation (GDPR) will be fully enforceable in the European Union. This new regulation succeeds the Data Protection Directive, a two-decade old directive that’s grown slowly in recent years due to the growth of available online information. The technology landscape was very different 20 years ago, when the Directive was first adopted. Today, with the widespread usage of social media, apps, and internet generally, personal data is being shared and transferred across borders more than ever before, and many felt that the Directive was due for a review after these changes.

On 15th March, 2018 at Buildo there was an event organized by Data Science Milan about GDPR topic.

GDPR is applied to all process where are involved personal data, by individuals (data controllers or data processors) who carry out their activities in the European Union. A big news regards accountability principle about personal data protection: information needs to be processed lawfully, fairly and in transparent manner. Some relevant topics about the application of accountability principle are data protection by Design and by Default, clear roles and responsibilities, assessment of the risks and the adoption of measures suitable to mitigate these risks. There is a new function: the Data Protection Officer (DPO), which is at the heart of the process of implementing the principle of “accountability” and is responsible for the data protection. This role is not mandatory, only for public administration, for large-scale monitoring activities and for sensitive personal data treatment. DPO is a point of contact between Company and the Supervisory Authority.

The violation of personal data (so-called “Data breach”) consists of any event that puts at risk the personal data held by the data controller. When there is a violation of personal data the data controller proceeds to the notification of the violation to the Supervisory Authority. It means a breach of security leading to the accidental or unlawful destruction, loss, alteration, unauthorised disclosure of, or access to, personal data transmitted, stored or otherwise processed. All data breaches have to be reported within 72 hours, notification without unjustified delay must be explained. If the risk is really high, notification must reach individuals and also is mandatory a book of violations. There are administrative fines divided into two brackets: from 10 million euro or up to 2% of the total worldwide annual turnover to 20 million or up 4% of the total worldwide annual turnover.

Iubenda is a service that allows you to create privacy and policy conditions quickly. The service covers not only websites but also a mobile app and a Facebook app.

Andrea carry on GDPR topic stating that Companies must prepare internal documentation observing certain guidelines in which are indicated security measures, how and which data to process, the instruction of responsibilities and response times in case of request by customers. For example, Companies must have an internal document about instructions in case of data breach, a document with instructions when customers want to exercise their rights. Inside of all these documents there is a corporate risk register: a sort of corporate privacy statement. It’s a written document, also in electronic format, which contains a series of information concerning the processing activities performed by the data controller. The register is mandatory for Companies or organizations that have more than 250 employees. It contains information as conditions about data processed and interested parties involved, purposes, security measures, where they are stored and how long and so on. Data protection assessment is an assessment document to draw up when Companies are starting a new project embedding which data are involved and why, with an assessment on risks and all measures to mitigate it. About web and cookies topics, privacy statement requires owner’s data, purposes, third parties involved, legal basis…is important to store offline forms and documents signed by customers because could raise a dispute about the request to receive promotions.

Turn GDPR into an added-value for your business by Andy Petrella, Kensu

Andy introduces perspective between data science catalog and data science governance tools and so how GDPR can add value to enterprise.

Data science is an umbrella on top of all activities on data and data pipeline connect activities on data from input to output transforming data, involving several assumptions and technologies: an end-to-end processing line to solve one problem, to take a decision. Data science governance controls that data activity meets precise standards and involves monitoring against production data activity: how accurate is the model, what are the patterns. In this process are involved technologies, users (who is responsible), sources, data and processing. Many tools are using data and the number of processing activities are growing so all this information is connected, “data flow”, in this way is possible to create a map by graphs about tools and process, know what data regards transactions, what data regards customers and so on. With this map is possible to assess governance activities such as impact analysis, dependency analysis, pipeline optimization, data/model recommendation. Accountability principle of GDPR requires to implement adequate technical solutions and internal audits of processing activities, with data science governance you can monitor activities (ex. machine learning performance) realizing a process registry with all data involved and tasks pursued. In this way are realized transparent reports of activities across the whole chain of processing.

]]>405Banksealer: a decision support system for online banking fraud analysishttp://datasciencemilan.org/2018/02/19/banksealer-decision-support-system-online-banking-fraud-analysis/
Mon, 19 Feb 2018 22:00:00 +0000http://datasciencemilan.org/?p=392“From black-box to white-box” On 12th February 2018 at Buildo, Data Science Milan and Buildo have opened 2018 data science events, with talks about online banking fraud detection. Financial fraud is a broad term with several potential meanings, it can be defined as the intentional use of illegal methods with the purpose to obtain Read more about Banksealer: a decision support system for online banking fraud analysis[…]

On 12th February 2018 at Buildo, Data Science Milan and Buildo have opened 2018 data science events, with talks about online banking fraud detection.

Financial fraud is a broad term with several potential meanings, it can be defined as the intentional use of illegal methods with the purpose to obtain financial gain. There are many different types of financial fraud, from credit card fraud to automobile insurance fraud and advancements in modern technologies such as internet and mobile have led to an increase of financial fraud.

“What is Banksealer” by Daniele Gallingani, Buildo

Danielepresented Banksealer, an online banking fraud and anomaly detection framework used by analysts as a decision support system. It started in 2016 from research by Politecnico di Milano sponsored by Secure Network and by Buildo.

It can be defined as a decision support system of the It Security teams that, by aggregating the historical transaction data, summarizes the interaction of each customer with the e-banking system and, using advanced statistical and machine learning techniques, notes if, and how, a transaction is atypical.

Some usual frauds in a bank scenario comes from phishing to credentials database compromise until most advanced techniques.

In this tool there is a real time ranking, with high level scoring to block the transaction instead with low level giving the opportunity to proceed, so a device integrated with bank infrastructure.

Banksealer can be defined an explaining machine learning with graphs, by a dashboard that visualize many useful information for the analyst and with other window, a top list of anomalous transactions.

“Banksealer Algorithms and Architecture” by Claudio Caletti, Buildo

In the second speech Claudio Caletti talked about software architecture and algorithms implemented in Banksealer.

Bansealer is a system that moves transactions from different states, the main entities in this tool are exactly transactions: bank transfers, payments, prepaid cards transactions, phone recharge transactions and so on.

The inputs comes from bank with transactions and are trained by machine learning algorithms, then labelled by a scored transaction step forwarding to the output by external systems.

The process can be shared in three blocks: a block exposed to external systems made by import transactions from banks in a raw format and the score transactions; a block of data made by relational database and elastic search; last block is the Banksealer core made by machine learning models and user interface (Front-end, Back-end).

All services in Banksealer are written in Scala because is a type-safety language, while components are made by generic components and specific components, the first one are the core of the system equal for all banks and the second one is a bank driver made up ad-hoc.

Banksealer approach is based on three main algorithm, the first one, local profile, is the most important.

Local profile works on single customers, it defines each user’s individual spending pattern to evaluate the anomaly of each new transaction. During training process transactions are aggregated by customer and each feature distribution is approximated by an histogram.

The anomaly score of each new transaction is calculated using the HBOS (Histogram Based Outlier Score) method. It computes the log-likelihood of a transaction according to the marginal distribution learned. HBOS score is a weighted sum of the normalization applied at the frequency histogram of each feature, where weighting coefficients are tuned by analyst and in the upgrade version they are calculated by a genetic algorithm.

HBOS assumes independence of features making it really faster than multivariate approaches but with a less precision, in fact it performs poor on local outlier problems.

The second algorithm is the global profile, it’s good for new users, it defines “classes” of spending patterns and mitigate the undertraining problem. Each user is represented by six components: total number of transactions, average transaction amount, total amount, average time span between subsequent transactions, number of transactions executed from overseas countries, number of transactions to overseas recipients.

To cluster customer’s profiles is used a DBSCAN (Density-based spatial clustering of applications with noise) using the Mahalanobis distance. For each global profile is calculated the CBLOF (Cluster Based Local Outlier Factor) anomaly score, which tells the analyst how uncommon is the spending pattern respect other closest customers. It detects how much the user profile deviates from the density cluster of “normal” users, small clusters are considered outliers respect large clusters.

The third algorithm is the temporal profile looks on with frauds that take advantage of many transactions made in a time window, by comparing the current spending profile with their history. During training, are calculated mean and standard deviation of these aggregated features for each customer: total amount, total and maximum daily number of transactions. At runtime, is calculated the cumulative value for each features belonging each user and compared it against the previously computed metrics.

All these algorithms are merged to an output ranking score.

Banksealer can be defined a white-box despite other similar tools (black-box) because analyst understand what’s going on, it’s not completely automated and it should be easy to deploy besides a good false positive ratio.

This tool is mainly focused on HBOS algorithm based on histograms, easy to understand by analyst, also there is a ranking score that help him to manage the number of reported transactions with the presence of false positive.