Category Archives: Architectures

Slava Razbash, has worked in data science roles in multinational enterprises,startups and even a university. He has a solid track record that includes working in CBA’s big data team and helping start Sportsbet’s datascience and personalisation capability. Slava is the Founder of the Enterprise Data Science Architecture Conference.

Reserve your place today at https://edsaconf.io because you must keep your skills current.

The problem that we are solving: Organisations have several places where different pieces of data are stored. We need information from each source to paint a complete picture that adds business value.

Imagine that we have the following data sources in our organisation:

The company data warehouse, with most of the information required by the finance and risks teams. However it’s missing the fields that we need to build a cross-sell model.

The database with historical customer transactions – essential to your data science project. It’s owned by marketing and your team can’t have access.

The finance team’s special database. The company can’t calculate EBIT without it.

The company CRM.

The company CRM for B2B customers.

The web analytics data store. The web analytics team are kind enough to provide your team with monthly extracts.

The company data lake. It stores outputs from your team’s machine learning models. Some source system data has been loaded as well.

If we could have access to all of these pieces of information then we could build the best machine learning models, report the deepest insights and place our company firmly in first place. But how? Our data science team can’t get access to most of those databases. Copying them into the data lake is an ongoing two year project. Data virtualisation could help.

The data virtualisation software will connect to and query our data sources. Our data users will connect to and query the data virtualisation software as if it were any other database. They will be able to query and join all of the data across all of the data sources.

Users will only need to apply for access to one system and their credentials only need to be removed from one system if they leave the company. The data virtualisation software may also be able to mask certain sensitive fields for certain types of users. For example, we can hide customer names from teams who don’t need to know them.

Data virtualisation is one piece of the of the picture. What you do with the data makes the difference between best practices and wasting money. As a specialist, you would have seen countless examples of adding value: increasing profit, saving lives, managing risk, automating manual labour. On the other hand, if you are a non-specialist, check out our example for non-specialists.

Productionise Properly – come see how it’s done. The Enterprise Data Science Architecture Conference focuses on how to properly productionise data science solutions at scale. We have confirmed speakers from ANZ Bank, Coles Group, SEEK, ENGIE, Latitude Financial, Microsoft, AWS and Growing Data. The combination of presentations is intended to paint a complete picture of what it takes to productionise a profitable data science solution. As an industry, we are figuring out how to best build end-to-end machine learning solutions. As the field matures, knowledge of best practices in end-to-end machine learning pipelines will become essential skills. I invite you to view our list of confirmed speakers and talks at https://edsaconf.io because this is the right place to meet the right people and up-skill.

Meet the right people and up-skill. The conference will be on the 27th March at the Melbourne Marriott Hotel. A fully catered conference with coffee, lunch, morning/afternoon tea and evening drinks & canapes. I invite you to reserve your place at https://edsaconf.io this is the best place to learn the emerging best practices.

Slava Razbash, has worked in data science roles in multinational enterprises,startups and even a university. He has a solid track record that includes working in CBA’s big data team and helping start Sportsbet’s datascience and personalisation capability. Slava is the Founder of the Enterprise Data Science Architecture Conference.

Reserve your place today at https://edsaconf.io because you must keep your skills current.

This article will show an example cloud solution for an end-to-end data science architecture. It’s based on Azure – there are other great vendors out there as well.

Data Science Architecture – An Azure Example

Let’s start to understand this by focusing on the “Data Lake Storage” component. We want to have all of our data in one place so that we can join data from various data sources. In this scenario, we have chosen to load everything into a Data Lake. We use Azure Data Factory to orchestrate the data loads – from the Various Data Sources.

For the data lake, we will be using ELT rather than ETL. This means that we will load the data as-is rather and than transform it as required for each use case. There are several disadvantage of with transforming the data when we load it, ETL. Some data science teams, the end users of ETL’ed data, have been hitting their head on these disadvantages for years. – The data becomes structured for a specific purpose. For example, finance and risk reporting. Information that’s relevant for marketing and customer analytics can be lost. In the best case, the data has to be restructured again for the new purpose.– The transformation step may have bugs. The long term data becomes permanently mangled. The accounting department may have audited the aggregate profit and loss numbers. However, the bugs come out when you try to use the data for a new purpose. The accounting team would have only checked to see if it fit for their purpose.– The transformation step requires a development team to write the transformation code. The organisation may not have budget to write code for every field. Hence some fields may be left out.

We will building and deploying our machine learning models within the Azure Machine Learning Services framework. Given that most of the underlying model training libraries are open source, you may be wondering “why?”. Azure Machine Learning Services gives us a nice workflow, an MLOps pipeline and makes it easier to deploy our model to somewhere customer facing. We also have the option of using Azure’s AutoML. AutoML could be a good first iteration for a supervised learning use case. In some organisations, your team will consist of only software engineers. In this case, AutoML might be your best option.

The code is written on the Data Scientist’s Local Machine and the model can training either on the Local Machine or cloud infrastructure such as the Training VM in the diagram above.

Our business has some kind of customer facing Application. The Application requests and receives real-time decisions from our machine learning models. Our machine learning models can be deployed to a Kubernetes Cluster. The Kubernetes Cluster can scale the number of containers as required by the demand on our system.

Some machine learning models may be used to predict who should get an email offer or who should be contact by a relationship manager. These predictions are loaded into the data lake and also our CRM.

Our Business Users require certain specific reports, which need to be refreshed regularly. For example, the accounting department will need to know profit and loss. The marketing teams will need to know how campaigns are performing. We can serve these reports as Dashboards. We will probably have some kind of Dashboard Server to serve and control access to the Dashboards from one central point.

The Dashboards will need specific, aggregated data. The data will need to be accessed quickly. For this purpose, we store certain materialised views in the Azure SQL Data Warehouse.

Why should we bother with all of this infrastructure? Our next article will present a SWOT analysis of data science projects in large organisations.

The Enterprise Data Science Architecture Conference focuses on how to properly productionise data science solutions at scale. We have confirmed speakers from ANZ Bank, Coles Group, SEEK, ENGIE, Latitude Financial, Microsoft, AWS and Growing Data. The combination of presentations is intended to paint a complete picture of what it takes to productionise a profitable data science solution. As an industry, we are figuring out how to best build end-to-end machine learning solutions. As the field matures, knowledge of best practices in end-to-end machine learning pipelines will become essential skills. I invite you to view our list of confirmed speakers and talks at https://edsaconf.io because you must keep you skills current.

Meet the right people and up-skill. The conference will be on the 27th March at the Melbourne Marriott Hotel. A fully catered conference with coffee, lunch, morning/afternoon tea and evening drinks & canapes. I invite you to reserve your place at https://edsaconf.io this is the best place to learn the emerging best practices.