The Power BI Analysis Services Connector can be used to connect from the Power BI service to your on premises tabular Analysis Services models.

In this blogpost I will look into the security and authentication possibilities that the SSAS Connector offers. Dynamic row-level security based on the Power BI user name to an on premises SSAS Tabular model is possible, I will show how you can set this up and how it works in the background.

If you want to know how to install and configure the SSAS Connector follow this excellent support guide.

The SSAS Connector has some similarities with the BI Semantic Model Connection that’s available for SharePoint. This connection can connect either with Kerberos or by using the EffectiveUserName property, which matches the SSAS Connector: - A connection to a tabular model is made with stored credentials that require administrator permissions on the SSAS server. - The EffectiveUserName parameter is send to SSAS to impersonate the current user. Only SSAS administrators have permission to connect using EffectiveUserName.

To investigate exactly how the connector works I have created a tabular model with a very simple underlying database model. The Sales table will be secured based on dynamic row-level security on the content of the EffectiveUserName column in the Department table.

The tabular model contains the dataset shown in the Excel pivot below. In an on premises situation my EffectiveUserName would be MACAW\jorgk, but because I login to Power BI with jorgk[at]macaw.nl I have created entries for both. Let’s see which one will be used later on.

I have created a SSAS Security Role with read permissions on my Active Directory account:

Row-level security has been set on the EffectiveUserName column of the Department table:

The SSAS Connector was installed on the server by my colleague Dave, who has administrator permissions on the SSAS Server (inherited from the local Administrators group). For purpose of the demo we made sure my domain account was not in the local admin group and not in the SSAS Server admin group. Dave’s credentials will be the stored credentials that are used to connect to the tabular model from the SSAS Connector, passing the EffectiveUserName property just like a BISM connection is able to do as I explained before.

Now I logged in to Power BI, created a report and the security is working, my report is filtered and shows only my data:

During the creation of my report Dave ran a Profiler trace to see what’s happening in the background. As we expected my connection came in under the account of Dave (blue) but with my account as EffectiveUserName (red):

It’s interesting to see my EffectiveUserName is JorgK[at]macaw.nl but the Power BI report I’ve created shows the data of MACAW\jorgk! Why didn’t it show the data of the row that equals exactly to my EffectiveUsername JorgK[at]macaw.nl? Well that’s because SSAS does not authorize the user based on the textual value of the EffectiveUserName property. Instead it calls to AD to check if the Power BI username (based on the mail address) exists in AD, if this is the case the AD Account is returned and the user will be authorized with it.

To wrap up this entire process I’ve created the following diagram:

From the Power BI tenant I connect to the SSAS Connector using my Power BI Account.

The SSAS Connector connects to the tabular model using the stored credentials, in this case of my colleague Dave who has administrator permissions on the model and therefore the permissions to pass on my Power BI account as EffectiveUserName.

The EffectiveUserName is checked in Active Directory. This will only be possible if DirSync has been set up, otherwise my Power BI account that comes from Windows Azure Active Directory will not be mapped to the on premises Active Directory.

After Active Directory authorizes me, my on premises AD Account is sent to the tabular model.

Row-level security is applied as configured in the SSAS Role.

Next thing to do was of course to share the Power BI report in a dashboard to another colleague. As a test I shared it with my colleague Martijn (blue) who did not have any permissions on the tabular model at al. Unfortunately it seems the credentials of the user who shares the dashboard are stored in that dashboard/report and will be used to make the connection to the underlying SSAS data source. Martijn’s report was identical to mine and the SQL Profiler showed us that indeed the connection was again made with the admin account of Dave, and my account passed in the EffectiveUserName (red).

ConclusionIt’s great to see we can finally connect to on premises models from Power BI using row-level security, which was not possible with the Data Management Gateway. This makes a hybrid Cloud/On Premises architecture a feasible option to implement. Unfortunately we are not there yet because it isn’t possible to impersonate a user after sharing a dashboard which is clearly not the functionality that’s desired. Let’s hope Microsoft will fix this in the coming months! Until then, be sure to advice your customers about this sharing security issue.

What is a Data Lake? Pentaho CTO James Dixon is credited with coining the term "Data Lake". As he describes it in his blog entry, "If you think of a Data Mart as a store of bottled water – cleansed and packaged and structured for easy consumption – the Data Lake is a large body of water in a more natural state. The contents of the Data Lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples."

These days, demands for BI data stores are changing. BI data consumers not only require cleansed and nicely modeled data, updated on a daily basis, but also raw, uncleansed and unmodeled data which is available near real-time. With new and much more powerful tooling like Power BI, users can shape and cleanse data in a way that fits their personal needs without the help of the IT department. This calls for a different approach when it comes to offering data to these users.

BI data consumers also demand a very short time-to-market of new data, they don’t want to wait for a few months until data is made available by a BI team, they want it today. The raw uncleansed form of data in a Data Lake can be loaded very quickly because it’s suitable for generated data loading technologies and replication, which makes this short time-to-market possible. Once users have discovered the data and have acquired enough insights that they want to share with the entire organization in a conformed way, the data can be brought to traditional Data Warehouses and cubes in a predictable manner.

Furthermore there is rise in the presence of unstructured and or semi-structured data and the need to have “big data” available for adhoc analyses. To store and analyze these forms of data new technologies and data structures are required.

When the Data Lake comes in place a lot of data streams from sources into the “lake” without knowing up front if it is eligible for answering business questions. The data can’t be modeled yet, because it’s not clear how it will be used later on. Data consumers will get the possibility to discover data and find answers before they are even defined. This differs fundamentally from the concept of a Data Warehouse in which the data is delivered through predefined data structures, based on relevant business cases and questions.

Technology From a technology view, a Data Lake is a repository which offers storage for large quantities and varieties of both unstructured, semi-structured and structured data derived from all possible sources. It can be formed by multiple underlying databases which store these different structured forms of data in both SQL and NoSQL technologies.

For the semi-structured/unstructured side of data which is used for big data analytics, Data Lakes based on Hadoop and other NoSQL technologies are common. For the semi-structured/structured data, SQL technologies are the way to go.

In this blog post I will describe the semi-structured/structured, relational appearance of the Data Lake in the form of a SQL Server database: The Relational Data Lake.

Extract Load (Transform) Data in a Data Lake is in raw form. Transformations will not be performed during loading and relationships and constraints between tables will not be created which is the default for transactional replication and keeps the loading process as lean and fast as possible. Because of the lack of transformations, movement of the data follows the Extract-Load-(Transform) (EL(T)) pattern instead of the traditional E-T-L. This pattern makes loading of data to the Data Lake easier, faster and much more suitable to perform using replication technologies or generated SSIS processes, for example with BIML. This creates a very attractive time-to-market for data which is added to the Data Lake. Latency of data is as low as possible, preferable data is loaded in near real-time: data should stream into the lake continuously.

Transformations take place after the data is loaded into the Data Lake, where applicable. Cosmetic transformations like translations from technical object and column names to meaningful descriptions which end users understand or other lightweight transformations can be performed in new structures (like SQL views) that are created inside the Data Lake.

Unlike Data Marts and Data Warehouses, which are optimized for data analysis by storing only the required attributes and sometimes dropping data below the required level of aggregation, a Data Lake always retains all attributes and (if possible) all records. This way it will be future proof for solutions that will require this data in a later moment in time or for users that will discover the data.

Accessing data Data is made accessible through structures which can either be accessed directly, or indirectly through the exposure as OData Feeds. These structures are secured and are the only objects end users or other processes have access to. The feeds can be accessed with any tool or technology that is best suited to the task at any moment in time, for example using Power BI tooling like Excel PowerPivot/PowerQuery.

We normally create SQL Views in which security rules and required transformation are applied.

The Data Lake also acts as a hub for other repositories and solutions like Data Warehouses and Operational Cubes.

Master Data Success of the Data Lake depends on good master data. When end users discover new raw data from the Data Lake they need to be able to combine it with high quality master data to get proper insights. Therefore a master data hub is a must have when a Data Lake is created. This hub should just be a database with master data structures in it, master data management on this data is preferable but not required. The master data hub should be a standalone solution, independent from the other BI solutions, as master data isn’t part of these solutions but is only used as data source. It should be sourced independently too, preferable using master data tooling or using tools like SSIS. Just like with data from the Data Lake, master data should also only be accessed through structures which can also be exposed as OData Feeds.

Next to the purpose of combining master data with data from the Data Lake, the master data can be used as source for other BI solutions like Data Warehouses. In there, the master data structures are often used as Data Warehouse Dimensions. To prevent the unnecessary duplicate loading of master data in the Data Warehouse that already exists in the master data hub, it can be a good choice to leave the master data out of the Data Warehouse Dimensions. Only the business keys are stored which can be used to retrieve the data from the master data hub when required. This way the Data Warehouse remains slim and fast to load and master data is stored in a single centralized data store.

Architecture The entire Data Lake architecture with all the described components are fit in the model below. From bottom to top the highlights are:

Extract/Load data from the sources to the Data Lake, preferably in near real-time.

The Data Lake can consist of multiple SQL (and NoSQL) databases.

Transformations and authorizations are handled in views.

The Data Lake acts as hub for other BI solutions like Data Warehouses and Cubes.

The master data hub is in the center of the model and in the center of the entire architecture. It’s loaded as a standalone solution and isn’t part of any of the other BI solutions.

Traditional BI will continue to exist and continue to be just as important as it has always been. It will be sourced from the Data Warehouses and cubes (and master data hub).

The Discovery Platform with its new Power BI tooling is the place where “various users of the lake can come to examine, dive in, or take samples.” These samples can be combined with the data from the master data hub.

Data Lake Challenges Setting up a Data Lake comes with many challenges, especially on the aspect of data governance. For example it’s easy to create any view in the Data Lake and lose control on who gets access to what data. From a business perspective it can be very difficult to deliver the master data structures that are so important for the success of the Data Lake. And from a user perspective wrong conclusions can be made by users who get insights from the raw data, therefore the Data Warehouse should still be offered as a clean trusted data structure for decision makers and a data source for conformed reports and dashboards.

Summary The Data Lake can be a very valuable data store that complements the traditional Data Warehouses and Cubes that will stay as important as they are now for many years to come. But considering the increased amount and variety of data, the more powerful self-service ETL and data modeling tooling which appear and the shortened required time-to-market of near real-time data from source up and to the user, the Data Lake offers a future proof data store and hub that enables the answering of yet undefined questions and gives users personal data discovery and shaping possibilities.

Thanks go to my Macawcolleague Martijn Muilwijk for brainstorming on this subject and reviewing this blog post.

SQL Server Analysis Services (SSAS) Multidimensional (MD) is a great product, and in my opinion it’s still the only real option to go for when building complex enterprise BI solutions. It’s still very fast when implemented correctly and it’s mature and therefore very stable.

The only real downside is cell security, which, in my opinion, is useless. It makes performance drop dramatically because it evaluates security cell-by-cell. I have seen reports that run in a few seconds without cell security taking 20 minutes with cell security implemented! Try to explain that to your customer.. It’s obvious that you can’t.

SSAS offers us the “Measures Dimension” which give you the possibility to secure measures like dimension members. Great, this is just what we need, but implementing it and creating MDX calculations based on the secured measures will give you errors for SSAS roles that do not have access to these measures. This is caused by the fact that the MDX script is executed after the security has been implemented. So if a user that doesn’t have access to a measure (set by dimension security) tries to connect to the cube while the MDX script contains a reference to this secured measure, this raises an error.

For a few years I’ve always implemented a workaround that uses the measures dimension. To prevent errors in the MDX script as described by option B above, I’ve added IsError() checks around all my calculation parts that could raise an error. For example, a simple calculation like Quantity * Price, where the price measure could be secured, looks like this:

This calculation would not raise an error but NULL if a user doesn’t have permission to the Price measure. Quite straightforward, only trick here is the IsError() check in the NON_EMPTY_BEHAVIOR, of course you would have to do this in the script view of your SSAS calculations tab. In the form view this will look a bit strange but it doesn’t cause any errors:

Using this approach you are able to implement measure security using the dimension security while you are still able to create MDX calculations with NON_EMPTY_BEHAVIOR based on these secured measures. This made the report I talked about before to run in 5 seconds instead of the 20 minutes. I’ve used this approach for quite some years now, and it has always been sufficient.

At the moment I’m working on a very big enterprise cube that contains almost 100 dimensions, 25 measure groups and millions of records. I’ve also implemented measure security like this and after a while I noticed some really heavy queries (reports) took quite some time, for example 30 seconds or 1 minute. Reason enough to re-think this approach. When I was at the SQL Server Days in Belgium last year, I’ve discussed this approach with Chris Webb. Of course he could understand what I was doing here and we agreed to email about this approach later on. Chris emailed me about a blogpost of him from some time ago: http://cwebbbi.wordpress.com/2008/01/22/dimension-security-tips-tricks-and-problems/. In this post Chris explains how to create secured calculations while using dimension securty by using named sets and scope statements:

I did not know about this approach, so I tried it out. To be able to get good test results I used a very heavy query that hopefully nobody would ever try: a calculation over all sales data (millions of records) against all customers, all products and all time. Unfortunately, the results were not very pleasing yet:

1. Calculation with IIF/IsError checks in both the calculation part and the non empty part: ran in 50 seconds.

2. Calculation with named set/scope approach: ran longer than 5 minutes, after that I stopped the query.

So Chris and I emailed again and I was wondering if the lack of NON_EMPTY_BEHAVIOR (NEB) could be the cause of the slow performance of the named set/scope approach. Chris said that since SSAS 2008 the usage of NEB was not neccesary anymore, something I’ve heard about before. I always had the idea adding NEB did make a difference in some cases so I never stoped using it. So I kinda merged Chris’ and mine approach, adding the NEB using an IsError() function and I replaced the StrToMember check with a direct reference to the measure in the named sets because in my opinion that part was not neccesary:

The result was almost unbelievable, the query now took not 5 minutes, not 50 seconds but only 2 seconds!! So adding NEB still can make a huge difference sometimes! After finding out about this, I contacted Chris again and he was also very surprised by the result.

I’ve created an MDX calculation template with some explanation added as comments. The big performance gain is achieved because the IsError() check doesn’t need to be executed every time the calculation is executed because it’s already executed when the static named set is created (one single time), afterwards this value just seems to be reused. I think the same applies for the IsError() check in the NEB, it seems it’s only executed once and is reused. Anyway, the result is, your MDX calculations are just as fast with security applied as without. Something that was not possible for a long time for me and I guess everybody else too. Just use the template below as a standard for creating your calculations and benefit from this solution:

--Calculation Template:

----------------------------------------

----------1: CHECKED CALCULATION: The secured "end product" used in other calculations----------

----------3: SCOPE: Assignment of either the measure or "nothing" to CHECKED calculation (1)----------

SCOPE ( [Quantity_CHECK] ) ;

This = [Measures].[Quantity] ;

ENDSCOPE ;

----------4: Second secure calculation created here----------

CREATEMEMBERCURRENTCUBE.[Measures].[Price_CHECKED] AS

NULL

, VISIBLE = 0 ;

CREATESET [Price_CHECK] AS

IIf (

IsError ( [Measures].[Price] )

, { }

, { Measures.[Price_CHECKED] }

) ;

SCOPE ( [Price_CHECK] ) ;

This = [Measures].[Price] ;

ENDSCOPE ;

----------5: Calculation based on the secure calculations.

----------Non_Empty_Behavior set with IIf/IsError on measure. Using calculations inside NEB is not possible----------

CREATEMEMBERCURRENTCUBE.[Measures].[Sales Amount] AS

[Measures].[Quantity_CHECKED] * [Measures].[Price_CHECKED]

, NON_EMPTY_BEHAVIOR =

{

IIf (

IsError ( [Measures].[Price] )

, NULL

, { [Measures].[Price] }

)

}

, VISIBLE = 1 ;

-------------------------------------------

Your calculations tab in SSAS will look like:

When I was investigating the inner working of the first part of the script I’ve added some comments to make it easy for myself to remember. I guess it’s also useful for anyone that want to know what’s really happening here:

CREATEMEMBERCURRENTCUBE.[Measures].[Quantity_CHECKED] --> Calculation is initially NULL and will be filled based on user rights later on.

ASNULL,

VISIBLE = 0; --> Make it invisible as this is a pure technical calculation that should not be used by end users. End users can use the measure this calculation is based on, if they have permission (set by dimension security on the Measures dimension).

CREATESET [Quantity_CHECK] AS--> Named Set will be used to perform the actual check for user rights, as this is a static named set this check will only be executed one time, at initialization, and will be reused.

IIf(IsError([Measures].[Quantity]) --> An IsError on the measure that needs to be checked will raise an error if a particular user has no rights on it because in that case the measure will simply not exist.

, {} --> If an error was raised by the IsError function set the value of this set to nothing: {}

, {[Measures].[Quantity_CHECKED]}); --> If no error was raised the user has rights to access the measure, in this case set the value of the set to the required CHECKED calculation created in the previous step.

SCOPE([Quantity_CHECK]);--> If in the Scope of the CHECK calculation, either the CHECK calculation can be {} (Nothing) or it can be the CHECKED calculation (based on user rights).

This=[Measures].[Quantity]; --> Assign the measure to This. This can be the CHECK calculation if the user has rights which will pass the measure through to the initial CHECKED calculation or it can be {}/Nothing which will pass the value of the measure to nothing instead of to the CHECKED calculation.

ENDSCOPE; -- So this Scope function either passes the measure to the CHECKED calculation or to nothing (the empty set {})

Last week Microsoft released CTP3 of SQL Server Integration Services (SSIS), code name: Denali. In this blog post I will look into the new key features and some of the minor improvements in SSIS Denali.

1. Development

Shared Connection ManagersConnection Managers can now be shared on SSIS project level. You can create them in the solution explorer, in the folder Connection Managers:

Once created, they will appear automatically in all your SSIS packages. The names are in bold so you can recognize them between your package-level connection managers:

You can also create Shared Cache Connection Managers for your cached lookups. This will be very useful for lookups that are performed multiple times, for example when you look up dimension tables multiple times from your fact table ETL. You can just convert the local cache connection manager by right clicking it:

What happened to the Shared Data Sources we knew from prior SSIS versions? Well they are no more. Shared Data Sources only lived at design time and not at runtime so they were not really useful anyway. Shared Connection Managers do live at runtime and offer even more possibilities as I will show later on, so they replace the old Shared Data Sources.

Data Flow - Column mappingsSSIS always mapped columns from source to transformations or destinations with the help of lineage ids. Every column had a unique metadata ID that was known by all components in the data flow. If something changed in the source this would break the lineage ids and raised error messages like: The external metadata column collection is out of synchronization with the data source columns. To fix this error you would re-map all broken lineage ids with the “Restore Invalid Column References Editor”. In Denali lineage-ids are no longer used. Mappings are done on column names, which is great because you can now use auto map on column names and even copy/paste pieces of another data flow and connect them by mapping the corresponding column names.

Data Flow - Flexible order of authoringThis improvement helps you edit data flow components even when they don’t have an input attached. Theoretically you can build your data flow backwards; start with the destination and track back to the source.

Data Flow - GroupingsYou can now group data flow components. Select the components you wish to group, right click and select Group:

The result is some sort of a data flow sequence container:

By clicking the arrow it will collapse:

Data flow groups are 100% eye candy; you can’t set any properties on them.

Data Flow - Data Quality Services Cleansing transformationWith this transformation you can apply data quality rules in the data flow. This is done by using a Knowledge Base which can be created by yourself or downloaded from the Windows Azure Marketplace. For example you could apply a rule that checks if a given postal code column is valid for a particular town column in your record.

Data Flow - Data Tap In Denali, we have the possibility to attach a “tap” at a data flow path (arrow). This tap captures all data coming through and dumps it in CSV files.

2. Configuration

SSIS Denali is fundamentally different to its predecessors when it comes to the concept and usage of configurations. SSIS package configurations are obsolete (they will still be available if you really want to use them ;-)) and parameters and environments are in the new kids in town.

ParametersSSIS Parameters look a lot like SSIS variables but there are some differences. There are two types of these parameters: 1. Package Parameters: Look at Package parameters as C# parameters, which are passed as input to a C# function(=your package). You can set them when executing (call) a package and the lifecycle of the parameters are limited to the SSIS package itself.

The difference with SSIS variables? You can set the parameters while executing a package with SSMS or an Execute Package Task.

Define your Package Parameters at the Package Parameters tab:

2. Project Parameters: Identical to Package Parameters, except for the scope, these Parameters exist on project level and can be referenced throughout all SSIS packages in your project. Package Parameters only live inside a single SSIS package.

You can define Project Parameters in the solution explorer within your SSIS project:

Both Package and Project Parameters can be referenced from your SSIS packages, you recognize them by their prefix, $Package or $Project:

Setting parameters in a Execute Package Task is achieved by the new Parameter bindings tab:

It’s also possible to parameterize SSIS tasks on the Control Flow by right clicking them and choose Parameterize:

Loads of capabilities here! I now realize that I have created a workaround for Package Parameters with my SSIS Package design pattern for loading a data warehouse where I (mis)used Package Configurations as Package Parameters. Creating a new package design pattern for Denali definitely goes on my TODO list!

Environments Environments are a collection of SSIS package settings that can be define on the SSIS Server. At runtime, the environment will override these settings in the SSIS packages. You can create multiple environments and when you run your SSIS packages you can select which environment it should use. It’s also possible to let multiple SSIS projects run under one environment, so flexibility all around the clock.

To make you understand the principle of Environments right away I have created a simple example that you will commonly use: Create two Environments, one with development settings and one with production settings.

I have deployed a Test SSIS project to SSMS which contains one SSIS package with one Shared Connection Manager. Notice the new Integration Services folder structure:

Next right click Environments and choose Create Environment:

TypeDevelopment as name for the Environment and click OK:

Now double click the Development Environment:

Click on the Variables tab and create an Environment Variable which will overwrite the Shared Connection Manager. Type in the connection string for the development server as the Value and clickOK:

Next create another Environment with the name Production and also create an Environment Variable with the name SCM_EnvironmentVar. The only difference between these two variables should be the value of the variable; it contains the production server connection string instead of the development value. You now have two Environments with one Environment Variable for the Shared Connection Manager each:

We now need to reference the Development and Production Environments in the Test project so they can be used. Right click the Test project and choose Configure:

Go to the references page and clickAdd..

Add both Environments:

Both environment are now referenced from the Test SSIS project and are available. Let’s configure the Shared Connection Manager so it will use the value specified in one of the environments. Click on the Parameters page, the Connection Managers tab and the … next to the ConnectionString property of the Shared Connection Manager:

The name of our Environment Variable is shown in the Value box of the ConnectionString property:

We have now succesfully attached the Environment Variable to override the ConnectionString property with a value from either the Development or the Production Environment. Executing the package under one of both Environments is very easy. To do so, right click the package and choose Run:

The Run Package dialog box appears and you instantly see a message that reminds you to assign a value to the ConnectionString property of connection manager SharedConnectionManager:

At the bottom of the dialog box you can select the Environment your package should use. Select the .\Development Environment and clickOK:

In the overview report you see the package has succesfully ran under the Development Environment:

If you change the package to use the Production Environment and you run the package again, you’ll get the following execution information:

Pretty straightforward and easy to use. The big difference? All the configuration work has been done in SQL Server Management Studio! Exactly the place where it should be done. In the current SSIS versions most of this work was done within Visual Studio and the configuration was done by editing tables. So unfriendly! Big improvement, great work MS! In this example I have set a property of a Shared Connection Manager but of course it is possible to set Package or Project Parameters with Environment Variables.

3. Deployment

We have had the option to deploy SSIS packages from Visual Studio for a long time thanks to BIDS Helper. Microsoft finally added this functionality out of the box in Denali. They even did more: you can now also build your projects from within Visual Studio!

There are two ways to deploy (push) your projects to the new SSIS server, directly from Visual Studio or by using an Integration Services Project Deployment File. They both use the new Integration Services Deployment Wizard. Deployments directly from Visual Studio can be done by right clicking your project and then choose Deploy. Now, the Deployment Wizard will pop straight up:

The other way is building your project first. This was already possible in SSIS version(s) prior to Denali but did not really have useful meanings.

In Denali this is different. Building your project will generate an .ispac file which contains your entire project and this can be used by the Deployment Wizard. Again a very nice new feature.

Handing over a new release of a SSIS solution to a customer’s IT department for deployment can now be done in a very nice, professional, manner. Ispac files can also be opened by Visual Studio to import your SSIS project. Building your project can be done by right clicking your project and choosing Build:

The output window displays the following:

When looking at the bin\Development folder we see the new Test.aspac file:

You can now select the project that you wish to deploy. You can either select to use the .ispac Projectdeployment file or choose to deploy an existing project located in any Integration Services catalog. The last option is useful when you wish to deploy from, for example, a TEST SSIS solution that is already located in a catalog to an ACCEPTATION environment in another catalog.

The project is loaded and validated:

Next, you now need to select the destination, which has to be a SSIS Server:

Review your selections and click on Deploy to start the actual deployment:

The results show a successful deployment. Notice the protection level has been changed. What happened here? The wizard clears the protection level, as we know from Visual Studio, to prevent “cannot decrypt password” errors.

Instead of using protection levels, passwords will be stored in clear text. The entire package (with the passwords in it) will be stored encrypted in the SSIS Server tables you are deploying to.

The project has now been deployed to the server:

When you right click the Test project and choose for Versions you are able to see the current version of your project:

If you deploy a newer version later on, the existing version(s) stay on the server and you can easily role back to a previous version if you’ve made a bad deployment.

4. Management

The new SSIS Server is the central storage and administration point of your SSIS solutions. No longer is this a standalone server that you’ll need to manage. Basically it is a database with Stored Procedures.

You now have an Integration Services node available in SSMS when you connect to the database engine:

Under the Integration Services node you will find your SSISDB catalog which holds all your SSIS solutions with its packages, environments, etc. The physical storage of these objects will be conducted in a SQL Server database with the same name as the catalog:

This database also contains all the stored procedures containing all the programming code for the SSIS Server:

With the help of these stored procedures you can manage your SSIS Server: e.g. your parameter values, connection managers, and override properties by using environments. Next to the configuration functionalities I have discussed earlier you can also implement security to control access to the catalog, both on folder level and package level. When you right click a folder/package in the catalog and choose the properties you’ll get the following window where you can manage security access:

Finally you’ll get an out of the box reporting dashboard which is built on the out of the box logging functionality that SSIS offers. On this dashboard you’ll get information about the execution history of your package and its sub packages, view which parameters were used, view specific messages and get a performance report over time. All the information is logged automatically if any package runs on the server. The information is very detailed; you can even get the row counts between SSIS data flow transformations!

When running a package you are able to select a logging level on the Advanced tab: You can choose for: - None: turn logging of for performance reasons - Basic: error and warning logging - Performance: detailed trace information - Verbose: diagnostics and fault debugging

When I run my Test package I’ll get the following execution dashboard. There are some hyperlinks that navigate to more detailed reports.

So no need for creating your own logging framework anymore, it’s all out of the box!

We recently had a few customers where a connection to Oracle on a 64 bit machine was necessary. A quick search on the internet showed that this could be a big problem. I found all kind of blog and forum posts of developers complaining about this. A lot of developers will recognize the following error message: Test connection failed because of an error in initializing provider. Oracle client and networking components were not found. These components are supplied by Oracle Corporation and are part of the Oracle Version 7.3.3 or later client software installation. Provider is unable to function until these components are installed.

After a lot of searching, trying and debugging I think I found the right way to do it!

Problems

Because BIDS is a 32 bit application, as well on 32 as on 64 bit machines, it cannot see the 64 bit driver for Oracle. Because of this, connecting to Oracle from BIDS on a 64 bit machine will never work when you install the 64 bit Oracle client.

Another problem is the "Microsoft Provider for Oracle", this driver only exists in a 32 bit version and Microsoft has no plans to create a 64 bit one in the near future.

The last problem I know of is in the Oracle client itself, it seems that a connection will never work with the instant client, so always use the full client. There are also a lot of problems with the 10G client, one of it is the fact that this driver can't handle the "(x86)" in the path of SQL Server. So using the 10G client is no option!

Solution

Download the Oracle 11G full client.

Install the 32 AND the 64 bit version of the 11G full client (Installation Type: Administrator) and reboot the server afterwards. The 32 bit version is needed for development from BIDS with is 32 bit, the 64 bit version is needed for production with the SQLAgent, which is 64 bit.

Configure the Oracle clients (both 32 and 64 bits) by editing the files tnsnames.ora and sqlnet.ora. Try to do this with an Oracle DBA or, even better, let him/her do this.

Use the "Oracle provider for OLE DB" from SSIS, don't use the "Microsoft Provider for Oracle" because a 64 bit version of it does not exist.

Schedule your packages with the SQLAgent.

Background information

Visual Studio (BI Dev Studio)is a 32bit application.

SQL Server Management Studio is a 32bit application.

dtexecui.exe is a 32bit application.

dtexec.exe has both 32bit and 64bit versions.

There are x64 and x86 versions of the Oracle provider available.

SQLAgent is a 64bit process.

My advice to BI consultants is to get an Oracle DBA or professional for the installation and configuration of the 2 full clients (32 and 64 bit). Tell the DBA to download the biggest client available, this way you are sure that they pick the right one ;-)

Testing if the clients have been installed and configured in the right way can be done with Windows ODBC Data Source Administrator: Start... Programs... Administrative tools... Data Sources (ODBC)

1. Open REGEDIT (Start… Run… REGEDIT) on the server and search for the following entry (for the 32 bits driver): HKEY_LOCAL_MACHINE\Software\Microsoft\MSDTC\MTxOCIMake sure the following values are entered:

2. Next, search for (for the 64 bits driver): HKEY_LOCAL_MACHINE\Software\Wow6432Node\Microsoft\MSDTC\MTxOCI Make sure the same values as above are entered.

In my last blog post I explained the principles of Replication Services and the possibilities it offers in a BI environment. One of the possibilities I described was the use of snapshot replication as an ETL extraction tool: “Snapshot Replication can also be useful in BI environments, if you don’t need a near real-time copy of the database, you can choose to use this form of replication. Next to an alternative for Transactional Replication it can be used to stage data so it can be transformed and moved into the data warehousing environment afterwards. In many solutions I have seen developers create multiple SSIS packages that simply copies data from one or more source systems to a staging database that figures as source for the ETL process. The creation of these packages takes a lot of (boring) time, while Replication Services can do the same in minutes. It is possible to filter out columns and/or records and it can even apply schema changes automatically so I think it offers enough features here. I don’t know how the performance will be and if it really works as good for this purpose as I expect, but I want to try this out soon!”

Well I have tried it out and I must say it worked well. I was able to let replication services do work in a fraction of the time it would cost me to do the same in SSIS. What I did was the following:

Configure snapshot replication for some Adventure Works tables, this was quite simple and straightforward.

Create an SSIS package that executes the snapshot replication on demand and waits for its completion. This is something that you can’t do with out of the box functionality. While configuring the snapshot replication two SQL Agent Jobs are created, one for the creation of the snapshot and one for the distribution of the snapshot. Unfortunately these jobs are asynchronous which means that if you execute them they immediately report back if the job started successfully or not, they do not wait for completion and report its result afterwards. So I had to create an SSIS package that executes the jobs and waits for their completion before the rest of the ETL process continues.

Fortunately I was able to create the SSIS package with the desired functionality. I have made a step-by-step guide that will help you configure the snapshot replication and I have uploaded the SSIS package you need to execute it.

Configure snapshot replication

The first step is to create a publication on the database you want to replicate. Connect to SQL Server Management Studio and right-click Replication, choose for New.. Publication…

The New Publication Wizard appears, click Next

Choose your “source” database and click Next

Choose Snapshot publication and click Next

You can now select tables and other objects that you want to publish

Expand Tables and select the tables that are needed in your ETL process

In the next screen you can add filters on the selected tables which can be very useful. Think about selecting only the last x days of data for example.

Its possible to filter out rows and/or columns. In this example I did not apply any filters.

Schedule the Snapshot Agent to run at a desired time, by doing this a SQL Agent Job is created which we need to execute from a SSIS package later on.

Next you need to set the Security Settings for the Snapshot Agent. Click on the Security Settings button.

In this example I ran the Agent under the SQL Server Agent service account. This is not recommended as a security best practice. Fortunately there is an excellent article on TechNet which tells you exactly how to set up the security for replication services. Read it here and make sure you follow the guidelines!

On the next screen choose to create the publication at the end of the wizard

Give the publication a name (SnapshotTest) and complete the wizard

The publication is created and the articles (tables in this case) are added

Now the publication is created successfully its time to create a new subscription for this publication.

Expand the Replication folder in SSMS and right click Local Subscriptions, choose New Subscriptions

The New Subscription Wizard appears

Select the publisher on which you just created your publication and select the database and publication (SnapshotTest)

You can now choose where the Distribution Agent should run. If it runs at the distributor (push subscriptions) it causes extra processing overhead. If you use a separate server for your ETL process and databases choose to run each agent at its subscriber (pull subscriptions) to reduce the processing overhead at the distributor.

Of course we need a database for the subscription and fortunately the Wizard can create it for you. Choose for New database

Give the database the desired name, set the desired options and click OK

You can now add multiple SQL Server Subscribers which is not necessary in this case but can be very useful.

You now need to set the security settings for the Distribution Agent. Click on the …. button

Again, in this example I ran the Agent under the SQL Server Agent service account. Read the security best practices here

Click Next

Make sure you create a synchronization job schedule again. This job is also necessary in the SSIS package later on.

Initialize the subscription at first synchronization

Select the first box to create the subscription when finishing this wizard

Complete the wizard by clicking Finish

The subscription will be created

In SSMS you see a new database is created, the subscriber. There are no tables or other objects in the database available yet because the replication jobs did not ran yet.

Now expand the SQL Server Agent, go to Jobs and search for the job that creates the snapshot:

Rename this job to “CreateSnapshot”

Now search for the job that distributes the snapshot:

Rename this job to “DistributeSnapshot”

Create an SSIS package that executes the snapshot replication

We now need an SSIS package that will take care of the execution of both jobs. The CreateSnapshot job needs to execute and finish before the DistributeSnapshot job runs. After the DistributeSnapshot job has started the package needs to wait until its finished before the package execution finishes. The Execute SQL Server Agent Job Task is designed to execute SQL Agent Jobs from SSIS. Unfortunately this SSIS task only executes the job and reports back if the job started succesfully or not, it does not report if the job actually completed with success or failure. This is because these jobs are asynchronous.

The SSIS package I’ve created does the following:

It runs the CreateSnapshot job

It checks every 5 seconds if the job is completed with a for loop

When the CreateSnapshot job is completed it starts the DistributeSnapshot job

And again it waits until the snapshot is delivered before the package will finish successfully

Quite simple and the package is ready to use as standalone extract mechanism. After executing the package the replicated tables are added to the subscriber database and are filled with data:

In this example I only replicated 5 tables, I could create a SSIS package that does the same in approximately the same amount of time. But if I replicated all the 70+ AdventureWorks tables I would save a lot of time and boring work! With replication services you also benefit from the feature that schema changes are applied automatically which means your entire extract phase wont break. Because a snapshot is created using the bcp utility (bulk copy) it’s also quite fast, so the performance will be quite good.

Disadvantages of using snapshot replication as extraction tool is the limitation on source systems. You can only choose SQL Server or Oracle databases to act as a publisher.

So if you plan to build an extract phase for your ETL process that will invoke a lot of tables think about replication services, it would save you a lot of time and thanks to the Extract SSIS package I’ve created you can perfectly fit it in your usual SSIS ETL process.

In this blog post I will explain the principles of SQL Server Replication Services without too much detail and I will take a look on the BI capabilities that Replication Services could offer in my opinion.

SQL Server Replication Services provides tools to copy and distribute database objects from one database system to another and maintain consistency afterwards. These tools basically copy or synchronize data with little or no transformations, they do not offer capabilities to transform data or apply business rules, like ETL tools do. The only “transformations” Replication Services offers is to filter records or columns out of your data set. You can achieve this by selecting the desired columns of a table and/or by using WHERE statements like this: SELECT <published_columns> FROM [Table] WHERE [DateTime] >= getdate() - 60

There are three types of replication:

Transactional Replication

This type replicates data on a transactional level. The Log Reader Agent reads directly on the transaction log of the source database (Publisher) and clones the transactions to the Distribution Database (Distributor), this database acts as a queue for the destination database (Subscriber). Next, the Distribution Agent moves the cloned transactions that are stored in the Distribution Database to the Subscriber. The Distribution Agent can either run at scheduled intervals or continuously which offers near real-time replication of data!

So for example when a user executes an UPDATE statement on one or multiple records in the publisher database, this transaction (not the data itself) is copied to the distribution database and is then also executed on the subscriber. When the Distribution Agent is set to run continuously this process runs all the time and transactions on the publisher are replicated in small batches (near real-time), when it runs on scheduled intervals it executes larger batches of transactions, but the idea is the same.

Snapshot Replication

This type of replication makes an initial copy of database objects that need to be replicated, this includes the schemas and the data itself. All types of replication must start with a snapshot of the database objects from the Publisher to initialize the Subscriber. Transactional replication need an initial snapshot of the replicated publisher tables/objects to run its cloned transactions on and maintain consistency.

The Snapshot Agent copies the schemas of the tables that will be replicated to files that will be stored in the Snapshot Folder which is a normal folder on the file system. When all the schemas are ready, the data itself will be copied from the Publisher to the snapshot folder. The snapshot is generated as a set of bulk copy program (BCP) files. Next, the Distribution Agent moves the snapshot to the Subscriber, if necessary it applies schema changes first and copies the data itself afterwards. The application of schema changes to the Subscriber is a nice feature, when you change the schema of the Publisher with, for example, an ALTER TABLE statement, that change is propagated by default to the Subscriber(s).

Merge Replication Merge replication is typically used in server-to-client environments, for example when subscribers need to receive data, make changes offline, and later synchronize changes with the Publisher and other Subscribers, like with mobile devices that need to synchronize one in a while. Because I don’t really see BI capabilities here, I will not explain this type of replication any further.

Replication Services in a BI environment Transactional Replication can be very useful in BI environments. In my opinion you never want to see users to run custom (SSRS) reports or PowerPivot solutions directly on your production database, it can slow down the system and can cause deadlocks in the database which can cause errors. Transactional Replication can offer a read-only, near real-time database for reporting purposes with minimal overhead on the source system.

Snapshot Replication can also be useful in BI environments, if you don’t need a near real-time copy of the database, you can choose to use this form of replication. Next to an alternative for Transactional Replication it can be used to stage data so it can be transformed and moved into the data warehousing environment afterwards. In many solutions I have seen developers create multiple SSIS packages that simply copies data from one or more source systems to a staging database that figures as source for the ETL process. The creation of these packages takes a lot of (boring) time, while Replication Services can do the same in minutes. It is possible to filter out columns and/or records and it can even apply schema changes automatically so I think it offers enough features here. I don’t know how the performance will be and if it really works as good for this purpose as I expect, but I want to try this out soon!

Update: I got a questing regarding the supported Replication Services features in the different versions of SQL Server (Standard,Enterprise,etc). There is a nice table on MSDN that shows this!

When watching the SQL PASS session “What’s Coming Next in SSIS?” of Steve Swartz, the Group Program Manager for the SSIS team, an interesting question came up:

Why is SSIS thought of to be BI, when we use it so frequently for other sorts of data problems?

The answer of Steve was that he breaks the world of data work into three parts:

Process of inputs

BI

Enterprise Information Management All the work you have to do when you have a lot of data to make it useful and clean and get it to the right place. This covers master data management, data quality work, data integration and lineage analysis to keep track of where the data came from. All of these are part of Enterprise Information Management.

Next, Steve told Microsoft is developing SSIS as part of a large push in all of these areas in the next release of SQL. So SSIS will be, next to a BI tool, part of Enterprise Information Management in the next release of SQL Server.

I'm interested in the different ways people use SSIS, I've basically used it for ETL, data migrations and processing inputs. In which ways did you use SSIS?

At the PASS Summit that is happening in Seattle at the moment Microsoft announced the “BI Semantic Model” (BISM).

It looks like BISM is something like the UDM that we now know from SSAS. While the UDM was the bridge between relational data to multidimensional data, BISM is the bridge between relational data to the column-based Vertipaq engine. Some compare BISM to Business Objects universes.

The next version of SSAS will be able to either run in the old “UDM” mode or in “BISM” mode, a combination is not possible. Of course this will have some radical consequences, because there are a few major differences between the two modes:

The switch from multidimensional cubes to the in-memory Vertipaq engine

The switch from MDX to DAX

So multidimensional cubes and MDX will be deprecated? No, not really, SSAS as we know it now will be a product in the future and will remain supported. But it looks like Microsoft will concentrate on BISM, mainly because multidimensional cubes and MDX are very difficult to learn. Microsoft wants to make BI more approachable and less difficult, just like with Self Service BI. I would say that it’s really time to start learning PowerPivot and DAX right now, if you have not already started learning it. If Microsoft will focus on the new BISM/Vertipaq technology that will be the future if you ask me.

Chris Webb wrote an interesting article about BISM and it looks like he is not very enthusiastic about the strategy Microsoft takes here because this could be the end of SSAS cubes within a few years: “while it’s not true to say that Analysis Services cubes as we know them today and MDX are dead, they have a terminal illness. I’d give them two, maybe three more releases before they’re properly dead, based on the roadmap that was announced yesterday.”

What’s also very interesting is the comprehensive comment on this article from Amir Netz. He explains BISM and UDM will live together in Analysis Services in the future and MOLAP is here to stay: “Make no mistake about it – MOLAP is still the bread and butter basis of SSAS, now and for a very long time. MDX is mature, functional and will stay with us forever.”

Read the article from Chris Webb here and make sure you don’t miss the comment from Amir!

With SQL Azure Reporting Services you can use SSRS as a service on the Azure platform with all the benefits of Azure and the most features and capabilities of premise. It’s also possible to embed your reports in your Windows or Azure applications.

Benefits of the Azure platform for Azure Reporting Services are:

Highly available, the cloud services platform has built-in high availability and fault tolerance

Scalable, the cloud services platform automatically scales up and down

Secure, your reports and SQL Azure databases are on a safe place in the cloud

Cost effective, you don’t have to set up servers and you don’t have to invest in managing servers

Use the same tools you use today to develop your solutions. Just develop your reports in BIDS or Report Builder and deploy to Azure

Disadvantages are:

SQL Azure databases are the only supported data sources in the first version, more data sources are expected to come

No developer extensibility in the first version, so no custom data sources, assemblies, report items or authentication

No subscriptions or scheduled delivery

No Windows Authentication, only SQL Azure username/password is supported in the first version, similar to SQL Azure database. When SQL Azure database gets Windows Authentication, Azure Reporting will follow

Despite the disadvantages of the first version I think SQL Azure Reporting Services offers great capabilities and can be extremely useful for a lot of organizations. I’m really curious about the CTP, which will be available before the end of this year. You can sign up for the SQL Azure Reporting CTP here

The exam contains 2 sections(basically separate exams), each with 25 questions: - A part which covers exam 70-448: TS: Microsoft SQL Server 2008, Business Intelligence Development and Maintenance - A part which covers exam 70-452: PRO: Designing a Business Intelligence Infrastructure Using Microsoft SQL Server 2008

You need to pass on both of the sections with a score that’s at least 700. If you fail one section, you fail on the entire exam.

How did I study

I searched the internet and the conclusion was that there is no preparation material available for the 70-452 exam but fortunately there was a self-paced training kit for the 70-448 exam, which also covers this exam. So i bought the book, scanned it for subjects that needed attention and fortunately that was enough to pass the exam for me.

For the entire list of preparation materials for the 70-448 and 70-452 exams follow the links below:

The latest releases of SQL Server contained (almost) no new SSIS features. With the release of SSIS 2008 the ability to use C# scripts, the improved data flow and the cached lookup were most thrilling new features. The release of SQL 2008 R2 only gave us the ability to use a bulk insert mode for the ADO.NET destination, which was a bit disappointing.

Fortunately Matt Mason from the SSIS team announced that the next version of SQL Server (SQL 11) contain quite some exiting new functionality for SSIS!

- Undo/Redo support. Finally, this should have been added a long time ago ;-)

- Improved backpressure for data flow transformations with multiple inputs (for example a Merge Join). When one of the inputs get to much data compared to the other, the component that receives the data can tell the data flow that it needs more data on the other input

- The Toolbox window will automatically locate and show newly installed custom tasks

Quite often one or more sources for a data warehouse consist of flat files. Most of the times these files are delivered as a zip file with a date in the file name, for example FinanceDataExport_20100528.zip

Currently I work at a project that does a full load into the data warehouse every night. A zip file with some flat files in it is dropped in a directory on a daily basis. Sometimes there are multiple zip files in the directory, this can happen because the ETL failed or somebody puts a new zip file in the directory manually. Because the ETL isn’t incremental only the most recent file needs to be loaded. To implement this I used the simple code below; it checks which file is the most recent and deletes all other files.

Usage is quite simple, just copy/paste the code in your script task and create two SSIS variables:

SourceFolder (type String): The folder that contains the (zip) files

DateInFilename (type Boolean): A flag, set it to True if your filename ends with the date YYYYMMDD, set it to false if creation date of the files should be used

Public Sub Main()
'Use this piece of code to loop through a set of files in a directory
'and delete all files except for the most recent one based on a date in the filename.
'File name example:
'DataExport_20100413.zip
Dim rootDirectory As New DirectoryInfo(Dts.Variables("SourceFolder").Value.ToString) 'Set the directory in SSIS variable SourceFolder. For example: D:\Export\
Dim mostRecentFile As String = ""
Dim currentFileDate As Integer
Dim mostRecentFileDate As Integer
Dim currentFileCreationDate As Date
Dim mostRecentFileCreationDate As Date
Dim dateInFilename As Boolean = Dts.Variables("DateInFilename").Value 'If your filename ends with the date YYYYMMDD set SSIS variable DateInFilename to True. If not set to False.
If dateInFilename Then
'Check which file is the most recent
For Each fi As FileInfo In rootDirectory.GetFiles("*.zip")
currentFileDate = CInt(Left(Right(fi.Name, 12), 8)) 'Get date from current filename (based on a file that ends with: YYYYMMDD.zip)
If currentFileDate > mostRecentFileDate Then
mostRecentFileDate = currentFileDate
mostRecentFile = fi.Name
End If
Next
Else 'Date is not in filename, use creation date
'Check which file is the most recent
For Each fi As FileInfo In rootDirectory.GetFiles("*.zip")
currentFileCreationDate = fi.CreationTime 'Get creation date of current file
If currentFileCreationDate > mostRecentFileCreationDate Then
mostRecentFileCreationDate = currentFileCreationDate
mostRecentFile = fi.Name
End If
Next
End If
'Delete all files except the most recent one
For Each fi As FileInfo In rootDirectory.GetFiles("*.zip")
If fi.Name <> mostRecentFile Then
File.Delete(rootDirectory.ToString + "\" + fi.Name)
End If
Next
Dts.TaskResult = ScriptResults.Success
End Sub