To Federate or Not to Federate

With the release of SAP BusinessObjects 4.0 a light version of Data Federator (DF) has been directly embedded into the new Information Design Tool (IDT) Universe (.UNX). With this plugin you can create a virtual connection to multiple disparate data sources and combine tables from each source together within a single Universe Foundation. For some organizations this will simplify their reporting environment by providing ad hoc users a single Universe capable of connecting to multiple databases or data sources. However, it is important to understand how this plugin works as it relates to the performance and scalability of solutions developed utilizing DF. This article will walk you through an example where this plugin is utilized to connect to a BW system utilizing a JCo Relational connection and Federated Universe. Developers will also find that they have more traditional Universes control in managing data output using JCo because they are designing a Universe using InfoProviders (Similar to Database Tables) with Data Federator SQL and not InfoCubes with MDX. At first glance, this sounds like a much needed feature for Universes. The specific issues related to this scenario will not necessarily apply to other RDMS data sources but I will close with a general discussion on issues related to other non-BW data sources.

To begin let me clarify that there is no mechanism in BOE 4.0 to create an IDT universe (.UNX) directly on a BEx OLAP connection via BICS. (Please see SAP Notes 1564682 and 1587810 ). There is however an option that allows developers to utilize a relational JDBC (Java Connector or JCo) that will provide direct access to the BW Aggregation Engine. In short, it is possible to develop a Universe (.UNX) directly on BW InfoProviders. However, the JCo relational connection only provides virtual access to the InfoProviders within BW. These InfoProviders are exposed to the Data Federation engine through a facade as a set of logical tables. As a result there are limitations on how efficiently Data Federator can access, join and aggregate these tables. Specifically BW via DF cannot “fully” join two virtual InfoProvider tables at the source. Because of this limitation, the DF engine is then leveraged to virtually join the tables or a portion of the table’s data on the Adaptive Processing Server instance that hosts the DF plugin. While this will work effectively for small or summarized data sets, larger data sets can create a significant bottleneck within the DF engine and network as it attempts to manage the large results of multiple BW InfoProvider tables on the reporting server. It is also worth mentioning that the DF plug-in can leverage a semi-join operations when accessing data in BW. In short, DF’s semi-join will use information from the source BW system and DF to help reduce the data transferred between BW and DF. While this is not the same as having the RDMS join the tables, it does provide performance enhancements. If you are not seeing a reduction in the data being transferred, you can check SAP note 1756611 for options to help optimize the statistics and algorithms that BW presents to DF when leveraging the semi-join feature. There is also an interesting article on SCN by Scott Broadway that further list tips for optimizing DF to reduce the data transferred from BW (Link). As this information relates to BusinessObjects 4.0 and Federated Universes the “Data Federator 3.0 BW Connector Guide” simply states the following:

“Data Federator appears in SAP BusinessObjects BI 4.0 only as a backend engine empowering Multi-Source Universe (MSU). In the 4.0 version it is not possible to author target tables as with DF 3.x. The 4.0 the DF BW connector is much the same as the 3.0 version. However, it is mainly aimed at supporting multi-source universe (e.g., BW and a relational data source) and not to allow mass data access on a single BW source (also there is no Data Federator Designer in BI 4.0).”

“In BI 4.0 the recommended access methods for Webi to BW for mass access data is BICS. Other clients such as Analysis OLAP or Analysis Office are also available to access BW with performance. A careful choice of the client based on business needs is recommended.”

Based on my firsthand experience, the above statement correlates to what I see when comparing Web Intelligence reports, utilizing BEx via BICS to a Web Intelligence report utilizing an IDT multi-source universe. During a recent project we discovered that an IDT multi-source universe based on BW sales order data required almost 10 minutes to refresh. There were two 2,000,000+ row InfoProviders and several other InfoProviders involved in the query. In comparison, the same query utilizing BEx via BICS refreshed in less than 15 seconds. In tracing the IP traffic, we discovered that over 550MB of data was transferred from BW to the DF engine using JCo. We also noticed a huge spike in CPU utilization on the BOE Java process that hosted the DF engine. Compared to the BICS option, we only experienced 890KB of data being transferred and little or no CPU utilization on the processes that manage this option within BOE. While we could have looked at ways to optimize the DF option, I still would expect the BICS option to be faster and more scalable until more enhancements are made between BW and DF.

A similar problem can occur with COUNT() and AVERAGE() aggregations. The DF engine cannot push these functions down to the BW Aggregation Engine in the same way that Web Intelligence does with standard RDMS queries therefore it will read the contents (or a portion of the contents) of the data into the DF engine and then perform the calculation. In some cases it appears that DF can use the source statistics to help with these functions but depending on the complexity of the query, it might be difficult for DF to chooses the most optimal route. Unless the net result of all InfoProvider data sets are very small (< 100,000 rows), the DF engine will not be able to handle the queries quickly or efficiently. Please refer to page 7 of the “Data Federator 3.0 BW Connector Guide” for additional limitations.

There are several excellent statements within the guide. Of note, the guide states the following: “How a specific Web Intelligence request will perform with a DF BW connector is a case-by-case problem. It requires an analysis of the way the DF Query Server will execute the SQL, the number of functions pushed to the aggregation engine and the overall data volume that will have to be transferred between BW and the DF Query Server.” Based on my experience, this is a profound statement. As opposed to creating ad-hoc friendly universes, developers are typically required to create highly specialized universes that optimize the DF queries through a series of tightly controlled coded statements. In other cases they will have to constantly use the DF Administrator Tool to optimize the Ad Hoc queries of users. As you can imagine, this creates a situation where the federated data source can only be accessed by developers with intimate knowledge of the underlying structure of the Universe. It also places a burden on DF administrators to constantly tweak the DF settings and stats. It’s also worth noting, that in some cases, there are simply no solutions to optimize the queries due to the limitations between DF and BW and the nature of the queries that ad hoc users tend to create.

For the sake of argument, I wanted to add that Web Intelligence itself has options for joining data from one or more Universes. This option, joining BW and non-BW data, requires that developers leverage data synchronization within Web Intelligence to merge the data. However, this solution creates a similar bottleneck as stated above (using DF to merge the data). The difference now is that the Web Intelligence report server will have to perform the aggregations and store the data sets within its engine. It’s also worth noting that in order to synchronize data from two Web Intelligence queries; the merged dimension values must be identical in both type and output. If there are any type or value differences, Web Intelligence will not be able to resolve them.

Now that we have discussed a specific example of where federating multiple sources (or in this case multiple InfoProviders) can be problematic, I want to discuss in general terms how a similar situation can occur for Non SAP BW sources. Some of the same discussion points from the “Data Federator 3.0 BW Connector Guide” are true for other data sources as well. However (in my opinion), Data Federator seems to work much better with non SAP BW sources in comparison. When you federate sources into a single logical data view you have to pay close attention to how many rows of data are returned from each source. If each desperate source returns 250,000 records, then the federation engine will have to execute the joins of the two results sets within its engine. Depending on the hardware configuration of the server managing the request, this process might yield slow performance. In order for this process to be optimal, fewer results must be returned from each source. This can be achieved several ways. If the Universe and Web Intelligence report query produces DF SQL code that contains WHERE and GROUP BY clauses, Data Federator will try to “push down” and translate these clauses to each source database table. If the results of each “Push Down” operation contain a few thousand records (post filtering and aggregating) then the DF engine will be able to manage this much faster. With this in mind, the art of optimizing DF queries requires constant analysis and tweaking to make sure that each Web Intelligence query maximizes “Push Down” operations or other DF optimization features. Data Federator has several optimization options that you can review in the “Data Federation Administration Guide” and “Data Federator Product Guide“. Based on my experience this can complicate the Universe Design process and ad hoc reporting experience because not every DF function can be translated to the original source DB. In most cases, if a single aggregation function cannot be translated to the source, then DF will request an abundant amount of data from the source table. As stated above, this is sometimes the case when utilizing DF to access multiple BW InfoProviders. If the report utilized the aggregate function COUNT(), then DF has the potential to request most of the InfoProvider data and then count the rows within its engine. While this specific example might not be an issue for an Oracle or MS SQL source, you can anticipate that other functions or SQL statements will yield similar results. Take for example the utilization of flat files within DF. Because flat files have no mechanism for filtering or aggregating, DF will have to read the entire contents into its engine to federate the data with other sources. Flat files are the extreme scenario but hopefully you can see how different source or types of data can each present a different challenges when optimizing a multi-source Universe.

If you are looking to provide a solution to integrate large data from different sources into a single Universe, (in my opinion) there are two optimal and scalable solutions available. The first solution would be to use an ETL tool to read data from ECC / R3 and other data sources and then load them into a single RDMS Data Mart or Data Warehouse. The second solution would be to use BW tools and Data Services to load and model all required data into a single instance of BW. If you are utilizing BusinessObjects 4.0, you can directly bind to BW BEx queries and bypass the Universe entirely utilizing BICS. In either scenario you are physically moving and transforming the data from its sources and then integrating it into a single physical source. In short, the standard Enterprise Information Process (EIM) will once again prove to be the most optimal and scalable solution.

In closing I want to state that the purpose of this article is not to discourage the use of a multi-source (IDT) Universe or Data Federator. The successes or failures of implementing solutions based on real-time federation are very situational. It depends on many factors such as the data source, the reporting requirements and the volume of data processed. Hopefully you will have great success implementing DF solutions within your organization, but you should be aware of how the services operate. Please feel free to post comments or suggestions.

Post navigation

27 comments

Thanks, very helpful article. Validated some concerns on amount of data (records) returned as well as with transformation logic slowing performance. Curious, would you say the same about the full Data Federator tool or is this specific to federation capabilities within IDT?

Based on my experiences and information from SAP, there are no major differences between the Data Federator 3.x engine and Data Federator engine embedded in IDT multi-source universes related to BW and JCo. However, the full version has many more tools and features.

Great post Jonathan. Question — I am attempting to use DF 4.0 against a DSO because of some BICS limitations that are not going to be solved (a long, different story) — I notice that it actually performs great, unless I want to bring in meta data (such as text or attributes), then it stupidly executes 2+ queries for each of these tables involved (text and MD for example). I do not understand why it will not do these joins in BW!? A huge cost for such a simple referential join! Is this a misconfiguration or simply an architectural constraint in how we can use DF?

I’m basically just wanting some DSO data similar to how you’d hit a flat table in Oracle, but with the ability to use the master data without the DF overhead of doing the joins that should really be able to be done in BW.

I think I figured out what I was doing wrong. Once I computed cardinalities on all related tables of the catalog of my DSO, DF started performing much better, I’m assuming this is some semi-join magic or DF Facade magic at work with its execution plan.

Assuming that you have reviewed all the optimization options linked within the article (Simi-Join, Updated the stats etc..), this is about as good as it will get for large datasets. The TXT and Master data requires joining that will impact the DF pushdown process. One though, (an ugly thought) is to try and model all the needed data into a single BW info provider or physical table in the database. I’m not sure of the overall benefits or negatives of this work-around, but it might make an interesting test.

I’ll just need to know that computing cardinality in DFAT is part of the workflow before proceeding with the UNX modeling in IDT. That was my main disconnect. I may go the route of doing it all in an InfoSet at the App Server level since the data is flattened in my case, or in a Function Module wrapper, as this case I’m working is very specific anyway.

Excellent article! I created a Data Federator connection the source it´s an InfoCube but the report have a bad performance, after that I read you article, I can understand the difference between a connection BICS and Data Federator.

Hi Johnathan -Very good R&D,I have one issue in IDT,I created single source connection on the top of DB2 but it is not showing some of the schema’ s but I am able to see those schema’ s through Multi source connection on the top of DB2.Do you know why it is not showing on Single source connection?

Hi,
Some really useful information here, though it’s from 2012. Do you know whether the current DF release can do more with actually pushing down joins to the source, or is the mechanism still substantially the same?

Jonathan, actually, I can’t immediately narrow it down. I’m trying to compare what different DV products (Denodo, SAP DF, IBM, etc.) are able to do in this area. I’m sure the actual capabilities of the underlying database and the capabilities as known to DF are factors in what can be pushed to where. Can you point me at anything authoritative and publicly available that discusses this? Google is actually kinda letting me down today. Thanks, -A

If were talking about the capabilities embedded in the IDT universe and BI 4.X platform, there have been no enhancements to my knowledge.

The standalone version of data federator, at last check, remains at version 3.X. They have embedded a feature within SAP HANA Called smart data access that performs a similar function as data federator. although it does not contain all of the nice tuning options inherent to data federator 3.X.

I am using IDT4.1 I have two tables from two different database, would like to join these two tables but the data type of columns is different, one is String another one is number, please advise how we can join these two, I tried using CAST() but i am getting data base error(WIS 10901)