I haven't got a lot of ETL experience but I haven't found the answer to my question either, although I guess it may be a no-brainer if you've worked with it. We're currently looking into creating a simple data warehouse (simple as in "copy most columns from most tables" and not OLAP-style) and it seems we're leaning towards SQL Server (2008) for a few reasons.

SSIS seems to be the tool for this kind of tasks when it comes to SQL Server, but I can't find anything about how it is affecting the source database cache, if at all, when loading data. Some of our installations are very sensitive performance-wise when it comes to having a usage-style-cache.

But if SSIS runs a "select *"-ish query and the cache is altered, then the performance for the users may degrade to unacceptable levels until it is rebuild from those queries again.

So my question is, does SSIS (or is there a way to avoid) affect the database cache when loading data from a SQL Server database?

Part of the problem is also that the source database could be both an Oracle or SQL Server database, so if there is a way to avoid the cache-affecting part for Oracle, that would be good input as well. (I guess the Attunity connector is the way to go?)

(Some additional info: We have considered plain files as well, but then export-import probably takes longer time than SSIS-transfer? I also guess change data capture is something we'll also look into, so if that is relevant to this question, feel free to include possible issues/benefits.)

2 Answers
2

First off, SSIS doesn't do anything special to avoid the buffer pool, or the plan cache.
Simple test (on a NON-production instance!):

Create a new SSIS package with a single connection manaager, and a single data flow containing one OLE DB Source, pointing to a table, similar to:

Clear the buffer pool, from SSMS: DBCC DROPCLEANBUFFERS

Verify that the cache has been cleared using the glorified dm_os_buffer_descriptors query at the top of this page: I get this:

Run the package

Re-run the query from step (2), and note that the data pages for the table (BOM_PIECE in my example) have been loaded into the cache:

Note that most SSIS components allow you to provide your own query, so if you have a way to avoid the buffer pool (I don't know that this is possible - I'd defer to someone who knows more about it), you could insert that into a query. So in the above example, instead of selecting Table or view in the OLE DB Source, you would select SQL command, or SQL command from variable if your command requires dynamic text.

Finally, I can imagine why you want to eliminate the cache load - but are you sure you want to do this? SQL Server is fairly good at managing memory, and what you're doing is swapping memory load for disk I/O load, which (depending on your use case) may have a negative impact on other users. This question has a discussion on SQL Server caching.

Thanks for the tip! I've tried to run it to see how much my query would affect the buffer, but unfortunately the largest "performance test database" we have, even though it contains a lot of data, it doesn't have the normal usage pattern now that it's inhouse instead of in production at a customer site, so I'm not sure I can draw any conclusions from that. Anyway, I guess we can use backups/transaction shipping/async replication as last resorts to get data from the production db, since they're both SQL servers in this case, so I'll leave it at that for now and start looking at the Oracle case.
–
IgorApr 17 '12 at 12:57

What do you mean "affect the database cache when loading data from a SQL Server database". SQL Server does not cache data, it caches execution plans. The fact that you are using SSIS wont affect your Server (other than the overhead of reading the data of course). Just use a propper transaction isolation level.

Thanks for the pointers! What i meant by "affect the database" is that (I'm not an expert on how databases do things "under the hood" so I could be wrong but) as I've understood the most recent data is held in memory if possible, and when doing a "select *"-ish query, a lot of uninteresting data from a user perspective would be put in memory instead of the recent "user data". But as I understand your answer an SSIS load can avoid that? Do I need a special kind of connection or would any do? (I mean an ODBC connection would probably change the cache?)
–
IgorApr 12 '12 at 13:58

About CDC, I meant that it would make the "select *"-ish queries only need be run the first time, and that subsequent reads would only be a few rows, so that the cache problem would not be relevant when only a few rows are read as opposed to the entire table.
–
IgorApr 12 '12 at 14:00

Ah, but perhaps you meant that the SSIS package is run on the source and not on the destination DB..maybe that is the normal way to do it? We couldn't do it since the source may be an oracle db in our case.
–
IgorApr 12 '12 at 14:11

@Diego: are you sure about SQL Server does not cache data, it caches execution plans? I believe there's a separate plan cache and buffer cache. From this MS page: The buffer manager manages the functions for reading data or index pages from the database disk files into the buffer cache. From this MS page: Most memory used by SQL Server is allocated to the Buffer Pool, [...] SQL Server steals a proportion of this memory for use in caching query plans
–
GeoffApr 12 '12 at 14:14

well, SQL Server maintains a physical pool of memory on the server that keep the data that has been read, but what I actually meant is that on a SSIS solution when you are dealing with a lot of data, you shouldn't really telly on it, right?
–
DiegoApr 12 '12 at 14:26