Linked servers: a case for global temporary tables

I'm not a SQL Server historian, so I don't know exactly when the global temporary table was first introduced into SQL Server. The feature may have been around since the very genesis of SQL Server. But I know for sure that it has been there since I started dealing with SQL Server in the early 90s.

The odd thing about the global temporary table is that although we understand how it behaves, it has not been easy to find a practical scenario where its use is just perfect. I'd like describe a scenario where I think it is perfect to use a global temporary table. But your opinion may vary, and I'd very much like to hear if you think otherwise or have a more fitting scenario.

Here is the case.

You have a complex query that involves joins between a local temporary table(s) and a remote permanent table(s). In addition, the local temporary table is not small, but nor is it very large. The remote permanent table, on the other hand, is very large. Moreover, the query resultset is relatively small. Finally, this query may be part of a larger stored procedure.

To simplify exposition, let's say the local server is ServerA and the remote server is ServerB. To guarantee performance stability, clearly it's better to process the query on the remote server and send back the resultset. One approach is to enclose the query in a string and send it through openquery to the other server for processing. The problem is that ServerB cannot see the local temporary table on ServerA, and this is where a global temporary table becomes useful. If we turn the local temporary table on ServerA into a global one, we can then pull the data from that global temporary table into a local temporary table on ServerA, perform the join on ServerA with all the data locally stored there, and dump the result into another global temporary table on ServerB. And finally from ServerA we can pull the data from the global temporary table that has the query result.

So why does this approach make sense? First, we are only sending a small amount of data across the servers (e.g. the amount of the data in the local temporary table(s) on ServerA). Second, we are essentially sending the logic from ServerA to ServerB and have the logic processed on ServerB where the large table(s) is. The more complex the query is, the better this approach is. And the larger the table on ServerB is, the better this approach is. It's even better if the query originally joins multiple local temporary tables on ServerA or multiple large tables on ServerB. The problem with joining multiple local temporary tables and large remote tables is that the SQL Server optimizer can get the execution plan terribly wrong, and the chance for getting the plan wrong is not even remote.

As a T-SQL programming pattern, the query processing on ServerB can be turned into a generic stored procedure on ServerB regardless of how the query may look like or how many tables may be involved. You only need to call the proc from ServerA, supply the join clause and the necessary local temporary table name(s) as the parameter, and receive the result from ServerB.

You may argue that the whole design at the higher level should be re-done so that we don't get into the situation where we have to join local temporary tables with large remote tables. That is a compelling argument, but that is quite a different story, and in many real world cases you may not have that choice.

Again, I'd very much like to hear your opinions on this approach, and if you have other scenarios where a global temporary table makes sense, let us know as well.

Comment Notification

Comments

Sorry Linchi, but I don't like it. Unless we are talking about a system where there is absolutely no concurrency... I would just make the data required by the remote system more accessible (avoiding the temp table altogether). If that means a filtered index on an existing table, or actually creating permanent tables for this process, both of these sound much less fragile than a global #temp table.

Open Management Studio, and two query windows connected to the same server. In one window, do this:

CREATE TABLE ##foo(ID INT);

INSERT ##foo SELECT 1;

In the second window, do this:

CREATE TABLE ##foo(ID INT);

That seems like a real concurrency killer to me. Or worse, a data pollution problem, if this procedure is called by two different users and the create table checks if the table exists. Yes, it exists, but it already has someone else's data in it. So change the second window to:

IF OBJECT_ID('tempdb..##foo') IS NULL

CREATE TABLE ##foo(ID INT);

INSERT ##foo SELECT 2;

SELECT * FROM ##foo;

There's a pretty good reason you're not going to find many "perfect" or even "remotely practical" places to use a global temporary table.

Sorry about the formatting, I really should have BEGIN / END after the IF (I indented it when I wrote it, but obviously the comment tool does not obey). Anyway the operative word is 'global' - unless you completely serialize access to the global temporary table, everyone can see it.

But that depends on how you name the global temporary tables and whether you follow a common protocol in using them. For instance, you can include the spid number and the server name in the name of the global temporary table so that different spids will not be accessing a global temp table belonging to another spid and joins from the same spid accessing a different remote server will not be sending back results to a wrong table. True, another app can choose to name its global temporary table exactly the name. In a controlled app environment, you can minimize that to an acceptable extent. I'm not making this statement out of just 'hope', but that's from actual real prod experience. In fact, I'm not just making this approach up. It seems to have worked well for some time now in a real environment where re-designing the app to get rid of the complex joins I mentioned in the body of this post was just not practical.

As you said, it's hard to find many (or even any) perfect or remotely practical scenarios for using global temp tables. And that's precisely why I'm curious about how others may see this approach.

Ok, well if you're creating spid-named global ##temp tables, you'll need to be using dynamic SQL, so then why not just create a permanent table? What does the global ##temp table buy you, except that you don't have to run the drop at the end?

What does a global temp table buy is exactly what a local temp table buys you. In addition to not having to drop them explicitly, it's for the same reason you don't convert your local temp tables to permanent tables, i.e. for the concurrency reason and for performance. For what it is worth, MS did put in some effort to make tempdb less a problem when you have to frequently create and drop tables. Try to create and drop tables at a high frequency in a regular user database. The performance benefit is because tempdb does much less logging than a regular user database.

Using permanent tables for scratch space is really not a good idea even not for performance reasons. Note that your proc may crap out in the middle for all sorts of reasons, and you can end up with lots of garbage that can lead to naming collision when you try to create another permanent table.

The benefit most people (while misguided) see from creating a #temp table as opposed to a permanent table, is that several different processes can create #foo without a name collision, so they don't *have* to append some uniqueifier. This isn't true with ##foo obviously, and to do it with global ##temp tables or permanent tables both require dynamic SQL.

So, if the method to create is equally cumbersome, you still haven't explained why a ##temp table is better than a permanent table. Why do you *want* to find a good case to use a ##temp table?

How is a global ##temp table more concurrent than a permanent table? What kind of concurrency are you talking about here? Also, why can't I create permanent tables in tempdb to take advantage of minimal logging? (I can, of course, though your example doesn't seem like it would be the type of throughput where that is really going to make a difference anyway. Correct me if I'm wrong.)

Anyway, I am not a huge fan of local #temp tables either, but I do use them occasionally. The reasons you've outlined are not going to convince me to go change all my code to use global ##temp tables instead of permanent tables. At least with permanent tables if something interrupts the process (failover, disconnection, client redirect in mirroring, etc.) I can start where I left off. If disruptions become an issue, it isn't all that hard to architect a background process that can clean up obsolete tables.

No, it's not that I want to find a good case for a ##tmp table. Rather, it happened that this seemed to be a good case for using ##tmp tables. Th real scenario is that we had a lot of procs that included complex joins with large archive tables and #tmp tables. And then, we wanted to move the archive tables to a separate archive server. So now, these joins became distributed joins if we just prefix the archive tables with the archive server name, and the optimizer could not consistently handle them.

I don't think using permanent tables would work, precisely because they are not cleaned up automatically when things happen. It's not sufficient to have a background process to clean them up because these stored procs are constantly running, and they would immediately fail with naming collision.

If we have construct a CREATE TABLE ##tmp string to dynamically create a ##tmp table with the right name, we do that and I don't see any issue with doing that dynamic SQL at all. What issue do you see there?

I don't see any issue with dynamic SQL. I think you missed by point - I was suggesting that it was the same as permanent tables (e.g. you don't get to avoid dynamic SQL just because you're using a ##temp table - a common misconception), not that it was worse because of dynamic SQL.

I still feel that I would rather explicitly drop my tables (even if it means that it requires a backup process for things that have gone wrong), for the advantage of being able to recover instead of having to re-create and populate. YMMV.

Your naming could still be tied to spid with permanent tables. And why would you have collisions? If you try to create a permanent table named spid<my spid> and you first find that it already exists, it is obviously no longer valid, so you can drop it first. The problems you are pointing out are already pretty easy to solve.

Wait a minute! Having that background process to clean up the permanent tables left behind when a proc craps out in the middle is just not going to be sufficient. And because of that, your critical process may fail. So the difference is really between a smooth operation and a failed app and pissed-off users. That seems to me to be sufficient enough justification for using a temp table (global or local). It's precisely the opposite of the advantage you are talking about. In other words, because of a permanent table, you may not be able to recover from a failure. And if you use tricks such as embedding a random string in the name of the permanent table in order to avoid naming collision and give you time for the cleanup, you then may leave a lot of permanent tables behind that are useless. The point I guess is that because of the durability of permanent tables you just have a much higher probability of naming collision. That leads us back to the very reason of having temp tables in the first place.

BTW, in the time you've invested arguing for this approach, you could have already architected something much better, e.g. make the data available to the other server through other means (mirroring, replication, who knows). Spending this much time worrying about the performance difference between temp and permanent tables (minimal, and certainly shouldn't make or break your app!) and how to avoid naming collisions (easy, as I've already described!) does not seem healthy to me.

If you embed the spid in a permanent table and create it in tempdb and check for its existence before creating it, you would avoid naming collision and take advantage of minimal logging. But then isn't that in itself a global temp table without ## prefix and without the benefit of the system garbage collecting it?

We have different requirements, Linchi; hence "it depends." I actually don't want the system garbage collecting it, so that I can choose whether my process has to start over from scratch. I'll give you a hint - I use a sequence as my identifier, not a spid. This makes it recoverable if I want it to be. I don't see the background collection as quite the burden you do, and I don't use tempdb because my testing really did not reveal any worthwhile performance differences, therefore my permanent table version is not just yours without the ## prefix.

> BTW, in the time you've invested arguing for this approach, you could have already architected something much better, e.g. make the data available to the other server through other means (mirroring, replication, who knows).

Not really. We wanted to avoid duplicating the data via mirroring, replication, etc. In fact, we were replicating prior to the change. But we'd like to point to a single copy of the archive data, which is multi-TB in size.