The Database isn’t Available from Half the Servers – it Must be DB2

I get it. The database server is a single point in many environments that have many points at most levels. Therefore any problem that affects more than one point must be the database.

As DBAs, sometimes we feel the database is under attack. We may therefore develop the knee-jerk reaction “It isn’t the database!” If a DBA is on top of things, and given reasonable power and resources, then it isn’t ALWAYS the database. I make a conscious effort to go into every outage call with the assumption by me that it IS the database, and it is my job to find out how. This approach has me digging a lot deeper than if I came in with the preconceived opinion that it wasn’t the database.

A client called me recently, reporting that only some of their servers were getting SQL300081 with return codes ,11004,. This is a DB2 on Windows environment. Many DBAs know that SQL30081 is a client-side code that basically means “I can’t talk to that database!” It can have such a vast array of causes from DB2 not being started on the database server, to a firewall between the servers, to an incorrect hosts file, and so much more. This technote is often extremely useful in troubleshooting. It has been around in one format or another since I started as a DBA.

Before they even gave me the error message, I was betting that’s what they’d say. But I didn’t say it out loud because I come across as enough of a know-it-all as it is. I dutifully linked through to the Windows message page, and got really nothing out of it. However, once the time frame of the fleeting issue that had already resolved itself by the time I joined the bridge call had been defined, I started parsing through the diag log for that time. I saw this error message:

I counted how many occurrences of this were seen in the half hour in which the problem occurred to the half hour before and the half hour after to make sure it wasn’t just something silly like a user mistyping a password that happened all the time. I found 251 occurrences in the half hour where the problem occurred, and ZERO in either the half hour before or the half hour after.

I’ve been finding the db2diag tool particularly useful, particularly on Windows lately. I used this command to count the occurrences in the problem half-hour:

After reporting this to everyone on a large call, they were able to identify that both the primary and the secondary domain controllers were rebooted at the same time. The USER1 user id is an Active Directory domain ID. All of the application servers that had problems connecting were in the same site as these domain controllers.

In this case, it wasn’t the database, but the DB2 DBA helped identify the root cause of the problem.

Ember is always curious and thrives on change. Working in IT provides a lot of that change, but after 17 years developing a top-level expertise on Db2 for mid-range servers and more than 7 years blogging about it, Ember is hungry for new challenges and looks to expand her skill set to the Data Engineering role for Data Science. With in-depth SQL and RDBMS knowledge, Ember shares both posts about her core skill set and her journey into Data Science.
Ember lives in Denver and work from home for XTIVIA, leading a team of Db2 DBAs.