reporting out of SWIS API

We have a reporting platform that we're trying to leverage the solarwinds API to query data from. It uses a homegrown tool that can use an API to ingest data. This works great for availability, cpu, disk, memory and response time statistics.

However, we're running into issues with interface related tables where the queries are timing out and the SWIS service itself starts to consume over 12GB of ram and then doesn't go back down until it is restarted, here is an example of one query:

So I had quite a discovery tonight... and it appears the performance issues we've been seeing is related to the order of the joins in the query... yes I know, the order of inner joins in SQL shouldn't matter, but apparently with SWQL, it does!

Running the below query will cause SWIS to immediately start consuming all available RAM on the server and will never return data:

The tricky part about getting help when a query is timing out is most people with an environment big enough to time out on aren't willing to throw a query in them if they already expect it to bog things down. One trick I've used when I'm testing risky queries is to point it at an APE or additional web server, that way if I make the Info service freak out I can bounce the services quickly and people will never know haha. Every variation of the queries you've given ran fine in my lab with 500 elements and completed in about a second so I don't have enough meat in there to stress test the queries in any meaningful way. To be honest, in SWQL you didn't actually need to do any joins by hand for your report, SWQL already has lots of the typical joins already in place and optimized, I'd be curious how the execution time of the below compares to what you've tried so far. In my lab this runs about 5x faster (comparing 1-2 seconds to 0.2-.4 seconds) but like I said my whole lab is a drop in the bucket compared to your set so you'll have more useful benchmarks than mine.

select n.Caption as DEVICE_NAME

, n.IP AS DEVICE_IP

, 'InPercentUtil' as PARAMETER_NAME

, n.interfaces.Caption AS OUTPUT_DATA3

, n.interfaces.TypeDescription AS OUTPUT_DATA2

, n.Interfaces.Traffic.DateTime AS CHECK_TIME

, n.Interfaces.Traffic.InPercentUtil AS OUTPUT_DATA

from orion.nodes n

where n.Interfaces.Traffic.DateTime > addhour(-1,GETUTCDATE())

To know where the built in joins are and what to call them just watch for the chain link icons in SWQL studio

Starting from nodes

then from interfaces to the traffic history

In most cases I try to build my queries from the table with the fewest entries to the ones with more. You can also apply where conditions as part of the join instead of doing it with the where at the end, for inner joins this should be equal but I do a lot of outer joins and it makes a HUGE difference there because it cuts down the data sets earlier in the process. Thinking about that actually prompted me to look at the SQL translated query you posted earlier and I noticed it had this bit of ugliness

Line 11 seems less than ideal and I can see why it is done, but I think it might be causing us to scan the interfaces table at least one extra time? Could hurt while reporting in larger environments. That's noteworthy because I the only times I can specifically remember having a hard time getting queries to complete is always when I pull in the interfacehistory, I had just figured that since it tends to be larger it was something I had to live with. Seeing this makes me re-evaluate that, bet I could convert all my reports from SWQL to a cleaner SQL execution and things would work faster, or maybe even keep the SWQL but not use the derived percentutils, just pull the real bps from interfacetraffic then manually specify the math to make it a percentage in my query as part of the select... Would have to play with it a few different ways and see how it turns out.

Maybe SWIS is counting on SQL to optimize that but it might be possible to preemptively improve the way the info service handles the interfacetraffic table. In practice, I have never written a report against traffic history where I wasn't also specifying a join or intrinsic link to Interfaces and now I'm thinking there is a possibly avoidable performance penalty with the way its being done.

thanks @mesverrum for your input. I hear what you're saying, but this new query with the join first on the interfacetraffic table works flawlessly everytime where the old one on the nodes table first would fail everytime. My reporting guy is also reporting the same. The interfacetraffic table really was the only one we were having trouble with. We only have about 300 network elements but some of them have tons of interfaces; over 21,000 interfaces spread across 3 pollers with standard polling intervals (9 minutes) So that comes out to about 5-6 data points per hour or just over 126,000 total per hour. I'm with you, I was using our APE to do these queries as I was able to quickly restart swis with minimal impact (if any that I could tell...) Our other poller is in Ireland so I didn't want to use that one!

I didn't know about those (inherited?) joins!?!? I've seen those lines in SWQL studio before but never really knew what they meant or how to use them! I'm definitely going to dig deeper into your query to see how it works - thanks again!

tdanner, I know this isn't official support... but any assistance you can offer would be greatly appreciated. We're really concerned here that the API is just up to the task - we've tried modifying, adjusting what have you and nothing is working. The bottle neck clearly resides with SWIS and we're going to have to revert to straight to SQL connection.

The latest error message from SWIS:

---------------------------

SWQL Studio

---------------------------

The communication object, System.ServiceModel.Security.SecuritySessionClientSettings`1+ClientSecurityDuplexSessionChannel[System.ServiceModel.Channels.IDuplexSessionChannel], cannot be used for communication because it is in the Faulted state.

The extra math you are seeing relates to two things SWIS is doing: providing the illusion that all timestamps in the database are in UTC when in fact not all of them are, and synthesizing in/out percent utilization values out of bps and bandwidth settings. In our experience this is not a major problem for SQL performance. And it certainly shouldn't make SWIS use all the memory and fail!

We need to find out what SWIS is actually doing that is chewing up memory. SWQL Studio has a feature that can help here. Append these two lines to your query:

WITH QUERYPLAN

WITH QUERYSTATS

The first line will cause SWIS to return a blob of XML that describes how SWIS is going to process the query. SWQL Studio will display this XML in a tab at the bottom of the window. On my system, it basically says "I'm going to run this SQL query and just stream the results back."

The second line will cause SWIS to return a table of statistics listing all the operations it performed in executing the query, including the time taken and number of rows processed. This will also be displayed in a tab on the bottom.

However, this only works if the query actually runs to completion. Your queries are failing, so you won't get this data. We will need to modify the query so that it actually completes. I would start by adding "TOP 10" to the SELECT clause. Assuming that completes, repeat with larger numbers.

Thanks tdanner​. I couldn't get those 2 options to work on my query because SWQL studio times out before I get results (I do have the timeout set to 10 minutes I believe). But, it does run when I filter the query down:

tdanner​, this is feeling hopeless now, almost to the point where we're considering abandoning the API... We sort of confirmed this on another call with a tech yesterday but when is the API going to be available for support with our maintenance contracts? I truly believe there is a problem with the SWIS service....

We've increased ram on our pollers to 32GB and have retried all queries on the interfacetraffic tables. None of them complete successfully. Whats different now from 16GB and 32GB of ram is at least this time the SWIS service releases the memory and I get a different error from SWQL studio after it failes. I still get an async_network_io wait from SQL as well.

Another thing I did was look at the SQL query that is being executed on the SQL side and it looks totally bizarre and you can see that its actually performing some math which I suspect is part of the problem. I think I get that the API exposes the inpercentutil field to you and then abstracts you from the math behind which is why I'm seeing it in the query but what is the point if it never works?

So I've been experimenting with comparing dates in order to get data from the last hour with a different method instead of doing the hourdiff. If I do a datetime > adddate('hour',-1,getdate()) the performance improvement is staggering. But this doesn't appear to return data from the last hour like I think it should in my head. I'm probably doing something wrong. maybe tdanner​ you could offer a suggestion?

This is my query thus far, which to be honest, I'm not sure really what its doing as it returns 30 rows per server:

thanks tdanner​, that did work however the query still times out despite performance improvements when querying one node with this method vs the datediff method.

If I let this cook, SQL will go into that async_network_io wait status and suspend the query, meanwhile the SWIS service on the poller continues to eat up RAM and never backs down. Could this be a bug? Is SWIS not capable of handling large queries when SQL clearly can?

So I was doing some troubleshooting in SQL Management studio and I can see the query come through as SolarWinds.InformationService.ServiceV3@domain-Orion. It runs and actually executes rather quickly, doesn't even register as an "recent expensive query" however it does immediately go into an async_network_io wait state and then the SQL server suspends the query. Based on what I read, that particular wait (async_network_io) occurs when SQL is sending the data back to the client faster than the client can ingest it. Not knowing anything about how the SWIS service works, are there technical limitations to SWIS that would cause a large query like this to completely fail? It seems as if our SQL server is not the issue, and is rather a problem with SWIS. Do we need more ram on the poller where we're doing the API call on? We only have 16GB, would 32GB be sufficient? will SWIS continue to use all available RAM and not scale back down? Whats strange is I can do TOP 127,000's all day long and SWIS can handle that, the moment I take it out, it crashes.

Another observation I made is a select * without a top x, will timeout in swql studio like I mentioned and I noticed the SWIS service will keep consuming RAM until there inst any left and will never go back down until it is restarted. Not sure if this is a SWQL studio thing or not because I do get the 2 minute timeout warning from swql studio.

Yeah, that's not specific to swql studio. If you wrote a query that pulls too much data and stick it in the web consoles or a report you can cause the info service to balloon like that. I've learned this through painful experience in client environments. It seems like the info service doesn't actually give up on the query, just the ui gives up on waiting for the response.

And interestingly enough, I was performing some tests earlier and the same query above returns about 126,000 records in about 10 seconds according to swql studio (21,060 interfaces * 6) where 6 is the amount of data points per interface per hour.

If I do that query with just a top 127,000 it completes everytime. If I take the top 127,000 out of there it constantly times out more often than not. Not sure what the deal is there

So in this case a top x is much more computationally efficient than a where by itself, the top command just pulls data until it's done at that max number of entries, the where logic is still evaluating all of the entire history of every interface to see if they match the condition (default settings would have data points going back a year on these). I'm boarding an airplane soon so I don't have a good solution off the top of my head for how to improve the efficiency of your query but hopefully that helps some.

Thank you mesverrum​. That being said, it sounds like implementing a top X wouldn't work to get results from all interfaces in the last hour because its only taking the top X from the table and then doing the hourdiff on those? Basically my question is, does the TOP X happen before the hourdiff or after?

Top x is after the where condition until it hits the cap, then it says it is done. But this thread on the same situation happening in plain sql seems to point toward your including a top statement could impact the execution plan in a way that makes it run more efficiently.

SolarWinds solutions are rooted in our deep connection to our user base in the THWACK® online community.
More than 150,000 members are here to solve problems, share technology and best practices, and directly
contribute to our product development process.
Learn more today by joining now.