I can run this same query in our Prod environment and it returns in less than a minute while running on approx 200k more records than QA.

Total volume is under 1.1 M records in SalesOrders.Allocation_Deliveries and 129 k records in SalesOrders.Material_Attributes. These are small datasets.

I compared the Explain plans on both environments and there is a stark difference in the estimated spool volume in the first Join step. The estimate in Production is on the money while the Estimate in QA is an order of magnitude off. However the data and table/views are identical in both systems and we have collected stats in every conceivable manner and we can see the particular table demographics in both systems as identical.

Lastly, this query has always returned in under a minute in all environments including QA as it is still doing in Production. This latent behavior is recent in the last week or so. I discussed this with our DBA and we have had no changes to software or configuration. He is new, but seems to know what he's doing but still getting caught up with a new environment.

I am looking for some pointers on what to check next. I have compared the relavant table / view definitions across QA and Prod and they are identical. The Table demographics in each system are also the same (I went through these with our DBA to make sure)

Any help is appreciated. Thanks in advance.
Pat

This is the Explain plan from QA. Note the very Low estimate in Step 5 (144 Rows). In Prod, the same Explain shows > 1 M rows which would be close to what I know.

Explain select a.material
, b.season
, b.theme
, b.collection
from SalesOrders_view.Allocation_Deliveries a
inner join SalesOrders_view.Material_Attributes_cur b
on a.material = b.material;
1) First, we lock SalesOrders.Allocation_Deliveries in view
SalesOrders_view.Allocation_Deliveries for access, and we lock
SalesOrders.Material_Attributes in view SalesOrders_view.Material_Attributes_cur for
access.
2) Next, we do an all-AMPs SUM step to aggregate from
SalesOrders.Material_Attributes in view SalesOrders_view.Material_Attributes_cur by way
of an all-rows scan with no residual conditions
, grouping by field1 ( SalesOrders.Material_Attributes.material
,SalesOrders.Material_Attributes.season ,SalesOrders.Material_Attributes.theme
,SalesOrders.Material_Attributes.theme ,SalesOrders.Material_Attributes.af_grdval
,SalesOrders.Material_Attributes.af_stcat
,SalesOrders.Material_Attributes.Material_Attributes_SRC_SYS_NM). Aggregate
Intermediate Results are computed locally, then placed in Spool 4.
The size of Spool 4 is estimated with high confidence to be
129,144 rows (41,713,512 bytes). The estimated time for this step
is 0.06 seconds.
3) We execute the following steps in parallel.
1) We do an all-AMPs RETRIEVE step from Spool 4 (Last Use) by
way of an all-rows scan into Spool 2 (all_amps), which is
redistributed by the hash code of (
SalesOrders.Material_Attributes.Field_9,
SalesOrders.Material_Attributes.Material_Attributes_SRC_SYS_NM,
SalesOrders.Material_Attributes.Field_7, SalesOrders.Material_Attributes.Field_6,
SalesOrders.Material_Attributes.theme, SalesOrders.Material_Attributes.theme,
SalesOrders.Material_Attributes.season, SalesOrders.Material_Attributes.material)
to all AMPs. Then we do a SORT to order Spool 2 by row hash
and the sort key in spool field1 eliminating duplicate rows.
The size of Spool 2 is estimated with low confidence to be
129,144 rows (23,504,208 bytes). The estimated time for this
step is 0.11 seconds.
2) We do an all-AMPs RETRIEVE step from SalesOrders.Material_Attributes in
view SalesOrders_view.Material_Attributes_cur by way of an all-rows scan
with no residual conditions locking for access into Spool 6
(all_amps), which is redistributed by the hash code of (
SalesOrders.Material_Attributes.material, SalesOrders.Material_Attributes.season,
SalesOrders.Material_Attributes.theme, SalesOrders.Material_Attributes.theme,
SalesOrders.Material_Attributes.Material_Attributes_SRC_SYS_NM,
SalesOrders.Material_Attributes.Material_Attributes_UPD_TS, (CASE WHEN (NOT
(SalesOrders.Material_Attributes.af_stcat IS NULL )) THEN
(SalesOrders.Material_Attributes.af_stcat) ELSE ('') END )(VARCHAR(16),
CHARACTER SET UNICODE, NOT CASESPECIFIC), (CASE WHEN (NOT
(SalesOrders.Material_Attributes.af_grdval IS NULL )) THEN
(SalesOrders.Material_Attributes.af_grdval) ELSE ('') END )(VARCHAR(8),
CHARACTER SET UNICODE, NOT CASESPECIFIC)) to all AMPs. Then
we do a SORT to order Spool 6 by row hash. The size of Spool
6 is estimated with high confidence to be 129,144 rows (
13,430,976 bytes). The estimated time for this step is 0.08
seconds.
4) We do an all-AMPs RETRIEVE step from Spool 2 (Last Use) by way of
an all-rows scan into Spool 7 (all_amps), which is built locally
on the AMPs. Then we do a SORT to order Spool 7 by the hash code
of (SalesOrders.Material_Attributes.material, SalesOrders.Material_Attributes.season,
SalesOrders.Material_Attributes.theme, SalesOrders.Material_Attributes.theme,
SalesOrders.Material_Attributes.Field_6, SalesOrders.Material_Attributes.Field_7,
SalesOrders.Material_Attributes.Material_Attributes_SRC_SYS_NM,
SalesOrders.Material_Attributes.Field_9). The size of Spool 7 is estimated
with low confidence to be 129,144 rows (13,301,832 bytes). The
estimated time for this step is 0.05 seconds.
5) We do an all-AMPs JOIN step from Spool 6 (Last Use) by way of an
all-rows scan, which is joined to Spool 7 (Last Use) by way of an
all-rows scan. Spool 6 and Spool 7 are joined using an inclusion
merge join, with a join condition of ("(material = material) AND
((season = season) AND ((theme = theme) AND ((theme =
theme) AND (((( CASE WHEN (NOT (af_grdval IS NULL )) THEN
(af_grdval) ELSE ('') END ))= Field_6) AND (((( CASE WHEN (NOT
(AF_STCAT IS NULL )) THEN (AF_STCAT) ELSE ('') END ))= Field_7)
AND ((Material_Attributes_SRC_SYS_NM = Material_Attributes_SRC_SYS_NM) AND
(Material_Attributes_UPD_TS = Field_9 )))))))"). The result goes into Spool
8 (all_amps), which is duplicated on all AMPs. The size of Spool
8 is estimated with low confidence to be 144 rows (5,616 bytes).
The estimated time for this step is 0.04 seconds.
6) We do an all-AMPs JOIN step from Spool 8 (Last Use) by way of an
all-rows scan, which is joined to SalesOrders.Allocation_Deliveries in view
SalesOrders_view.Allocation_Deliveries by way of an all-rows scan with no
residual conditions. Spool 8 and SalesOrders.Allocation_Deliveries are
joined using a single partition hash_ join, with a join condition
of ("SalesOrders.Allocation_Deliveries.material = material"). The result goes
into Spool 1 (group_amps), which is built locally on the AMPs.
The size of Spool 1 is estimated with low confidence to be 3,858
rows (146,604 bytes). The estimated time for this step is 0.44
seconds.
7) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1. The total estimated time is 0.70 seconds.

Here is what the record distribution looks like and the SQL I used to generate the result set

SELECT HASHAMP(HASHBUCKET(HASHROW( MATERIAL ))) AS
"AMP#",COUNT(*)
FROM EDW_LND_SAP_VIEW.EMDMMU01_CUR
GROUP BY 1
ORDER BY 2 DESC;

If the EXPLAIN plan is different then something between the QA and Production environment is different. Can share the EXPLAIN plans, view definitions and table definitions (I don't care if you rename the columns)? What does the data distribution look like for the columns participating in the INNER JOIN? Keep in mind these columns have to be on the same AMP to join. If there is a redistribution and a skew in the data you have found your problem.
–
Rob PallerJan 8 '13 at 19:44

Rob, I added the Explain Plan from QA. Had to change the names but they match
–
apesaJan 8 '13 at 19:54

Also, I used WinMerge to Diff the 2 Explain Plans and the only differences are in the row counts and size estimates. They are all accurate when comparing them to actual table recs or what is expected in the Join. The only way off value is in Step 5 (In QA only) where it is showing 144 rows. That should be upwards of 1 M rows as it is in Prod.
–
apesaJan 8 '13 at 20:14

1

Are your PROD and QA systems running the same version and same size? What do these three queries tell you: select * from dbc.dbcinfo; select count(distinct nodeid) as Number_of_nodes from dbc.resusagescpu; select hashamp()+1 as Number_of_amps; Just a thought, Rob is a good source!
–
BellevueBobJan 8 '13 at 22:15

This diagnostic when run in conjunction with the EXPLAIN command will produce a list of recommended statistics that may beneficial to optimizer in producing the lowest cost query plan. This may yield no difference or it may point to something that is different between the environments (data or otherwise).

Views and JOIN conditions

Based on your EXPLAIN plan one or both of the views in the SalesOrders_View database appear to be using an EXISTS clause. This EXISTS clause is relying on a COALESCE condition (or explicit CASE logic) to accomodate for a comparison between a column in one table that is defined as NOT NULL and a column in another table that is defined to allow NULL values. This can affect performance of that join.

Data Distribution

Your distribution results appear to be from the PRODUCTION environment. (Based on the number of AMPS and the number of rows shown on the AMP with the highest and lowest rows.) What does that look like for QA?

Edit - 2013-01-09 09:21

If the data was copied from Prod 2 months ago it may seem silly to ask but were the statistics recollected afterward? Stale statistics on top of replaced data could lead to the variance in the query plan between the environments.

Are you collecting PARTITION statistics on your tables even if they are not PPI tables? This helps the optimizer with cardinality estimates.

Are you the only workload running on the QA system?

Have you looked at the DBQL metrics to compare CPU and IO consumption for the query in each environment? Look at IO Skew, CPU Skew, and Unnecessary IO metrics as well.

Do you have delay throttles in place in the QA environment that may be delaying your workload? This will givie you the perception that it is taking longer to run in the QA environment when in fact the actual CPU consumption and IO consumption are the same between QA and PROD.

Do you have access to Viewpoint?

If so, have you looked at your query using either the My Queries and/or Query Spotlight
portlets to observe it's behavior?

Do you know which step in the Query Plan is the most expensive or time consuming? Viewpoint Rewind with either portlet I mentioned or Step logging in DBQL can show you this.

Are the DBS Control settings between the environments identical? Ask your DBA to look at this. There are settings in there that can affect the join plans that are used by the optimizer.

In the end if the data, table structures, indexes, and statistics are the same on the two systems whose hardware and TDBMS patch levels are identical you should not get two different EXPLAIN plans. If that ends up being the case I would suggest that you contact the GSC and get them involved.

Thanks Rob, The Distribution Results are in fact from QA and when summed equal the total row count in that table in QA. The Number of AMPS and Nodes are the same in both QA and Prod. I will post the View definition. Stay tuned for the View Def and Result Set from your SQL.
–
apesaJan 9 '13 at 0:56

I ran the Explain with the Diag Helpstats in front and it came out the same as before so I did not re post. Ran it in both systems. Neither came out with any recommended Stats.
–
apesaJan 9 '13 at 1:00

Lastly, I went through the views. The first view is just a view on the base table with nothing else. The second view uses a MAX() on the load Timestamp to produce a "Delta" dataset. I am using Coalesce on 2 fields that may have NULLS, but those fields are not used in the Inner Join. The Inner Join is simply using "Material" and that is never NULL. I verified that there are no NULLS in the data.
–
apesaJan 9 '13 at 4:55

The point of note is that between Prod and QA there are no noticeable differences in configuration and/or Data In both environments this same SQL ran in less than a minute. Now in QA only it takes 12 minutes to run. it does show a difference in the Explain plans, yet no real reason. Another FYI, the QA data was copied down from Prod about 2 Months ago so in this case with Master Data it really is the same data. Lastly, being SAP data I know it inside and out and understand what to expect in the Join result. This is also Prod SQL that is tested and verified to produce the correct results
–
apesaJan 9 '13 at 5:04

1

Here is an update. It appears to be data related, although we are unable to pin point the actual problem. The Slow running query went away once we deleted and recopied the data back into the table. I combed through the initial data set looking for bad chars, etc.. and nothing stuck out. Also the data that was copied back into the table is the same data record for record. IDK, one of those challenges
–
apesaJan 15 '13 at 23:46