Ask Ben: Querying Groups With Conditional Elements

I hav a situation in my academic project, and was struggling to get a proper approach towards the solution, then i decided to ask 'kind-hearted' ppl for help ... while browsing net for forums wid posts with similar situation, i 'stumbled' upon you site. it's a nice and great job u r doin here ben. thanks a ton... do consider to direct me towards the solution for the following situation.

The following table reflects the One-to-Many relationship between TransID and TagID

i would need to know how exactly i should be aproaching the problem to find out solution for following situations:

STRICT MODE:------------

* TransID with TagID (1 AND 4 AND Not 2)

Result : i should get only TransID:3 as the result. (ALL Conditions shud return true, so in the interface the user would have selected Having 1,4 and not having 2).

Relaxed Mode:-------------

* TransID with TagID (1 OR 4 OR Not 2)

Result : i should get TransID:1,2,3 as the result, since each of the TransID satisfies one or other condition. (ANY Condition returning true the TransID qualifies to be returned, and in the interface the user would hav selected Having 1,4 and not Having 2)

Before we get into this, let's start out by building a SQL temp table into which I can insert the test data:

<!---

Create the SQL build scripts so that we can reuse it

within the different CFQuery tags.

--->

<cfsavecontent variable="strBuildSQL">

DECLARE @data TABLE (

trans_id INT,

tag_id INT

);

<!--- Populate data. --->

INSERT INTO @data

(

trans_id,

tag_id

)(

SELECT 1, 1 UNION ALL

SELECT 1, 2 UNION ALL

SELECT 2, 1 UNION ALL

SELECT 2, 2 UNION ALL

SELECT 2, 3 UNION ALL

SELECT 3, 3 UNION ALL

SELECT 3, 4

);

</cfsavecontent>

Now that we have that in place, our @data table should mimic what the user has described (although I have changed the column names to line up more appropriately with my own database naming conventions).

Ok, so let's take a look at the problem. For the Strict mode part, the first thing that jumps to mind is grouping. We want to check out the group of TransIDs and make sure that each group contains either TagID 1 or 4, but definitely not TagID 2. To examine groups in SQL Server, we can use the GROUP BY clause. Then, to examine the contents of each group, we can leverage the fact that the COUNT() aggregate can handle CASE statements. If you remember my exploration of the COUNT() / CASE scenario, you will remember that the COUNT() aggregate does not include NULL values in its tally. Therefore, if we can conditionally include values, we can use the COUNT() of each group to narrow down the selection.

Taking this understanding and applying it to the problem at hand, we want to make sure that each TransID group has at least one count of the TagID 1 or 4 and a zero count of the TagID 2:

<!--- Query for TransID values. --->

<cfquery name="qStrict" datasource="#REQUEST.DSN.Source#">

<!--- Build SQL table and populate. --->

#PreserveSingleQuotes( strBuildSQL )#

<!--- Get the appropriate trans IDs. --->

SELECT

d.trans_id

FROM

@data d

GROUP BY

d.trans_id

HAVING

<!---

Make sure that the group has either

tag ID 1 or 4.

--->

COUNT(

CASE

WHEN

d.tag_id IN ( 1,4 )

THEN

d.tag_id

ELSE

NULL

END

) > 0

AND

<!---

Make sure that the group does not also contain

the tag id, 2.

--->

COUNT(

CASE

WHEN

d.tag_id = 2

THEN

d.tag_id

ELSE

NULL

END

) = 0

;

</cfquery>

<!--- Output out results. --->

TransID: #ValueList( qStrict.trans_id, ', ' )#

Running this SQL, we get the following output:

TransID: 3

This works, but it looks really complicated. We can start to simplify the query if we look at the problem in a slightly different way. We can think of the 1 and 4 requirements as filter and then the NOT 2 as a secondary layer of filter. Let's run a query that includes just the first part of that filtering:

<!--- Query for TransID values. --->

<cfquery name="qStrict" datasource="#REQUEST.DSN.Source#">

<!--- Build SQL table and populate. --->

#PreserveSingleQuotes( strBuildSQL )#

<!--- Get the appropriate trans IDs. --->

SELECT

d.trans_id

FROM

@data d

WHERE

d.tag_id IN ( 1,4 )

GROUP BY

d.trans_id

;

</cfquery>

<!--- Output out results. --->

TransID: #ValueList( qStrict.trans_id, ', ' )#

This gives us the following output:

TransID: 1, 2, 3

Good, the first part works. Now, we want to apply the second filter, not having TagID 2. We can't just add another simple WHERE clause for != 2 since any TagID that is 1 or 4 is ALSO not equal to 2. What we need to do is apply the not 2 to each group of TransIDs. To do this, we are going to add a NOT EXISTS clause:

<!--- Query for TransID values. --->

<cfquery name="qStrict" datasource="#REQUEST.DSN.Source#">

<!--- Build SQL table and populate. --->

#PreserveSingleQuotes( strBuildSQL )#

<!--- Get the appropriate trans IDs. --->

SELECT

d.trans_id

FROM

@data d

WHERE

d.tag_id IN ( 1,4 )

AND

<!---

Make sure that this TransID value does not

exists in a group that ALSO contains a TagId, 2.

--->

NOT EXISTS

(

SELECT

1

FROM

@data d2

WHERE

d2.trans_id = d.trans_id

AND

d2.tag_id = 2

)

GROUP BY

d.trans_id

;

</cfquery>

<!--- Output out results. --->

TransID: #ValueList( qStrict.trans_id, ', ' )#

Running the above code, we get the following output:

TransID: 3

This version works as well. To me, this is somewhat easier to read than the leveraged COUNT() aggregate; however, I am not sure exactly which one is going to be faster. I am sure that it will depend somewhat on the size of the table and on the indexes that are in place.

Now that we have the Strict Matching out of the way, let's take a look at the Relaxed Matching mode. Relaxed matching is much easier as it doesn't really apply to the group level, but rather to the individual records being examined. As such, we can simply use a straightforward WHERE clause to limit each returned TransID and then group by the TransID to make sure we don't get duplicate values:

<!--- Query for TransID values. --->

<cfquery name="qRelaxed" datasource="#REQUEST.DSN.Source#">

<!--- Build SQL table and populate. --->

#PreserveSingleQuotes( strBuildSQL )#

<!--- Get the appropriate trans IDs. --->

SELECT

d.trans_id

FROM

@data d

WHERE

d.tag_id = 1

OR

d.tag_id = 4

OR

d.tag_id != 2

GROUP BY

d.trans_id

;

</cfquery>

<!--- Output out results. --->

TransID: #ValueList( qRelaxed.trans_id, ', ' )#

Running the above code, we get the following output:

TransID: 1, 2, 3

Works fine and as you can see, relaxed mode matching is much easier and faster to perform.

Reader Comments

i.e when TransID with TagID (1 AND 4 AND Not 2) means :TransIDs are expected to have both TagID 1 AND 4 (in the solution you coded, using TagID (1,4), will result in '1 OR 4')

so only when user is searching for TransID with TagID (1 AND 2 AND NOT 3) only the TransID: 1 satisfies the condition and should be returned ...

i am using a query like

SELECT TransID FROM Trans_Tags WHERE TransID IN (SELECT TransID in Trans_Tags WHERE TagID = 1) ANDWHERE TransID IN (SELECT TransID in Trans_Tags WHERE TagID = 2)ANDWHERE TransID NOT IN (SELECT TransID in Trans_Tags WHERE TagID = 3)

will this query run efficiently in a huge tables with say a million records ?! or could you suggest me any better approach ?!