P-STAT, Inc. Release V223.1
------------HELP ON turf------------------
TURF stands for Total Unduplicated Reach and Frequency.
It is most often used in market research applications.
Last updated: July 9, 2005.
For example, a file has the responses of 1000 cases
on 40 items. Each item is a TV program which the
respondent did or did not watch.
You would like to know which group of 8 programs was
reached by the largest number of respondents.
A respondent is "reached" by a given group of programs if
they watched at least one of the programs in the group.
TURF, to find the best group, evaluates each of the 76.9
million combinations of 8 items taken from a pool of 40,
and writes a file identifying the 100 best combinations.
This takes about a minute on a 2.4GHz pc.
Weighting can be applied to the cases, or to the response
values, or to the items. The "being reached" criterion
of one response can be increased.
Adding options affects speed. The above one-minute run
takes 12 minutes if case weights are used, and would take
about 30 minutes if the other options are used.
****************************
* reach versus frequency *
****************************
Reach and frequency are different measurements in TURF.
The REACH score for a combination is the number of cases
that have at least N positive responses on the variables
in that combination. N is the reach threshold in use,
which has a default setting of one.
The FREQ score for a combination is, for the reached cases,
normally the total number of positive responses on the
variables in the combination. The response.weights option
causes the FREQ score to be the sum of the positive responses.
If a case indeed watched all eight programs in a group,
the frequency count for that group would be increased by 8,
but the reach count would only be increased by 1.
If the input were hours watched rather than just watched
or not, the response.weights option would sum the response
values and thereby measure the impact of the combination.
**********************************
* features of the TURF command *
**********************************
Allows up to 210 items (i.e., variables).
Allows combinations of items up to size 60.
Allows several combination sizes to be done in one run.
Allows tens of thousands of cases.
Allows weighting of cases.
Allows weighting of items.
Allows weighting of responses; this allows the intensity
of a response to be utilized.
Allows setting a reach threshold of more than one.
Allows forcing designated items into every combination.
Allows limits on how many of a set of items can be
placed in the combinations to be analyzed.
Writes a result file containing the best combinations.
The items within each combination are ordered
by their importance.
Writes a template file for use by the TURF.SCORES command.
Writes (in TURF.SCORES) the reach score for each case.
Takes 3.1 seconds for 1,000 cases on 40 items, 6 at a time
on a 2.4 GHz PC. This evaluates 3.838 million groups.
Runs so rapidly that billions of combinations can be done.
Takes 18.5 minutes for 1,000 cases on 100 items, 6 at a time
on a 2.4 GHz PC. This evaluates 1.192 billion groups.
Note: using all options would take about 9 hours.
Shows the percent of combinations already processed in a
progress window.
Writes a detailed report when the command finishes.
*******************************
* a simple dataset, used *
* in some examples that *
* show various TURF options *
*******************************
The following dataset has a case identifier (case.id), a case weight
(www), and the responses to five variables. The responses are zeros,
ones and two twos. This dataset is used to show various TURF options.
The twos are treared differently from ones only when the
RESPONSE.WEIGHTS option is in use.
-----------------file ddd--------------
case.id www v1 v2 v3 v4 v5
9001 1 1 1 0 0 0
9002 1 1 1 0 0 0
9003 1 1 1 0 0 0
9004 1 1 1 0 0 0
9005 1 1 1 0 0 0
9006 1 1 1 0 0 0
9007 1 0 1 0 0 0
9008 1 0 0 1 0 0
9009 1 0 0 1 0 0
9010 1 0 0 1 0 0
9011 1 0 0 1 0 0
9012 1 0 0 0 1 0
9013 1 0 0 0 1 0
9014 1 0 0 0 1 0
9015 3 0 0 0 0 2
9016 3 0 0 0 0 2
*******************************
* example 1: *
* a TURF run using defaults *
*******************************
turf ddd [drop case.id www], size 3, reach.results rrr $
list rrr $
This command uses PPL (P-STAT Programming Language) to drop the first
two variables, leaving five items for the turf analysis. Using [keep v1
to v5] would have done the same thing.
SIZE 3 tells the command to look at the items in groups of 3.
REACH.RESULTS RRR creates a file named RRR that identifies the best
groups. That file is then listed.
Ten groups will be tested: v1-v2-v3, v1-v2-v4, etc through v3-v4-v5. We
are looking for the group that has at least one positive response for
the largest number of cases. In other words, which group of three items
REACH the most cases ?
The best combination uses variables v2, v3, and v4, which reach 14 of
the 16 cases. The FREQ score for that combination is also 14.
V1 is not included because it adds nothing once v2 is selected. V1-v2-
v3 have the highest number of of responses, but do not reach as many
different cases as the group of v2-v3-v4. The twos in cases 9015 and
9016 on v5 are treated as ones.
*************************************
* example 2: *
* a TURF run using case weighting *
*************************************
turf ddd [drop case.id ], size 3, reach.results rrr,
case.weights www $
list rrr $
Here, WWW is not dropped because it is needed in the command. We are
seeking the group of three variables that has the largest WEIGHTED
number of reached cases. Cases 9015 and 9016 are the only cases with a
caseweight of other than one, so they are the ones affected by case-
weights in this example.
The best group is v2-v3-v5, which reach 17 weighted cases. V2 and V3
provide 11 cases (all of which had unit weights). Adding v4 would pro-
vide 3 more, but adding v5 increases the weighted reach count by 6,
since each of those two cases has a caseweight of three on WWW. The
FREQ score for that combination is also 17.
*****************************************
* example 3: *
* a TURF run using response weighting *
* and a threshold of more than one *
*****************************************
turf ddd [drop case.id www], size 3, reach.results rrr,
response.weights,
reach.threshold 2 $
list rrr $
In examples 1 and 2, the responses to the items were treated in a zero
versus nonzero manner. Using RESPONSE.WEIGHTS causes the actual
response values to contribute to the reach scores. In addition, using
REACH.THRESHOLD 2 causes a case to be reached only when its reach score
for a given group is 2 or more.
The best group in this example is v1-v2-v5, which reached 8 cases.
Cases 1 through 6 were reached by a response of 1 to both v1 and v2;
cases 15 and 16 were reached because of responses of 2 on v5. The FREQ
score for that combination is 16.
**************************************
* example 4: *
* a TURF run using item weighting *
* and a threshold of more than one *
**************************************
/* create a file containing a weight value of 2 for item v3 */
build work1, vars name:c weight;
v3 2
$
turf ddd [drop case.id www], size 3,
reach.results rrr,
item.weights work1,
reach.threshold 2 $
list rrr $
Normally each item has a weight of one; each has the same contribution
to a reach score. It is possible, however, to make some items worth more
than other, possible reflecting, for example, differences in costs of
the items.
In this example, item v3 is weighted. This is conveyed in file WORK1
whose record defines a weight of 2 for item v3. This causes responses
on v3 to be worth twice what they would otherwise be worth. As in exam-
ple 3, a threshold of 2 is used.
The best group in this example is v1-v2-v3, which reach 10 cases. Cases
1 through 6 achieve a reach score of 2 using items v1 and v2. Cases 8
through 11 have reach scores of 2 because of the item weight given to
v3. The FREQ score for that combination is 20.
*************************
* general identifiers *
*************************
TURF xxx, this supplies the input filename.
Except for an optional weight variable,
all variables are treated as analysis items.
The values on the analysis items should
be zeros or positive numbers.
A positive value signifies a "hit".
Cases with any missing or negative values on
the analysis items are ignored.
The SET.MISS.TO.ZERO identifier, described
below, sets missing analysis items to zeros.
When case weighting is being used, any cases
with a missing, negative or zero value on
the weight variable are also ignored.
SIZE 6, what size combinations to use. required.
SIZE 4 to 7,
SIZE 6 to 3,
SIZE 4 6 8,
One or more sizes can be done in one run.
They are done in the order given;
for SIZE 6 to 4, size 6 is done first.
The final report shows the result for each
size separately. The output files show the
best results from the first size, then the
second, and so forth.
Many (20 or more) sizes can be done in a run;
each size must be from 1 to 60, and there
should not be any repeated sizes in a run.
Note...some sizes cannot be run in a
reasonable amount of time. Consider 40 items.
Depending on number of cases and on options:
Size 4 takes 91,390 iterations. Seconds.
Size 6 takes 3.8 million iterations. Minutes.
Size 10 takes 847 million iterations. An hour.
Size 15 takes 40 billion iterations. A day.
Size 20 takes 137 billion iterations. A week.
This command produced the above numbers.
DO #j = 1, 20;
PUT #j (combinations( 40,#j));
ENDDO $
The F2 key can be used to cause a TURF command
to abandon the current size being processed.
It will produce the report and the output
files for the sizes already completed.
REACH.THRESHOLD 2, optional. can be fractional.
This permits the user to control
what constitutes a successful "reach".
The default is one; if a case has a positive
response on any of the items in a given
combination, that case is added to the reach
total for that set of items.
Using REACH.THRESHOLD 3, for example, means
a case needs a reach score of 3 or more
to have been reached on a given group.
Having several responses increases a case's
reach score; weighting of either items or
responses can also affect the reach score.
PROGRESS 5, optional. controls how often the progress
window or report line is updated.
The default is 1, which means every million
combinations. PROGRESS 0 turns it off.
SET.MISS.TO.ZERO, optional. If used, missing analysis
values in the input file are set to zeros.
If needed, this saves having to write
some PPL as the file is read.
*****************************************
* identifiers that control the makeup *
* of the combinations to be used *
*****************************************
USE list-of-vars min max,
optional. This provides a limitation on the
makeup of the combinations to be tried.
Of the variables whose names (or ranges)
follow USE, at least MIN of them and
at most MAX of them should be in every
combination that will be tried.
The MIN value can be zero.
Up to 30 such USE phrases can be given.
Combinations are used only if they pass the
constraints in every one of the USE phrases.
Each use of USE is followed by:
(1) The names of the variables in the group.
Ranges, like TOPPING.1 TO TOPPING.8,
can be used.
(2) The smallest number of those variables
that are required. Can be zero.
A combination must have AT LEAST that
many of the variables in the group.
(3) The largest number of those variables
that may be used.
A combination may have AT MOST that many
of the variables in the group.
All of the group could be used if the
supplied number is equal to or larger
then the size of the group. Therefore,
using 999 is a vivid way of saying there
is no upper limit for the group.
For example:
TURF xxx, size 8,
use aaa bbb to ddd 1 999,
use eee to ggg jjj to mmm 2 4,
use yyy zzz 0 1 $
In the above command, the only combinations
that will be evaluated are those that have
at least one variable from the first group, and
at least two but no more than four variables
from the second group, and
no more than one variable from the third group.
FORCE vars, optional. names or ranges of items that
should be part of every combination.
Suppose there are 30 items and size is 6;
without force, 593,775 combinations are done,
because we take 30 items 6 at a time.
If 2 items are forced, only 20,475
combinations will be done because the run
reduces to 28 items taken 4 at a time.
If size is 6 and all 6 items are forced,
just that one pass will be done.
*****************************
* identifiers for various *
* kinds of weighting *
*****************************
CASE.WEIGHTS varname, optional.
The named variable will be used as a
caseweight, and not as an analysis item.
ITEM.WEIGHTS filename,
the default is treat all of the items
the same, i.e., with weights of 1.
When ITEM.WEIGHTS is used, it should
be followed by the name of a p-stat system
file which itself has exactly 2 variables.
In each record, the first variable has the
name of a item being used for the TURF
analysis, the second is the weight to be used
for that item. The first variable is therefore
character, and the second is numeric.
The file is not required to have a record
for every item. In other words, some items
can be given changed weights; others can
be left as is ( i.e., still set to 1).
The file can have names and weights for
items not used in the current run; if so,
they are ignored.
RESPONSE.WEIGHTS, the default is to store the input data
as zeros or ones, with one meaning a yes.
This option leaves the input values intact;
they should be in zero (no) or a positive
value (not necessarily an integer) to show
the INTENSITY of a yes.
****************************
* the REACH.RESULTS file *
****************************
REACH.RESULTS rrr 300,
optional output p-stat system file.
This file holds the combinations with the
best REACH values.
They are in descending order on REACH.
Within ties on reach, the combinations are
in descending order on FREQ.
The item names in a combination are ordered by
the reach contribution that each in turn adds.
The default is to write the 100 best
combinations for each size.
If an integer like 300 follows the file
name, that many are written for each size.
Each combination will take from 1 to 5 lines,
as determined by the REACH.DETAILS identifier,
described below. The default is two lines:
one for the item names, the second for the
cumulative reach for each successive item.
The names of the variables in the
REACH.RESULTS file itself are these.
Note, some (or all) of the initial 6 can be
dropped by using the OMIT identifier,
described below.
(1) SIZE: the combination size.
(2) RANK: the rank within size.
(3) REACH: the reach value for the
combination.
(4) PCT.REACHED: the percent of usable
cases reached by the combination.
The usable cases are the cases
with no invalid data. This includes
cases with no responses whatsoever.
(5) PCT.OF.MAX.REACH: the percent of active
cases reached by the combination.
An active case is a usable case that
has at least one positive response;
other cases cannot possibly be reached.
(6) FREQ: the freq value for the
combination.
(7+) ITEM.1, ITEM.2, ITEM.3, etc:
These variables contain the names of
the items that make up a combination.
The item names are ordered by their
contribution to the reach score.
I.e., the name appearing under ITEM.1
is the 'best' item in the combination.
If sizes 6 and 8 are both being done,
the file will have item.1 through
item.8. The results for size 6 will
have blanks for item.7 and item.8.
********************************
* the REACH.RESULTS file: *
* the items in a combination *
* are ordered by importance *
********************************
Suppose 4 items, AA, BB, CC and DD, comprise
a combination about to be written to the
reach.results file.
Before writing them, they are reordered so that
the leftmost item is the one with the highest
individual reach. The next item shown has,
when paired with the leftmost item,
the largest 2-item reach score, and so on.
The reordering is done in this manner.
Each of the four variables is used in a size 1
pass to see which has the best standalone
reach.
Suppose it was CC.
CC is placed in the ITEM.1 column.
Now, given CC is best, which item is next ?
It is the item that, when paired with CC,
provides the best increase in reach.
This is done by making three size 2 passes
over the data, using CC-AA, CC-BB and CC-DD.
Again, we take the best result.
Suppose it is CC-BB. BB is therefore placed
in the second position (in the ITEM.2 column).
Now we try CC-BB-AA and CC-BB-DD to see which
of AA and DD should be in the ITEM.3 column.
The remaining item in this stepwise procedure
goes into the ITEM.4 column.
*************************************
* the REACH.RESULTS file: *
* TURF can be flummoxed by small, *
* carefully constructed data sets *
*************************************
It should be noted that selecting the best
two items in a stepwise manner is not quite
the same as selecting the best two by trying
all possible pairs.
Suppose we have a file of 14 cases.
Again, there are 4 items: AA, BB, CC and DD.
We would like to find the 'best' two items.
AA reaches cases 1-10,
BB reaches cases 11-13,
CC reaches cases 1- 5 and 11-12,
DD reaches cases 6-10 and 13-14.
The stepwise approach selects AA and, having
AA in hand, adds BB to get its best two items.
They have a reach of 13.
A non-stepwise approach tries all combinations
of size 2 and would select CC and DD.
They have a reach of 14.
The TURF command uses a stepwise procedure
only in the REACH.RESULTS (and FREQ.RESULTS)
reordering; otherwise all runs are done
trying every possible combination of the size
being analyzed.
****************************************
* the REACH.RESULTS file: *
* using OMIT to drop some *
* (or all) of the first 6 statistics *
****************************************
OMIT size pct.of.max.reach,
The default is for the reach.results and
freq.results files to have six numeric
values before the items appear. These are:
SIZE
RANK
REACH
PCT.REACHED
PCT.OF.MAX.REACH
FREQ
An OMIT phrase can be used to drop any number
of them, including all of them. This may
reduce the number of print passes to see it.
One OMIT phrase applied to both results files.
OMIT, in other words, can be used to cause
a better looking listing. In LIST itself,
using BLANK.MISSING would convert the dashes
that represent missing in LIST output into
blanks. Also, using SKIP 2 when there is one
extra line helps appearances.
***************************************
* the REACH.RESULTS file: *
* using REACH.DETAILS to select *
* which (if any) extra lines should *
* be written for each combination *
***************************************
REACH.DETAILS cumulative.pct,
When a reach.results file is written,
the items within each combination are ordered
by their reach contribution within the
combination. This is always done.
In addition, an extra line is written for
each group which shows the cumulative reach
as each item is added.
That is the default, but it can be changed.
As many as four extra lines are possible:
(1) cumulative, the increasing reach as
each succesive item is added.
This is the default.
(2) separate, which has the additional
reach provided by each successive item.
(3) cumulative.pct, the percent of the
cases reached as each itme is added.
(4) separate.pct, the additional percent of
cases reached by each successive item.
REACH.DETAILS can be followed by:
(1) NONE by itself, no line are written.
(2) ALL by itself, 4 lines are written.
(3) one or more of CUMULATIVE, SEPARATE,
CUMULATIVE.PCT and SEPARATE.PCT.
The requested lines would be written.
***************************
* the FREQ.RESULTS file *
***************************
FREQ.RESULTS fff 500,
optional output p-stat system file.
This file holds the combinations with the
best FREQ values.
They are in descending order on FREQ.
Within ties on FREQ, the rows are in
descending order on REACH.
The default is to write the 100 best
combinations for each size.
If an integer like 500 follows the file
name, that many are written for each size.
The item names in a combination are ordered by
the freq contribution that each in turn adds.
The FREQ.RESULTS file has the same variables
as the REACH.RESULTS file.
***************************************
* the FREQ.RESULTS file: *
* using FREQ.DETAILS to select *
* which (if any) extra lines should *
* be written for each combination *
***************************************
FREQ.DETAILS cumulative.pct,
When a freq.results file is written,
the items within each combination are ordered
by their freq contribution within the
combination. This is always done.
In addition, an extra line is written for
each group which shows the cumulative freq
as each item is added.
That is the default, but it can be changed.
Two extra lines are possible:
(1) cumulative, the increasing total freq as
each succesive item is added. Default.
(2) separate, which has the additional
freq provided by each successive item.
FREQ.DETAILS can be followed by:
(1) NONE by itself, no line are written.
(2) ALL by itself, 2 lines are written.
(3) one or both of CUMULATIVE and SEPARATE.
The requested lines would be written.
********************************************
* other optional output file identifiers *
********************************************
REACH.SUMMARY qqq 200,
optional output p-stat system file.
This file tells you how many combinations
had each of the reach values that were found.
The default is to write the 100 best reach
values for each size being processed.
If an integer like 200 follows the file name,
that many would be written for each size.
Each row in the reach.summary file contains:
(1) SIZE: the combination size.
(2) RANK: the rank within size.
(3) REACH: a reach value.
(4) NUMBER.OF.COMBOS: the number of
combinations that have that reach value.
(5) PCT.OF.COMBOS: the percent of the
combinations that have that reach value.
(6) LOWEST.FREQ: The lowest freq value in
the combinations at that reach value.
(7) HIGHEST.FREQ: The highest freq value in
the combinations at that reach value.
FREQ.SUMMARY qqq 200,
optional output p-stat system file.
This file tells you how many combinations
had each of the freq values that were found.
The default is to write the 100 best freq
values for each size being processed.
If an integer like 200 follows the file name,
that many would be written for each size.
Each row in the freq.summary file contains:
(1) SIZE: the combination size.
(2) RANK: the rank within size.
(3) FREQ: a freq value.
(4) NUMBER.OF.COMBOS: the number of
combinations that have that freq value.
(5) PCT.OF.COMBOS: the percent of the
combinations that have that freq value.
(6) LOWEST.REACH: The lowest reach value
in the combinations at that freq value.
(7) HIGHEST.REACH: The highest reach value
in the combinations at that freq value.
FULL.OUTPUT fff 200,
optional output p-stat system file.
This has the results of combinations
in the order that they were processed.
The default is to write the results of the
initial 5,000 combinations that were tried,
regardless of size.
If an integer like 25,000 follows the file
name, up to that many would be written.
Such a large number should be used with
caution, and only if really needed, because
this can create a very large file.
Each row in the full.output file contains:
(1 ) the reach value for the combination.
(2 ) the FREQ value for the combination.
(3+) the positions of the items that
make up the combination.
If the USE.NAMES identifier is also
used, the names of the items will
be used instead of the positions.
Names, however, make the file larger.
************************************************
* passing results to the TURF.SCORES command *
* using a TEMPLATE file *
************************************************
The TURF.SCORES command needs to know the items and options
to be used in the scoring. These can be supplied within the
TURF.SCORES command itself. However, if you want to score
the best result from a TURF run, it is easier to have TURF
write a template file which TURF.SCORES can read directly.
This option cannot be used when several sizes are being run.
TEMPLATE ttt,
optional output p-stat system file.
This contains the names of the items
that comprised the best combination.
It also contains information about the
options (like weighting) that were used.
This file can be given to the TURF.SCORES
command, to score the cases on the combination
contained in the file. TURF.SCORES then writes
an output file that has the reach score for
each case on that combination.
It is then quite easy to investigate the
demographics of the reached cases.
****************************
* a simple TEMPLATE file *
****************************
The following shows a typical template file
for a combination of 5 items, with variable
www being used as a caseweight, and
no other options in use.
item case response reach
items weights weights weights threshold
VAR3 1 www no 1
VAR4 1 - - -
VAR5 1 - - -
VAR13 1 - - -
VAR23 1 - - -
*********************************************
* a typical final report produced by TURF *
*********************************************
---------TURF analysis for file work2 completed----------
| OPTIONS: none |
| |
| 100 items were used in the analysis. |
| |
| 1,000 cases were read and used. |
| 973 cases had at least one positive response, |
| making that the maximum possible reach. |
| |
| SIZE 6 evaluated 1,192,052,400 combinations: |
| 941 was the best REACH, found in 1 combination. |
| 1,956 was the FREQ value in that combination. |
| 1,983 was the best FREQ in any size 6 combination. |
| |
| The FREQ score for a combination is the count |
| of the non-zero responses for that combination, |
| summed over the reached cases. |
| |
| REACH.RESULTS file work101 has the 100 |
| combinations with the highest reach scores. |
| The items are ordered by their REACH contribution. |
| Cumulative reach is shown. |
| |
| Time: 18 minutes, 35.5 seconds. |
---------------------------------------------------------
**********************************************
* processing speed: cases and combinations *
**********************************************
There are three components whose effect on the speed
of a TURF run is more or less linear:
(1) more cases. Twice the cases for the same analysis
will take twice the time.
(2) more combinations to be tested.
30 items taken 6 at a time (ie, 30,6) has 593,775
combinations, 30,7 has 2,035,800.
That is 3.4 times as many combinations.
Therefore, it will take 3.4 times longer to run.
(3) CPU speed. Since the data and the results are held
within memory during a run, going from an 800 mHz chip
to a 2.4 GHz chip should be about 3 times faster.
**************************************************
* processing speed: effects of various options *
**************************************************
TURF speed is also greatly affected by the options chosen.
The fastest run is one that uses none of the options:
for example, TURF INFILE, SIZE 6, REACH.RESULTS OUTFILE $.
If this takes one second, how much longer do the various
options take ?
(1) 10 seconds if adding just CASE WEIGHTING.
(2) 28 seconds if using response weighting, item weighting
or reach threshold (with or without case weighting).
************************************
* processing speed: output files *
************************************
The various output files take very little extra time.
A 29,7 run for 500 cases with no output files took 2.5
seconds. Adding the default sized (best 100) REACH.RESULTS
or FREQ.RESULTS files made the runtime 2.7 seconds;
doing both took 2.8 seconds.
Asking for the 20,000 best REACH.RESULTS results instead
of the default best 100 took 3.1 seconds instead of 2.7.
The REACH.SUMMARY and FREQ.SUMMARY output files take a little
more time than the REACH.RESULTS and FREQ.RESULTS files.
The FULL.OUTPUT file takes almost no extra time since there
is no sort management involved.
***********************************************
* ------how the reach scoring is done------ *
* determining if a case has been "reached" *
* on a given combination of items *
***********************************************
INPUT FILE. A TURF run is done on an input file
that contains some numeric items to be analyzed.
The number of analysis items (NV) can be from 1 to 210,
but will often be 20 or 30 or so.
The file can contain thousands of cases.
CASE.WEIGHT. The file may also have a weight variable
which provides a weight for each case.
If not, a weight of one is assumed for each case.
SIZE. A combination size needs to be provided. This is the
number of items to be examined in each pass over the data.
Suppose NV is 30 and the size is 7. A pass over the data
will be done for every different combination of 7 of the
30 items, causing 2,035,800 passes.
In each pass, the number of cases that have been reached
by the current combination of items is counted.
The goal is to identify the combination that reaches the
largest number of the cases.
RESPONSE.WEIGHTS. The response values themselves can be
used as weights that reflect the intensity of the response.
Using this option causes the responses for each case
to be placed in memory without any change.
When response.weights is not used, the responses for each
case are stored in 0/1 form; 0 if the response was indeed
zero, and 1 if the response was any value greater than zero.
This takes much less memory space.
ITEM.WEIGHTS. The items themselves are assumed to be
equally important. In other words, the default is for each of
the NV items to have a weight of one in the reach scoring.
Different weights can be provided for some or all of the
items. These are read from a file associated with the
ITEM.WEIGHTS option.
REACH.THRESHOLD. Finally, the reach threshold, which
defaults to one, can be changed to 3, for example, by saying
REACH.THRESHOLD 3. The threshold can be fractional, like 3.5.
A case is reached when its reach score equals or exceeds the
reach threshold. Suppose we are scoring a case on a
combination that consists of items V2, V5, V11 and V17.
Remember, the responses are stored internally as 0 or 1
except when the RESPONSE.WEIGHTS option is in use.
The reach score for a given case is:
V2 response times V2's item weight, plus
V5 response times V5's item weight, plus
V11 response times V11's item weight, plus
V17 response times V17's item weight.
If that score equals or exceeds the reach threshold,
the case's caseweight is added to the number of cases
that have been reached for that combination.
One is used when there is no CASE.WEIGHT variable.
When responses and items are unweighted and the
threshold is one, a case will have been reached when it
has a positive response on any item in the combination.
**********************************************
* ------how the freq scoring is done------ *
**********************************************
The FREQ score for a combination is the sum of the
freq scores of the cases that were reached.
If the case.weight option is not in use, consider each
case to have a weight of one.
A case's freq score depends on the options in use.
(1) no use of response.weights or item.weights:
Count the positive responses within a case
on the variables in the combination.
Multiply that count by the case's weight.
(2) response weights are used, but not item weights:
Sum the positive responses within a case
on the variables in the combination.
Multiply that sum by the case's weight.
(3) item weights are used, but not response weights:
Sum the item weights for those variables in the
combination that have a positive value.
Multiply that sum by the case's weight.
(4) response weights and item weights are in use:
Sum the item weight times the response value
for those variables in the combination that
have a positive value.
Multiply that sum by the case's weight.
*****************************
* limitations on run size *
*****************************
ITEMS: Using lots of items can be done, but only with sensible combina-
tion sizes. For example, one might look at 4 items out of 200 (64.7
million passes), but even 6 out of 200 would become excessive (82.4 bil-
lion passes).
COMBINATION SIZE: the maximum combination size is 60 items. One could
look at 16 out of 24, for example, or even 40 out of 45. However, 60
out of 210 is so large that it will never finish. This P-STAT command,
PUT( COMBINATIONS(50,7))$ would return the number of combinations that
7 out of 50 would require, for example, and may be useful in estimating
the time of a prospective run.
HOW LARGE A RUN CAN BE DONE: as described above, it depends on the
number of cases, the number of combinations, the options used, and the
speed of the PC itself.
If you are considering a large run, 10 billion combinations or more, you
might try smaller run first, get it's time, and use the ratio of the
combinations to estimate the time needed for the larger run.
IS 10 OUT OF 50 POSSIBLE: this is 10.27 billion combinations and would
undoubtedly take quite a few hours, but is possible. The progress meter
ticks every million combinations, so you can easily tell how long a run
will take once it starts.
********************************
* limitations on memory size *
********************************
MEMORY CONSTRAINTS FOR INPUT: the input data must fit in memory. (We
really don't want to read the data afresh from disk for each of 50 mil-
lion different combinations.)
In most situations memory should not be a problem because the data is
usually stored very compactly.
The final report shows how much of the input data area was used; that
line however is omitted when less than 50% was needed.
MEMORY CONSTRAINTS FOR OUTPUT: the output files (except for the
FULL.OUTPUT file) are collected in memory in sort order as the run
progresses. The default sizes cause no problems.
If one asks for 20,000 results in the REACH.RESULTS or FREQ.RESULTS
files, the run will be slightly slower but the file should fit.
------------HELP ON turf.scores-----------
TURF.SCORES xin, template ttt, out xout $
TURF.SCORES xin, items v2 v4 to v8, carry case.num, out xout $
*************************
* command description *
*************************
TURF.SCORES computes the REACH score on a specified
combination of items for each case of an input file.
These scores are written to an output file.
The calculations are identical to those in the TURF command.
The output file will have the items used in the scoring, the
reach score, and any "carried" variables. Carried variables
are usually variables that identify the individual cases,
facilitating demographic breakdowns of the reached cases.
The command must be given the names of the variables (items)
them make up the combination to be scored.
In addition, the defaults can be changed for various options.
These are: case weighting, item weighting, response weighting
and the response threshold.
This information can be supplied in 2 ways:
(1) by providing a TEMPLATE file (created in a TURF command).
(2) by providing the controls as part of this command.
********************************************
* identifiers when using a TEMPLATE file *
********************************************
TURF.SCORES file, this supplies the input data file.
Required.
TEMPLATE file, this is a file, created in a TURF run,
that contains the settings that were
used, and the names of the items
that made up the best combination.
Required.
OUT file, name for the result file. Required.
CARRY vars, variables that should be carried over
from the input file to the OUT file,
even though they are not involved in
the execution of the command.
Optional.
************************************************
* identifiers when NOT using a TEMPLATE file *
************************************************
TURF.SCORES file, this supplies the input data file.
Required.
ITEMS vars, names of the variables (items) that
make up the combination to be scored.
Ranges can be used: var2 to var5.
Required.
ITEM.WEIGHTS numbers, weights of the items.
The default is to treat all items the
same, i.e., with weights of 1. Optional.
CASE.WEIGHTS var, name of a variable to be used for case
weighting when computing the overall
reach for the dataset. Optional.
RESPONSE.WEIGHTS, causes the actual values of the items
to be used in the scoring instead of
just zero/one. Optional.
REACH.THRESHOLD number, default one. the value that a case
must achieve to be 'reached'. Optional.
OUT file, name for the result file. Required.
CARRY vars, variables that should be carried over
from the input file to the OUT file,
even though they are not involved in
the execution of the command.
Optional.
******************************
* contents of the OUT file *
******************************
The OUT file will have the following variables:
the items in the combination that was scored,
the caseweight variable (if there was one), and
any CARRY variables that were requested.
In addition, two variables are added:
(1) REACH.SCORE, the score of each case for the combination.
This is set to M1 if there are negative or missing
values for the case in the combination items.
It is also M1 if the case has a non-positive caseweight.
(2) REACH.CATEGORY, was the case reached, given its score
and the threshold in use ?
It is M1 when reach.score is M1.
It is zero when reach.score is less than the threshold.
It is one when reach.score satisfies the threshold.
****************
* an example *
****************
The TURF documentation shows a 16 case file named ddd.
The following command reads that file and computes reach
scores using items v1, v2 and v5. The response.weights
option is selected, and a threshold of 2 is used.
Variable case.id is carried across into out file sss.
turf.scores ddd, items v1 v2 v5,
response.weights,
reach.threshold 2,
carry case.id,
out sss $
*******************************
* the report produced by *
* running the above command *
*******************************
--------------TURF.SCORES completed--------------
| The reach scoring was done using these items: |
| v1 v2 v5 |
| |
| 16 cases were read from file ddd. |
| The reach.threshold was 2. |
| The threshold was met by 8 cases. |
| The RESPONSE.WEIGHTS option was in use. |
| 6 variables were written to file sss. |
-------------------------------------------------
*********************************
* the output file produced by *
* running the above command *
*********************************
case reach reach
id v1 v2 v5 score category
9001 1 1 0 2 1
9002 1 1 0 2 1
9003 1 1 0 2 1
9004 1 1 0 2 1
9005 1 1 0 2 1
9006 1 1 0 2 1
9007 0 1 0 1 0
9008 0 0 0 0 0
9009 0 0 0 0 0
9010 0 0 0 0 0
9011 0 0 0 0 0
9012 0 0 0 0 0
9013 0 0 0 0 0
9014 0 0 0 0 0
9015 0 0 2 2 1
9016 0 0 2 2 1
For more information:
email support@pstat.com
phone 609-466-9200