A job definition is an XML file which contains processing instructions to be applied to a template gep file. The complete process requires a job file and a gep file as a template. The instructions in the job file will be applied to copies of the gep file which are created when the job file is processed by GeneXproServer. Here is an example of a very simple job file:

This job file will use the CreditApproval.gep file which is in the folder c:\work as template. It will process in parallel (async='yes') two runs, one for 30 minutes and the second one for 45 minutes. Each one of the runs will be a copy of the original CreditApproval.gep file and will be placed in one of two sub-folders named CreditApproval_1 and CreditApproval_2. By default GeneXproServer will generate a report for each new run that includes logging and summary information.
To start a run use the command (assuming that you chose to add GeneXproServer
to the PATH during installation):

gxps50c.exe c:\work\jobdefinition.xml

The following image shows the transition between processing two runs simultaneously and finishing the second run which will run for longer:

Finally when all processing is done the contents of the folder c:\work will be:

And inside the CreditApproval_1 you will find the following files:

These include the two new gep files (CreditApproval_000001.gep and CreditApproval_000002.gep), several report files for each run and a summary report in xml and html (Report.xml and Report.htm).

This describes a simple job but there are many options that can be added to the process such as running external processes before and after the job and before and after each run, code generation, testing,
and so on. These options are detailed below.

Job Node

In a job definition file the job node is the top node and encloses all the commands in the file. The job itself supports the following attributes:

Required Attributes:

filename: The name of the gep file including the extension.

path: The fully qualified path to the folder where the gep file is. All processing will be done in this folder.

feedback: The amount of time in seconds between updates to the screen.

Optional Attributes:

usesubfolder: Create a folder for the job. When set to 1 (Auto) the folder will be called filename_x where
x is an integer starting at 1. If set to 2 (Fixed) then the attribute
subfoldername is required and must contain a valid file name.

Values:

0 (None)

1 (Auto)

2 (Fixed)

Default: 1 (Auto)

subfoldername: The name of the folder for
the job processing; it is ignored unless
usesubfolder is set to 2 (Fixed).

report: Generate a report for the job.

Values:

yes

no

Default: yes

createconsolidatedrun: At the end of the job processing, create a new gep file that contains the best models of all the runs in the job, selected according to
different criteria.

Values:

0 (Don't Create)

1 (Select by best training fitness)

2 (select by best validation fitness)

3 (select by best training accuracy)

4 (select by best validation accuracy)

5 (select by best training r-square)

6 (select by best validation r-square)

7 (select by best training favorite statistic)

8 (select by best validation favorite
statistic)

Default: 0 (Don't Create)

async: Process the runs in parallel.

Values:

yes

no

Default: yes

logtofile: Create log files of the run process. Creates one log file per run. The log files are XML files and are named after the run (for example: CreditApproval_000001.xml).

Values:

yes

no

Default: yes

Run Node

The job node can contain any number of
run nodes. Each run represents a unit of processing which can be processing a run from scratch, continuing the processing of an existing
run or do other tasks defined by other nodes such as code generation and predictions.

The run node is by
far the most complex element in the job file since it can contain so many different nodes and instructions.
Notwithstanding, the simplest run node only needs an id, a stop condition and a type. For example, the run node below attempts to improve the selected model with a run of 50 generations, much like as if you pressed the Continue button in the Run Panel in GeneXproTools.

<run id='1' stopcondition='generations' value='50' type='continue' />

The mandatory run node attributes are:

id: Must be an integer and must be unique. That is, no two run folders can have the same
id.

type: The action to perform with the run. The last four values correspond to the buttons in the Run Panel of GeneXproTools and basically process the run by starting from scratch or by starting from an existing model. Idle is a special case to be used when we need to use the models in the run to generate code or predictions. In this case no new models are created and the attributes
stopcondition and
value are optional.

Values:

idle

start

continue

simplify

complexify

stopcondition: Defines the unit for the duration of the run.

Values:

generations

hours

minutes

seconds

value: The number of generations (integer), hours, minutes or seconds (all float) as defined in
stopcondition to process the run.

Pre and Post Processing (Job and Run)

The nodes preprocessing and postprocessing can be added inside the job or inside each run. The job's pre and postprocessing nodes are used to start external applications when the job starts and just before the job ends. Similar nodes can be added to each one
of the runs and represent processes that are started just before the run is processed and right after the run ends. The path to the application is defined in the path attribute;
the application can run without showing its window if the
hide attribute is set to yes and the job processor waits for the application to end its processing when the attribute
synchro is set to yes.

path: The fully qualified path to the application to be started.

arguments: A string that is passed to the process being created.

hide: If set to yes then the user interface of the process being fired is hidden.

Values:

yes

no

synchro: Defines if the job processor waits for the process to finish running.

Values:

yes

no

Data Loading

GeneXproServer can load new training and validation/test datasets into each run or it can completely replace the original data in the run. This operation is attained using the node
dataset which contains a number of attributes and one child node, the
connection. There are five types of
connection: database, file, excel, internal and gepfile.

The dataset node attributes are:

type: Defines the dataset used for
training and validation/testing.

Values:

training

validation

records: The number of records to load. When set to "all" all records in the source are loaded.

The connection node determines the type of source data to use. There are five types of connection (database, file, excel, internal and gepfile) that contain information that is specific to each type of data source.

In this example we are selecting the top 500 records of the table Cancer_Test and loading them into the validation dataset.

type: must be a database.

format: The format of the data.

Values:

responselast: The response (or
dependent) variable is the last column in a tabular dataset.

responsefirst: The response variable is the first column in a tabular dataset.

timeseries: The dataset is composed of a single column.

geneexpression: The dataset in a gene
expression matrix format is transposed.

The database connection contains two nodes:

oledbconnectionstring: A connection string to the database. The connection string must be compatible with ADO (ActiveX Data Objects) technology. The connection string must be set as the value of the node.

sqlstatement: An SQL statement. This is not validated or parsed by GeneXproServer and it is assumed to be correct. It can be of any length and contain any SQL statement that is compatible with the database server in use. The SQL must be set as the value of the node.

In this example the Excel file CreditApproval.xlsx contains a spreadsheet named Train_Val which contains the data to load in the range $A$1:$P$11. Since the
columns attribute is set to all then all columns will be loaded but since the dataset attribute
records is set to 100 then only the top 100 records will be loaded. It is more efficient to set both these attributes to all and select the exact range.

Excel connection attributes:

type: Must be excel.

format: The format of the data.

Values:

responselast: The response (or
dependent) variable is the last column in a tabular dataset.

responsefirst: The response variable is the first column in a tabular dataset.

timeseries: The dataset is composed of a single column.

geneexpression: The dataset in a gene
expression matrix format is transposed.

The Excel connection node has a single child node called
sheet that must contain the path to the Excel file as its value. This Excel file must contain at least one spreadsheet with data and the first row of data must be the names of each column (labels).

The sheet node also requires the following attributes:

name: The name of the spreadsheet.

range: The range of columns and rows to load. This range is passed to Excel as is and must be a valid Excel range.

columns: Defines which columns are loaded from the range defined above. To load all columns set
columns to all. To load a subset of the columns this attribute should be set to a list of column names separated by the pipe (|) character. The names of the columns are the values defined in the top row (labels).

In this example we are changing the Genes value to 5 and the HeadSize to 10. The keys and values correspond to the settings in GeneXproTools with the same name. These values are not validated so you must ensure that the correct type is used or the run file may become corrupted. The list of settings that can be changed is shown in the settings page.

Some settings change the structure of the models and in this case GeneXproServer gives a warning and clears all the models in the run. The settings in question are:

HeadSize

Genes

LinkingFunction

TimeSeriesDelayTime

TimeSeriesEmbeddingDimension

UseRNC

ConstantsPerGene

Changing the Function Set

The functions node allows adding and removing functions as well as changing the weight of the functions that are part of the function set:

The functions node can contain any number of
function nodes which specify the
action and the
symbol to change. See the example:

In this example we are adding the function Pow (power) to the function set with a weight of 2; we are increasing the weight of the
addition (+) function to 5; and, finally, we are removing the
multiplication (*) function from the function set.

When you are continuing a run (Continue, Simplify and Complexify run options) you must ensure that you are not removing a function that is used in a model of the original run and that you are not adding a function with higher arity than the maximum arity of the run. If you are starting a run from scratch then the only limitation is that you should not remove all the functions from the function set.

Each function node must contain the following three attributes:

action: whether to add, remove or change the weight of the function.

Values:

add: Adds the function to the function set.

remove: Removes the function from the function set.

set: Changes the weight of the function.

symbol: the representation of the function (see
the Functions entry for a list of supported symbols).

weight: the weight of the function in the function set (integer).

If you add a function which has higher arity than the max arity of the current function set, then GeneXproServer gives a warning and clears all the models in the run. The same happens if the removal of a function causes the max arity of the function set to change.

Time Series Prediction Modes

In GeneXproServer and GeneXproTools the Time Series
Prediction category can run in two prediction modes: Testing and Prediction. It is also possible to
go back and forward between prediction modes and to do this operation in GeneXproServer we use the
transform node. This node has one attribute that determines which
prediction mode the run is being converted to:

The Classification and Logistic Regression categories support datasets with
multiple classes in the response variable. With GeneXproServer
it is possible to change which of these classes is assigned the value 1 (all the other classes are assigned the value 0). This transformation is applied before processing the run and
uses the transform node:

The example above sets the class IrisSetosa to be the singled out class. Note that the name of the singled out class must exactly match one of the classes in the response variable or the transformation will fail.

Testing

At the end of a run with a validation/test dataset, GeneXproServer tests
just the last model on the validation/test dataset by default;
all the other intermediate models are left untested by
default. If you want to test all the models or a number of models
against the validation/test dataset, then you can use the
test node:

In this example the run will be processed for 30 minutes and, at the end
of the run, it will test all the models on the validation
dataset. If instead of all
the value of the attribute was 10, for example, it would
test the last 10 models only:

Finally, you can specify the dataset to test by adding a
dataset attribute which can have the values
training,
validation or
both. The first will test the training set, the second, which is the default
and can be omitted, will test the validation set and
both will test both datasets. This can be very useful if you change the fitness function
or the favorite statistic and want to recalculate their values for a large number of models. This functionality is similar to the Refresh/Test buttons in the History Panel of GeneXproTools.

In this case all the models will be tested against the training and validation datasets.

whichmodels: either
all or a positive integer.

dataset:
defines the dataset to be tested.

Values:

training

validation

both

Pre-Selecting and Selecting Models

There are two selection nodes: the
preselect and the
select nodes. Both set a model to be the active or current model according to a number of criteria or to a model index. The
preselect operation sets the model to active before the run starts and it is useful when you want to continue a run from a model that was not
the active model in the original run. The
select node changes the active model at the end of a run and can be useful when generating code
from models that are not the last one.

<run id='1' type='continue' stopcondition='generations' value='500'>
<!-- Select model number 3 before starting the run -->
<preselect criteria='modelindex' value='3'/>
<!-- Select model with best favorite statistic at the end of the run -->
<select criteria='bestvalidationfitness'/>
</run>
<run id='2' type='continue' stopcondition='generations' value='500'>
<settings>
<setting key='FavoriteStatistic' value='36'/>
</settings>
<test whichmodels='all' dataset='both'/>
<!-- Select model with best average training and validation fitness before starting the run -->
<preselect criteria='avgtrainingvalidationfitness'/>
<!-- Select model with best average training and validation favorite statistic at the end of the run -->
<select criteria='avgtrainingvalidationfavorite'/>
</run>

In the first run of this example GeneXproServer selects model number 3, then improves that model for 500 generations, after which it sets the model with the best
fitness value in the validation dataset to be the active model. When the
select node is not specified GeneXproServer always selects the model with the best training fitness.

In the second run of this example GeneXproServer selects
the model with the best average fitness value in the
training and validation datasets to be used as seed
model in another optimization run of 500 generations.
Then at the end of the run it selects the model with the
best average value for the favorite statistic in the
training and validation datasets. The extra code for
setting the favorite statistic (the area under the ROC
curve in this case) and testing all models on
the training and validation datasets must be included if
you want to apply any of the selection criteria that
involve favorite statistics.

criteria: The type of selection.

Values:

besttrainingfitness

bestvalidationfitness

lastmodel

firstmodel

random

besttrainingfavorite

bestvalidationfavorite

avgtrainingvalidationfavorite

avgtrainingvalidationfitness

modelindex

value: Integer that matches the model
index in a run. Only used when criteria is set to modelindex.

Converting Model to Code

GeneXproServer lets you convert any model to the 19 supported programming languages or to your own custom programming language. For this end it uses the
convert node. This node has several attributes that give you control over the language, the format of the code,
the output type for CLassification and Logistic
Regression runs and the location where it is saved.

This examples converts the last model to Javascript and saves it to the file
MyModel.js (filename) next to the run. The code will use the labels for the variable names (uselabels) and
will output the most likely class (outputtype).

language: Any of the 19 supported
programming languages or the name of a custom language.

uselabels: When set to yes the code uses the variable labels for the variable names in the model. Otherwise it uses the
default d0, d1,…dn representation.

Values:

yes

no

format: The format of the model. This allows exporting the model for external systems (xml) or to serve to web clients directly (json).

Values:

text

json

xml

filename: The name of the file where the converted model will be saved. It should not have an extension since GeneXproServer will add the programming language extension to the end.

outputtype: This attribute corresponds to the
Model Output Types used in GeneXproTools Classification and Logistic Regression runs.

Values:

rawmodel

mostlikelyclass

probability1

grammartype: This is only applicable in Logistic Synthesis runs and corresponds to the type of gates used to build the model's code.

Values:

allgates

notandoronly

nandonly

noronly

muxsystem

reedmullersystem

Consolidated Runs

Very long runs can produce a very large number of models
that need to be analyzed at the end of a job. To help
with model selection GeneXproServer can pick a
particular model from each run in a job according to a
certain criterion and add them all to a new run file.
Usually you would select the best model of each run
according to the selection criterion you are interested in.

In this example we have 7 runs that will run for some time each and at the end GeneXproServer will create a file named BreastCancer_consolidated.gep with 7 models which were selected from
each of the generated runs because they had the best accuracy in the validation set.

createconsolidatedrun: Create a new run file with models selected from the generated runs in the job.

Values:

0: Don't Create

1: Select based on best Training Fitness

2: Select based on best Validation Fitness

3: Select based on best Training Accuracy

4: Select based on best Validation Accuracy

5: Select based on best Training R-square

6: Select based on best Validation R-square

7: Select based on best Training Favorite Statistic

8: Select based on best Validation Favorite Statistic

Making Predictions (Time Series Prediction)

GeneXproServer lets you automate the creation of predictions and can export them to a number of formats. This can be achieved using xml similar to the example below:

In this example the run is processed for 0.3 minutes and then 3 predictions are generated using the last model. These predictions are saved in xml format to a file that is derived from the run’s filename (DowJones_000001_predictions.xml in this case).

quantity: The number of predictions to generate (positive integer).

format: The format used to lay the predictions in the file.

Values:

text

json

xml

filename: Either a filename or the word
auto. When set to
auto the predictions file will be named after the run name with the posfix _predictions.

Sync/Async Processing

The current version of GeneXproServer introduces a very efficient parallel processing algorithm that allows it to scale linearly to
a large number of cores. By default all runs are now processed in parallel but you can revert to single run processing by setting the
async attribute of the job node to
no. This may be important when you are using external fitness functions that cannot service more than one run at a time.

Each run is processed in a different Windows process which implies that they are completely isolated from each other. GeneXproServer queues all the runs at the beginning of a job but only processes as many as the number of cores available simultaneously.

In this example, which assumes a 4 core CPU and that all runs are equal in structure and dataset, GeneXproServer will process the first
4 runs simultaneously until the minute 30 where run number 1 finishes and GeneXproServer starts run 5. At
minute 45 the runs 2 to 4 will end and the runs 6 and 7 are started. Run 5 will end 30 minutes later and the last two will run for a further 15 minutes. These values will vary due to the
random nature of the algorithms but the total processing time will be close to a fourth of the time it would take to process these runs serially.