A new system for runtime estimation and credit

Terminology

BOINC estimates the peak FLOPS of each processor.
For CPUs, this is the Whetstone benchmark score.
For GPUs, it's given by a manufacturer-supplied formula.

Application performance depends on other factors as well,
such as the speed of the host's memory system.
So a given job might take the same amount of CPU time
on 1 GFLOPS and 10 GFLOPS hosts.
The efficiency of an application running on a given host
is the ratio of actual FLOPS to peak FLOPS.

For our purposes, the peak FLOPS of a device
is based on single or double precision, whichever is higher.

BOINC's estimate of the peak FLOPS of a device may be wrong,
e.g. because the manufacturer's formula is incomplete or wrong.

The first credit system

In the first iteration of BOINC's credit system,
"claimed credit" C of job J on host H was defined as

C = H.whetstone * J.cpu_time

There were then various schemes for taking the
average or min claimed credit of the replicas of a job,
and using that as the "granted credit".

We call this system "Peak-FLOPS-based" because
it's based on the CPU's peak performance.

The problem with this system is that,
for a given app version, efficiency can vary widely between hosts.
In the above example, the 10 GFLOPS host would claim 10X as much credit,
and its owner would be upset when it was granted only a tenth of that.

Furthermore, the credits granted to a given host for a
series of identical jobs could vary widely,
depending on the host it was paired with by replication.
This seemed arbitrary and unfair to users.

The second credit system

We then switched to the philosophy that
credit should be proportional to the FLOPs actually performed
by the application.
We added API calls to let applications report this.
We call this approach "Actual-FLOPs-based".

SETI@home's application allowed counting of FLOPs,
and they adopted this system,
adding a scaling factor so that average credit per job
was the same as the first credit system.

Not all projects could count FLOPs, however.
So SETI@home published their average credit per CPU second,
and other projects continued to use benchmark-based credit,
but multiplied it by a scaling factor to match SETI@home's average.

This system has several problems:

It doesn't address GPUs properly; projects using GPUs
have to write custom code.

Project that can't count FLOPs still have device neutrality problems.

It doesn't prevent credit cheating when single replication is used.

Goals of the new (third) credit system

Limited project neutrality: different projects should grant
about the same amount of credit per host-hour, averaged over hosts.
Projects with GPU apps should grant credit in proportion
to the efficiency of the apps.
(This means that projects with efficient GPU apps will
grant more credit than projects with inefficient apps. That's OK).

Cheat-resistance.

A priori job size estimates and bounds

For each job, the project supplies

an estimate of the FLOPs used by a job (wu.fpops_est)

a limit on FLOPs, after which the job will be aborted
(wu.fpops_bound).

Previously, inaccuracy of fpops_est caused problems.
The new system still uses fpops_est,
but its primary purpose is now to indicate the relative sizes of jobs.

Averages of FLOP count and elapsed time
are normalized by fpops_est (see below),
and if fpops_est is correlated with runtime,
these averages will converge more quickly.

Notes:

A posteriori estimates of job size may exist also,
e.g., an iteration count reported by the app.
They aren't cheat-proof, and we don't use them.

Peak FLOP Count

This system uses the Peak-FLOPS-based approach,
but addresses its problems in a new way.

When a job J is issued to a host,
the scheduler computes peak_flops(J)
based on the resources used by the job and their peak speeds.

When a client finishes a job and reports its elapsed time T,
we define peak_flop_count(J), or PFC(J) as

PFC(J) = T * peak_flops(J)

The credit for a job J is typically proportional to PFC(J),
but is limited and normalized in various ways.

Notes:

PFC(J) is not reliable;
cheaters can falsify elapsed time or device attributes.

We use elapsed time instead of actual device time (e.g., CPU time).
If a job uses a resource inefficiently
(e.g., a CPU job that does lots of disk I/O)
PFC() won't reflect this. That's OK.
The key thing is that BOINC allocated the device to the job,
whether or not the job used it efficiently.

peak_flops(J) may not be accurate; e.g., a GPU job may take
more or less CPU than the scheduler thinks it will.
Eventually we may switch to a scheme where the client
dynamically determines the CPU usage.
For now, though, we'll just use the scheduler's estimate.

For projects (CPDN) that grant partial credit via
trickle-up messages, substitute "partial job" for "job".
These projects must include elapsed time and result ID
in the trickle message.

Computing averages

The policies described below involve computing averages
of various quantities.
This computation must take into account:

The quantities being averaged may gradually change over time
(e.g. average job size may change)
and we need to track this.
This done as follows: for the first N samples
we take the straight average.
After that we use an exponentially-weighted average with parameter A.
The choice of N and A depends on the entity involved;
for app versions (which typically get thousands of jobs per day)
we might use N=100 and A=.001.
For hosts (which typically get a few jobs per day)
we might use N=10 and A=.01.

To reduce the effect of erroneously huge samples,
samples after the first are capped at X times the current average.
X depends on the entity:
maybe 10 for hosts, 100 for app versions.

We keep track of the number of samples,
and use an average only if its number of samples
is above a sample threshold.

Statistics maintained by the server

The server maintains the following statistics:

host_app_version.pfc_avg

for each app version V and host H,
the average of PFC(J)/wu.fpops_est for jobs completed by H using A.

app_version.pfc_avg

the average of PFC(J)/wu.fpops_est for all jobs
completed by the app version.

Sanity check

If PFC(J) is > wu.fpops_bound,
J is assigned a "default PFC" D and it's not used to update statistics.
D is determined as follows:

If app.min_avg_pfc is defined then

D = app.min_avg_pfc * wu.fpops_est

Otherwise

D = wu.fpops_est

Cross-version normalization

A given application may have multiple versions
(e.g., CPU, multi-thread, and GPU versions).
If jobs are distributed uniformly to versions,
all versions should get the same average granted credit.
To make this so, we scale PFC as follows.

For each app, we periodically compute cpu_pfc
(the weighted average of app_version.pfc over CPU app versions)
and similarly gpu_pfc.
We then compute X as follows:

If there are only CPU or only GPU versions,
and at least 2 versions are above sample threshold,
X is their average (weighted by # samples).

If there are both, and at least 1 of each is above sample
threshold, let X be the min of the averages.

If X is defined, then for each app version

app_version.pfc_scale = (X/app_version.pfc_avg)

The PFC of the app version's jobs are scaled by this factor.

If X is defined, then we set

app.min_avg_pfc = X

Otherwise, if an app version is above sample threshold, we set

app.min_avg_pfc = app_version.pfc_avg
app_version.pfc_scale = 1

Notes:

Version normalization is only applied if at least two
versions are above sample threshold.

Version normalization addresses the common situation
where an app's GPU version is much less efficient than the CPU version
(i.e. the ratio of actual FLOPs to peak FLOPs is much less).
To a certain extent, this mechanism shifts the system
towards the "Actual FLOPs" philosophy,
since credit is granted based on the most efficient app version.
It's not exactly "Actual FLOPs", since the most efficient
version may not be 100% efficient.

If jobs are not distributed uniformly among versions
(e.g. if SETI@home VLAR jobs are done only by the CPU version)
then this mechanism doesn't work as intended.
One solution is to create separate apps for separate types of jobs.

Cheating or erroneous hosts can influence app_version.pfc_avg to some extent.
This is limited by the "sanity check" mechanism,
and by the fact that only validated jobs are used.
The effect on credit will be negated by host normalization
(see below).
There may be an adverse effect on cross-version normalization.
This could be eliminated by computing app_version.pfc_avg
as the sample-median value of host_app_version.pfc_avg

Host normalization

The second normalization is across hosts.
Assume jobs for a given app are distributed uniformly among hosts.
Then the average credit per job should be the same for all hosts.

To achieve this, we scale PFC by the factor

app_version.pfc_avg / host_app_version.pfc_avg

This scaling is only done if both statistics are above sample threshold.

There are some cases where hosts are not sent jobs uniformly:

job-size matching (smaller jobs sent to slower hosts)

GPUGrid.net's scheme for sending some (presumably larger)
jobs to GPUs with more processors.

The normalization by wu.fpops_est handles this
(assuming that it's set correctly).

Notes:

For apps with large variance of job sizes,
the host normalization mechanism is vulnerable to
a type of cheating called "cherry picking".
A mechanism for defeating this is described below.

The host normalization mechanism reduces the claimed credit of hosts
that are less efficient than average,
and increases the claimed credit of hosts that are more efficient
than average.

Anonymous platform

For jobs done by anonymous platform apps,
the server knows the devices involved and can estimate PFC.
It maintains host_app_version records for anonymous platform,
and it keeps track of PFC and elapsed time statistics there.
There are separate records per resource type.
The record's app_version_id encodes the app ID and the resource type
(-2 for CPU, -3 for NVIDIA GPU, -4 for ATI).

If app.min_avg_pfc is defined,
host_app_version.pfc_avg is above sample threshold,
we normalize PFC by the factor

app.min_avg_pfc/host_app_version.pfc_avg

Otherwise the claimed PFC is

app.min_avg_pfc(A)*wu.fpops_est

If app.min_avg_pfc is not defined, the claimed PFC is

wu.fpops_est

Notes:

In the current design, anonymous platform jobs don't
contributed to app.min_avg_pfc,
but it may be used to determine their credit.
This may cause problems:
e.g., suppose a project offers an inefficient version
and volunteers make a much more efficient version
and run it anonymous platform.
They'd get an unfairly low amount of credit.
This could be fixed by creating app_version records
representing all anonymous platform apps of a given
platform and resource type.

Summary

Given a validated job J, we compute

the "claimed PFC" F

a flag "approx" that is true if F is an approximation
and may not be comparable with other instances of the job

Cross-project version normalization

If an application has both CPU and GPU versions,
the version normalization mechanism figures out
which version is most efficient and uses that to reduce
the credit granted to less-efficient versions.

If a project has an app with only a GPU version,
there's no CPU version for comparison.
If we grant credit based only on GPU peak speed,
the project will grant much more credit per GPU hour than other projects,
violating limited project neutrality.

A solution to this: if an app has only GPU versions,
then for each version V we let
S(V) be the average scaling factor
for that resource type among projects that have both CPU and GPU versions.
This factor is obtained from a central BOINC server.
V's jobs are then scaled by S(V) as above.

Projects will export the following data:

for each app version
app name
platform name
recent average granted credit
plan class
scale factor

The BOINC server will collect these from several projects
and will export the following:

for each plan class
average scale factor (weighted by RAC)

We'll provide a script that identifies app versions
for GPUs with no corresponding CPU app version,
and sets their scaling factor based on the above.

This means that no special cheat-prevention scheme
is needed for single replications;
in this case, granted credit = claimed credit.

However, two kinds of cheating still have to be dealt with:

One-time cheats

For example, claiming a PFC of 1e304.

This is handled by the sanity check mechanism,
which grants a default amount of credit
and treats the host with suspicion for a while.

Cherry picking

Suppose an application has a mix of long and short jobs.
If a client intentionally discards
(or aborts, or reports errors from) the long jobs,
but completes the short jobs,
its host scaling factor will become large,
and it will get excessive credit for the short jobs.
This is called "cherry picking".

When send a job to a host,
if scale_probation is true,
set host_scale_time to now+X, where X is the app's delay bound.

When a job is successfully validated,
and now > host_scale_time,
set scale_probation to false.

If a job times out or errors out,
set scale_probation to true,
max the scale factor with 1,
and set host_scale_time to now+X.

when computing claimed credit for a job,
and now < host_scale_time, don't use the host scale factor

The idea is to use the host scaling factor
only if there's solid evidence that the host is NOT cherry picking.

Because this mechanism is punitive to hosts
that experience actual failures,
it's selectable on a per-application basis (default off).

In addition, to limit the extent of cheating
(in case the above mechanism is defeated somehow)
the host scaling factor will be min'd with a constant (say, 10).

Error rate, host punishment, and turnaround time estimation

Unrelated to the credit proposal, but in a similar spirit.

Due to hardware problems (e.g. a malfunctioning GPU)
a host may have a 100% error rate for one app version
and a 0% error rate for another.
Similar for turnaround time.

So we'll move the "error_rate" and "turnaround_time"
fields from the host table to host_app_version.

The host punishment mechanism is designed to deal with malfunctioning hosts.
For each host the server maintains max_results_day.
This is initialized to a project-specified value (e.g. 200)
and scaled by the number of CPUs and/or GPUs.
It's decremented if the client reports a crash
(but not if the job was aborted).
It's doubled when a successful (but not necessarily valid)
result is received.

This should also be per-app-version,
so we'll move "max_results_day" from the host table to host_app_version.

App plan functions

App plan functions no longer have to make a FLOPS estimate.
They just have to return the peak device FLOPS.

The scheduler adjusts this,
using the elapsed time statistics,
to get the app_version.flops_est it sends to the client
(from which job durations are estimated).

Scheduler changes

When dispatch anonymous app job,
set result.app_version_id to -2/-3/-4 depending on resource.

update host_app_version.host_scale_time for
app versions for which jobs are being sent
and for which scale_probation is set.

Validator changes

To reduce DB access, validator maintains a vector of app_versions.
This is appended to by assign_credit_set().
At the start of every validator pass, the pfc and expavg_credit
fields of the app versions are saved.
Updates are accumulated,
and at the end of the validator pass (before sleep())
the incremental changes are written to the DB.
This scheme works correctly even with multiple validators per app.

The updating of app_versions is done in such a way that
we pick up changes to pfc_scale by the feeder.

The app record is reread at the start of each scan,
in case its min_avg_pfc has been changed by the feeder.

check_set() no longer returns credit (leave arg there for now)

update host_app_version.scale_probation in is_valid()

don't grant credit in is_valid()

compute and grant credit in handle_wu()

Feeder changes

If we're the "main feeder" (mod = 0, or mod not used),
update app_version.pfc_scale and app.min_avg_pfc every 10 minutes.

Download in other formats:

Site migrated to: https://github.com/BOINC/boinc-dev-doc/wikiCopyright (c) 2014 University of California. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.