In the first set of experiments, we attempted to determine how the
quality of ATTac-2001's hotel price predictions affects its performance.
To this end, we devised seven price prediction schemes, varying
considerably in sophistication and inspired by approaches taken
by other TAC competitors, and incorporated these schemes into our
agent. We then played these seven agents against one another
repeatedly, with regular retraining as described below.

Following are the seven hotel prediction schemes that we used, in
decreasing order of sophistication:

:
This is the ``full-strength'' agent based on boosting that was used
during the tournament. (The denotes sampling.)

:
This agent samples prices from the empirical distribution of prices
from previously played games, conditioned only on the closing time of the
hotel room (a subset of of the features used by
).
In other words, it collects all historical hotel prices and breaks them
down by the time at which the hotel closed (as well as room type, as
usual).
The price predictor then simply samples from the collection of prices
corresponding to the given closing time.

:
This agent samples prices from the empirical distribution of prices
from previously played games, without regard to the closing time of the
hotel room (but still broken down by room type). It uses a subset of
the features used by
.

,
,
: These agents
predict in the same way
as their corresponding predictors above, but instead of returning a
random sample from the estimated distribution of hotel
prices, they deterministically return the expected value of the
distribution. (The denotes expected value, as introduced in
Section 2.1.)

CurrentBid:
This agent uses a very simple predictor that always predicts that the
hotel room will close at its current price.

In every case, whenever the price predictor returns a price that is
below the current price, we replace it with the current price (since
prices cannot go down).

In our experiments, we added as an eighth agent EarlyBidder, inspired by
the livingagents agent. EarlyBidder used
to predict
closing prices, determined
an optimal set of purchases, and then placed bids for these goods at
sufficiently high prices to ensure that they would be purchased
($1001 for all hotel rooms, just as livingagents did in
TAC-01) right after the first flight quotes. It then never revised
these bids.

Each of these agents require training, i.e., data from previously
played games. However, we are faced with a sort of ``chicken and
egg'' problem: to run the agents, we need to first train the agents
using data from games in which they were involved, but to get this kind of
data, we need to first run the agents. To get around this problem, we
ran the agents in phases. In Phase I, which consisted of 126 games,
we used training data from the seeding, semifinals and finals rounds
of TAC-01. In Phase II, lasting 157 games, we retrained the agents
once every six hours using all of the data from the seeding,
semifinals and finals rounds as well as all of the games played in
Phase II. Finally, in Phase III, lasting
622 games, we continued to retrain the agents once every six hours,
but now using only data from games played during Phases I and II, and
not including data from the seeding, semifinals and finals rounds.

Table 12:
The average relative scores ( standard deviation) for eight agents in
the three phases of our controlled experiment in which the hotel
prediction algorithm was varied. The relative score of an agent is
its score minus the average score of all agents in that game. The
agent's rank within each phase is shown in parentheses.

Agent

Relative Score

Phase I

Phase II

Phase III

EarlyBidder

CurrentBid

Table 12 shows how the agents performed in
each of these
phases.
Much of what we observe in this table is consistent with our
expectations.
The more sophisticated boosting-based agents (
and
) clearly dominated the agents based on simpler
prediction schemes.
Moreover, with continued training, these agents improved markedly
relative to EarlyBidder.
We also see the performance of the simplest agent, CurrentBid, which
does not employ any kind of training, significantly decline relative
to the other data-driven agents.

On the other hand, there are some phenomena in this table that were
very surprising to us. Most surprising was the failure of sampling to
help. Our strategy relies heavily not only on estimating hotel
prices, but also taking samples from the distribution of hotel prices.
Yet these results indicate that using expected hotel price, rather
than price samples, consistently performs better. We speculate that
this may be because an insufficient number of samples are being used
(due to computational limitations) so that the numbers derived from
these samples have too high a variance. Another possibility is that
the method of using samples to estimate scores consistently
overestimates the expected score because it assumes the agent can
behave with perfect knowledge for each individual sample--a property
of our approximation scheme. Finally, as our algorithm uses sampling
at several different points (computing hotel expected values, deciding
when to buy flights, pricing entertainment tickets, etc.), it is
quite possible that sampling is beneficial for some decisions while
detrimental for others.
For example, when directly comparing versions of the algorithm with
sampling used at only subsets of the decision points, the data
suggests that sampling for the hotel decisions is most beneficial,
while sampling for the flights and entertainment tickets is neutral at
best, and possibly detrimental. This result is not surprising given
that the sampling approach is motivated primarily by the task of
bidding for hotels.

We were also surprised that
and
eventually
performed worse than the less sophisticated
and
. One possible explanation is that the simpler model
happens to give predictions that are just as good as the more
complicated model, perhaps because closing time is not terribly
informative, or perhaps because the adjustment to price based on
current price is more significant. Other things being equal, the
simpler model has the advantage that its statistics are based on all
of the price
data, regardless of closing time, whereas the conditional model makes
each prediction based on only an eighth of the data (since there are
eight possible closing times, each equally likely).

In addition to agent performance, it is possible to measure the
inaccuracy of the eventual predictions, at least for the non-sampling
agents.
For these agents, we measured the root mean squared error of the
predictions made in Phase III. These were: 56.0 for
,
66.6 for
, 69.8 for CurrentBid and 71.3 for
. Thus, we see that the lower the error of the predictions
(according to this measure), the higher the score (correlation ).