ABSTRACT1 IntroductionThis article investigates how machine learning methodsJRockit™, the Java™ Virtual Machine (JVM)might enhance current garbage collection techniques inconstructed by Appeal Virtual Machines and nowthat they contribute to more adaptive solutions. Machineowned by BEA and named Weblogic JRockit, waslearning is concerned with programs that improve withdesigned recognizing that all applications are differentexperience. Machine learning techniques have beenand have different needs. Thus, a garbage collectionsuccessfully applied to a number of real world problems,technique and a garbage collection strategy that workssuch as data mining, game playing, medical diagnosis,well for one particular application may perform poorlyspeech recognition and automated control.for another. To achieve good performance over a broadReinforcement learning provides an approach in whichspectrum of different applications, various garbagean agent interacts with the environment and learns bycollection techniques with different characteristics havetrial and error rather than from direct training examples.been implemented. However, any garbage collectionIn other words, the learning task is specified by rewardstechnique requires a strategy that allows it to adapt itsand penalties that indirectly tell the agent what it isbehavior to the current context of operation. Over thesupposed to do instead of telling it how to accomplishpast few years, the need for better and more adaptivethe task. In this article we outline a framework forstrategies has become apparent.applying reinforcement learning to optimize theperformance of conventional garbage collectors.Imagine that a JVM is running a program X. For thisprogram, it might be best to garbage collect accordingIn this project we have researched an adaptive decisionto a rule Y. Whenever Y becomes true, the JVMprocess that makes decisions regarding which garbagegarbage collects. However, this might not be thecollector technique should be invoked and how it shouldoptimal strategy for another program X'. For X', rule Y'be applied. The decision is based on information aboutmight be the best choice. Combining rule Y and Y' doesthe memory allocation behavior of currently runningnot have to be complicated, but consider writing aapplications. The system learns through trial and error tocombined rule that works really well for hundreds oftake the optimal actions in an initially unknownprograms? How does the JVM implementer know that aenvironment.rule that works really well for many programs doesn'tperform badly on others? Providing startup parametersfor controlling the rule heuristics is a good start but itcannot adapt over time to a dynamic environment thathas different needs at different points of time.The idea is to let a learning decision process decidewhich garbage collector technique to use and how touse it, instead of static rules making these decisionsduring run time. The learning decision process selectsamong different kinds of state of the art garbagecollection techniques in JRockit™, the one that is bestsuitable for the current application and platform.The objective for this investigation is to find out if Unlike some other garbage collection techniques, suchmachine learning is able to contribute to improved as parallel garbage collection and stop-and-copy,performance of a commercial product. Theoretically concurrent garbage collection starts to garbage collectmachine learning could contribute to more adaptive before the memory heap is full. A full heap wouldsolutions, but is such an approach feasible in practice? cause all application threads to stop, which would notbe necessary if the concurrent garbage collector hadThis paper is concerned with the question whether and,started in time, since a concurrent garbage collectorif so, how a learning decision process can be used for aallows running applications to run concurrently withmore dynamic garbage collection in a modern JVM,some phases of the garbage collection. For furthersuch as JRockit.reading about garbage collection, see references [2, 6,8, 9, 13, 14].1.1 Paper OverviewAn important issue, when it comes to concurrentSection 2 relates the paper to previous work and ingarbage collection in a JVM, is to decide when toSection 3 we present the problem specification. Sectiongarbage collect. Concurrent garbage collection must not4 provides a survey of the reinforcement learningstart too late, or else the running program may run outmethod that has been used. Section 5 presents possibleof memory. Neither must it be invoked too frequently,situations of a system that uses a garbage collector insince this causes more garbage collections thanwhich a learning decision process might perform betternecessary and thereby disturbs the execution of thethan a regular garbage collector. Section 6 handles therunning program. The key idea in our approach is todesign of the prototype and is followed by afind the optimal trade-off between time and memorypresentation of experimental results, discussion ofresources by letting a learning decision process decidefuture developments and conclusions in Section 7, 8when to garbage collect [2, 6, 8, 9, 13, 14].and 9.4 Reinforcement Learning2 Related workReinforcement learning methods solve a class ofTo our current best knowledge we are not aware of anyproblems known as Markov Decision Processes (MDP).other attempt to utilize reinforcement learning in aIf it is possible to formulate the problem at hand as anJVM. Therefore, we are not able to provide referencesMDP, reinforcement learning provides a suitableto similar approaches for that particular problem. Manyapproach to its solution [3, 4, 5].papers on garbage collection techniques include somesort of heuristics on when the technique should beapplied, but they are usually quite simple. Thesemethods are usually straightforward and based on1. Environment ! State (s ) + Reward (r ) ! Decision processt tgeneral rules that do not take the specific characteristics2. Decision process ! Action (a ) ! Environmenttof the application into account.3. Environment ! new State (s ) + new Reward (r )t+1 t+1Brecht et al. [7] provide an analysis on when garbagecollection should be invoked and when the heap shouldbe expanded in the context of a Boehm-Demers-Weiser(BDW) collector. However, they do not introduce anys + radaptive learning but instead investigate theDecision process Environmentcharacteristic properties of different heuristics.a3 Problem Specification

The problem to solve is: how to design an automaticFigure 1 The figure shows model of a reinforcementlearning system. First the decision process observes the current stateand learning decision process for more dynamicand reward then the decision process performs an action that effectsgarbage collection in a modern JVM.the environment. Finally the environment returns the new state andthe obtained reward.Figure 1 depicts the interaction between an agent andits environment in a typical reinforcement learningsetting. The agent perceives the current state of theenvironment by means of the state signal s upon whichtit responds with a control action a .tMore formally, a policy is a mapping from states to If it is possible to define a way of representing statessuch that all relevant information for making a decisionactions π: SxA → [0, 1], in which π(s, a) denotes theprobability with which the agent chooses action a in is retained in the current state, the garbage collectionproblem becomes an MDP. Therefore, a prerequisite forstate s. As a result of the action taken by the agent inbeing able to use reinforcement learning methodsthe previous state, the environment transitions to a newsuccessfully is to find a way to represent states in astate s . Depending on the new state and the previoust+1correct manner [1, 3, 5].action the environment might pay a reward to the agent.The scalar reward signal indicates how well the agent isIn theory it is required that the agent has completedoing with respect to the task at hand. However, rewardinformation about the state of the environment in orderfor desirable actions might be delayed, leaving theto be able to guarantee asymptotic convergence to theagent with the temporal credit assignment problem ofoptimal solution. However, often fast learning is muchfiguring out which actions lead to desirable states ofmore important than a guarantee of eventually optimalhigh rewards. The objective for the agent is to chooseperformance. In practice, many reinforcement learningthose actions that maximize the sum of futureschemes are still able to achieve a good behavior in adiscounted rewards:reasonable amount of time even if the Markov propertyis violated [10].2R = r + γ r + γ r ….t t+1 t+2Whereas dynamic programming requires a model of theThe discount factor γ∈[0,1] favors immediate rewards environment for computing the optimal actions,reinforcement learning methods are model free and theover equally large payoffs to be obtained in the future,similar to the notion of an interest rate in economics [1, agent obtains knowledge about its environment through3, 5]. interaction. The agent explores the environment in atrial and error fashion, observing the rewards obtainedNotice, that usually the agent knows neither the stateof taking various actions in different states. Based ontransition nor the reward function, neither do thesethis information the agent updates its beliefs about thefunctions need to be deterministic. In the general caseenvironment and refines its policy that decides whatthe system behavior is determined by the transitionaction to take next [4, 5].probabilities P(s | s, a) for ending up in state s ift+1 t t t+1the agent takes action a in state s and the rewardt t4.1 Temporal-Difference Learningprobabilities P(r | s, a) for obtaining reward r for thet tThere are mainly four different approaches to solvestate action pair s , a .t tMarkov decision processes: Monte Carlo, temporal-A state signal that succeeds in retaining all relevantdifference, actor-critic and R-learning. For furtherinformation about the environment is said to have thediscussion about these methods, see references [5, 6,Markov property. In other words, in an MDP the12, 15].probability of the next state of the environment onlyWhat distinguishes temporal-difference learningdepends on the current state and the action chosen bymethods from the other methods is that they updatethe agent, and does not depend on the previous historytheir beliefs at each time step. In applicationof the system [1, 3, 5].environments where the memory allocation rate varies aA reinforcement learning task that satisfies the Markovlot over time, it is important to observe the amount ofproperty is an MDP. More formally: if t indicates theavailable memory at each time step. Hence temporal-time step, s is the state of the environment, a is an

difference learning seems to be well suited for solvingaction taken by the agent and r is a reward, then thethe garbage collecting problem [3, 5, 11, 15].environment and the task have the Markov property ifTemporal-difference learning is based on a valueand only if [5]:function, referred to as the Q-value function, whichcalculates the value of taking a certain action in aPr{s = s’, r = r | s , a } is equal to:t+1 t+1 t tcertain state. The algorithm performs an action,Pr{s = s’, r = r | s , a , r , s , a ,…, r , s , a } observes the new state and the achieved reward at eacht+1 t+1 t t t t-1 t-1 1 0 0time step. Based on the observations, the algorithmupdates its beliefs – the policy – and therebytheoretically improves its behavior at each time step [3,5, 11, 15].There are mainly two different approaches when it Generalization is a way of handling continuous valuescomes to temporal-difference methods: Q-learning and of state features. As it is the case of the garbageSARSA (State, Action, Reward, new State, new collection problem, generalization of the state isAction). This project has investigated the SARSA needed. Alternative approaches, other thanapproach, since it is an on-policy method. On-policy generalization, to approximate the Q-value function aremeans updating the policy that is being followed, i.e. regression methods and neural networks [4, 6].the policy improves while being used. Further issues However, the approach used during this project wasregarding how to use this method are discussed below. generalization.There are mainly four approaches for generalizing4.2 Exploring vs. Exploitingstates and actions: coarse coding, tile coding, radialIn reinforcement learning problems the agent isbasis functions and Kanerva coding. For further readingconfronted with a trade-off between exploration andabout these methods see references [3, 5, 6].exploitation. On the one hand it should maximize itsCoarse coding is a generalization method using a binaryreward by always choosing the action a = max Q(s, a’)avector, where each index of the vector represents athat has the highest Q-value in the current state s.feature of the state, either present (1) or absent (0). TileHowever, it is also important to explore other actions incoding is a form of coarse coding where the stateorder to learn more about the environment. Each timefeatures are grouped together in partitions of the statethe agent takes an action it faces two possiblespace. These partitions are called tilings, and eachalternatives. One is to execute the action that accordingelement of a partition is called a tile. The more tilingsto the current beliefs has the highest Q-value. The otheryou have, the more states will be affected of the rewardpossibility is to explore a non-optimal action with aachieved and share the knowledge obtained from anlower expected Q-value of higher uncertainty. Due toaction performed. On the other hand, the system willthe probabilistic nature of the environment, an uncertainget exponentially more complex depending on howaction of lower expected Q-value might ultimately turnmany tilings are used [3, 5].out to be superior to the current best-known action.Obviously there is a risk, that the taking of the sub-Tile coding is particularly well suited for use onoptimal action diminishes the overall reward. However,sequential digital computers and for efficient onlineit still contributes to the knowledge about thelearning and is therefore used in this project [5].environment, and therefore allows the learning programto take better actions with more certainty in the future5 State Features and Actions of the[4, 5, 11, 12].General Garbage Collection ProblemThere are three different types of exploration strategiesIn the sections below some state features, actions andfor choosing actions, the greedy algorithm, the ε-greedyunderlying reward features, possible to apply in aalgorithm and the soft-max algorithm. The greedymemory management system, are presented.algorithm is not of interest to use, since the garbageDiscussions of how they may be represented are alsocollection problem requires exploration. Both the otherprovided.two algorithms are well suited for the garbagecollection problem. However, the ε-greedy algorithm5.1 Possible State Featureswas the choice we made.A problem in defining state features and rewards for aThe ε-greedy algorithm chooses the calculated, best Markov decision process, is the fact that the evolutionaction most of the times, but with a small probability ε of the state to a large extent is governed by the runninga random action is selected instead. The probability of application as it determines which objects on the heapchoosing a random action is decreased over time and are no longer referenced and how much new memory ishence satisfies both needs for exploration and allocated. The garbage collector can only partiallyinfluence the amount of available memory in that itexploitation [1, 5].reduces fragmentation of the heap and frees thememory occupied by dead objects. Therefore, it is often4.3 Generalizationdifficult to decide whether to blame the garbageAnother common problem is environments that havecollecting strategy or the application itself forcontinuous, and consequently infinitely many states. Inexceeding the available memory resources.this case it is not possible to store state-action values(Q-values) in a simple look-up table. A look-up tablerepresentation is only feasible when states and actionsare discrete and few. Function approximation andgeneralization are solutions to this problem [3, 12].In the following sections we present some suggestions5.2 State Representationof possible state features. Some state features might beEach observable system parameter, described in thedifficult to calculate accurately at run time. Forprevious section, constitutes a feature of the currentexample, if the free memory were distributed acrossstate. Tile coding, see Section 4.3, is used to map theseveral lock-free caches, the number of free bytescontinuous feature values to discrete states. Each tilingwould be hard to measure, or would at least takepartitions the domain of a continuous feature into tiles,prohibitively long time to measure correctly. Wewhere each tile corresponds to an interval of thetherefore have to assume that approximations of thesecontinuous feature.parameters are still accurate enough to achieve aThe entire state is represented by a string of bits, withreasonably good behavior.one bit per tile. If the continuous state value falls withinA fragmentation factor that indicates what fraction ofthe interval that constitutes the tile, the correspondingthe heap is fragmented is of interest. Fragments arebit is set to ‘one’, otherwise it is set to ‘zero’:chunks of free memory that are too small (<2kB) to• The tile contains the current state feature value !belong to the free-list, from which new memory is1allocated. As the heap becomes highly fragmented,garbage collection should be performed more• The tile does not contain the current state featurefrequently. This is desirable as it might reducevalue ! 0fragmentation by collecting dead objects adjacent tofragments. As a result, larger blocks of free memory For example, a particular state is represented by avector s = [1, 1, 0, …, 1, 0, 1], where each bit denotesmay appear that can be reused for future memorythe absence or presence of the state feature value in theallocation. In other words garbage collection should becorresponding tile.performed when the heap contains a large number ofnon-referenced, small blocks of free memory.5.3 Possible RewardsIt is important to keep track of how much memory isTo evaluate the current performance of the system,available in the heap. Based on this information thequantifiable values of the goals of the garbage collectorreinforcement learning system is able to decide atare desired. The objectives of a garbage collector (seewhich percentage of allocated memory it is mostreferences [6, 9, 13 14]) concern maximization of therewarding to perform a certain action, for instance toend-to-end performance and minimization of longgarbage collect.interruptions of the running application, caused byIf the rate at which the running program allocatesgarbage collection. These goals provide the basis formemory can be determined, it would be possible todefining the appropriate scalar rewards and penalties.estimate at what point in time the application will runA necessity when deciding the reward function is toout of memory, and hence when to start garbagedecide what are good and bad states or events. In acollection at the latest.garbage-collecting environment there are a lot ofIf it is possible to estimate how much processor time issituations that are neither bad nor good per se but mightactually spent on executing instructions of the runningultimately lead to a bad (or good) situation. Thisprogram, this factor could be used as a state feature.dynamic aspect adds another level of complexity to theHowever, when using a concurrent garbage collector itenvironment. It is in the nature of the problem thatis very difficult to measure the exact time spent ongarbage collection always intrudes on the process timegarbage collection versus the time used by the runningof the running program and always constitutes extraapplication. Hence, this measurement will either becosts. Therefore, no positive rewards are given but allimpossible to obtain or the information is highlyreinforcement signals are penalties for consuminginaccurate.computational resources for garbage collection or evenworse: running of out of memory. The objective of theThe average size of newly allocated objects mightlearning process is to minimize the discountedprovide valuable information about the applicationaccumulated penalties incurred over time.running that can be utilized by the garbage collector.Another feature of the same category is the average ageA fundamental rule for imposing penalty is to punish allof newly allocated objects, if measurable. The amountactivities that consume processing time from theof newly allocated objects is another possible feature.running program. For instance a punishment is imposedevery time the system performs a garbage collection.An alternative is to impose a penalty proportional to thefraction of time spent on garbage collection comparedto the total run time of the program.Another penalty criterion is to punish the system when When using compacting garbage collectors, it isthe average pause time exceeds an upper limit that is interesting to observe the success rate of allocatedconsidered still tolerable by the user. It is also important memory in the most fragmented area of the heap. Theto assure that the number of pauses does not exceed the actual amount of new memory allocated in themaximum allowed number of pauses. If the average fragmented area of the heap is compared to thepause time is high and the number of pauses is low, the theoretical limit of available memory in case of nosituation may be balanced by taking less time- fragmentation at all. An illustration of some possibleconsuming actions more frequently. If they are both situations is shown in Figure 3. It is desirable that 100high, a penalty might be in order. % of the newly allocated memory is allocated in themost fragmented area of the heap, in order to reduceWhen using a concurrent collector, a severe penaltyfragmentation. A penalty is imposed that is inverselymust be imposed if the running program runs out ofproportional to the ratio of actual allocated memory andmemory and as a result has to wait until a garbageits theoretical limit in the best possible case.collection is completed, since this is the worst possiblesituation to arise.1 2A is of size 2BA CB is of size 1At first, it seems like a good idea to impose a penaltyC is of size 3proportional to the amount of occupied memory.Fragm ented heap N on-fragm ented heapHowever, even if the memory is occupied up to 99 %B A Cthis does not cause a problem, as long as the runningapplication terminates without exceeding the available 3/6 = 50 % of new m em ory w as successfullyallocated in the fragm ented heapmemory resources. In fact, this is the most desirable = free memory5/6 = 83 % of new m em ory could theoretically= occupied m em orycase, namely that the program terminates requiring nobe allocated in the fragm ented heap= unallocated heapgarbage collection but still never runs out of memory.Therefore, directly imposing penalties for the3 4occupation of memory is not a good idea.The ratio of freed memory after completed garbagecollection compared to the ratio allocated memory inB A C B A Cthe heap prior to garbage collection provides anotherpossible performance metric. This parameter gives an3/6 = 50 % of new m em ory w as successfully 6/6 = 100 % of new m em ory w as successfullyallocated in the fragm ented heap allocated in the fragm ented heapestimate of how much memory has been freed. If theamount is large there is nothing to worry about, as 6/6 = 100 % of new m em ory could theoretically 6/6 = 100 % of new m em ory could theoreticallybe allocated in the fragm ented heap be allocated in the fragm ented heapillustrated to the left in Figure 2. If the amount freedmemory is low and the size of the free-list is low as

well, problems may occur and hence the garbage Figure 3 To the upper right (2) half of the new allocatedmemory was successfully allocated in the fragmented heap. To thecollector should be penalized. The latter situation,lower left (3) the same percentage was successfully allocated in theillustrated to the right in Figure 2, might occur if afragmented heap although space for all new allocated objects existsrunning program has a lot of long-living objects andin the fragmented area. To the lower right (4) all new allocatedruns for a long time, so that most of the heap will be objects were successfully allocated in the fragmented heap.occupied.If the memory relies on global structures that need alock to be accessed, taking the lock ought to bepunished. This might be the case for memory free-lists,Example of a good situation Example of a bad situationcaches etc.The heap A The heap A AThe more time a compacting garbage collector spendsB B B BC C Con iterating over the free-list (for explanation seeD DE EE references [13, 14]) the more it should be penalized. AF F F Flong garbage collection cycle is an indicator for aG G GH Hfragmented heap. High fragmentation in itself is notI I Ij J Jnecessarily bad, but the iteration consumes timeotherwise available to the running application, which iswhy such a situation should be punished.

Figure 2 A good situation with a high freeing rate isillustrated to the left. A worse situation is illustrated to the right,where there is little memory left in the heap although a garbagecollection has just occurred. This last situation may cause problems.When it comes to compacting garbage collectors a To reduce garbage collection time, smaller free blocksmeasurement of the effectiveness of a compaction might not be added to a free list during a sweep-phase.provides a possible basis for assigning a reward or a The memory block size is the minimum size of a freepenalty. If there was no need for compacting, the memory block for being added to the free list. Differentsection in question must have been non-fragmented. applications may have different needs with respect toAccordingly a situation like this should be assigned a this parameter.reward.How many generations are optimal for a generationalThere is one possible desirable configuration to which a garbage collector? With the current implementation it isreward, rather than a penalty, should be assigned, only possible to decide prior to starting the garbagenamely if a compacting collector frees large, connected collector if it operates with either one or twochunks of memory. The opposite, if the garbage generations. It might be possible, even today, to reducecollector frees a small amount of memory and the the number of generations from two to one, but not torunning program is still allocating objects, could increase them during run-time. When it comes to futurepossibly be punished in a linear way, as some of the generational garbage collectors it would be of interestother reward situations described above. to let the system vary the size of the differentgenerations. If there is a promotion rate available, thisis a factor that might be interesting for the system to5.4 Possible Actionsvary as well.Whether to invoke garbage collection or not at a certainpoint of time is the most important decision for theIf the garbage collector uses an incremental approach,garbage collecting strategy to take. Therefore, the set ofdeciding the size of the heap area that is collected at apossible actions taken by the prototype discussed in thetime might be an interesting aspect to consider. Thelater section is reduced to this binary decision.same applies to deciding whether to use the concurrentapproach, in conjunction with the factors of how manyWhen the free memory is not large enough and thegarbage collection steps to perform at a time and howgarbage collection fails to free a sufficiently largelong a time the system should pre-clean (foramount of memory, a possible remedy is to increase theexplanation see references [14]).size of the heap. It is also of interest to be able todecrease the heap size, if a large area of the heap never6 The Prototypebecomes allocated. To decide whether to increase ordecrease the heap size can constitute an action. If aThe state features used in the prototype are the currentchange is needed a complementary decision is to decideamount of available memory s and the change in1the new size of the heap.available memory s, calculated as the difference2between s at the previous time step - s at the current1 1To save heap space or rather to use the available heaptime step.more effectively, a decision to compact the heap or not,could also be of interest. In addition the action couldThere is only one binary decision to make, namelyspecify how much and which section of the heap towhether to garbage collect or not. Hence, the action setcompact.contains only two actions {0, 1}, where 1 representsperforming a garbage collection and 0 represents notTo handle synchronization between allocating threadsperforming a garbage collection.of the running program, a technique of using lock-freeThread Local Areas (TLAs) is usually used. EachThe tile coding representation of the state in theallocating thread is allowed to allocate memory withinprototype was chosen to be one 10x2-tiling in the caseonly one TLA at a time and vice versa there is only onewhere only s was used. In the case where both state1thread permitted to allocate memory in a particularfeatures were used the tile coding representation wasTLA. The garbage collection strategy could determinechosen to be one 10x7x2-tiling, one 10-tiling, one 7-the size of each TLA and how to distribute the TLAstiling and one 10x7-tiling. A non-uniform tiling wasbetween the threads.chosen, in which the tile resolution is increased forstates of low available memory, and a coarserWhen allocating large objects often a Large Objectresolution for states in which memory occupancy is stillSpace (LOS) is used, especially in cases wherelow. The tiles for feature s correspond to the intervals1generational garbage collectors are considered, in order[0, 4], [4, 8], [8, 10], [10, 12], [12, 14], [14, 16], [16,to avoid moving large objects. Deciding the size of the18], [18, 20], [22, 26] and [30, 100]. The tiles forLOS and how large an object has to be, to befeature s are at a resolution: [<0], [0-2], [3-4], [5-6],2considered a large object, are additional issues for the[7-8], [9-10] and [>10].reinforcement learning decision process to consider.The reward function of the prototype imposes a penalty However, learning a proper garbage collection policy(-10) for performing a garbage collection. The penalty should take a reasonable amount of time, as otherwisefor running out of memory is set to -500. It is difficult the reinforcement learning system would be of littleto specify the quantitative trade-off between using time practical value. The first step of an evaluation of RLS isfor garbage collection and running out of memory. In to verify that learning and adaptation actually occur atprinciple the later situation should be avoided at all all, namely that the system improves its performancecosts, but a too large penalty in that case might bias the over time. The learning success is measured by thedecision process towards too frequent garbage average reward per time step. Analyzing the timecollection. Running out of memory is not desirable evolution of the Q-function provides additional insightsince a concurrent garbage collector is used. A into the learning progress.concurrent garbage collector must stop all threads if thesystem runs out of memory, which is the major purpose7 Resultsof using a concurrent garbage collector in the firstOne of the main objectives of this project is theplace.identification of suitable state features, underlyingreward features and action features for the dynamicThe probability p that determines whether to pick theaction with the highest Q-value or a random action for garbage-collection learning problem. An additionalexploration evolves over time according to the formula: objective is the implementation of a simple prototypeand the evaluation of its performance on a restricted set-(t / C)p = p * e0 of benchmarks in order to investigate whether theproposed machine learning approach is feasible inwhere p = 0.5 and C = 5000 in the prototype, whichpractice.0means that random actions are chosen with decreasingThis section compares the performance of aprobability until approximately 25000 time stepsconventional JVM with a JVM using reinforcementelapsed. A time step t corresponds to about 50ms of reallearning for making the decision: when to garbagetime between two decisions of the reinforcementcollect. The JVM using reinforcement learning islearning system.referred to as the RLS (Reinforcement LearningThe learning rate α decreases over time according to System) and the conventional JVM is JRockit.the formula stated below:Since JRockit is optimized for environments in whichthe allocation behavior changes slowly, environments-(t / D)α = α * e0where the allocation behavior changes more rapidlymight cause a degraded performance of JRockit. Inwhere α = 0.1 and D = 30000 in the prototype. The0these environments it is of special interest to investigatediscount factor γ is set to 0.9.if an adaptive system, such as an RLS, is able toThe test application used for evaluation is designed to perform equally well or even better than JRockit.demonstrate a very dynamic memory allocationFigure 4 shows the results of using the RLS and JRockitbehavior. The memory allocation rate of the testfor the test application described in Section 6. Due toapplication alternates randomly between differentthe random distribution of behavior cycles a directbehavior cycles. A behavior cycle consists of eithercycle-to-cycle comparison of these two different runs is10000 iterations or 20000 iterations of either low ornot meaningful. Instead, the accumulated timehigh memory allocation rate. The time performance ofperformances, illustrated in Figure 4, are used forthe RLS is measured during a behavior cycle as thecomparison. As may be seen in the lower chart, thenumber of milliseconds required to complete the cycle.RLS performs better than JRockit in this dynamicenvironment. This confirms the hypothesis of an RLS6.1 Interesting Comparative Measurementsbeing able to outperform an ordinary JVM in a dynamicThe performance of the garbage collector in JRockitenvironment.ought to be compared to the performance when usingthe reinforcement system for deciding when to garbagecollect not only in terms of time performance but alsoin terms of the reward function. The reward function isbased on the throughput and the latency of a garbagecollector and the underlying features of the rewardfunction are hence suitable for extracting comparableresults of the two systems.

performance of the RLS and JRockit when running the applicationFigure 5 The upper chart illustrates the accumulatedwith behavior cycles of random duration and memory allocation rate.penalty for the RLS compared to JRockit. The lower chart illustratesThe upper chart shows the performances during the first 20 behaviorthe average penalty as a function of time. For RLS the penalty due tocycles and the lower chart shows the performances during 20garbage collection and due to running out of memory is shownbehavior cycles after approximately 50000 time steps. Notice thatseparately.lower values correspond to better performance.The accumulated penalty over a time period betweenFigure 5 illustrates the accumulated penalty for the RLStime step 30000 and 50000 after RLS completedcompared to JRockit. In the beginning the RLS runs outlearning, has been calculated to -8400. Theof memory a few times, as shown in the graph labeledcorresponding accumulated penalty for JRockit for thepenalty RLS for running out of memory, but after aboutsame period of time was calculated to -8550. This15000 time steps it learns to avoid running out ofshows that the results of the RLS are comparable to thememory. The lower chart shows the current averageresults of JRockit. The values verify the resultspenalty of the RLS and JRockit. After about 20000 timepresented above: that the RLS performs equally well orsteps the RLS has adapted its policy and achieves theeven slightly better than JRockit in an intentionallysame performance as JRockit. The results show that thedynamic environment.RLS in principle is able to learn a policy that cancompete with the performance of JRockit. The testsession only takes about an hour, which is a reasonablelearning time for offline learning (i.e. following onepolicy while updating another) of long runningapplications. Also, no attempt has been made tooptimize the parameters of the RLS, such as explorationand learning rate, in order to minimize learning timewithin this project.

0246810121416182080828486889092949698100Time (ms)Time (ms)In the following we analyze the learning process in

Q-function after 2500more detail by looking at the time evolution of the Q-0function for the single feature case that only considers-50no garbage collectionthe amount of free memory. The upper chart in Figure 6garbage collection-100compares the Q-function for both actions, namely togarbage collect or not to garbage collect, after-150approximately 2500 time steps. Notice, that the RLS 0 0.0 5 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5Q-function after 10000 time stepsalways prefers the action of higher Q-value. The0probability p of choosing a random action is still very-50high and garbage collection is randomly invokedno garbage collectionY-axis:frequently enough to prevent the system from running-100 garbage collectionQ(s )1out of memory. On the other hand the high frequency of-150

random actions during the first 5000 time steps leads0 0.0 5 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5the system to avoid deliberate garbage collection action Q-function after 500000at all. In other words it always favors not to garbage-50collect in order to avoid the penalty of -10 units for theno garbage collectionalternative action.garbage collection-100-150Initially, the system does not run out of memory due to0 0.0 5 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5the high frequency of randomly performed garbageX-axis: s1collections. The only thing the system has learned so farFigure 6 The figure shows the development of the state-is that it is better not to garbage collect than to garbageaction value function, the Q-function, over time. The upper chartcollect. Notice, that the system did not learn for statesshows the Q-function after approximately 2500 time steps. The middleof low free memory, as those did not occur yet. Thechart shows the Q-function after approximately 10000 time steps andthe lower chart shows the Q-function after approximately 50000 timedifference of the Q-value between the two actions is -steps and is then constant.10, which corresponds exactly to the penalty forperforming a garbage collection. This makes senseThe performance comparison between the RLS andinsofar as the successor state after performing a garbageJRockit suggests further investigation of reinforcementcollection is similar to the state prior to garbagelearning for dynamic memory management. Regardingcollection, namely a state for which the amount ofthe fact that this first version of the prototype onlymemory available is still high.considers a single state feature, it would be interestingThe middle chart in Figure 6 shows the Q-function afterto investigate the performance of an RLS that takesapproximately 10000 time steps. The probability ofadditional and possibly more complex state featureschoosing a random action has now decreased to theinto consideration. Additional state features mightextent, that the system actually runs out of memory.enable the RLS to take more informed decisions andOnce that happens the RLS incurs a large penalty, andthereby achieve even better performance.thereby learns to deliberately take the alternative action,In Figure 7 the accumulated time performance of thenamely to garbage collect at states of low availableRLS using one (1F2T) and two state features (2F5T),memory.and JRockit (JR) is compared. In the case of two stateThe lower chart in Figure 6 illustrates the Q-functionfeatures, five (instead of only two) tilings were used inafter approximately 50000 time steps. At this point theorder to achieve better generalization across the higherQ-values for the different states has already converged.dimensional state space. In order to illustrate the effectGarbage collection is invoked once the amount ofof five tilings, the time performance of an RLS usingavailable memory becomes lower than approximatelytwo state features but only two tilings (2F2T) is also12%. This policy is optimal considering the limitedshown in the charts of Figure 7. The upper chartstate information available to RLS, the particular testillustrates the performance of the four systems in theapplication and the specific reward function.initial stage at which the RLS is adapting its policy. Thelower chart shows the performance after approximately50000 decisions (time steps). The graphs show that theRLS using two state features and five tilings does notperform better than the RLS using only one statefeature or JRockit. However, the system using fivetilings is significantly better than the RLS using twostate features and two tilings.The main reason for the inferior behavior is probablythat the new feature increases the number of states and SPECjbb2000: RLS vs JRockitthat therefore converging to the correct Q-values and30000optimal policy requires more time. The decisionboundary is more complex than in the case of only a25000single state feature. The number of states for which theRLS has to learn that it runs out of memory, if it does20000not perform a garbage collection, has increased andRLS15000thereby also the complexity of the learning task.JRockit10000Accumulated Time Performance(the first 40 behavior cycles)500080000070000006000000 6250 12500 18750 25000 31250 37500 43750Time Step5000002F2T2F5T400000

JRFigure 8 The figure illustrates the performance of the RLS using300000 1F2Tone state feature compared to JRockit of a SPECjbb2000 session with200000full occupancy from the beginning.1000000The average performance scores of both systems arepresented in Table 1. As may be observed, the use ofBehavior Cycle (nr)the RLS for the decision of when to garbage collectimproves the average performance by 2%. That numberAccumulated Time Performance(after approximately 50000 time steps)already includes the learning period. If the learningperiod of the RLS is excluded (i.e. measured after3500000approximately 30000 decisions), the averageimprovement when using the RLS is 6%.3000000Table 1 The table illustrates the average performance results of2F2Tthe RLS using one state feature and JRockit, when running2F5T2500000JR SPECjbb2000 with full occupancy.1F2T2000000System Average score Average score(learning incl.) (learning1500000excl.)Behavior CyclesJRockit 22642,86 23293,98

Figure 7 The figure shows the accumulated time performance ofRLS 23093,08 24775,43JRockit compared to the RLS using one state feature and two RLSusing two state features but different tilings.Improvement (%) 1,98832 6,359799

Another consequence of the increased number of statesis that the system runs out of memory more often. Tosome extent Q-function approximation (i.e. tile coding,8 Discussion and Future Developmentsfunction approximation) provides a remedy to thisThe preliminary results of our study indicate thatproblem. Further investigation regarding this aspect isreinforcement learning might improve existing garbageneeded, see the discussion in Section 8.collection techniques. However, a more thoroughanalysis and extended benchmark tests are required forTo provide some standard measurement results the bestan objective evaluation of the potential ofRLS, i.e. the RLS using only one state feature, isreinforcement learning techniques for dynamic memorycompared to the JRockit version used in previous testmanagement.sessions due to SPECjbb2000 scores. In Figure 8 theresults of a test session with full occupancy from theThe most important task of future investigation is tobeginning are presented. As mentioned before, the RLSsystematically investigate the effect of using additional

is learning until the 30000th time step (decision).state features for the decision process and to investigatetheir usefulness for making better decisions.

0246810121416182022242628303234363840140142144146148150152154156158160162164166168170172174176178180Time Performance (ms) Time Performance (ms)Score (ops/sec)The second important aspect is to investigate more Once a real system has been developed from thecomplex scenarios of memory allocation, in which the prototype, it can be used to handle some of the othermemory allocation behavior switches more rapidly and decisions related to garbage collection proposed in thisless regularly. It is also of interest to investigate other report.dimensions of the garbage-collection problem such asIt is recommended to investigate this research areaobject size and levels of references between objects,further, since it is far from exhausted. Considering thatamong others. It is important to emphasize that thethe results were achieved using a prototype that isresults above are derived from a limited set of testpoorly adjusted in several aspects, further developmentapplications that cannot adequately represent the rangemight lead to interesting and even better results thanof all possible applications.obtained within the restricted scope of this project.The issue of selecting proper test application environ-ments also relates to the problem of generalization. The9 Conclusionsquestion is: how much does training on one particularThe trade-off that every garbage collecting system facesapplication or a set of multiple applications help tois that garbage collection in itself is undesirable, as itperform well on unseen applications? It would beconsumes time from the running program. However, ifinteresting to investigate how long it takes to learn fromgarbage collection is not performed the system runs thescratch or how fast an RLS can adapt when therisk of running out of memory, which is far worse thanapplication changes dynamically.slowing down the application. The motivation for usingAnother suggestion for improving the system is to a reinforcement learning system is to optimize thisdecrease the learning rate more slowly. The same trade-off between saving CPU time and avoidingexhaustion of the memory.suggestion applies to the probability of choosing arandom action in order to achieve a better balanceThis report has investigated how to design andbetween exploitation and exploration. The optimalimplement a learning decision process for a moreparameters are best determined by cross-validation.dynamic garbage collection in a modern JVM. TheAn approach for achieving better results when more results of this thesis show that it is in principle possiblestate features are taken into account might be to for a reinforcement learning system to learn when torepresent the state features in a different way. For garbage collect. It has also been demonstrated that oninstance, radial basis functions, mentioned earlier in simple test cases the performance of the RLS afterthis report, might be of interest for generalization of training in terms of the reward function is comparablecontinuous state features. An even better approach with the heuristics of a modern JVM, such as JRockit.would be to represent the state features with continuousThe time it takes for the RLS to learn also seemsvalues and to use a gradient-descent method forreasonable since the system only runs out of memory 5-approximating the Q-function.10 times during the learning period. Whether this costIt seems that that the total number of state features is a of learning a garbage collecting policy is acceptable incrucial factor. JRockit considers only one parameter for real applications depends on the environment and thethe decision of when to garbage collect. The requirements on the JVM.performance of the RLS was not improved using twoFrom the results in the case of two state features, itstate features, likely due to the enlarged state space.becomes clear that using multiple state featuresThe question remains, whether the performance of thepotentially results in more complex decision surfacesRLS improves if additional state information isthan simple standard heuristics. Observations have alsoavailable and the time for exploration is increased. Thebeen made that there exists an evident trade-offpotential strength of the RLS might reveal itself betterbetween using more state features, in order to makeif the decision is based on more state features thanmore optimal decisions, and the increased time requiredJRockit uses currently.for learning due to an enlarged state space.Another important aspect is online vs. offlineFrom the above results one can learn that the use of aperformance. How much learning can be afforded, orreinforcement learning system is particularly useful ifshall only online-performance be considered? That ofan application has a complex dynamic memorycourse is also a design issue for JRockit, which reliesallocation behavior, which is why a dynamic garbageon a more precise definition of the concrete objectivescollector was proposed in the first place. It isand requirements of a dynamic Java Virtual Machine.noteworthy to observe that machine learning through anadaptive and optimizing decision process can replace ahuman designed heuristic such as JRockit that operateswith a dynamic threshold.This article is an excerpt of the project report 12. Precup, D., Sutton, R. S. and Dasgupta, S.Reinforcement Learning for a Dynamic JVM [6], which (2001). Off-policy temporal-difference learning withmay be obtained by contacting the author at: function approximation. School of computer science,eva.andreasson@appeal.se. McGill University, Montreal, Quebec, Canada andAT & T Shannon laboratory, New Jersey, USA.10 References13. Printezis, T. (2001). Hot-swapping between aLiterature mark&sweep and a mark&compact garbagecollector in a generational environment. Department1. Bertsekas, D. P. and Tsitsiklis, J. N. (1996). Neuro-of Computing Science, University of Glasgow,dynamic programming. Athena Scientific, Belmont,Glasgow, Scotland, UK.Massachusetts, USA.14. Printezis, T.; Detlefs, D. (1998). A2. Jones, R. and Lins, R. (1996). Garbage collection –generational mostly-concurrent garbage collector.algorithms for automatic dynamic memoryDepartment of Computing Science, University ofmanagement. John Wiley & Sons Ltd., Chichester,Glasgow, Glasgow, Scotland, UK; Sun MicrosystemsEngland, UK.Laboratories East, Massachusetts, USA.3. Mitchell, T. M. (1997). Machine learning. McGraw15. Tsitsiklis, J. N. and Van Roy, B. (1997). AnHill, USA.analysis of temporal-difference learning with4. Russell, S. J. and Norvig, P. (1995). Artificial function approximation. Laboratory for informationand decision systems, MIT, Cambridge,intelligence – a modern approach. Prentice-Hall,Inc., Englewood Cliffs, New Jersey, USA. Massachusetts, USA.5. Sutton, R. S. and Barto, A. G. (1998). Reinforcementlearning – an introduction. MIT Press, Cambridge,Massachusetts, USA.Papers6. Andreasson, E. (2002). Reinforcement Learning for adynamic JVM. KTH/Appeal Virtual Machines,Stockholm, Sweden.7. Brecht, T., Arjomandi, E., Li, C. and Pham, H.(2001). Controlling garbage collection and heapgrowth to reduce the execution time of javaapplications. ACM Conference, OOPSLA, Tampa,Florida, USA.8. Flood, C. H. and Detlefs, D.; Shavit, N.; Zhang, X.(2001). Parallel garbage collection for sharedmemory multiprocessors. Sun MicrosystemsLaboratories, USA; Tel-Aviv University, Israel;Harvard University, USA.9. Lindholm, D. and Joelson, M. (2001). Garbagecollectors in JRockit 2.2. Appeal Virtual Machines,Stockholm, Sweden. Confidential.10. Pack Kaelbling, L.; Littman, M. L. and Moore,A. W. (1996). Reinforcement Learning: A Survey.Journal of Artificial Intelligence Research, Volume 4.11. Pérez-Uribe, A. and Sanchez, E. (1999). Acomparison of reinforcement learning with eligibilitytraces and integrated learning, planning andreacting. Concurrent Systems Engineering Series,Vol. 54, IOS Press, Amsterdam.