shanir@cs.tau.ac.ilIn software transactional memory (STM) systems, it isuseful to isolate a memory region accessed by onethread from all others, so that it can then operate on it“privately”, that is, without the instrumentation overheadof inter-transactional synchronization. Allowingtransactions to implicitly privatize memory is a source ofmajor performance degradation in state-of-the-art

STMs. The alternative, to explicitly declare andguarantee privacy only when needed, has been arguedto be too tricky to be useful for general programming.

This paper proposes private transactions, a simpleintermediate that combines the ease of use of implicitprivatization, with the efﬁciency that can be obtainedfrom explicitly knowing which regions are private.

We present a new scalable quiescing algorithm forimplicit privatization using private transactions,applicable to virtually any STMalgorithm, including thebest performing TL2/LSA-style STMs. The newalgorithm delivers virtually unhindered performance atall privatization levels when private transactions involvework, and even under the extreme case of emptyprivate transactions, allows for a scalable “pay as yougo” privatization overhead depending on theprivatization level.

1. Introduction

One goal of transactional memory algorithms is to allowprogrammers to use transactions to simplify theparallelization of existing algorithms. A

common anduseful programming pattern is to isolate a memorysegment accessed by some thread, with the intent ofmaking it inaccessible to other threads. This “privatizes”the memory segment, allowing the owner access to itwithout having to use the costly transactional protocol(for example, a transaction could unlink a

Permission to make digital or hard copies of all or part of this work for personalor classroom use is granted without fee provided that copies are not made ordistributed for proﬁt or commercial advantage and that copies bear this noticeand the full citation on theﬁrst page. To copy otherwise, to republish, to post onservers or to redistribute to lists, requires prior speciﬁc permission and/or a fee.

TRANSACT ’10 Date, City. Copyright c

2010 ACM [to be supplied]...$10.00node from a transactionally maintained concurrent list inorder to operate on it or to free the memory forreallocation.)

Today, many of the best performing lock-basedsoftware transactional memory (STM) algorithms [4–6]use a variation of the TL2/LSA [4, 11] style global-clockalgorithm using invisible reads. When we say invisiblereads, we mean that the STM does not know which, oreven how many, readers are accessing a given memorylocation. Not having to track readers is the key to manyefﬁcient STMs, but also a problem if one wishes to allowprivatization.

Allowing transactions to implicitly privatize memory isa source of major performance degradation in suchSTMs (One should note that STMs that use centralizeddata

structures, such as RingSTM [14] or compilerassisted coarse grained locking schemes [10], canprovide implicit privatization without the need for explicitvisible readers). The alternative, to explicitly declare andguarantee privacy only when needed, has been arguedto be too tricky to be useful for general programming.

Why is guaranteeing implicit privatization such aproblem? Consider a transaction that has just privatizeda memory segment. Even though the segment cannotbe accessed by any other transaction (executing on thesame or other processor), after the transaction commits,prior to the commit, latent transactional loads and storesmight be pending. These latent loads and stores,executed by transactions that accessed the segmentbefore it was isolated, can still read from and write intothe shared memory segment that was intended to beisolated. This unexpected behavior is known as the“privatization problem.” This results in unexpectedchanges to the contents of the isolated shared memory(which

may have been reallocated and (althoughresident in the shared memory)) is intended to beoutside of the transactionally shared data region. Otherunexpected, generally asynchronous, behaviors canalso occur.

For example, consider the scenario in Figure 1: aninvisible read based transaction by a thread Premoves a node form a linked list. Thus, once thetransaction completes, the node will no longer bereachable to other threads and P will be able tooperate on it privately. However, before P completesits transaction, another transaction Q reads thepointer, and is poised to read a value from the node.P has no wayb c d a0

: divide by0 error

Figure 1. Privatization Pathology Example

of detecting Q . Thisis because Q reads the pointerinvisibly, and will not see any trace of P touching thelocation in the node since P is operating on itnon-transactionally. As a result, even though Q isdoomed to fail (once it revalidates the locations it read,and detects

the pointer has changed), in the interim itcan perform illegal operations. This is an example of aprivatization problem that one must overcome.

One solution to the privatization problem is toestablish programming constraints against concurrenttransactional and non-transactional access to the sameset of shared memory locations. However, this is asolution we would like to avoid.

Another solution is to add privatization capabilities toa transactional memory. The transactional memory canemploy either “explicit privatization,” where theprogrammer explicitly designates regions passing out oftransactional use to be quiesced, waiting for anypending transactional loads and stores to completebefore the memory is allowed to be accessednon-transactionally.

Programming explicit segmentquiescence is complex and error prone. For example, itis insufﬁcient for a transaction to explicitly privatize asegment from the transactionally shared data regionbefore modifying that segment.

Alternately, transactions can

employ “implicitprivatization,” where the STM automatically managesthe memory accessibilty/lifecycle issues. A paper byCheng et al. [15] describes an implicit privatizationmechanism that quiesces threads instead of sharedmemory regions, potentially impacting overall scalability.As we said, the problem with implicit privatizationtechniques to date, is that they hinder the scalability ofmost of the best performing STMs.

The current state-of-the-art is thus that there existseveral highly scalable STM

algorithms that operatewell without providing privatization, but do not scale wellwhen implicit privatization capabilities are added ( seeTL2-IP algorithm in [3] and see [1, 8, 9, 13] ). This is thesituation we wish to rectify.

This paper proposes private transactions, a simpleintermediate approach that combines the ease of use ofimplicit privatization, with the efﬁciency that can beobtained from explicitly knowing which regions areprivate. The idea behind private transactions is simple.The user will use implicit privatization to privatizememory segments just as before,but will declare, using a special keyword or keywords,which code segments he/she expects to be executedprivately.

From the programmers point of view, a privatetransaction is thus a declaration of the code elementsthat are expected to be private: it is the programmersresponsibility to make sure that the selected locationswithin the private transaction are indeed not accessibleto successful transactions. It is the STMs responsibilityto make sure that unsuccessful transactions do notviolate the transactional barrier and access theseprivatized regions.

We believe private transactions will not add an extraburden to programmers beyond that of implicitprivatization because the programmer must anyhowknow what he expects to be private following the act ofprivatization! (This is definitely true for newly writtencode, and for legacy code in which the programmer isnot performing guesswork while, say, replacing lockswith transactions). Notice that there is no limitation onthe code that can be called within a private transaction,and in particular one can call external libraries that knownothing about the transactions executing in the system.

What do we gain from the private transactiondeclaration? Our gain is twofold. We remain within thetransactional model (i.e. the private transaction is atransaction, not a “barrier” whose semantics and rulesof use with respect to other transactions are unclear),and we can algorithmically guarantee efﬁcient executionof the privatized code (it will run at the pace ofun-instrumented code), placing only a limitedcomputational burden on regular non-privatetransactions. In other words, unlike with the standardmodel of implicit privatization,

the privatization overheadusing private transactions will have a pay-as-you-gonature: the less privatization, the less overhead.

An important contribution of our paper is a newscalable quiescing algorithm for implicit privatizationusing private transactions, applicable to virtually anySTM algorithm, including the TL2/LSA-style STMs. Wenote that for those who do not wish to use privatetransactions programming model, this quiescingalgorithm can still be used at the end of privatizingtransactions toguarantee efﬁcient implicit privatization.

To show the power of the new quiescing algorithmtechnique, we apply it to the latest version of the TL2STM. We then compare our new TL2P algorithm, thatis, TL2 with private transaction capability, to TL2-IP, thatis, TL2 with a known implicit privatization mechanismbased on shared reader counters.

As we show, the new algorithm is highly scalable. In arealistic situation in which private transactions includework (see Figure 5, it delivers virtually unhinderedperformance. In less realistic trying benchmark inwhich private transactions do not include work, itdelivers great performance at low privatization levels,and unlike former techniques, as exempliﬁed byTL2-IP, remains scalable (though not as efﬁ-cient as TL2) even with 100% privatization. We believeit can be applied to many existing STMs, allowing themto maintain scalability while providing low overheadprivatization capabilities.slot of thread 2s lot of thread 1Dynamic Array of active threads

thread 1 thread 2Interestingly, non-transactional data structures, such asthose in the Java concurrency package, suffer fromprivatization issues. For example, a record removedfrom a red-black tree cannot be modiﬁed withoutworrying that some other thread is concurrently readingit. Our new private transaction mechanism offers ascalable way to add privatization to such structures.TXP-START→

updates thread’sslot

STM Transaction

TX-END→

updatesthread’s slot

TXP-START→

executes waitbarrier

TXP-ENDPrivate TransactionTX-START

TX-END2. Private Transactions

A private transaction is a transaction accessing memorylocations that cannot be accessed by any othersuccessful transaction.

The idea behind private transactions is simple. Theprogrammer, using a regular transaction, privatizescertain sections of code by making them inaccessible toany thread thatstarts executing after the completion ofthis transaction. The programmer also declares, usingthe special private transaction keyword or keywords,which code segments he/she expects will be executedprivately.

The private transaction is thus a declaration

of thecode elements that are expected to be private afterregular successful transactions have privatized it. It isthe STMs responsibility to make sure that unsuccessfultransactions do not violate the regular transactionalbarrier.

Thus, in theclassical linked list example, aprogrammer will use a regular transaction to privatize alinked list node, and then place all code accessing thisnode within the private transaction, knowing it is nolonger accessible. If he/she correctly privatized using

aregular transaction, the private transaction semanticswill be guaranteed, and otherwise, as with any buggyprogram, all bets are off.

We believe private transactions will therefore not addan extra burden to programmers beyond that of implicitprivatization because the programmer must anyhowknow what he expects to be private following the act ofprivatization! Notice that there is no limitation on thecode that can be called within a private transaction, andin particular one can call external libraries that knownothing about the transactions executing in the system.

3. Implementing Private Transactions

Our privatization technique can be added to any existingSTM without changing it. The main idea, which we willcall a quiescing barrier, is well known:

track in a lowoverhead fashion when threads start transactions andwhen they end them. Using this tracking data,privatization can be provided on demand by waiting forall active transactions to complete before continuingwith the execution of a privatetransaction. However,past attempts to make this type of algorithm scale failedbecause the mechanisms used to implement the qui-Figure 2. Two threads execute transactions. We see thedynamic array used to track quiescing information andbars tracking the

execution phases of the two threads.Upon start andﬁnish, the threads update the dynamicarray slot associated with each one of them. WhenThread 1 executes a private transaction, it will execute await barrier, waiting for Thread 2 because it detects thatThread 2 is in a middle of transaction execution.

escing barrier incurred too large an overhead, and thisoverhead was exacerbated by the requirement toprivatize 100% of the transactions: there was nodeclaration of when it is actually required.

Here we combine the use of private transactions witha very low overhead shared array to achieve alightweight quiescing scheme.

The transactional tracking mechanism, depicted inFigure 2 is implemented as follows. The quiescingbarrier needs to know about the threads that aretransactionally active. We thus assign every thread aslot in an array which the barrier scans. For this to beefﬁcient, we use an array proportional to the maximalconcurrency level, and use a leasing mechanism [7]together with a dynamicresizing capability. We willexplain shortly how this is done. A thread will indicate, inits array slot, if it is running an active transaction, andwill add an increasing per-thread transaction number.The respectiveﬁelds are the IsRun booleanﬂagindicating if the thread is executing a transaction, andTxCount is a byte size counter which is incrementedupon every transaction start.

During a transaction’s start: 1. Map Thread Slot: Thethread id is mapped to an index

inside the array that the barrier scans. Anexplanation will follow shortly.

2. Increase Thread’s Counter: The TxCount counter isincreased by one to indicate a new transaction hasstarted.

3. Update Run Status: The transactional active statusﬂag IsRun is set to TRUE.

4. Execute a memorybarrier: A write-read memorybarrier instruction is executed to ensure that the otherthreads see that current thread is active.At the transaction’s end: 1. Update the Run Status:The statusﬂag IsRun is set to

FALSE. (No need for a memory barrier). As can beseen, the operations by any given thread

amount, in the common case, to a couple of load andstore instructions and a memory barrier.

In more detail, the quiescing barrier records thecurrent run status of the transactionally active threadsand waits for completion of each of them. It uses thefollowing 4 arrayﬁelds of MAX THREADS length.Several arrays are used: an array th tx counts[] is usedto hold the stored counters of the transactionally activethreads, th ids[] is used to hold the stored thread ids ofthe transactionally active threads, th checked[] is usedto indicate for which active threads the waiting conditionneeds to be tracked. Also, we maintain a global variableCurNumberOfSlots that holds the current number ofassigned slots in the global slots array th slots[] in whichevery assigned slot has a pointer to an associatedTxContext.

Using these variables the quiescing barrier algorithmproceedsas follows:

1. Store the Active Threads Status: For everytransactionally active thread with id thread id, store thethread’s TxCount and thread id to the waiting thread’scontextﬁelds th tx counts[thread id] and the thids[thread id]. Also set the waiting threads’s th checkedat entry thread id to FALSE, indicating that one needsto wait for this thread’s progress.

2. Wait For Progress: For every tracked stored threadstatus which need to be checked for progress, check ifthe thread is still running andits TxCount counter isequal to the stored one. If so we need to wait, thereforestart this step again. Return when all the threads wewaited for made progress.

The number of threads in an application can be muchhigher (in the thousands) than the actual concurrencylevel at any given moment. This is typically determinedby the number of hardware threads (in the tens).Therefore, we maintain an array of “leased” slots,proportional to the expected concurrency, to which thetransactionally active threads are

mapped. A lease [7] isa temporary ownership of a lock on a location, one thatis either released by a thread or revoked if it is heldbeyond the speciﬁed duration. The allocated array itselfcan be much larger than the number of active threads,but we keep a counter indicating its actual active size atany point. The scanning thread, the one checking thebarrier, need only traverse the array upto its actualactive size.

When a thread starts a transaction, it checks if itowns its assigned array slot. If its does, then thethread continues as usual. Otherwise, the threadpicks a slot in the array. If the hashed entry is freethen the thread takes it, and otherwise it tries tosteal the slot. The thread will succeed in stealingthe slot only if the slot’s lease time or timeout, fromthe lastactive run of the thread which owns the slot, haspassed. If the timeout has not expired, than the threadtries to acquire another slot. If no slot can be acquired,the thread adds a new slot to the end of the array andassigns itself to it.

In the array, every thread’s context ctx has a is slotvalid boolean variable indicating if the thread’s slot isvalid (assigned and not stolen) and a is slot steal inprocess boolean variable indicating if some thread istrying to steal the current thread’s slot.

The Assert Thread Slot works as follows: 1. Checkfor a Steal: If some other thread is in the pro-

cess of stealing the current thread’s slot then spin onis slot steal in process.

3. Register a Slot: Compute the slot number of thethread by hashing it to its thread id. For example, useslot number = thread id mod number of cores. If the slotwith the computed number is free to try to acquire itusing a Compare And Swap (CAS) to write into it apointer to the thread’s record ctx. If the slot is not free, orthe CAS failed, then Try To Steal Slot. If the stealingfailed, try to steal any other assigned slot. If all thestealing attempts failed, allocate a new slot in thedynamic array.

The Try To Steal Slot procedure checks for the slottimeout and if it has expired acquires the slot. Try ToSteal Slot procedure works as follows:

1. Check For a Timeout:Check if the slot’s owningthread timeout from last active transaction has expired.If not return failure.

2. Check For a Steal: Check if some other thread is alreadytrying to steal that slot by looking at the is slot steal inprocess value of the slot’sowning thread. If it is, returnfailure. Otherwise, try to CAS the is slot steal in processﬁeldto TRUE. If the CAS fails, return failure, and otherwisecontinue to the next step.

3. Validate the Slot Status: Check that the slot ownerhas not changed andcheck the timeout expirationagain. If all checks are positive-

continue to next step,and otherwise return failure.

4. Steal the Slot: Assign the slot’s value to be a pointerto the stealing thread’s context and set its is slot valid toTRUE. In addition, set the previous owning thread’s isslot valid and is slot steal in processﬂag to FALSE.

Allocation of a new slot is done when all the stealsfailed. The procedure works as follows:

1. Allocate a new Slot: Increment the CurNumberOfSlotsglobal limit variable using a CAS.2. Initialize the Allocated Slot: Assign the slot’s value tobe a pointer to the thread’s context and set its is slotvalid to TRUE.

In order to garbage collect the expired slots,periodically execute a maintenance operation whichchecks for expired slots and frees them. This sameoperation can adjust the the CurNumberOfSlotsaccording to the actual number of slots with unexpiredleases.

The main purpose of this complex dynamic slotallocation algorithm is to avoid scanning an arrayproportional to the number of threads in the system, andinstead scan only those which are transactionally active.

As we show, the complete mechanism is lightweightand delivers scalable performance.

Theend result of this algorithm is theun-instrumented execution of privatized code, with nolimitation on code that can be called within a privatetransaction: in particular one can call external librariesthat know nothing about the transactions executing inthe system.

4. Outline of correctness

For lack of space we do not discuss private transactionsemantics and only brieﬂy outline why our algorithm iscorrect. In a nutshell, each private transaction ispreceded by a traversal through the privatization barrier,recording all active transactions. We assume all privatetransaction memory regions are not accessible tosuccessful transactions. Thus, by waiting till all activetransactions have completed, and given that privatelocations can no longer be reached

by newly startedtransactions, privacy is guaranteed.

5. Empirical Performance Evaluation

Many of the scalable lock-based STM algorithms in theliterature use a TL2 style locking and global clock basedscheme, differing perhaps in details such as the order oflock acquisition and the abort and retry policies [4–6,11, 12]. Most of these algorithms do not supportprivatization because of its high overhead. We willtherefore provide an evaluation of our new privatizationalgorithm by adding it to the most efﬁcient know versionof the TL2 algorithm, one using a GV6 clock schemeciteTL2. We call this new version of TL2 supportingprivate transactions TL2P.

We would have loved to provide a comparison ofTL2P to the STM of Saha et. al [10] that providesprivatization via a global transactional quiescingmechanism, but unfortunately it is only available usingthe author’s speciﬁc STM compiler framework whichcannot be applied to our algorithm.

This section therefore presents a comparison of thevanilla TL2 algorithm with a GV6 counter, the TL2-IPalgorithm that provides implicit privatization for TL2,and our new TL2P algorithm supporting implicitprivatization with private transactions. Themicrobenchmarks include the (now standard)concurrent red-black tree structure and a randomizedwork-distribution benchmark.The red-black tree was derived from thejava.util.TreeMap implementation found in the Java 6.0JDK. That implementation was written by Doug Leaand Josh Bloch. In turn, parts of the Java TreeMapwere derived from the Cormen et al [2]. We would havepreferred to use the exact Fraser-Harris red-black treebut that code was written to to their speciﬁctransactional interface and could not readily beconverted to a simple form.

The red-black tree implementationexposes akey-value pair interface of put, delete, and getoperations. The put operation installs a key-value pair.If the key is not present in the data structure put will puta new element describing the key-value pair. If the keyis already present in the data structure put will simplyinsert the value associated with the existing key. Theget operation queries the value for a given key,returning an indication if the key was present in thedata structure. Finally, delete removes a key from thedata structure, returning an indication if the key wasfound to be present in the data structure. Thebenchmark harness calls put, get and delete to operateon the underlying data structure. The harness allowsfor the proportion of put, get and delete operations to

be varied by way of command line arguments, as wellas the number of threads, trial duration, initial numberof key-value pairs to be installed in the data structure,and the key-range. The key range of 2K elementsgenerates a small size tree while the range of 20Kelements creates a large tree, implying a largertransaction size for the set operations. We report theaggregate number of successful transactionscompleted in the measurement interval, which in ourcase is 10 seconds.

In the random-array benchmark each worker threadloops, accessing random locations. The transactionlength can be a constant or variable. While overlysimplistic we believe our random access benchmarkcaptures critical locality of reference properties found inactual programs. Wereport the aggregate number ofsuccessful transactions completed in the measurementinterval, which in our case is 10 seconds.

In our benchmarks we “transactiﬁed” the datastructures by hand: explicitly adding transactional loadand store operators. Ultimately we believe thatcompilers should perform this transformation. We didso since ourgoal is to explore the mechanisms andperformance of the underlying transactionalinfrastructure, not the language-level expression of“atomic.” Our benchmarked algorithms included:

TL2 The transactional locking algorithm of [4] usingthe GV4 global clockalgorithm that attempts toupdate the shared clock in every transaction, but onlyonce: even if the CAS fails, it continues on to validateand commit. We use the latest version of TL2 which(through several code optimizations, as opposed toalgorithmic changes) has about 25% better singlethreaded latency than theversion used in in [4]. This algorithm isrepresentative of a class of high performancelock-based algorithms such as [6, 12, 16].

TL2-IP A version of TL2 with an added mechanism toprovide implicit privatization. Our scheme, which wediscovered independently in 2007 [3], was alsodiscovered by Maratheet al. [8] who in turn attribute theidea to Detlefs et al. It works by using a simplistic GV1global clock advanced with CAS [4] before the validationof the read-set. We also add a new egress globalvariable, whose value “chases” the clock in the mannerof a ticket lock. We opted to use GV1 so we couldleverage the global clock as the incoming side of aticket lock. In the transactional load operator eachthread keeps track of the most recent GV (global clock)value that it observed, and if it changed since the lastload, we refresh the thread local value and revalidatethe read-set. That introduces a validation cost that is inthe worst case quadratic. These two changes–

serializing egress from the commit–

and revalidationare sufﬁcient to give TL2 implicit privatization. Thesechanges solve both halves of the implicit privatizationproblem, the 1st half being the window in commit wherea thread has acquired write locks, validated its read-set,but some other transaction races past and writes to alocation in the 1st thread’s read-set, privatizing a regionto which the 1st thread is about to write into. Serializingegress solves that problem. The 2nd half of theserialization problem is that one can end up with zombiereader transactions if a thread reads

some variable andthen accesses a region contingent or dependent on thatvariable, but some other thread stores into that variable,privatizing the region. Revalidating the read-set avoidsthat problem by forcing the 1st thread to discover theupdate andcausing it to self-abort.

TL2P This is the same TL2 algorithm without anyinternal changes, to it to which we added the privatetransaction support mechanism.

5.1 Red-Black Tree Benchmark In the red-black treebenchmark, we varied the fraction of

transactions with privatization. In the top two graphs inFigure 3, private transactions involve no computation,stressing the quiescing mechanism. We can see thatunder these extreme circumstances, in all the cases,unlike TL2-IP, the TL2P scheme is scalable at all levelsof privatization. This is quite surprising because onemight think that as more threads run one needs to scanmore entries in the dynamic array when performing theprivate transaction. But as can be seen d=from thegraphs, it does not impose a signiﬁcant overhead on thequiescence mechanism. The TL2P algorithm with 20%mutations pays a maximum penalty of 15% for 10%privatization case. 35% for 50% privatization and 50%for 100% privatization. With 4% of mutations (not showninFigure 3. Throughputwhen private transactions do nowork. A 2K sized Red-Black Tree on a 128 threadNiagara 2 with 25% puts and 25% deletes and 10%puts and 10% deletes for TL2, TL2-IP, and TL2Pvarying the percentage of private transactions: 100%,50%, and 10%.

the graphs),perhaps a realistic level of mutation on asearch structure, the maximum performance penalty inTL2P for 100% privatization, which is like implicitprivatization, is no more than 25%. And if theprivatization is only partial, say 10%, the penalty is just11%! In general we can see that the TL2P algorithmwith no private transaction usage is a little lower thanTL2. It is because of the internal counters used forthread transactional tracking. They impose some minoroverhead above the standard TL2.

5.2 Random Array Benchmark In the random-arraybenchmark we vary the privatization

density and the transaction patterns. The goal is toestimate the penalty private transactions pay fordifferent transaction lengths. For short transactions, weuse 32 reads per reader and 16 read-modify-writeoperations per writer. We use 128 reads and 64read-modify-write operations for the long transactioncase. To mimic the heterogeneous case, the readerlength is randomized between 1-128 reads, and thewriter between 1-64 read-modify-write accesses.

In Figure 4 we can see that the performance in theshort and long transactions benchmarks is nearly thesame. In bothFigure 4. Throughput when private transactions do nowork. A 4M sized Random-Array on a 128 threadNiagara 2 for, short transactions of 32 reads per readerand 16 readmodify-write operations per writer, longtransactions of 128 reads per reader and 64read-modify-write per writer and heterogeneoustransactions with random [1-128] reads per reader andrandom [1-64] read-modify-write per writer

we see a 20% penalty for TL2P in the 100%privatization case, but the TL2IP performance isdifferent in the long transaction case, caused by aheavier use of the global clock, which is affected by longtransactions.

The heterogenous benchmark creates a higher penaltythan the constant length transactions. That is becausethe private transaction barrier waits for all the activethreads andthe possibility that it will wait for the long one thread ishigher as there are more threads in parallel. Therefore,the penalty for the quiesence is higher, but as theprivatization level decreases to 50% and 10%, theperformance improves. This is the case where the ondemand privatization approach saves the situation andallows TL2P to achieve good results.

ation when private transactions involve computation. Inthis case it consists of a sequence of random readsapproximately 10-15 times longer than the privatizingtransaction. Because these are reads, TL2 can runthem even though it has no privatization. Here you cansee that TL2P (in blue) provides virtually the sameperformance as TL2 (in red) at all levels of privatization,conﬁrming the potential of the private transactionquiescing technique. In contrast, TL2-IP (in orange)does not scale at any level.

In summary, we see that the simple quiescenceprivatization technique added to the TL2 STM,provides TL2 with privatization support which deliversgreat scalable performance under realistic conditionsand takes advantage of partial privatization under fullstress when there are empty private transactions.6. Conclusion

We presented theﬁrst scalable approach for privatizingTL2/LSA style invisible-read-based STM algorithms.Private transactions offer a simple intermediateapproach that combines the ease of use of implicitprivatization, with the efﬁciency that can be obtainedfrom explicitly knowing which regions are private. Theresult is a “pay as you go” cost for privatization, and aframework, private transactions, that will hopefully allowfor further compiler and other optimizations that willmake privatization a low cost addition technique not justfor STMs but perhaps in general for concurrent datastructures.

The quiescing algorithm at the basis of the privatetransaction methodology is of independent value as itcan be used as a privatization barrier within STMs or inthe context

of other data structures.

7. Acknowledgements

This paper was supported in part by grants from theEuropean Union under grant FP7-ICT-2007-1 (projectVELOX), as well as grant 06/1344 from the IsraeliScience Foundation, and a grant from SunMicrosystems.