SimPL: An Effective Placement Algorithm

Transcription

1 SimPL: An Effective Placement Algorithm Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov University of Michigan, Department of EECS, 226 Hayward St., Ann Arbor, MI {mckima, ejdjsy, Abstract We propose a self-contained, flat, forcedirected algorithm for global placement that is simpler than existing placers and easier to integrate into timingclosure flows. It maintains lower-bound and upper-bound placements that converge to a final solution. The upperbound placement is produced by a novel rough legalization algorithm. Our placer SimPL outperforms mpl6, NTUPlace3, FastPlace3, APlace2 and Capo simultaneously in runtime and solution quality, running 6.4 times faster than mpl6 and reducing wirelength by 2% on the ISPD 25 benchmark suite. I. INTRODUCTION Global placement currently remains at the core of physical design and is a gating factor for downstream optimizations during timing closure [2]. Despite impressive improvements reported by researchers [15] and industry software in the last five years, state-ofthe-art algorithms and tools for placement suffer several key shortcomings which are becoming more pronounced at recent technology nodes. These shortcomings fall into four categories: (i) speed, (ii) solution quality, (iii) simplicity and integration with other optimizations, (iv) support for multithreaded execution. We propose the SimPL algorithm that simultaneously improves results in the first three categories and lends itself naturally to parallelism on multicore CPUs. State-of-the-art algorithms for global placement form two families: (i) force-directed placers, such as Kraftwerk2 [2], FastPlace3 [22] and RQL [23], and (ii) non-linear optimization techniques, such as APlace2 [12], NTUPlace3 [7] and mpl6 [6]. Forcedirected algorithms model total net length by a quadratic function of cell locations and minimize it by solving a large sparse system of linear equations. To discourage cell overlap, forces are added pulling cells away from high-density areas. These forces are modeled by pseudopins and pseudonets, which extend the original quadratic function [11]. They are updated after each linear-system solve until iterations converge. Non-linear optimization models net length by more sophisticated differentiable functions with linear asymptotic behavior which are then minimized by advanced numerical analysis techniques [12]. Cell density is modeled by functional terms, which are more accurate than forces, but also require updates after each change to placement [7], [12]. Algorithms in both categories are directly used in the industry or closely resemble those in industry placers. Tools based on non-linear optimization achieve the best results reported for academic implementations [7] and EDA vendor tools, but are significantly slower, which is problematic for modern flat SoC placement instances with tens of millions of movable objects. To scale the basic non-linear optimization framework, all tools in this family employ netlist clustering and multilevel extensions, sometimes at the cost of solution quality. Such multilevel placers perform many sequential steps, obstructing efficient parallelization. Moreover, clustering and refinement do not fully benefit from modern multicore CPUs. Due to their complexity, multilevel placers are also harder to maintain, improve, and combine with other physical-design techniques. In particular, clustered netlists complicate accurate static timing analysis, congestion maps and physical synthesis, such as performance-driven buffering, gate sizing, fanin/fanout optimization, cloning, etc [2]. Hence, timing-closure flows often repeat global placement 3-4 times, alternating it with timing analysis, physical synthesis and congestion improvement. State-of-the-art force-directed placers tend to run many times faster than non-linear optimization, but also use multilevel extensions in their most competitive configurations. Their solution quality is mixed. FastPlace3 underperforms mpl6 and NTUPlace3 [7], but the industry tool RQL closely related to FastPlace slightly outperforms these two non-linear placers. Kraftwerk2 is the only competitive flat placer (i.e., it does not use clustering) and rivals other force-directed placers in speed. However, it lags behind in solution quality. Its implementation poses several challenges, such as quickly solving Poisson s equation, ensuring the convergence of iterations and avoiding halos over macros. Our experience indicates that the performance of Kraftwerk2 can be uneven, and stability can only be achieved with some loss of solution quality [13]. State-of-the-art placers are described in the book [15] and recent journal papers [3], [7], [2]. In this work, we develop a new, self-contained technique for global placement that ranks as a flat force-directed placement algorithm. It maintains lowerbound and upper-bound placements that converge to a final solution. The upper-bound placement is produced by a novel rough legalization algorithm based on geometric top-down partitioning and non-linear scaling. Our implementation outperforms published

2 placers simultaneously in solution quality and speed on standard benchmarks. Our algorithm is simpler, and our attempts to improve overall results using additional modules and extensions from existing placers (such as netlist clustering [6], [12], [22], iterative local refinement (ILR) [22], and median-improvement(boxplace) [13]) were unsuccessful. In the remainder of this paper, Section II describes the building blocks from which our algorithm was assembled. Section III introduces our key ideas and articulates our solution of the force modulation problem. The SimPL algorithm is presented in Section IV. Extensions and the use of parallelism are discussed in Section V. Empirical validation is described in Section VI, and Section VII summarizes our results. II. ESSENTIAL CONCEPTS AND BUILDING BLOCKS Given a netlist N = (E, V ) with nets E and nodes (cells) V, global placement seeks node locations (x i, y i ) such that the area of nodes within any rectangular region does not exceed the area of (cell sites in) that region. 1 Some locations of cells may be given initially and fixed. The interconnect objective optimized by global placement is the Half-Perimeter WireLength(HPWL).Fornodelocations x = {x i }and y = {y i }, HPWL N ( x, y)= HPWL N ( x)+hpwl N ( y), where HPWL N ( x) = Σ e E [max i e x i min i e x i] (1) Efficient optimization algorithms often approximate HPWL N bydifferentiablefunctions,asillustratednext. Quadratic optimization. Consider a graph G = (E G, V ) with edges E G, vertices V and edge weights w ij > for all edges e ij E G. The quadratic objective Φ G is defined as Φ G ( x, y) = Σ i,j w i,j [(x i x j ) 2 + (y i y j ) 2 ] (2) Its x & y components are cast in matrix form [3], [2] Φ G ( x) = 1 2 xt Q x x + c T x x + const (3) The Hessian matrix Q x captures connections between pairs of movable vertices, while vector c x captures connections between movable and fixed vertices. When Q x is non-degenerate, Φ G ( x) is a strictly convex function with a unique minimum, which can be found by solving the system of linear equations Q x x = c x. Solutions can be quickly approximated by iterative Krylov-subspace techniques, such as the Conjugate Gradient (CG) method and its variants [19]. Since Q x is symmetric positive definite, CG iterations provably minimize the residual norm. The convergence is monotonic [21], but its rate depends on the spectral properties of Q x, which can be enhanced by preconditioning. In other words, we solve the equivalent system 1 In practice, this constraint is enforced for bins of a regular grid. P 1 Q x = P 1 c x for a nondegenerate matrix P, such that P 1 is an easy-to-compute approximationof Q 1 x. Given that Q x is diagonally dominant, we chose P to be its diagonal, also known as the Jacobi preconditioner. Our placement algorithm (Section IV-C) deliberately enhances diagonal dominance in Q x. The Bound2Bound net model [2]. To represent the HPWL objective by the quadratic objective, the netlist N is transformed in two graphs, G x and G y, that preserve the node set V and represent each two-pin net by a single edgewith weight 1/length.Largernets are decomposed depending on the relative placement of vertices for each p-pin net, the extreme nodes (min and max) are connected to each other and to each internal node by edges, with the following weight w B2B x,ij = 1 (p 1) x i x j (4) For example, 3-pin nets are decomposed into cliques with edge weight 1/2l, where l is the length of a given edge. In general, this quadratic objective and the Bound2Bound (B2B) net decomposition capture the HPWL objective exactly, but only for the given placement. As locations change, the error may grow, necessitating multiple updates throughout the placement algorithm. Most quadratic placers use the placementindependent star or clique decompositions, so as not to rebuild Q x and Q y many times [3], [22], [23]. Yet, the B2B model uses fewer edges than cliques (p > 3), avoids new variables used in stars, and is more accurate than both stars and cliques [2]. III. KEY IDEAS IN OUR WORK Analytic placement techniques first minimize a function of interconnect length, neglecting overlaps between standard cells, macros, etc. This initial step places many cells in densely populated regions, typically around the center of the layout. Cell locations are then gradually spread through a series of placement iterations, during which interconnect length slowly increases, converging to a final overlap-free placement (a small amount of overlap is often allowed and later resolved during detailed placement). Our algorithm also starts with pure interconnect minimization, but its next step is unusual most overlaps are removed using a fast rough legalizer based on top-down geometric partitioning and nonlinear scaling. Locations of movable objects in the legalized placement serve as anchors to coerce the initial locations into a configuration with less overlap, by adding pseudonets to baseline force-directed placement [11]. Each subsequent iteration of our algorithm produces (i) an illegal placement that underestimates the final result through linear system solver, and (ii) an almost-legal placement that overestimates the final

3 result through rough legalization. The gap between the lower and upper bounds helps monitor convergence (Section IV-C). Solving the force-modulation problem. A key innovation in SimPL is the interaction between the lowerbound and the upper-bound placements it ensures convergence to a no-overlap solution while optimizing interconnect length. It solves two well-known challenges in analytic placement: (1) finding directions in which to spread the locations (force orientation), and (2) determining the appropriate amount of spreading (force modulation) [13], [23]. This is unlike previous work, where spreading directions are typically based on local information, e.g., placers based on non-linear optimization use gradient information and require a large number of expensive iterations. Kraftwerk2 [2] orients spreading forces according to solutions of Poisson s equation, providing a global perspective and speeding up convergence. However, this approach does not solve the force-modulation problem, as articulated in [13]. 2 The authors of RQL [23], which can be viewed as an improvement on FastPlace, revisit the force-modulation problem and address it by a somewhat ad hoc limit on the magnitude of spreading forces. In our work, the rough legalization algorithm (Section IV-B), invoked at each iteration, determines both the direction and the magnitude of spreading forces. It is global in nature, accounts for fixed obstacles, and preserves relative placement to ensure interconnect optimization and convergence. Our placement algorithm does not require exotic components, such as a Poisson-equation solver used by Kraftwerk; our C++ implementation is self-contained. Global placement with look-ahead. The legalized upper-bound placements constructed at every iteration of our placer can be viewed as look-ahead. They pull cell locations in lower-bound placements not just away from dense regions, but also toward the regions where space is available. Such area look-ahead is particularly useful around fixed obstacles, where local information does not offer sufficient guidance. While not explored in this paper, similar congestion look-ahead and timing look-ahead based on legalized placements can be used to integrate our placement algorithm into modern timing-closure flows. IV. OUR GLOBAL PLACEMENT ALGORITHM Our placement technique consists of three phases: initial placement, global placement iterations and postglobal placement (Figure 1). Initial placement, described next, is mostly an exercise in judicious application of known components. Our main innovation is in the global placement phase. Post-global placement is straightforward, given current state of the art. 2 The work in [13] performs force modulation with line search but is not currently competitive with state of the art. Fig. 1. The SimPL algorithm uses placement-dependent B2B net model, which is updated on every iteration. Gap refers to the difference between upper and lower bounds. A. Initial Placement Our initial-placement step is conceptually similar to those of other force-directed placers [2], [22], [23] it entirely ignores cell areas and overlaps, so as to minimize a quadratic approximation of total interconnect length. We found that this step notably impacts the final result. Therefore, unlike FastPlace3 [22] and RQL [23], we use the more accurate BoundingBox net model from [2] reviewed in Section II. After the first quadratic solve, we rebuild the circuit graph because the B2B net model is placement-dependent. We then alternate quadratic solves and graph rebuilding until HPWL stops improving. In practice, this requires a small number of iterations (5-7), regardless of benchmark size, because the relative ordering of locations stabilizes quickly. B. Rough Legalization Consider a set of cell locations with a significant amount of overlap as measured using bins of a regular grid. Rough legalization changes the global positioning of those locations, seeking to remove most of the overlap (with respect to the grid) while preserving the relative ordering. This task can be formulated at different geometric scales by varying the grid. The quality of rough legalization is measured by its impact on the entire placement flow. Our rough legalization is based on top-down recursive geometric partitioning and non-linear scaling, as outlined in Algorithm 1.

4 Algorithm 1 Rough Legalization by Top-down Geometric Partitioning and Non-linear Scaling Maximum allowed density γ, where < γ < 1 Floorplan with obstacles Placement of cells Queue of rectangles Q = 1: Identify γ-overfilled bins and cluster them // Fig. 2(a) 2: foreach cluster c do 3: Find a minimal rectangular region R c with density(r) γ 4: Q.enqueue(R) 5: while!q.empty() do 6: B=Q.dequeue() 7: M ={movable cells in B} 8: Find axis-aligned cutline Cc to evenly split cell area in M 9: Find axis-aligned cutline Cb to evenly partition B 1: (S, S1 )={two sub-regions of B created by cutline Cc } 11: M ={movable cells in S } 12: M1 ={movable cells in S1 } 13: Perform NON - LINEAR SCALING on M to Cb // Fig. 3 14: Perform NON - LINEAR SCALING on M1 to Cb // Fig. 3 15: if Area(B) >.1 LayoutArea then 16: (B, B1 )={two sub-regions of B created by cutline Cb } 17: Q.enqueue(B ) 18: Q.enqueue(B1 ) 19: end if 2: end while 21: end foreach Fig. 3. Non-linear scaling in a region with obstacles (I): the formation of Cb -aligned stripes (II), cell sorting by distance from Cb (III), greedy cell positioning (IV). linear scaling, implements this intuition, but relies on cell-area cutline Cc chosen in Algorithm 1 and shifts it toward the median of available area Cb, so as to equalize densities in the two sub-regions (Figure 2). Non-linear scaling in one direction is illustrated in Figure 3, where a new region was created by a vertical cutline Cb during rough legalization. This region is further subdivided into vertical stripes parallel to Cb. First, cutlines are drawn along the boundaries of obstacles present in this region. Each vertical stripe created in this process is further subdivided (by up to 1 evenly distributed cutlines) if its width exceeds 1/1 of the region s width. Movable cells in the region are then sorted by their distance from Cb and greedily packed into the stripes in that order. For each stripe, we calculate the available site area A+ and consider the stripe filled when the area of assigned cells reaches γa+. Cell locations within each stripe are linearly scaled from current locations (non-linearity arises from different scaling in different stripes). Rough legalization applies non-linear scaling in alternating directions, as illustrated in Figure 4 on one of ISPD 25 benchmarks. Here, a region R is selected that contains overfilled bins, but is wide enough to facilitate overlap removal. R is first partitioned by a vertical cutline, after which non-linear scaling is applied in the two new sub-regions. Subsequently, rough legalization (Algorithm 1) considers each sub-region individually and selects different horizontal cutlines. Four rounds of non-linear scaling follow, spreading Handling density constraints. For each grid bin of a given regular grid, we calculate the total area of contained cells A and the total available area of cell sites A+. A bin is γ-overfilled if its cell density A /A+ exceeds given density limit < γ < 1. Adjacent γoverfilled bins are clustered by Breadth-First Search (BFS), and rough legalization is performed on such clusters. For each cluster, we find a minimal containing rectangular region with density γ (these regions can also be referred to as clusters ). A key insight is that overlap removal in a region, that is filled to capacity, is more straightforward because the absence of whitespace leaves less flexibility for interconnect optimization. If relative placement must be preserved, overlap can be reduced by means of x- and y-sorting with subsequent greedy packing. The next step, non (a) (b) Fig. 2. Clustering of overfilled bins in Algorithm 1 and adjustment of cell-area to region-area median by non-linear scaling (also see Figure 3). Movable cells are shown in blue, obstacles in solid gray Fig. 4. Non-linear scaling after the first vertical cut and two subsequent horizontal cuts (ADAPTEC 1) from intermediate steps between iterations and 1 in Figure 7.

5 1.6e+8 1.4e+8 1.2e+8 Wirelength upper bound 4e+7 Fig. 5. An anchor with a pseudonet. cells over the region s expanse (Figure 4). Despite a superficial similarity to cell-shifting in FastPlace [22], our non-linear scaling does not use cell locations to define bins/ranges, or map ranges onto a uniform grid. Cutline shifting. Median-based cutlines are neither necessary nor sufficient for good solution quality. We therefore adopt a fast cutline positioning technique from [17]. When obstacles cover <2% of a region s area, we find a cutline position C c to minimize net cut, with <55% of cell area per partition. We record the ratio ρ of cell areas in the two partitions and adjust the region s C b cutline to the position that partitions the region s area with the same ratio ρ. C. Global Placement Iterations Using legalized locations as anchors. Solving an unconstrained linear system results in a placement with significant amount of overlap. To pull cells away from their initial positions, we gradually perturb the linear system. As explained in Section IV-B, at each iteration of our global placement, top-down geometric partitioning and scaling generates a roughly legalized solution. We use these legalized locations as fixed, zero-area anchors connected to their original cells with artificial two-pin pseudonets. Furthermore, following the discussion in Section II, we note that connections to fixed locations do not increase the size of the Hessian matrix Q, and only contribute to its diagonal elements. This enhances diagonal dominance, condition number of P 1 Q, and the convergence rate of Jacobipreconditioned CG. In addition to weights given by the B2B net model on pseudonets, we control cell movement and iteration convergence by multiplying each pseudonet weight by an additional factor α > computed as α =.1 (1 + iterationn umber). At early iterations, small α values weaken spreading forces, giving greater significance to interconnect and more freedom to the linear system solver. As the relative ordering of cells stabilizes, increasing α values boost the pull toward the anchors and accelerate the convergence of lower bounds and upper bounds. Grid resizing. To identify γ-overfilled bins, we overlay a uniform grid over the entire layout. The grid size is initially set to S init = 2 2 to accelerate HPWL 1.e+8 8.e+7 6.e+7 4.e+7 Amount of overlap Wirelength lower bound 2e+7 1e Iteration number 3e+7 Legal solution Fig. 6. Lower and upper bounds for HPWL, the amount of overlap at each iteration & HPWL of the legal placement (ADAPTEC1). the rough legalization. However, in order to accurately capture the amount of overlap, the grid size decreases by β = 1.6 at each iteration of global placement until it reaches 2 the average movable cell size. Grid resizing also affects the clustering of γ-overfilled bins during rough legalization (Section IV-B) effectively limiting the amount of cell movement and encouraging convergence at later iterations. A progression of global placement is annotated with HPWL values in Figure 7. The upper-bound placements on the right appear blocky in the first iteration, but gradually refine with grid resizing. Convergence criteria. A convergence criterion similar to that in Section IV-A can be adopted in global placement. We alternate (1) rough legalization, (2) updates to anchors and the B2B net model, and (3) solution of the linear system, until HPWL of solutions generated by rough legalization stops improving. Unlike in the initial placement step, however, HPWL values of upper-bound solutions oscillate during the first 5-1 iterations, as illustrated in Figure 6. To prevent premature convergence, we monitor the gap between the lower and upper bounds. Global placement continues until the gap is reduced and stops improving. On the ISPD 25 benchmark suite, this convergence criterion entails iterations of global placement. The final set of locations (global placement) is produced by the last rough legalization as indicated in Figure 1. V. EXTENSIONS AND IMPROVEMENTS The algorithm in Section IV can be improved in terms of runtime and solution quality. A. Selecting Windows for Rough Legalization During early global iterations, most movable cells of the lower-bound placement reside near the center of the layout region (Figure 7). In such cases, there is usually one expanded minimal rectangular region (cluster) that will encompass most of γ-overfilled bins. However, as global iterations progress, γ-overfilled bins will Overlaps

6 HPWL= 4.484e+7, Stage=IP, Iter= HPWL= 1.51e+8, Stage=RL, Iter= HPWL= 5.556e+7, Stage=LSS, Iter= HPWL= 1.173e+8, Stage=RL, Iter= HPWL= 6.496e+7, Stage=LSS, Iter= HPWL= 9.28e+7, Stage=RL, Iter= HPWL= 6.824e+7, Stage=LSS, Iter= HPWL= 8.572e+7, Stage=RL, Iter= Fig. 7. A progression of global placement snapshots from different iterations and algorithm steps (adpatec1). IP=Initial Placement, RL=Rough Legalization, LSS=Linear System Solver. Leftside placements show lower bounds and right-side placements show upper bounds. split the nets of the netlist into equal groups that can be processed by multiple threads. To parallelize the CG solver, we applied a coarse-grain row partitioning [1] scheme to the Hessian Matrix Q, where different blocks of rows are assigned to different threads. A critical kernel operation in CG is the Sparse Matrix-Vector multiply (SpMxV). Memory bandwidth is a known performance bottleneck in a uniprocessor environment [9], and its impact is likely to aggravate when multiple cores access the main memory through a common bus. We reduce memory bandwidth demand of SpMxV by using the CSR (Compressed Sparse Row) [19] memory layout for the Hessian matrix Q. As part of our empirical validation, we ran SimPL on an 8-core AMD-based system with four dual-core CPUs. Single-thread execution was compared to eightthread execution. Solution quality did not appreciably change, but memory usage increased by 5% whereas runtime of global placement iterations was reduced by 2-3 times. The initial placement stage was accelerated by about 4 times. While CG remained the runtime bottleneck of SimPL on 8 cores, rough legalization, which we have not yet parallelized, became a close second (> 2%). In addition to thread-level parallelism, our implementation makes use of SSE instructions (through g++ intrinsics) that perform several floatingpoint operations at once. However, the speed-up they provided to global placement was only several percent and not worth the development effort. The overall speed-up due to parallelism varies between different hardware systems, as it depends on the relation between CPU speed and memory bandwidth. be scattered around the layout region, and multiple clusters of bins may exist. In our implementation, we process γ-overfilled bins in the decreasing order of density. Each expansion stops when the cluster s density drops to γ or the cluster abuts the boundaries of previously processed clusters. This strategy may generate incompletely expanded clusters, especially in mid-stages of global placement iterations. However, as the densest bins are processed first, the number of regions with peak density is guaranteed to decrease at every iteration except when the peak density itself decreases. At each iteration of global placement, rough legalization is repeated up to ten times until maximal density is decreased below γ. Our implementation was written in C++ and compiled with g Unless indicated otherwise, benchmark runs were performed on an Intel Core i7 Quad CPU Q66 Linux workstation running at 3.2GHz, using only one CPU core. We compared SimPL to other academic placers on the ISPD 25 placement contest benchmark suite. Focusing on global placement, we delegate final legalization (into rows and sites) and detailed placement to FastPlace-DP [16], but post-process it by a greedy cell-flipping algorithm from Capo [5]. HPWL of solutions produced by each placer is computed by the GSRC Bookshelf Evaluator [1]. B. Speeding up Placement Using Parallelism A. Analysis of Our Implementation Further speed-up is possible for SimPL on workstations with multicore CPUs. Runtime bottlenecks in the sequential variant of the SimPL algorithm (Section VI-A) updates to the B2B net model and the CG solver can be parallelized. Given that the B2B net model is separable, we process the x and y cases in parallel. When more than two cores are available, we The SimPL global placer is a stand-alone tool that includes I/O, initial placement and global placement iterations. Living up to its name, it consists of fewer than 5, lines of C++ code and relies only on standard C++ libraries. There are four command-line parameters that affect performance two for grid resizing (initial and step), and two for pseudonet weighting (initial VI. E MPIRICAL VALIDATION

7 Benchmark APLACE2. CAPO1.5 FASTPLACE3. MPL6 SIMPL size (#cells) HPWL Runtime HPWL Runtime HPWL Runtime HPWL Runtime HPWL Runtime ADAPTEC1 211K ADAPTEC2 255K ADAPTEC3 452K ADAPTEC4 496K BIGBLUE1 278K BIGBLUE2 558K BIGBLUE3 1.1M BIGBLUE4 2.18M Average TABLE I LEGAL HPWL ( 1E6) AND TOTAL RUNTIME (MINUTES) COMPARISON ON THE ISPD 25 BENCHMARK SUITE. EACH PLACER RAN AS A SINGLE THREAD ON A 3.2GHZ LINUX WORKSTATION. HPWL WAS COMPUTED BY THE GSRC BOOKSHELF EVALUATOR[1]. and step). In all experiments we used default values described in Section IV. Running in a single thread, SimPL completes the entire ISPD 25 benchmark suite in 1 hour 2 minutes, placing the largest benchmark, BIGBLUE4 (2.18M cells), in 36 minutes using 2.1GB of memory. We report the runtime breakdown on BIGBLUE4 according to Figure 1, excluding 1.4% runtime for I/O. Initial placement takes 5.2% of total runtime, of which 3.9% is spent in CG, and 1.3% in building B2B net models and sparse matrices for CG. Global placement iterations take 36.2%, of which 18% is in the CG solver, and 7.9% is in sparse matrix construction and B2B net modeling. Inserting pseudonets takes 1.3%, and rough legalization 9%. Post-global placement takes 57.2%, predominantly in detailed placement. Greedy orientation improvement and HPWL evaluation were almost instantaneous. B. Comparisons to State-of-the-art Placers We compared SimPL to other placers whose binaries are available to us. Our requests for NTUPlace3 binaries went unanswered, but NTUPlace3 results [7] are very similar to mpl6, which we compare to SimPL. We run each available placer, 3 including SimPL, in default mode and show results in Table I. The HPWL results reported by APlace2 [12], Capo1.5 [18] and mpl6 [6] were confirmed by the GSRC Bookshelf Evaluator. However, FastPlace3 [22] reported lower HPWL by.25% to.95%. For consistency, we report the readings of the GSRC Bookshelf evaluator. SimPL found placements with the lowest HPWL for seven out of eight circuits in the ISPD 25 benchmark suite (no parameter tuning to specific benchmarks was employed). On average, SimPL obtains wirelength improvement of 8.26%, 18.7%, 3.85%, and 1.96% versus APlace2, Capo1.5, FastPlace3 and mpl6, respectively. SimPL was also the fastest among the placers on seven out of eight circuits, as well as on average. It is 6.4 times faster than mpl6, which appears to be the strongest pre-existing placer. SimPL 3 The KraftWerk2 binary we obtained did not run on our system. is 1.22 times faster than FastPlace3., which has been the fastest academic placer so far. While we managed to obtain almost all bestperforming academic placers in binaries, RQL reportedly outperforms mpl6 in HPWL by a small amount [23]. Comparing our HWPL results to numbers in [23], we observe four wins for SimPL and four losses. RQL is 3.1 times faster than mpl6, making it more than twice as slow as SimPL. VII. CONCLUSIONS AND FUTURE WORK In this work, we developed a new, flat, force-directed algorithm for global placement. Unlike other stateof-the-art placers, it is rather simple, and our selfcontained implementation includes fewer than 5, lines of C++ code. The algorithm is iterative and maintains two placements one computes a lower bound and one computes an upper bound on final wirelength. These two placements interact, ensuring stability and fast convergence of the algorithm. The upper-bound placement is produced by a new rough legalization algorithm, based on top-down geometric partitioning and non-linear scaling, and converges to final cell locations. In contrast, all analytic algorithms we reviewed (both force-directed and non-linear) derive their final solution from a lower-bound placement. The use of upper-bound placements offers a solution to the force-modulation problem [13], [23] and removes the need for the so-called hold forces used by several force-directed placers. As discussed in Section III, upper-bound placements perform an area lookahead 4 that is instrumental in the handling of layout obstacles. APlace2, NTUPlace3, mpl6, as well as some force-directed placers, model obstacles by additional smoothened penalty terms in the objective function. Not only such terms introduce extra work, but they also add imprecisions to modeling. For similar reasons, SimPL avoids netlist clustering used by other placers. We have implemented several other techniques 4 The concept of area look-ahead was proposed in [8] for blockpacking by nested bisection, where it checks if a given bisection admits a legal block packing in each partition. Area look-ahead was not used in [8] to spread standard cells from dense regions.

Voronoi Treemaps in D3 Peter Henry University of Washington phenry@gmail.com Paul Vines University of Washington paul.l.vines@gmail.com ABSTRACT Voronoi treemaps are an alternative to traditional rectangular

CHAPTER 7 Contributions to Gang Scheduling In this Chapter, we present two techniques to improve Gang Scheduling policies by adopting the ideas of this Thesis. The first one, Performance- Driven Gang Scheduling,

1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.

11 th International LS-DYNA Users Conference Computing Technology A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures Yih-Yih Lin Hewlett-Packard Company Abstract In this paper, the

Utilizing the quadruple-precision floating-point arithmetic operation for the Krylov Subspace Methods Hidehiko Hasegawa Abract. Some large linear systems are difficult to solve by the Krylov subspace methods.

Edward Walker benchmarking Amazon EC2 for high-performance scientific computing Edward Walker is a Research Scientist with the Texas Advanced Computing Center at the University of Texas at Austin. He received

A New Method for Estimating Maximum Power Transfer and Voltage Stability Margins to Mitigate the Risk of Voltage Collapse Bernie Lesieutre Dan Molzahn University of Wisconsin-Madison PSERC Webinar, October

CHAPTER 1 INTRODUCTION 1.1 Research Background A Printed Circuit Board (PCB) is a board made from glass reinforced plastic with copper tracks, which is the backbone of electrical devices. Traditionally,

152 APPENDIX 1 USER LEVEL IMPLEMENTATION OF PPATPAN IN LINUX SYSTEM A1.1 INTRODUCTION PPATPAN is implemented in a test bed with five Linux system arranged in a multihop topology. The system is implemented

Why eigenvalues? Week 8: Friday, Oct 12 I spend a lot of time thinking about eigenvalue problems. In part, this is because I look for problems that can be solved via eigenvalues. But I might have fewer

MPI and Hybrid Programming Models William Gropp www.cs.illinois.edu/~wgropp 2 What is a Hybrid Model? Combination of several parallel programming models in the same program May be mixed in the same source

Chapter 5 Clustering & Visualization Clustering in high-dimensional databases is an important problem and there are a number of different clustering paradigms which are applicable to high-dimensional data.

Alpha CPU and Clock Design Evolution This lecture uses two papers that discuss the evolution of the Alpha CPU and clocking strategy over three CPU generations Gronowski, Paul E., et.al., High Performance

Symantec Backup Exec 10d System Sizing Best Practices For Optimizing Performance of the Continuous Protection Server Table of Contents Table of Contents...2 Executive Summary...3 System Sizing and Performance

This space is reserved for the Procedia header, do not use it How High a Degree is High Enough for High Order Finite Elements? William F. National Institute of Standards and Technology, Gaithersburg, Maryland,

High Performance Computing for Operation Research IEF - Paris Sud University claude.tadonki@u-psud.fr INRIA-Alchemy seminar, Thursday March 17 Research topics Fundamental Aspects of Algorithms and Complexity

Spot server problems before they are noticed The system s really slow today! How often have you heard that? Finding the solution isn t so easy. The obvious questions to ask are why is it running slowly

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.

2.6 Iterative Solutions of Linear Systems 143 2.6 Iterative Solutions of Linear Systems Consistent linear systems in real life are solved in one of two ways: by direct calculation (using a matrix factorization,

Approximation Algorithms 11 Approximation Algorithms Q Suppose I need to solve an NP-hard problem What should I do? A Theory says you're unlikely to find a poly-time algorithm Must sacrifice one of three

White Paper Enhance Service Delivery and Accelerate Financial Applications with Consolidated Market Data What You Will Learn Financial market technology is advancing at a rapid pace. The integration of

Multi-core and Linux* Kernel Suresh Siddha Intel Open Source Technology Center Abstract Semiconductor technological advances in the recent years have led to the inclusion of multiple CPU execution cores

We SOLVE COMPLEX PROBLEMS of DATA MODELING and DEVELOP TOOLS and solutions to let business perform best through data analysis Whitepaper: performance of SqlBulkCopy This whitepaper provides an analysis