Jan Gray.
Building a RISC System in an FPGA.
In Circuit Cellar Ink, Number 116--118, March, April, May, 2000.
[bibtex entry]
[URL]
Tutorial on building custom processors optimized for implementation on FPGAs.

Monty Denneau.
The Yorktown Simulation Engine.
In Proceedings of the 19th Design Automation Conference,
p. 55--59, 1982.
[bibtex entry]
[URL]
This was a pre-FPGA logic simulation engine that was
also used to simulate logic before building hardware.
It includes most of the ideas behind multicontext FPGAs.

Jonathan Rose and Robert Francis and David Lewis and Paul Chow.
Architecture of Field-Programmable Gate Arrays: The Effect of Logic Block Functionality on Area Efficiency.
In IEEE Journal of Solid-State Circuits,
Volume 25,
Number 5,
pp. 1217--1225,
October, 1990.
[bibtex entry]
[DOI 10.1109/4.62145]
Why did we start with 4-LUT FPGAS? But more than that,
this is beautiful example of formulating a clean question
about architecture, defining a parameterized space,
identifying a cost model, and explorting the space to
find the best option.

Maya Gokhale, William Holmes, Andrew Kopser, Sara Lucas, Ronald Minnich, Douglas Sweely, and Daniel Lopresti.
Building and Using a Highly Programmable Logic Array.
In IEEE Computer, Volume 24, Number 1,
pp. 81--89, 1991.
[bibtex entry]
[DOI 10.1109/2.67197]
One of the early FPGA Computing systems that demonstrated performance exceeding supercomputers on a specialized problem (DNA Sequence matching) using a board of FPGAs. The entire capacity of one of these boards is smaller than today's midrange FPGAs.

Jason Cong and Yuzheng Ding.
FlowMap: An Optimal Technology Mapping Algorithm for Delay Optimization in Lookup-Table Based FPGA Designs.
In IEEE Transactions on Computer-Aided Design, Volume 13, Issue 1,
pp. 1--12, January, 1994.
[bibtex entry]
[DOI 10.1109/43.273754]
How to cover logic into LUTs;
nice observation that the problem can be reframed from logic packing
to IO cuts. Use of dynamic programming and max flow is algorithmically
elegant. There are a wealth of improvements and more sophisticated
versions since this, but it's worth starting here for the cleanness
of this basic problem formulation.

Larry McMurchie and Carl Ebeling.
PathFinder: A Negotiation-Based Performance-Driven Router for FPGAs.
In Proceedings of the International Symposium on Field-Programmable Gate Arrays,
pp. 111--117, 1995.
[bibtex entry]
[DOI 10.1145/201310.201328]
The basic routing algorithm around which virtually all FPGA
routing is built today. Dismisses with separate global/detail
phases and uses adaptive costs to sort out congestion.

John Villasenor, and Chris Jones, and Brian Schoner.
Video Communications using Rapidly Reconfigurable Hardware.
In IEEE Transactions on Circuits and Systems for Video Technology,
Volume 5,
Number 6,
pp. 565--567, December 1995.
[bibtex entry]
[DOI 10.1109/76.475899]
Early article articulating and demonstrating the idea of using rapid Run-Time Reconfiguration in order to run large tasks on smaller FPGA systems.

Andr&eacute DeHon.
DPGA Utilization and Application.
In Proceedings of the International Symposium on Field-Programmable Gate
Arrays, pp. 115--121, February, 1996.
[bibtex entry]
[DOI 10.1145/1145/228370.228387]
What would you do with a multicontext FPGA and what benefits does it offer?

Ethan Mirsky and Andr&eacute DeHon.
MATRIX: A Reconfigurable Computing Architecture with Configurable Instruction Distribution and Deployable Resources.
In Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines,
pp. 157--166, April, 1996.
[bibtex entry]
[DOI 10.1109/FPGA.1996.564808]
Early coarse-grained reconfigurable architecture that allows flexible organization of units and instruction distribution. The basic element is a composable 8b funcntional unit with a 256 byte memory/register file that can also be used to hold dynamic instructions.

Brian Von Herzen.
Signal Processing at 250 MHz using High-Performance FPGAs.
In Proceedings of the International Symposium on Field-Programmable Gate Arrays,
pp. 62--68, 1997.
[bibtex entry]
[DOI 10.1145/258305.258313]
Early and inspiring demonstration that FPGAs can operate productively at very high clock rates by paying careful attention to spatial locality and pipelining.

Jan Rabaey.
Reconfigurable Computing: The Solution to Low Power Programmable DSP.
In Proceedings of the 1997 IEEE International Conference on Acoustics,
Speech, and Signal Processing,
Volume 1, pp. 275--278, April, 1997.
[bibtex entry]
[DOI 10.1109/ICASSP.1997.599622]
Early paper making the case for the energy efficiency of reconfigurable architectures and including an early comparison of energy among processors, FPGAs, and ASICs.

W. Bruce Culbertson, Rick Amerson, Richard Carter, Phil Kuekes, and Greg Snider.
Defect Tolerance on the TERAMAC Custom Computer.
In Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines,
pp. 116--123, April, 1997.
[bibtex entry]
[DOI 10.1109/FPGA.1997.624611]
Shows how reconfigurability of the FPGA can be used to to map around
defects in the fabricated IC or board-level system.
An early paper giving a full-system demonstration of the benefits of component-specific mapping.

Steve Trimberger and Dean Carberry and AndersJohnson and Jennifer Wong
A Time-Multiplexed FPGA.
In Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines,
pp. 22--28, April, 1997.
[bibtex entry]
[DOI 10.1109/FPGA.1997.624601]
How to add multicontext support to a mostly conventional FPGA architecture base.

Vaughn Betz and Jonathan Rose.
VPR: A New Packing, Placement, and Routing Tool for FPGA Research.
In Proceedings of the International Conference on Field-Programmable
Logic and Applications
(published as LNCS-1304),
pp. 213--222, Springer, 1997.
[bibtex entry]
[URL]
A good placer coupled with a good version of Pathfinder
and targeted at Island-style FPGAs. The free availability
of this high-quality tool has provided a baseline standard
for FPGA architectural work for over a decade.

Peichen Pan and Chih-Chang Lin.
A New Retiming-based Technology Mapping Algorithm for LUT-based FPGAs.
In Proceedings of the International Symposium on Field-Programmable Gate
Arrays, pp. 35--42, February, 1998.
[bibtex entry]
[DOI 10.1145/275107.275118]
Optimally solve LUT mapping and retiming simultaneously;
there are so few things we can solve optimally, and so few
things we can afford to address together, it's refreshing
see formulations where you can provide optimal results
across multiple traditional levels of decomposition.
As with flowmap, there are later papers which take this
further and provide more efficient and general solutions,
but the earlier papers introduce the cleanest problems
and key ideas.

Vaughn Betz and Jonathan Rose.
How Much Logic Should Go in an FPGA Logic Block?
In IEEE Design and Test of Computers, Volume 15, Number 1,
pp. 10--15, 1998.
[bibtex entry]
[DOI 10.1109/54.655177]
A paper explaining the move to ``Island-Style'' FPGAs.
Why do we use clusters with multiple LUTs?

Vaughn Betz, Jonathan Rose, and Alexander Marquardt.
Architecture and CAD for Deep-Submicron FPGAs.
Kluwer Academic Publishers, 1999.
[bibtex entry]
[URL]
Classic book on FPGA architecture and CAD. Describes VPR and island style
FPGAs. While the technology is dated, this book provides the best single
introduction to FPGA organization and implementation issues as well as a
description of the popular clustering, placement, and routing algorithms
using for physical mapping of designs to FPGAs.

Andr&eacute DeHon.
Balancing Interconnect and Computation in Reconfigurable Computing Array
(or, why you don't really want 100% LUT utilization).
In Proceedings of the International Symposium on Field-Programmable Gate
Arrays, pp. 69--78, February, 1999.
[bibtex entry]
[DOI 10.1145/296399.296431]
Since interconnect is the dominant area (and delay and energy) contributor
on FPGAs, architectural optimizations which try to provide adequate
interconnec to use all the logic may quite inefficient; this paper turns
the question around and asks how the two should be balanced together.
This provides a clean, parameterized formulation of this tradeoff.

Timothy Callahan and John Hauser and John Wawrzynek.
The Garp Architecture and C Compiler.
In IEEE Computer, Volume 33, Number 4,
pp. 62--69, 2000.
[bibtex entry]
[DOI 10.1109/2.839323]
Details one of the earliest architectures for using a reconfigurable array as a coprocessor attached to a microprocessor, including a compiler capable of automatically extracting application kernels for execution on the reconfigurable array.

Keith Underwood.
FPGAs vs. CPUs: Trends in Peak Floating-Point Performance.
In Proceedings of the International Symposium on Field-Programmable Gate
Arrays, pp. 171--180, February, 2004.
[bibtex entry]
[DOI 10.1145/968280.968305]
Article pointing out that FPGA performance on floating point was catching up with microprocessors and on track to surpass micoprocessor floating-point performance for many tasks.

Ian Kuon and Jonathan Rose.
Measuring the Gap Between FPGAs and ASICs.
In IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,
Volume 26,
Number 2,
pp. 203--215,
February, 2007.
[bibtex entry]
[DOI 10.1109/TCAD.2006.884574]
Modern effort to quantify the relative area, power, and delay of FPGAs compared to ASICs

John Wawrzynek, David Patterson, Mark Oskin, Shih-Lien Lu, Christoforos Kozyrakis, James C. Hoe, Derek Chiou, and Krste Asanovic.
RAMP: Research Accelerator for Multiple Processors.
In IEEE Micro, Volume 27, Number 2,
pp. 46---57, 2007.
[bibtex entry]
[DOI 10.1109/MM.2007.39]
An important, modern reconfigurable platform for emulation and simulation. With the growth in FPGA capacity, this effort can contemplate the emulation of systems containing hundreds to thousands of processor cores, where each FPGA is modeling several processors.

Shih-Lien L. Lu and Peter Yiannacouras and Taeweon Suh and Rolf Kassa and Michael Konow.
A Desktop Computer with a Reconfigurable Pentium.
In Transactions on Reconfigurable Technology and Systems,
Volume 1,
Number 1,
March, 2008.
[bibtex entry]
[DOI 10.1145/1331897.1331901]
Demonstration that a Pentium processor can be implemented on less than half of a modern FPGA.