Abstract

We present a thorough characterization of the access patterns in blogspace, which comprises a rich interconnected web of blog postings and comments by an increasingly prominent user community that collectively define what has become known as the blogosphere. Our characterization of over 35 million read, write, and management requests spanning a 28-day period is done at three different levels. The user view characterizes how individual users interact with blogosphere objects (blogs); the object view characterizes how individual blogs are accessed; the server view characterizes the aggregate access patterns of all users to all blogs. The more-interactive nature of the blogosphere leads to interesting traffic and communication patterns, which are different from those observed for traditional web content. We identify and characterize novel features of the blogosphere workload, and we show the similarities and differences between typical web server workloads and blogosphere server workloads. Finally, based on our main characterization results, we build a new synthetic blogosphere workload generator called GBLOT, which aims at mimicking closely a stream of requests originating from a population of blog users. Given the increasing share of blogspace traffic, realistic workload models and tools are important for capacity planning and traffic engineering purposes.