Timothy B. Lee

My friend Ray Lehman points to an intriguing opportunity to expand public access to government data: insurance regulation. The United States has a decentralized, state-based system for regulating the insurance industry. Insurance companies are required to disclose data on their premiums, claims, assets, and many other topics, to state regulators for each state in which they do business. These data are then shared with the National Association of Insurance Commissioners, a private, non-profit organization that combines it and then sells access to the database. Ray tells the story:

The major clients for the NAIC’s insurance data are market analytics firms like Charlottesville, Va.-based SNL Financial and insurance rating agency A.M. Best (Full disclosure: I have been, at different times, an employee at both firms) who repackage the information in a lucrative secondary market populated by banks, broker-dealers, asset managers and private investment funds. While big financial institutions make good use of the data, the rates charged by firms like Best and SNL tend to be well out of the price range of media and academic outlets who might do likewise.

And where a private stockholder interested in reading the financials of a company whose shares he owns can easily look up the company’s SEC filings, a private policyholder interested in, say, the reserves held by the insurer he has entrusted to protect his financial future…has essentially nowhere to turn.

However, Ray points out that the recently-enacted Dodd-Frank legislation may change that, as it creates a new Federal Insurance Office. That office will collect data from state regulators and likely has the option to disclose that data to the general public. Indeed, Ray argues, the Freedom of Information Act may even require that the data be disclosed to anyone who asks. The statute is ambiguous enough that in practice it’s likely to be up to FIO director Michael McRaith to decide what to do with the data.

I agree with Ray that McRaith should make the data public. As several CITP scholars have argued, free bulk access to government data has the potential to create significant value for the public. These data could be of substantial value for journalists covering the insurance industry and academics studying insurance markets. And with some clever hacking, it could likely be made useful for consumers, who would have more information with which to evaluate the insurance companies in their state.

In my research on privacy problems in PACER, I spent a lot of time examining PACER documents. In addition to researching the problem of “bad” redactions, I was also interested in learning about the pattern of redactions generally. To this end, my software looked for two redaction styles. One is the “black rectangle” redaction method I described in my previous post. This method sometimes fails, but most of these redactions were done successfully. The more common method (around two-thirds of all redactions) involves replacing sensitive information with strings of XXs.

Out of the 1.8 million documents it scanned, my software identified around 11,000 documents that appeared to have redactions. Many of them could be classified automatically (for example “123-45-xxxx” is clearly a redacted Social Security number, and “Exxon” is a false positive) but I examined several thousand by hand.

Here is the distribution of the redacted documents I found.

Type of Sensitive Information

No. of Documents

Social Security number

4315

Bank or other account number

675

Address

449

Trade secret

419

Date of birth

290

Unique identifier other than SSN

216

Name of person

129

Phone, email, IP address

60

National security related

26

Health information

24

Miscellaneous

68

Total

6208

To reiterate the point I made in my last post, I didn’t have access to a random sample of the PACER corpus, so we should be cautious about drawing any precise conclusions about the distribution of redacted information in the entire PACER corpus.

Still, I think we can draw some interesting conclusions from these statistics. It’s reasonable to assume that the distribution of redacted sensitive information is similar to the distribution of sensitive information in general. That is, assuming that parties who redact documents do a decent job, this list gives us a (very rough) idea of what kinds of sensitive information can be found in PACER documents.

The most obvious lesson from these statistics is that Social Security numbers are by far the most common type of redacted information in PACER. This is good news, since it’s relatively easy to build software to automatically detect and redact Social Security numbers.

Another interesting case is the “address” category. Almost all of the redacted items in this category—393 out of 449—appear in the District of Columbia District. Many of the documents relate to search warrants and police reports, often in connection with drug cases. I don’t know if the high rate of redaction reflects the different mix of cases in the DC District, or an idiosyncratic redaction policy voluntarily pursued by the courts and/or the DC police but not by officials in other districts. It’s worth noting that the redaction of addresses doesn’t appear to be required by the federal redaction rules.

Finally, there’s the category of “trade secrets,” which is a catch-all term I used for documents whose redactions appear to be confidential business information. Private businesses may have a strong interest in keeping this information confidential, but the public interest in such secrecy here is less clear.

To summarize, out of 6208 redacted documents, there are 4315 Social Security that can be redacted automatically by machine, 449 addresses whose redaction doesn’t seem to be required by the rules of procedure, and 419 “trade secrets” whose release will typically only harm the party who fails to redact it.

That leaves around 1000 documents that would expose risky confidential information if not properly redacted, or about 0.05 percent of the 1.8 million documents I started with. A thousand documents is worth taking seriously (especially given that there are likely to be tens of thousands in the full PACER corpus). The courts should take additional steps to monitor compliance with the redaction rules and sanction parties who fail to comply with them, and they should explore techniques to automate the detection of redaction failures in these categories.

But at the same time, a sense of perspective is important. This tiny fraction of PACER documents with confidential information in them is a cause for concern, but it probably isn’t a good reason to limit public access to the roughly 99.9 percent of documents that contain no sensitive information and may be of significant benefit to the public.

Since we launched RECAP a couple of years ago, one of our top concerns has been privacy. The federal judiciary’s PACER system offers the public online access to hundreds of millions of court records. The judiciary’s rules require each party in a case to redact certain types of information from documents they submit, but unfortunately litigants and their counsel don’t always comply with these rules. Three years ago, Carl Malamud did a groundbreaking audit of PACER documents and found more than 1600 cases in which litigants submitted documents with unredacted Social Security numbers. My recent research has focused on a different problem: cases where parties tried to redact sensitive information but the redactions failed for technical reasons. This problem occasionally pops up in news stories, but as far as I know, no one has conducted a systematic study.

To understand the problem, it helps to know a little bit about how computers represent graphics. The simplest image formats are bitmap or raster formats. These represent an image as an array of pixels, with each pixel having a color represented by a numeric value. The PDF format uses a different approach, known as vector graphics, that represent an image as a series of drawing commands: lines, rectangles, lines of text, and so forth.

Vector graphics have important advantages. Vector-based formats “scale up” gracefully, in contrast to the raster images that look “blocky” at high resolutions. Vector graphics also do a better job of preserving a document’s structure. For example, text in a PDF is represented by a sequence of explicit text-drawing commands, which is why you can cut and paste text from a PDF document, but not from a raster format like PNG.

But vector-based formats also have an important disadvantage: they may contain more information than is visible to the naked eye. Raster images have a “what you see is what you get” quality—changing all the pixels in a particular region to black destroys the information that was previously in that part of the image. But a vector-based image can have multiple “layers.” There might be a command to draw some text followed by a command to draw a black rectangle over the text. The image might look like it’s been redacted, but the text is still “under” the box. And often extracting that information is a simple matter of cutting and pasting.

So how many PACER documents have this problem? We’re in a good position to study this question because we have a large collection of PACER documents—1.8 million of them when I started my research last year. I wrote software to detect redaction rectangles—it turns out these are relatively easy to recognize based on their color, shape, and the specific commands used to draw them. Out of 1.8 million PACER documents, there were approximately 2000 documents with redaction rectangles. (There were also about 3500 documents that were redacted by replacing text by strings of Xes, I also excluded documents that were redacted by Carl Malamud before he donated them to our archive.)

Next, my software checked to see if these redaction rectangles overlapped with text. My software identified a few hundred documents that appeared to have text under redaction rectangles, and examining them by hand revealed 194 documents with failed redactions. The majority of the documents (about 130) appear be from commercial litigation, in which parties have unsuccessfully attempted to redact trade secrets such as sales figures and confidential product information. Other improperly redacted documents contain sensitive medical information, addresses, and dates of birth. Still others contain the names of witnesses, jurors, plaintiffs, and one minor.

Implications

PACER reportedly contains about 500 million documents. We don’t have a random sample of PACER documents, so we should be careful about trying to extrapolate to the entire PACER corpus. Still, it’s safe to say there are thousands, and probably tens of thousands, of documents in PACER whose authors made unsuccessful attempts to conceal information.

It’s also important to note that my software may not be detecting every instance of redaction failures. If a PDF was created by scanning in a paper document (as opposed to generated directly from a word processor), then it probably won’t have a “text layer.” My software doesn’t detect redaction failures in this type of document. This means that there may be more than 194 failed redactions among the 1.8 million documents I studied.

A few weeks ago I wrote a letter to Judge Lee Rosenthal, chair of the federal judiciary’s Committee on Rules of Practice and Procedure, explaining this problem. In that letter I recommend that the courts themselves use software like mine to automatically scan PACER documents for this type of problem. In addition to scanning the documents they already have, the courts should make it a standard part of the process for filing new documents with the courts. This would allow the courts to catch these problems before the documents are made available to the public on the PACER website.

My code is available here. It’s experimental research code, not a finished product. We’re releasing it into the public domain using the CC0 license; this should make it easy for federal and state officials to adapt it for their own use. Court administrators who are interested in adapting the code for their own use are especially encouraged to contact me for advice and assistance. The code relies heavily on the CAM::PDF Perl library, and I’m indebted to Chris Dolan for his patient answers to my many dumb questions.

Getting Redaction Right

So what should litigants do to avoid this problem? The National Security Agency has a good primer on secure redaction. The approach they recommend—completely deleting sensitive information in the original word processing document, replacing it with innocuous filler (such as strings of XXes) as needed, and then converting it to a PDF document, is the safest approach. The NSA primer also explains how to check for other potentially sensitive information that might be hidden in a document’s metadata.

Of course, there may be cases where this approach isn’t feasible because a litigant doesn’t have the original word processing document or doesn’t want the document’s layout to be changed by the redaction process. Adobe Acrobat’s redaction tool has worked correctly when we’ve used it, and Adobe probably has the expertise to do it correctly. There may be other tools that work correctly, but we haven’t had an opportunity to experiment with them so we can’t say which ones they might be.

Regardless of the tool used, it’s a good idea to take the redacted document and double-check that the information was removed. An easy way to do this is to simply cut and paste the “redacted” content into another document. If the redaction succeeded, no text should be transferred. This method will catch most, but not all, redaction failures. A more rigorous check is to remove the redaction rectangles from the document and manually observe what’s underneath them. One of the scripts I’m releasing today, called remove_rectangles.pl, does just that. In its current form, it’s probably not user-friendly enough for non-programmers to use, but it would be relatively straightforward for someone (perhaps Adobe or the courts) to build a user-friendly version that ordinary users could use to verify that the document they just attempted to redact actually got redacted.

One approach we don’t endorse is printing the document out, redacting it with a black marker, and then re-scanning it to PDF format. Although this may succeed in removing the sensitive information, we don’t recommend this approach because it effectively converts the document into a raster-based image, destroying useful information in the process. For example, it will no longer be possible to cut and paste (non-redacted) text from a document that has been redacted in this way.

Bad redactions are not a new problem, but they are taking on a new urgency as PACER documents become increasingly available on the web. Correct redaction is not difficult, but it does require both knowledge and care by those who are submitting the documents. The courts have several important roles they should play: educating attorneys about their redaction responsibilities, providing them with software tools that make it easy for them to comply, and monitoring submitted documents to verify that the rules are being followed.

This research was made possible with the financial support of Carl Malamud’s organization, Public.Resource.Org.

On Tuesday Judge Denny Chen rejected a proposed settlement in the Google Book Search case. My write-up for Ars Technica is here.

The question everyone is asking is what comes next. The conventional wisdom seems to be that the parties will go back to the bargaining table and hammer out a third iteration of the settlement. It’s also possible that the parties will try to appeal the rejection of the current settlement. Still, in case anyone at Google is reading this, I’d like to make a pitch for the third option: litigate!

Google has long been at the forefront of efforts to shape copyright law in ways that encourage innovation. When the authors and publishers first sued Google back in 2005, I was quick to defend the scanning of books under copyright’s fair use doctrine. And I still think that position is correct.

Unfortunately, in 2008 Google saw an opportunity to make a separate truce with the publishing industry that placed Google at the center of the book business and left everyone else out in the cold. Because of the peculiarities of class action law, the settlement would have given Google the legal right to use hundreds of thousands of “orphan” works without actually getting permission from their copyright holders. Competitors who wanted the same deal would have had no realistic way of doing so. Googlers are a smart bunch, and so they took what was obviously a good deal for them even though it was bad for fair use and online innovation.

Now the deal is no longer on the table, and it’s not clear if it can be salvaged. Judge Chin suggested that he might approve a new, “opt-in” settlement. But switching to an opt-in rule would undermine the very thing that made the deal so appealing to Google in the first place: the freedom to incorporate works whose copyright status was unclear. Take that away, and it’s not clear that Google Book Search can exist at all.

Moreover, I think the failure of the settlement may strengthen Google’s fair use argument. Fair use exists as a kind of safety valve for the copyright system, to ensure that it does not damage free speech, innovation, and other values. Although formally speaking judges are supposed to run through the famous four factor test to determine what counts as a fair use, in practice an important factor is whether the judge perceives the defendant as having acted in good faith. Google has now spent three years looking for a way to build its Book Search project using something other than fair use, and come up empty. This underscores the stakes of the fair use fight: if Judge Chen ruled against Google’s fair use argument, it would mean that it was effectively impossible to build a book search engine as comprehensive as the one Google has built. That outcome doesn’t seem consistent with the constitution’s command that copyright promote the progress of science and the useful arts.

In any event, Google may not have much choice. If it signs an “opt-in” settlement with the Author’s Guild and the Association of American Publishers, it’s likely to face a fresh round of lawsuits from other copyright holders who aren’t members of those organizations — and they might not be as willing to settle for a token sum. So if Google thinks its fair use argument is a winner, it might as well test it now before it’s paid out any settlement money. And if it’s not, then this business might be too expensive for Google to be in at all.

As promised, the official Freedom to Tinker predictions for 2011. These predictions are the result of discussions that included myself, Joe Hall, Steve Schultze, Wendy Seltzer, Dan Wallach, and Harlan Yu, but note that we don’t individually agree with every prediction.

DRM technology will still fail to prevent widespread infringement. In a related development, pigs will still fail to fly.

Copyright and patent issues will continue to be stalemated in Congress, with no major legislation on either subject.

Momentum will grow for HTTPS by default, with several major websites adding HTTPS support. Work will begin on adding HTTPS-by-default support to Apache.

Despite substantial attention by Congress to online privacy, the FTC won’t be granted authority to mandate Do Not Track compliance.

Some advertising networks and third-party Web services will begin to voluntarily respect the Do Not Track header, which will be supported by all the major browsers. However, sites will have varying interpretations of what the DNT header requires, leading to accusations that some purportedly DNT-respecting sites are not fully DNT-compliant.

Congress will pass an electronic privacy bill along the lines of the principles set out by the Digital Due Process Coalition.

The seemingly N^2 patent lawsuits among all the major smartphone players will be resolved through a grand cross-licensing bargain, cut in a dark, smoky room, whose terms will only be revealed through some congratulatory emails that leak to the press. None of these lawsuits will get anywhere near a courtroom.

Android smartphones will continue gaining market share, mostly at the expense of BlackBerry and Windows Mobile phones. However, Android’s gains will mostly be at the low end of the market; the iPhone will continue to outsell any single Android smartphone model by a wide margin.

2011 will see the outbreak of the first massive botnet/malware that attacks smartphones, most likely iPhone or Android models running older software than the latest and greatest. If Android is the target, it will lead to aggressive finger-pointing, particularly given how many users are presently running Android software that’s a year or more behind Google’s latest—a trend that will continue in 2011.

Mainstream media outlets will continue building custom “apps” to present their content on mobile devices. They’ll fall short of expectations and fail to reverse the decline of any magazines or newspapers.

At year’s end, the district court will still not have issued a final judgment on the Google Book Search settlement.

The market for Internet set-top boxes like Google TV and Apple TV will continue to be chaotic throughout 2011, with no single device taking a decisive market share lead. The big winners will be online services like Netflix, Hulu, and Pandora that work with a wide variety of hardware devices.

Online sellers with device-specific consumer stores (Amazon for Kindle books, Apple for iPhone/iPad apps, Microsoft for Xbox Live, etc.) will come under antitrust scrutiny, and perhaps even be dragged into court. Nothing will be resolved before the end of 2011.

With electronic voting machines beginning to wear out but budgets tight, there will be much heated discussion of electronic voting, including antitrust concern over the e-voting technology vendors. But there will be no fundamental changes in policy. The incumbent vendors will continue to charge thousands of dollars for products that cost them a tiny fraction of that to manufacture.

Pressure will continue to mount on election authorities to make it easier for overseas and military voters to cast votes remotely, despite all the obvious-to-everybody-else security concerns. While counties with large military populations will continue to conduct “pilot” studies with Internet voting, with grandiose claims of how they’ve been “proven” secure because nobody bothered to attack them, very few military voters will cast actual ballots over the Internet in 2011.

In contrast, where domestic absentee voters are permitted to use remote voting systems (e.g., systems that transmit blank ballots that the voter returns by mail) voters will do so in large numbers, increasing the pressure to make remote voting easier for domestic voters and further exacerbating security concerns.

At least one candidate for the Republican presidential nomination will express concern about the security of electronic voting machines.

Multiple Wikileaks alternatives will pop up, and pundits will start to realize that mass leaks are enabled by technology trends, not just by one freaky Australian dude.

The RIAA and/or MPAA will be sued over their role in the government’s actions to reassign DNS names owned by allegedly unlawful web sites. Even if the lawsuit manages to get all the way to trial, there won’t be a significant ruling against them.

Copyright claims will be asserted against players even further removed from underlying infringement than Internet/online Service Providers: domain name system participants, ad and payment networks, and upstream hosts. Some of these claims will win at the district court level, mostly on default judgments, but appeals will still be pending at year’s end.

A distributed naming system for Web/broadcast content will gain substantial mindshare and measurable US usage after the trifecta of attacks on Wikileaks DNS, COICA, and further attacks on privacy-preserving or anonymous registration in the ICANN-sponsored DNS. It will go even further in another country.

ICANN still will not have introduced new generic TLDs.

The FCC’s recently-announced network neutrality rules will continue to attract criticism from both ends of the political spectrum, and will be the subject of critical hearings in the Republican House, but neither Congress nor the courts will overturn the rules.

The tech policy world will continue debating the Comcast/Level 3 dispute, but Level 3 will continue paying Comcast to deliver Netflix content, and the FCC won’t take any meaningful actions to help Level 3 or punish Comcast.

Comcast and other cable companies will treat the Comcast/Level 3 dispute as a template for future negotiations, demanding payments to terminate streaming video content. As a result, the network neutrality debate will increasingly focus on streaming high-definition video, and legal academia will become a lot more interested in the economics of Internet interconnection.

We’re running a little behind this year, but as we do every year, we’ll review the predictions we made for 2010. Below you’ll find our predictions from 2010 in italics, and the results in ordinary type. Please notify us in the comments if we missed anything.

(1) DRM technology will still fail to prevent widespread infringement. In a related development, pigs will still fail to fly.

(2) Federated DRM systems, such as DECE and KeyChest, will not catch on.

Work on DECE (now renamed UltraViolet) continues to roll forward, with what appears to be broad industry support. It remains to be seen if those devices will actually work well, but the format seems to have at least “caught on” among industry players. We haven’t been following this market too closely, but given that KeyChest seems to mostly be mentioned as an also-ran in UltraViolet stories, its chances don’t look as good. Verdict: Mostly wrong.

(3) Content providers will crack down on online sites that host unlicensed re-streaming of live sports programming. DMCA takedown notices will be followed by a lawsuit claiming actual knowledge of infringing materials and direct financial benefits.

Like their non-live bretheren, live streaming sites like Justin.tv have received numerous DMCA takedown notices for copyrighted content. At the time of this prediction, we were unaware of the lawsuit against Ustream by a boxing promotional company, which began in August 2009. Nonetheless, the trend has continued. In the UK, there was an active game of cat-and-mouse between sports teams and live illegal restreaming sources for football (ahem: soccer) and cricket, which make much of their revenue on selling tickets to live matches. In some cases, a number of pubs were temporarily closed when their licenses were suspended in the face of complaints from content providers. In the US, Zuffa, the parent company for the mixed martial arts production company Ultimate Fighting Championship, sued when a patron at a Boston bar connected his laptop to one of the bar’s TVs to stream a UFC fight from an illicit site (Zuffa is claiming $640k in damages). In July, Zuffa subpoenaed the IP addresses of people uploading its content. And last week UFC sued Justin.tv directly for contributory and vicarious infringement, inducement, and other claims (RECAP docket). Verdict: Mostly right.

(4) Major newspaper content will continue to be available online for free (with ads) despite cheerleading for paywalls by Rupert Murdoch and others.

Early last year, the New York Timesannounced its intention to introduce a paywall in January 2011, and that plan still seems to be on track, but didn’t actually happen in 2010. The story is the same at the Philly Inquirer, which is considering a paywall but hasn’t put one in place. The Wall Street Journal was behind a paywall already. Other major papers, including the Los Angeles Times, the Washington Post, and USA Today, seem to be paywall-free. The one major paper we could find that did go behind a paywall is the Times of London went behind a paywall in July, with predictably poor results. Verdict: Mostly right.

(5) The Supreme Court will strike down pure business model patents in its Bilski opinion. The Court will establish a new test for patentability, rather than accepting the Federal Circuit’s test. The Court won’t go so far as to ban software patents, but the implications of the ruling for software patents will be unclear and will generate much debate.

The Supreme Court struck down the specific patent at issue in the case, but it declined to invalidate business method patents more generally. It also failed to articulate a clear new test. The decision did generate plenty of debate, but that went without saying. Verdict: Wrong.

(6) Patent reform legislation won’t pass in 2010. Calls for Congress to resolve the post-Bilski uncertainty will contribute to the delay.

Another prediction that works every year. Verdict: Right.

(7) After the upcoming rulings in Quon (Supreme Court), Comprehensive Drug Testing (Ninth Circuit or Supreme Court) and Warshak (Sixth Circuit), 2010 will be remembered as the year the courts finally extended the full protection of the Fourth Amendment to the Internet.

The Supreme Court decided Quon on relatively narrow grounds and deferred on the Fourth Amendment questions on electronic privacy, and the Ninth Circuit in Comprehensive Drug Testingdismissed the lower court's privacy-protective guidelines for electronic searches. However, the big privacy decision of the year was in Warshak, where the Sixth Circuit ruled strongly in favor of the privacy of remotely stored e-mail. Paul Ohm said of the decision: “It may someday be seen as a watershed moment in the extension of our Constitutional rights to the Internet.” Verdict: Mostly right.

(8) Fresh evidence will come to light of the extent of law enforcement access to mobile phone location-data, intensifying the debate about the status of mobile location data under the Fourth Amendment and electronic surveillance statutes. Civil libertarians will call for stronger oversight, but nothing will come of it by year’s end.

Even though we didn’t learn anything significant and new about the extent of government access to mobile location data, the debate around “cell-site” tracking privacy certainly intensified, in Congress, in the courts and in the public eye. The issue gained significant public attention through a trio of pro-privacy victories in the federal courts and Congress held a hearing on ECPA reform that focused specifically on location-based services. Despite the efforts of the Digital Due Process Coalition, no bills were introduced in Congress to reform and clarify electronic surveillance statutes. Verdict: Mostly right.

(9) The FTC will continue to threaten to do much more to punish online privacy violations, but it won’t do much to make good on the threats.

As a student of the FTC’s Chief Technologist, I’m not touching this one with a ten-foot pole.

(10) The new Apple tablet will be gorgeous but expensive. It will be a huge hit only if it offers some kind of advance in the basic human interface, such as a really effective full-sized on-screen keyboard.

(11) The disadvantages of iTunes-style walled garden app stores will become increasingly evident. Apple will consider relaxing its restrictions on iPhone apps, but in the end will offer only rhetoric, not real change.

Apple’s iPhone faced increasingly strong competition from Google’s rival Android platform, and it’s possible this could be attributed to Google’s more liberal policies for allowing apps to run on Android devices. Still, iPhones and iPads continued to sell briskly, and we’re not aware of any major problems arising from Apple’s closed business model. Verdict: Wrong.

(12) Internet Explorer’s usage share will fall below 50 percent for the first time in a decade, spurred by continued growth of Firefox, Chrome, and Safari.

There’s no generally-accepted yardstick for browser usage share, because there are so many different ways to measure it. But Wikipedia has helpfully aggregated browser usage share statistics. All five metrics listed there show the usage share falling by between 5 and 10 percent over the last years, with current values being between 41 to 61 percent. The mean of these statistics is 49.5 percent, and the median is 46.94 percent. Verdict: Right.

(13) Amazon and other online retailers will be forced to collect state sales tax in all 50 states. This will have little impact on the growth of their business, as they will continue to undercut local bricks-and-mortar stores on prices, but it will remove their incentive to build warehouses in odd places just to avoid having to collect sales tax.

(14) Mobile carriers will continue locking consumers in to long-term service contracts despite the best efforts of Google and the handset manufacturers to sell unlocked phones.

Google’s experiment selling the Nexus One directly to consumers via the web ended in failure after about four months. T-Mobile, traditionally the nation’s least restrictive national wireless carrier, recently made it harder for consumers to find its no-contract “Even More Plus” plans. It’s still possible to get an unlocked phone if you really want one, but you have to pay a hefty premium, and few consumers are bothering. Verdict: Right.

(15) Palm will die, or be absorbed by Research In Motion or Microsoft.

This prediction was almost right. Palm’s Web OS didn’t catch on, and in April the company was acquired by a large IT firm. However, that technology firm was HP, not RIM or Microsoft. Verdict: Half right.

(16) In July, when all the iPhone 3G early adopters are coming off their two-year lock-in with AT&T, there will be a frenzy of Android and other smartphone devices competing for AT&T’s customers. Apple, no doubt offering yet another version of the iPhone at the time, will be forced to cut its prices, but will hang onto its centralized app store. Android will be the big winner in this battle, in terms of gained market share, but there will be all kinds of fragmentation, with different carriers offering slightly different and incompatible variants on Android.

Almost everything we predicted here happened. The one questionable prediction is the price cut, but we’re going to say that this counts. Verdict: Right.

(17) Hackers will quickly sort out how to install their own Android builds on locked-down Android phones from all the major vendors, leading to threatened or actual lawsuits but no successful legal action taken.

(18) Twitter will peak and begin its decline as a human-to-human communication medium.

We’re not sure how to measure this prediction, but Twitter recently raised another $200 million in venture capital and its users exchanged 250 billion tweets in 2010. That doesn’t look like decline to us. Verdict: Wrong.

(19) A politican or a candidate will commit a high-profile “macaca”-like moment via Twitter.

We can’t think of any good examples of high-profile cases that severely affected a politician’s prospects in the 2010 elections, like the “macaca” comment did to George Allen’s 2006 Senate campaign. However, there were a number of more low-profile gaffes, including Sarah Palin’s call for peaceful muslims to “refudiate” the “Ground Zero Mosque” (the New Oxford American Dictionary named refudiate its word of the year), then-Senator Chris Dodd’s staff mis-tweeting inappropriate comments and a technical glitch in computer software at the U.S. embassy in Beijing tweeting that the air quality one day was “crazy bad”. Verdict: Mostly wrong.

(20) Facebook customers will become increasingly disenchanted with the company, but won’t leave in large numbers because they’ll have too much information locked up in the site.

In May 2010, Facebook once again changed its privacy policy to make more Facebook user information available to more people. On twooccasions, Facebook has faced criticism for leaking user data to advertisers. But the site doesn’t seem to have declined in popularity. Verdict: Right.

(21) The fashionable anti-Internet argument of 2010 will be that the Net has passed its prime, supplanting the (equally bogus) 2009 fad argument that the Internet is bad for literacy.

Wireddeclared the web dead back in August. Is that the same thing as saying the Net has passed its prime? Bogus arguments all sound the same to us. Verdict: Mostly right.

(22) One year after the release of the Obama Administration’s Open Government Directive, the effort will be seen as a measured success. Agencies will show eagerness to embrace data transparency but will find the mechanics of releasing datasets to be long and difficult. Privacy– how to deal with personal information available in public data– will be one major hurdle.

Many people are calling this open government’s “beta period.” Federal agencies took the landmark step in January by releasing their first “high-value” datasets on Data.gov, but some advocates say these datasets are not “high value” enough. Agencies also published their plans for open government—some were better than others—and implementation of these promises has indeed been incremental. Privacy has been an issue in many cases, but it’s often difficult to know the reasons why an agency decides not to release a dataset. Verdict: Mostly right.

(23) The Open Government agenda will be the bright spot in the Administration’s tech policy, which will otherwise be seen as a business-as-usual continuation of past policies.

As we noted above, the Obama administration has had a pretty good record on open government issues. Probably the most controversial tech policy change has been the FCC’s adoption of new network neutrality rules. These weren’t exactly a continuation of Bush administration policies, but they also didn’t go as far as many activist groups wanted. And we can think of any other major tech policy changes. Verdict: Mostly right.

In my last post I told a story about the Level 3/Comcast dispute that portrays Comcast in a favorable light. Now here’s another story that casts Comcast as the villain.

Story 2: Comcast Abuses Its Market Power

As Steve explained, Level 3 is an “Internet Backbone Provider.” Level 3 has traditionally been considered a tier 1 provider, which means that it exchanges traffic with other tier 1 providers without money changing hands, and bills everyone else for connectivity. Comcast, as a non-tier 1 provider, has traditionally paid Level 3 to carry its traffic to places Comcast’s own network doesn’t reach directly.

Steve is right that the backbone market is highly competitive. I think it’s worth unpacking why this is in a bit more detail. Let’s suppose that a Comcast user wants to download a webpage from Yahoo!, and that both are customers of Level 3. So Yahoo! sends its bits to Level 3, who passes it along to Comcast. And traditionally, Level 3 would bill both Yahoo! and Comcast for the service of moving data between them.

It might seem like Level 3 has a lot of leverage in a situation like this, so it’s worth considering what would happen if Level 3 tried to jack up its prices. There are reportedly around a dozen other tier 1 providers that exchange traffic with Level 3 on a settlement-free basis. This means that if Level 3 over-charges Comcast for transit, Comcast can go to one of Level 3’s competitors, such as Global Crossing, and pay it to carry its traffic to Level 3’s network. And since Global Crossing and Level 3 are peers, Level 3 gets nothing for delivering traffic to Global Crossing that’s ultimately bound for Comcast’s network.

A decade ago, when Internet Service Retailers (to use Steve’s terminology) were much smaller than backbone providers, that was the whole story. The retailers didn’t have the resources to build their own global networks, and their small size meant they had relatively little bargaining power against the backbone providers. So the rule was that Internet Service Retailers charged their customers for Internet access, and then passed some of that revenue along to the backbone providers that offered global connectivity. There may have been relatively little competition in the retailer market, but this didn’t have much effect on the overall structure of the Internet because no single retailer had enough market power to go toe-to-toe with the backbone providers.

A decade of consolidation and technological progress has radically changed the structure of the market. About 16 million households now subscribe to Comcast’s broadband service, accounting for at least 20 percent of the US market. This means that a backbone provider that doesn’t offer fast connectivity to Comcast’s customers will be putting themselves at a significant disadvantage compared with companies that do. Comcast still needs access to Level 3’s many customers, of course, but Level 3 needs Comcast much more than it needed any single Internet retailer a decade ago.

Precedent matters in any negotiated relationship. You might suspect that you’re worth a lot more to your boss than what he’s currently paying you, but by accepting your current salary when you started the job you’ve demonstrated you’re willing to work for that amount. So until something changes the equilibrium (like an competing job offer), your boss has no particular incentive to give you a raise. One strategy for getting a raise is to wait until the boss asks you to put in extra hours to finish a crucial project, and then ask for the raise. In that situation, not only does the boss know he can’t lose you, but he knows you know he can’t lose you, and therefore aren’t likely to back down.

Comcast seems to have pursued a similar strategy. If Comcast had simply approached Level 3 and demanded that Level 3 start paying Comcast, Level 3 would have assumed Comcast was bluffing and said no. But when Level 3 won the Netflix contract, Level 3 suddenly needed a rapid and unexpected increase in connectivity to Comcast. And Comcast bet, correctly as it turned out, that Level 3 was so desperate for that additional capacity that it would have no choice but to pay Comcast for the privilege.

If Comcast’s gambit becomes a template for future negotiations between backbone providers and broadband retailers, it could represent a dramatic change in the economics of the Internet. This is because it’s much harder for a backbone provider to route around a retailer than vice versa. As we’ve seen Comcast can get to Level 3’s customers by purchasing transit from some other backbone provider. But traffic bound for Comcast’s residential customers have to go through Comcast’s network. And Level 3’s major customers—online content providers like Netflix—aren’t going to pay for transit services that don’t reach 20 percent of American households. So Level 3 is in a weak bargaining position.

In the long run, this could be very bad news for online businesses like Netflix, because its bandwidth costs would no longer be constrained by the robust competition in the backbone market. Netflix apparently got a good deal from Level 3 in the short run. But if a general practice emerges of backbone providers paying retailers for interconnection, those costs are going to get passed along to the backbone providers’ own customers, e.g. Netflix. And once the precedent is established that retailers get to charge backbone providers for connectivity, their ability to raise prices may be much less constrained by competition.

Conclusions

So which story is right? If I knew the answer to that I wouldn’t have wasted your time with two stories. And it’s worth noting that these stories are not mutually exclusive. It’s possible that Comcast has been looking for an opportunity to shift the balance of power with its transit providers, and the clumsiness of Level 3’s CDN strategy gave them an opportunity to do so in a way that minimizes the fallout.

One sign that story #2 might be wrong is that content providers haven’t raised much of a stink. If the Comcast/Level 3 dispute represented a fundamental shift in power toward broadband providers, you’d expect the major content providers to try to form a united front against them. Yet there’s nothing about the dispute on (for example) the Google Public Policy blog, and I haven’t seen any statements on the subject from other content providers. Presumably they’re following this dispute more closely than I am, and understand the business issues better than I do, so if they’re not concerned that suggests maybe I shouldn’t be either.

A final thought: one place where I’m pretty sure Level 3 is wrong is in labeling this a network neutrality dispute. Although the dispute was precipitated by Netflix’s decision to switch CDN providers, there’s little reason to think Comcast is singling out Netflix traffic for special treatment. In story #1, Comcast is be happy to deliver Netflix (or any other) content via a well-designed CDN; they just object to having their bandwidth wasted. In story #2, Comcast’s goal is to collect payments for all inbound traffic, not just traffic from Netflix. Either way, Comcast hasn’t done anything that violates leading network neutrality proposals. Comcast is not, and hasn’t threatened to, discriminate against any particular type of traffic. And no, declining to upgrade a peering link doesn’t undermine network neutrality.

Like Steve and a lot of other people in the tech policy world, I’ve been trying to understand the dispute between Level 3 and Comcast. The combination of technical complexity and commercial secrecy has made the controversy almost impenetrable for anyone outside of the companies themselves. And of course, those who are at the center of the action have a strong incentive to mislead the public in ways that makes their own side look better.

So building on Steve’s excellent post, I’d like to tell two very different stories about the Level 3/Comcast dispute. One puts Level 3 in a favorable light and the other slants things more in Comcast’s favor.

Story 1: Level 3 Abuses Its Customer Relationships

As Steve explained, a content delivery network (CDN) is a network of caching servers that help content providers deliver content to end users. Traditionally, Netflix has used CDNs like Akamai and Limelight to deliver its content to customers. The dispute began shortly after Level 3 beat out these CDN providers for the Netflix contract.

The crucial thing to note here is that CDNs can save Comcast, and other broadband retailers, a boatload of money. In a CDN-free world, content providers like Netflix would send thousands of identical copies of its content to Comcast customers, consuming Comcast’s bandwidth and maybe even forcing Comcast to pay transit fees to its upstream providers.

Akamai reportedly installs its caching servers at various points inside the networks of retailers like Comcast. Only a single copy of the content is sent from the Netflix server to each Akamai cache; customers then access the content from the caches. Because these caches are inside Comcast’s network, they never require Comcast to pay for transit to receive them. And because there are many caches distributed throughout Comcast’s network (to improve performance), content delivered by them is less likely to consume bandwidth on expensive long-haul connections.

Now Level 3 wants to enter the CDN marketplace, but it decides to pursue a different strategy. For Akamai, deploying its servers inside of Comcast’s network saves both Comcast and Akamai money, because Akamai would otherwise have to pay a third party to carry its traffic to Comcast. But as a tier 1 provider, Level 3 doesn’t have to pay anyone for connectivity, and indeed in many cases third parties pay them for connectivity. Hence, placing the Level 3 servers inside of the Level 3 network is not only easier for Level 3, but in some cases it might actually generate extra revenue, as Level 3’s customers have to pay for the extra traffic.

This dynamic might explain the oft-remarked-upon fact that Comcast seems to be simultaneously a peer and a customer of Level 3. Comcast pays Level 3 to carry traffic to and from distant networks that Comcast’s own network does not reach—doing so is cheaper than building its own worldwide backbone network. But Comcast is less enthusiastic about paying Level 3 for traffic that originates from Level 3’s own network. (known as “on-net” traffic)

And even if Comcast isn’t paying for Level 3’s CDN traffic, it’s still not hard to understand Comcast’s irritation. When two companies sign a peering agreement, the assumption is typically that each party is doing roughly half the “work” of hauling the bits from source to destination. But in this case, because the bits are being generated by Level 3’s CDN servers, the bits are traveling almost entirely over Comcast’s network.

Hauling traffic all the way from the peering point to Comcast’s customers will consume more of Comcast’s network resources than hauling traffic from Akamai’s distributed CDN servers did. And to add insult to injury, Level 3 apparently only gave Comcast a few weeks’ notice of the impending traffic spike. So faced with the prospect of having to build additional infrastructure to accommodate this new, less efficient method for delivering Netflix bits to Comcast customers, Comcast asked Level 3 to help cover the costs.

Of course, another way to look at this is to say that Comcast (and other retailers like AT&T and Time Warner) brought the situation on themselves by over-charging Akamai for connectivity. I’ve read conflicting reports about whether and how much Comcast has traditionally charged Akamai for access to its network (presumably these details are trade secrets), but some people have suggested that Comcast charges Akamai for bandwidth and cabinet space even when their servers are deep inside Comcast’s own network. If that’s true, it may be penny wise and pound foolish on Comcast’s part, because if Akamai is not able to win big customers like Netflix, then Comcast will have to pay to haul that traffic halfway across the Internet itself.

In my next post I’ll tell a different story that casts Comcast in a less flattering light.

Ed posted yesterday about Google’s bombshell announcement that it is considering pulling out of China in the wake of a sophisticated attack on its infrastructure. People more knowledgeable than me about China have weighed in on the announcement’s implications for the future of US-Sino relations and the evolution of the Chinese Internet. Rebecca MacKinnon, a China expert who will be a CITP visiting scholar beginning next month, says that “Google has taken a bold step onto the right side of history.” She has a roundup of Chinese reactions here.

One aspect of Google’s post that hasn’t received a lot of attention is Google’s statement that “only two Gmail accounts appear to have been accessed, and that activity was limited to account information (such as the date the account was created) and subject line, rather than the content of emails themselves.” A plausible explanation for this is provided by this article (via James Grimmelmann) at PC World:

Drummond said that the hackers never got into Gmail accounts via the Google hack, but they did manage to get some “account information (such as the date the account was created) and subject line.”

That’s because they apparently were able to access a system used to help Google comply with search warrants by providing data on Google users, said a source familiar with the situation, who spoke on condition of anonymity because he was not authorized to speak with the press.

Obviously, this report should be taken with a grain of salt since it’s based on a single anonymous source. But it fits a pattern identified by our own Jen Rexford and her co-authors in an excellent 2007 paper: when communications systems are changed to make it easier for US authorities to conduct surveillance, it necessarily increases the vulnerability of those systems to attacks by other parties, including foreign governments.

Rexford and her co-authors point to a 2006 incident in which unknown parties exploited vulnerabilities in Vodafone’s network to tap the phones of dozens of senior Greek government officials. According to news reports, these attacks were made possible because Greek telecommunications carriers had deployed equipment with built-in surveillance capabilities, but had not paid the equipment vendor, Ericsson, to activate this “feature.” This left the equipment in a vulnerable state. The attackers surreptitiously switched on the surveillance capabilities and used it to intercept the communications of senior government officials.

It shouldn’t surprise us that systems built to give law enforcement access to private communications could become vectors for malicious attacks. First, these interfaces are often backwaters in the system design. The success of any consumer product is going to depend on its popularity with customers. Therefore, a vendor or network provider is going to deploy its talented engineers to work on the public-facing parts of the product. It is likely to assign a smaller team of less-talented engineers to work on the law-enforcement interface, which is likely to be both less technically interesting and less crucial to the company’s bottom line.

Second, the security model of a law enforcement interface is likely to be more complex and less well-specified than the user-facing parts of the service. For the mainstream product, the security goal is simple: the customer should be able to access his or her own data and no one else’s. In contrast, determining which law enforcement officials are entitled to which information, and how those officials are to be authenticated, can become quite complex. Greater complexity means a higher likelihood of mistakes.

Finally, the public-facing portions of a consumer product benefit from free security audits from “white hat” security experts like our own Bill Zeller. If a publicly-facing website, cell phone network or other consumer product has a security vulnerability, the company is likely to hear about the problem first from a non-malicious source. This means that at least the most obvious security problems will be noticed and fixed quickly, before the bad guys have a chance to exploit them. In contrast, if an interface is shrouded in secrecy, and only accessible to law enforcement officials, then even obvious security vulnerabilities are likely to go unnoticed and unfixed. Such an interface will be a target-rich environment if a malicious hacker ever does get the opportunity to attack it.

This is an added reason to insist on rigorous public and judicial oversight of our domestic surveillance capabilities in the United States. There has been a recent trend, cemented by the 2008 FISA Amendments toward law enforcement and intelligence agencies conducting eavesdropping without meaningful judicial (to say nothing of public) scrutiny. Last month, Chris Soghoian uncovered new evidence suggesting that government agencies are collecting much more private information than has been publicly disclosed. Many people, myself included, oppose this expansion of domestic surveillance grounds on civil liberties grounds. But even if you’re unmoved by those arguments, you should still be concerned about these developments on national security grounds.

As long as these eavesdropping systems are shrouded in secrecy, there’s no way for “white hat” security experts to even begin evaluating them for potential security risks. And that, in turn, means that voters and policymakers will be operating in the dark. Programs that risk exposing our communications systems to the bad guys won’t be identified and shut down. Which means the culture of secrecy that increasingly surrounds our government’s domestic spying programs not only undermines the rule of law, it’s a danger to national security as well.

One sentiment I’ve seen in a number of people express about our release of RECAP is illustrated by this comment here at Freedom to Tinker:

Technically impressive, but also shortsighted. There appears a socialistic cultural trend that seeks to disconnect individual accountability to ones choices. $.08 a page is hardly burdensome or profitable, and clearly goes to offset costs. If additional taxes are required to make up the shortfall RECAP seems likely to create, we all will pay more in general taxes even though only a small few ever access PACER.

Now, I don’t think anyone who’s familiar with my work would accuse me of harboring socialistic sympathies. RECAP has earned the endorsement of my colleague Jim Harper of the libertarian Cato Institute and Christopher Farrell of the conservative watchdog group Judicial Watch. Those guys are not socialists.

Still, there’s a fair question here: under the model we advocate, taxpayers might wind up picking up some of the costs currently being bourne by PACER users. Why should taxpayers in general pay for a service that only a tiny fraction of the population will ever use?

I think there are two answers. The narrow answer is that this misunderstands where the costs of PACER come from. There are four distinct steps in the process of publishing a judicial record. First, someone has to create the document. This is done by a judge in some cases and by private litigants in others. Second, someone has to convert the document to electronic format. This is a small and falling cost, because both judges and litigants increasingly produce documents using word processors, so they’re digital from their inception. Third, someone has to redact the documents to ensure private information doesn’t leak out. This is supposed to be done by private parties when they submit documents, but they don’t always do what they’re supposed to, necessitating extra work by court personnel. Finally, the documents need to be uploaded to a website where they can be downloaded by the public.

The key thing to understand here is that the first three steps are things that the courts would be doing anyway if PACER didn’t exist. Court documents were already public records before PACER came onto the scene. Anyone could (and still can) drive down to the courthouse, request any unsealed document they want, and make as many photocopies as they wish. Moreover, even if documents weren’t public, the courts would likely still be transitioning to an electronic system for their internal use.

So this means that the only additional cost of PACER, beyond the activities the courts would be doing anyway, is the web servers, bandwidth, and personnel required to run the PACER web sites themselves. But RECAP users imposes no additional load on PACER’s servers. Users download RECAP documents directly from the Internet Archive. So RECAP is entirely consistent with the principle that PACER users should pay for the resources they use.

I think there’s also a deeper answer to this question, which is that it misunderstands the role of the judiciary in a free society. The service the judiciary provides to the public is not the production of individual documents or even the resolution of individual cases. The service it provides is the maintenance of a comprehensive and predictable system of law that is the foundation for our civilization. You benefit from this system whether or not you ever appear in court because it gives you confidence that your rights will be protected in a fair and predictable manner. And in particular, you benefit from judicial transparency because transparency improves accountability. Even if you’re not personally interested in monitoring the judiciary for abuses, you benefit when other people do so.

This is something I take personally because I’ve done a bit of legal reporting myself. I obviously get some direct benefits from doing this—I sometimes get paid and it’s always fun to have people read my work. But I like to think that my writing about the law also benefits society at large by increasing public understanding and scrutiny of the judicial system. And charging for access to the law will be most discouraging to people like me who are trying to do more than just win a particular court case. Journalists, public interest advocates, academics, and the like generally don’t have clients they can bill for the expense of legal research, so PACER fees are a significant barrier.

There’s no conflict between a belief in free markets and a belief that everyone is entitled to information about the legal system that governs their lives. To the contrary, free markets depend on the “rules of the game” being fair and predictable. The kind of judicial transparency that RECAP hopes to foster only furthers that goal.

Freedom to Tinker is hosted by Princeton's Center for Information Technology Policy, a research center that studies digital technologies in public life. Here you'll find comment and analysis from the digital frontier, written by the Center's faculty, students, and friends.