Apache Spark is a fast and general engine for large-scale data processing.

100 times faster than Hadoop.

Everyone knows SQL. But traditional databases are not good in hadling big amount of data. Nevertheless, SQL is a good DSL for data processing and it is much easier to understand Spark if you have similar query implemented in SQL.

Erlang/OTP 18.0 is a new major release with new features, quite a few (characteristics) improvements, as well as some incompatibilities.
A non functional but major change this release is the change of license to APL 2.0 (Apache Public License).

Some highlights of the release are:

Starting from 18.0 Erlang/OTP is released under the APL 2.0 (Apache Public License)

erts: The time functionality has been extended. This includes a new API for
time, as well as “time warp” modes which alters the behavior when system time changes. You are strongly encouraged to use the new API instead of the old API based on erlang:now/0. erlang:now/0 has been deprecated since it is a scalability bottleneck.
For more information see the Time and Time Correction chapter of the ERTS User’s Guide. Here is a link http://www.erlang.org/doc/apps/erts/time_correction.html

erts: Beside the API changes and time warp modes a lot of scalability and performance improvements regarding time management has been made. Examples are:

scheduler specific timer wheels,

scheduler specific BIF timer management,

parallel retrieval of monotonic time and system time on OS:es that support it.

erts: The previously introduced “eager check I/O” feature is now enabled by default.

erts/compiler: enhanced support for maps. Big maps new uses a HAMT (Hash Array Mapped Trie) representation internally which makes them more efficient. There is now also support for variables as map keys.

dialyzer: The -dialyzer() attribute can be used for suppressing warnings
in a module by specifying functions or warning options.
It can also be used for requesting warnings in a module.

ssl: Remove default support for SSL-3.0 and added padding check for TLS-1.0 due to the Poodle vulnerability.

ssl: Remove default support for RC4 cipher suites, as they are consider too weak.

stdlib: Allow maps for supervisor flags and child specs

stdlib: New functions in ets:

take/2. Works the same as ets:delete/2 but
also returns the deleted object(s).

A major holiday approaches in the United States (July 4th). A time when budget puffing terror alerts are issued, fatal automobile accidents surge, driving while intoxicated arrests jump, the usual marks of a US holiday.

If you spend some time with Erlang/OTP 18, you can greet your co-workers who survive the long weekend, albeit with frayed nerves from long proximity to family members and hangovers to boot, with some new tricks.

The aim behind this blog post is to introduce open source business intelligence technologies and explore data using open source technologies like D3.js, DC.js, Nodejs and MongoDB.

Over the span of this post we will see the importance of the various components that we are using and we will do some code based customization as well.

The Need for Visualization:

Visualization is the so called front-end of modern business intelligence systems. I have been around in quite a few big data architecture discussions and to my surprise i found that most of the discussions are focused on the backend components: the repository, the ingestion framework, the data mart, the ETL engine, the data pipelines and then some visualization.

I might be biased in favor of the visualization technologies as i have been working on them for a long time. Needless to say visualization is as important as any other component of a system. I hope most of you will agree with me on that. Visualization is instrumental in inferring the trends from the data, spotting outliers and making sense of the data-points.

What they say is right, A picture is indeed worth a thousand words.

The components of our analysis and their function:

D3.js: A javascript based visualization engine which will render interactive charts and graphs based on the data.

Dc.js: A javascript based wrapper library for D3.js which makes plotting the charts a lot easier.

Crossfilter.js: A javascript based data manipulation library. Works splendid with dc.js. Enables two way data binding.

Node JS: Our powerful server which serves data to the visualization engine and also hosts the webpages and javascript libraries.

Mongo DB: The resident No-SQL database which will serve as a fantastic data repository for our project.

[I added links to the components.]
…

A very useful walk through of interactive data visualization using open source tools.

It does require a time investment on your part but you will be richly rewarded with skills, ideas and new ways of thinking about visualizing your data.

I don’t know that you will agree with Betsy’s conclusion but it is an interesting read.

…
Fourteen years ago the authors of the Agile Manifesto said unto us: all technical problems are people problems that manifest technically. In doing so they repeated what Peopleware’s DeMarco and Lister had said fourteen years before that. We cannot break the endless cycle of broken frameworks and buggy software by pretending that broken, homogenous [sic] communities can produce frameworks that meet the varied needs of a broad developer base. We have known this for three decades.
…

The “homogeneous community” in question is, of course, white males.

I have no idea if the founders of the languages she mentions are all white males or not. But for purposes of argument, let’s say that the founding communities in question are exclusively white males. And intentionally so.

OK, where is the comparison case of language development that demonstrates a more gender, racial, sexual orientation, religious, inclusive group would produce less broken frameworks and less buggy software, but some specified measure?

I understand the point that frameworks and code are currently broken and buggy, no argument there. No need to repeat that or come up with new examples.

The question that interests me and I suspect would interest developers and customers alike, is where are the frameworks or code that is less buggy because they were created by more inclusive communities?

Inclusion will sell itself, quickly, if the case can be made that inclusive communities produce more useful frameworks or less buggy code.

In making the case for inclusion, citing studies that groups are more creative when diverse isn’t enough. Point to the better framework or less buggy code created by a diverse community. That should not be hard to do, assuming such evidence exists.

Make no mistake, I think discrimination on the basis of gender, race, sexual orientation, religion, etc. are not only illegal, they are immoral. However, the case for non-discrimination is harmed by speculative claims for improved results that are not based on facts.

Where are those facts? I would love to be able to cite them.

PS: Flames will be deleted. With others I fought gender/racial discrimination in organizing garment factories where the body heat of the workers was the only heat in the winter. Only to be betrayed by a union more interested in dues than justice for workers. Defeating discrimination requires facts, not rhetoric. (Recalling it was Brown vs. Board of Education that pioneered the use of social studies data in education litigation. They offered facts, not opinions.)

Datasets published on the Web are accessed and experienced by consumers in a variety of ways, but little information about these experiences is typically conveyed. Dataset publishers many times lack feedback from consumers about how datasets are used. Consumers lack an effective way to discuss experiences with fellow collaborators and explore referencing material citing the dataset. Datasets as defined by DCAT are a collection of data, published or curated by a single agent, and available for access or download in one or more formats. The Dataset Usage Vocabulary (DUV) is used to describe consumer experiences, citations, and feedback about the dataset from the human perspective.

By specifying a number of foundational concepts used to collect dataset consumer feedback, experiences, and cite references associated with a dataset, APIs can be written to support collaboration across the Web by structurally publishing consumer opinions and experiences, and provide a means for data consumers and producers advertise and search for published open dataset usage.

From Status of This Document:

This is a draft document which may be merged with the Data Quality Vocabulary or remain as a standalone document. Feedback is sought on the overall direction being taken as much as the specific details of the proposed vocabulary.

This document specifies a processing model and syntax for general purpose inclusion. Inclusion is accomplished by merging a number of XML information sets into a single composite infoset. Specification of the XML documents (infosets) to be merged and control over the merging process is expressed in XML-friendly syntax (elements, attributes, URI references).

The promise of XML of dynamic documents, composed from data stores, other documents, etc., does happen, but not nearly as frequently as it should.

Looking for XML Inclusions to be another step away from documents as static containers.

Posted in XInclude, XML | Comments Off on XML Inclusions (XInclude) Version 1.1

In the context of a content-based music retrieval system or archiving digital audio data, genre-based classification of song may serve as a fundamental step. In the earlier attempts, researchers have described the song content by a combination of different types of features. Such features include various frequency and time domain descriptors depicting the signal aspects. Perceptual aspects also have been combined along with. A listener perceives a song mostly in terms of its tempo (rhythm), periodicity, pitch and their variation and based on those recognises the genre of the song. Motivated by this observation, in this work, instead of dealing with wide range of features we have focused only on the perceptual aspect like melody and rhythm. In order to do so audio content is described based on pitch, tempo, amplitude variation pattern and periodicity. Dimensionality of descriptor vector is reduced and finally, random sample and consensus (RANSAC) is used as the classifier. Experimental result indicates the effectiveness of the proposed scheme.

A new approach to classification of music, but that’s all I can say since the content is behind a pay-wall.

One way to increase the accessibility of texts would be for tenure committees to not consider publications as “published” until they are freely available for the author’s webpage.

That one change could encourage authors to press for the right to post their own materials and to follow through with posting them as soon as possible.

Feel free to forward this post to members of your local tenure committee.

A test pilot has some very, very bad news about the F-35 Joint Strike Fighter. The pricey new stealth jet can’t turn or climb fast enough to hit an enemy plane during a dogfight or to dodge the enemy’s own gunfire, the pilot reported following a day of mock air battles back in January.

“The F-35 was at a distinct energy disadvantage,” the unnamed pilot wrote in a scathing five-page brief that War Is Boring has obtained. The brief is unclassified but is labeled “for official use only.”

The U.S. Air Force, Navy and Marine Corps — not to mention the air forces and navies of more than a dozen U.S. allies — are counting on the Lockheed Martin-made JSF to replace many if not most of their current fighter jets.

And that means that, within a few decades, American and allied aviators will fly into battle in an inferior fighter — one that could get them killed … and cost the United States control of the air.
…

A close friend recently said that I shouldn’t complain about vendors making money off of the government in return for little or no useful goods or services. He called it, “…breaking their rice bowls….”

Perhaps so but the result of thousands, if not hundreds of thousands, of people not speaking up when the government is billed for little or no useful goods or services is the $1 Trillion Lockheed Martin F-35 Flying Coffin.

Not only do such projects damage the military capability of the United States, it also degrades the military forces of every country that buys one of these buggy, flammable and easy-to-defeat aircraft.

I’m sure it can stand off and fire missiles with great accuracy, but so can a land-based cruise missile launcher. For a lot less money.

Foreign countries should be rushing to cancel orders for the Lockheed Martin F-35 Flying Coffin and invest in innovative military solutions. Highly sophisticated missile systems designed to degrade aircraft delivery platforms for example. Or electronic warfare and anti-aircraft missile defenses.

The jsonlite package provides a powerful JSON parser and generator that has become one of standard methods for getting data in and out of R. We discuss some recent additions to the package, in particular support streaming (large) data over http(s) connections. We then introduce the new mongolite package: a high-performance MongoDB client based on jsonlite. MongoDB (from “humongous”) is a popular open-source document database for storing and manipulating very big JSON structures. It includes a JSON query language and an embedded V8 engine for in-database aggregation and map-reduce. We show how mongolite makes inserting and retrieving R data to/from a database as easy as converting it to/from JSON, without the bureaucracy that comes with traditional databases. Users that are already familiar with the JSON format might find MongoDB a great companion to the R language and will enjoy the benefits of using a single format for both serialization and persistency of data.

In case your recent history is a bit rusty, phosgene was one of the terror weapons of World War I. It accounted for 85% of the 100,000 deaths from chemical gas. Not as effective as say sarin but no slouch.

Don’t run to the library, online guides or the FBI for recipes to make phosgene at home. Its use in industrial applications should give you a clue as to an alternative to home-made phosgene. Use of phosgene violates the laws of war, so being a thief as well should not trouble you.

No, I don’t have a list of locations that make or use phosgene, but then DHS probably doesn’t either. They are more concerned with terrorists using “nuclear weapons” or “gamma-ray bursts“. One is mechanically and technically difficult to do well and the other is impossible to control.

The idea of someone using a dual-wheel pickup and a plant pass to pickup and deliver phosgene gas is too simple to have occurred to them.

If you are pitching topic maps to a science/chemistry oriented audience, these podcasts make a nice starting point for expansion. To date there are two hundred and forty-two (242) of them.

Countless learning tasks require awareness of time. Image captioning, speech synthesis, and video game playing all require that a model generate sequences of outputs. In other domains, such as time series prediction, video analysis, and music information retrieval, a model must learn from sequences of inputs. Significantly more interactive tasks, such as natural language translation, engaging in dialogue, and robotic control, often demand both.

Recurrent neural networks (RNNs) are a powerful family of connectionist models that capture time dynamics via cycles in the graph. Unlike feedforward neural networks, recurrent networks can process examples one at a time, retaining a state, or memory, that reflects an arbitrarily long context window. While these networks have long been difficult to train and often contain millions of parameters, recent advances in network architectures, optimization techniques, and parallel computation have enabled large-scale learning with recurrent nets.

Over the past few years, systems based on state of the art long short-term memory (LSTM) and bidirectional recurrent neural network (BRNN) architectures have demonstrated record-setting performance on tasks as varied as image captioning, language translation, and handwriting recognition. In this review of the literature we synthesize the body of research that over the past three decades has yielded and reduced to practice these powerful models. When appropriate, we reconcile conflicting notation and nomenclature. Our goal is to provide a mostly self-contained explication of state of the art systems, together with a historical perspective and ample references to the primary research.

Lipton begins with an all too common lament:

The literature on recurrent neural networks can seem impenetrable to the uninitiated. Shorter papers assume familiarity with a large body of background literature. Diagrams are frequently underspecified, failing to indicate which edges span time steps and which don’t. Worse, jargon abounds while notation is frequently inconsistent across papers or overloaded within papers. Readers are frequently in the unenviable position of having to synthesize conflicting information across many papers in order to understand but one. For example, in many papers subscripts index both nodes and time steps. In others, h simultaneously stands for link functions and a layer of hidden nodes. The variable t simultaneously stands for both time indices and targets, sometimes in the same equation. Many terrific breakthrough papers have appeared recently, but clear reviews of recurrent neural network literature are rare.

Unfortunately, Lipton gives no pointers to where the variant practices occur, leaving the reader forewarned but not forearmed.

Still, this is a survey paper with seventy-three (73) references over thirty-three (33) pages, so I assume you will encounter various notation practices if you follow the references and current literature.

Capturing variations in notation, along with where they have been seen, won’t win the Turing Award but may improve the CS field overall.

Before you credit this report too much, consider the following points:

Crunching the Survey Numbers

MeriTalk, on behalf of Splunk, conducted an online survey of 150 Federal and 152 State and Local cyber security pros in March 2015. The report has a margin of error of ±5.6% at a 95% confidence level. (slide 15)

An effort to capture anomalies from medical imaging, package those with other data, and deliver it for use by clinicians.

If you think of each medical image as represented a large amount of data, the underlying idea is to filter out all but the most relevant data, so that clinicians are not confronting an overload of information.

In network terms, rather than displaying all of the current connections to a network (the ever popular eye-candy view of connections), displaying only those connections that are different from all the rest.

The same technique could be usefully applied in a number of “big data” areas.

From the post:

Medical Sieve is an ambitious long-term exploratory grand challenge project to build a next generation cognitive assistant with advanced multimodal analytics, clinical knowledge and reasoning capabilities that is qualified to assist in clinical decision making in radiology and cardiology. It will exhibit a deep understanding of diseases and their interpretation in multiple modalities (X-ray, Ultrasound, CT, MRI, PET, Clinical text) covering various radiology and cardiology specialties. The project aims at producing a sieve that filters essential clinical and diagnostic imaging information to form anomaly-driven summaries and recommendations that tremendously reduce the viewing load of clinicians without negatively impacting diagnosis.

Statistics show that eye fatigue is a common problem with radiologists as they visually examine a large number of images per day. An emergency room radiologist may look at as many 200 cases a day, and some of these imaging studies, particulary lower body CT angiography can be as many as 3000 images per study. Due to the volume overload, and limited amount of clinical information available as part of imaging studies, diagnosis errors, particularly relating to conincidental diagnosis cases can occur. With radiologists also being a scarce resource in many countries, it will even more important to reduce the volume of data to be seen by clinicians particularly, when they have to be sent over low bandwidth teleradiology networks.

MedicalSieve is an image-guided informatics system that acts as a medical sieve filtering the essential clinical information physicians need to know about the patient for diagnosis and treatment planning. The system gathers clinical data about the patient from a variety of enterprise systems in hospitals including EMR, pharmacy, labs, ADT, and radiology/cardiology PACs systems using HL7 and DICOM adapters. It then uses sophisticated medical text and image processing, pattern recognition and machine learning techniques guided by advanced clinical knowledge to process clinical data about the patient to extract meaningful summaries indicating the anomalies. Finally, it creates advanced summaries of imaging studies capturing the salient anomalies detected in various viewpoints.

Medical Sieve is leading the way in diagnostic interpretation of medical imaging datasets guided by clinical knowledge with many first-time inventions including (a) the first fully automatic spatio-temporal coronary stenosis detection and localization from 2D X-ray angiography studies, (b) novel methods for highly accurate benign/malignant discrimination in breast imaging, and (c) first automated production of AHA guideline17 segment model for cardiac MRI diagnosis.

For more details on the project, please contact Tanveer Syeda-Mahmood (>stf@us.ibm.com).

You can watch a demo of our Medical Sieve Cognitive Assistant Application here.

Curious: How would you specify the exclusions of information? So that you could replicate the “filtered” view of the data?

Replication is a major issue in publicly funded research these days. Not reason for that to be any different for data science.

A great first step but I don’t find country level visualizations (or agency level accountability) all that compelling. There is $X amount of tax avoidance in country Y but that lacks the impact of naming the people who are evading the taxes, perhaps along with a photo for the society pages and their current location.

The New York Philharmonic played its first concert on December 7, 1842. Since then, it has merged with the New York Symphony, the New/National Symphony, and had a long-running summer season at New York’s Lewisohn Stadium. This Performance History database documents all known concerts of all of these organizations, amounting to more than 20,000 performances. The New York Philharmonic Leon Levy Digital Archives provides an additional interface for searching printed programs alongside other digitized items such as marked music scores, marked orchestral parts, business records, and photos.

Geographic location of concert (Countries are identified by their current name. For example, even though the orchestra played in Czechoslovakia, it is now identified in the data as the Czech Republic)

Venue

Name of hall, theater, or building where the concert took place

Date

Full ISO date used, but ignore TIME part (1842-12-07T05:00:00Z = Dec. 7, 1842)

Time

Actual time of concert, e.g. “8:00PM”

Works Info: the fields below are repeated for each work performed on a program. By matching the index number of each field, you can tell which soloist(s) and conductor(s) performed a specific work on each of the concerts listed above.

worksConductorName

Last name, first name

worksComposerTitle

Composer Last name, first / TITLE (NYP short titles used)

worksSoloistName

Last name, first name (if multiple soloists on a single work, delimited by semicolon)

worksSoloistInstrument

Last name, first name (if multiple soloists on a single work, delimited by semicolon)

At Grammarly, the foundation of our business, our core grammar engine, is written in Common Lisp. It currently processes more than a thousand sentences per second, is horizontally scalable, and has reliably served in production for almost 3 years.

We noticed that there are very few, if any, accounts of how to deploy Lisp software to modern cloud infrastructure, so we thought that it would be a good idea to share our experience. The Lisp runtime and programming environment provides several unique, albeit obscure, capabilities to support production systems (for the impatient, they are described in the final chapter).
…

An inspirational story about Lisp, along with tips on features you are unlikely to find elsewhere. A good read and worth the time.

Since the OPM is still running COBOL, I am sure one of your favorite agencies is still crunching Lisp. You might need to get them to upgrade.

I know, not nearly as interesting as talking about Raquel Welch, but someone has to. 😉

From the post:

In recent years, we have witnessed a big growth of the Web of Data due to the enthusiasm shown by research scholars, public sector institutions and some private companies. Nevertheless, no rigorous processes for creating or mapping data have been systematically followed in most cases, leading to uneven quality among the different datasets available. Though low quality datasets might be adequate in some cases, these gaps in quality in different datasets sometimes hinder the effective exploitation, especially in industrial and production settings.

In this context, there are ongoing efforts in the Linked Data community to define the different quality dimensions and metrics to develop quality assessment frameworks. These initiatives have mostly focused on spotting errors as part of independent research efforts, sometimes lacking a global vision. Further, up to date, no significant attention has been paid to the automatic or semi-automatic repair of Linked Data, i.e., the use of unattended algorithms or supervised procedures for the correction of errors in linked data. Repaired data is susceptible of receiving a certification stamp, which together with reputation metrics of the sources can lead to having trusted linked data sources.

The goal of the Workshop on Linked Data Repair and Certification is to raise the awareness of dataset repair and certification techniques for Linked Data and to promote approaches to assess, monitor, maintain, improve, and certify Linked Data quality.

There is a call for papers with the following deadlines:

Paper submission: Monday, July 20, 2015

Acceptance Notification: Monday August 3, 2015

Camera-ready version: Monday August 10, 2015

Workshop: Monday October 7, 2015

Now that linked data exists, someone has to undertake the task of maintaining it. You could make links in linked data into topics in a topic map and add properties that would make them easier to match and maintain. Just a thought.

Every journey we take on the web is unique. Yet looked at together, the questions and topics we search for can tell us a great deal about who we are and what we care about. That’s why today we’re announcing the biggest expansion of Google Trends since 2012. You can now find real-time data on everything from the FIFA scandal to Donald Trump’s presidential campaign kick-off, and get a sense of what stories people are searching for. Many of these changes are based on feedback we’ve collected through conversations with hundreds of journalists and others around the world—so whether you’re a reporter, a researcher, or an armchair trend-tracker, the new site gives you a faster, deeper and more comprehensive view of our world through the lens of Google Search.

Real-time data

You can now explore minute-by-minute, real-time data behind the more than 100 billion searches that take place on Google every month, getting deeper into the topics you care about. During major events like the Oscars or the NBA Finals, you’ll be able to track the stories most people are searching for and where in the world interest is peaking. Explore this data by selecting any time range in the last week from the date picker.
…

When I think of topic maps that I can give you as examples, they involve taxes, Castrati, and other obscure topics. My favorite use case is an ancient text annotated with commentaries and comparative linguistics based on languages no longer spoken.

I know what interests me but not what interests other people.

Thoughts on using Google Trends to pick “hot” topics for topic mapping?

If you read the press release, you will miss these goodies from the complaint:

…
28. The FBI built a functional silencer at Sullivan’s request. That silencer does not bear the required serial number,7 and is not registered to Sullivan or any person in the National Firearms Registration and Transfer Record.

29. The FBI sent a package constaining the silencer to Sullivan’s home at 5470 Rose Carswell Road, Morganton, North Carolina, according to Sullivan’s instructions. At approximately 4:15 p.m. on June 19, 2015, Sullivan’s mother picked up the mail, to include the package containing the silencer, from the mailbox and returned to the house. FBI surveillance confirmed Sullivan was in the house when his mother entered with the silencer.

30. On June 19, 2015, the FBI conducted a search of 5470 Carswell Road, Morganton, North Carolina, pursuant to the consent of Sullivan’s mother and a federal search warrant. Among other things, the FBI found the silencer delivered to Sullivan earlier that day, which was hidden under plastic in a crawlspace accessible from the basement of the home….
…

How did all this start?

10. On April 21, 2015, Sullivan’s father placed a “911” call to request police assistance at the family residence at 5470 Rose Carswell Road, Morganton, North Carolina. Sullivan’s father said: “I don’t know if it is ISIS or what, but he [Sullivan] is destroying Buddhas, and figurines and stuff.” He stated that Sullivan was destroying their “religious” items, had done so before, and this time Sullivan poured gasoline on some such items to burn them. Sullivan’s father added: “I mean, we are scared to leave the house.” Sullivan could be heard in the background stating: “why are you trying to say I am a terrorist?” and words to that effect, multiple times. Sullivan complained in the background that his father was only mentioning the religious items, and asked his father to tell the police he had destroyed other objects as well. Sullivan could be heard stating that “they” were going to put Sullivan “in jail my whole life,” or, alternatively: “they are not going to put me in jail. They are going to kill me.”

Of course, rather than a referral to mental health services, a FBI undercover agent made contact with Sullivan on June 6, 2015. You can read the recounting of the bizarre conversations with Sullivan in the complaint. It is an image file so I have to re-type anything that appears in the blog.

According to the news release Sullivan was charged with:

one count of attempting to provide material support to ISIL,

one count of transporting and receiving a silencer in interstate commerce with intent to commit a felony, and

one count of receipt and possession of an unregistered silencer, unidentified by a serial number.

True enough, a person disturbed enough to:

Sullivan complained in the background that his father was only mentioning the religious items, and asked his father to tell the police he had destroyed other objects as well.

How’s that for an answer to the complaint you are destroying religious items? You want to point out to the police you are destroying other stuff too?

Sullivan was suffering from paranoid delusions but rather than getting him help, the FBI set him up for being charged with attempting to assist ISIS and two silencer violations that occurred only because the FBI built and mailed him a silencer.

Victimizing the mentally ill pads the FBI terrorist statistics and serves to further the fictional war on terrorism.

Posted in Government, Law | Comments Off on FBI Builds Silencers For The Mentally Ill

When Gabriel Weinberg launched a new search engine in 2008 I doubt even he thought it would gain any traction in an online world dominated by Google.

Now, seven years on, Philadelphia-based startup DuckDuckGo – a search engine that launched with a promise to respect user privacy – has seen a massive increase in traffic, thanks largely to ex-NSA contractor Edward Snowden’s revelations.

Since Snowden began dumping documents two years ago, DuckDuckGo has seen a 600% increase in traffic (but not in China – just like its larger brethren, its blocked there), thanks largely to its unique selling point of not recording any information about its users or their previous searches.

Such a huge rise in traffic means DuckDuckGo now handles around 3 billion searches per year.
…

DuckDuckGodoes not track its users. Instead, it makes money off of displaying key word (from your search string) based ads.

Hmmm, what if instead of key words from your search string, you pre-qualified yourself for ads?

Say for example I have a topic map fragment that pre-qualifies me for new books on computer science, break baking, and waxed dental floss. When I use a search site, it uses those “topics” or key words to display ads to me.

Advertisers benefit because their ads are displayed to people who have qualified themselves as interested in their products. I don’t know what the difference in click-through rate would be but I suspect it would be substantial.

Since a European Court of Justice ruling last year, individuals have the right to request that search engines remove certain web pages from their search results. Those pages usually contain personal information about individuals.

The BBC has decided to make clear to licence fee payers which pages have been removed from Google’s search results by publishing this list of links. Each month, we’ll republish this list with new removals added at the top.

We are doing this primarily as a contribution to public policy. We think it is important that those with an interest in the “right to be forgotten” can ascertain which articles have been affected by the ruling. We hope it will contribute to the debate about this issue. We also think the integrity of the BBC’s online archive is important and, although the pages concerned remain published on BBC Online, removal from Google searches makes parts of that archive harder to find.

The pages affected by delinking may disappear from Google searches, but they do still exist on BBC Online. David Jordan, the BBC’s Director of Editorial Policy and Standards, has written a blog post which explains how we view that archive as “a matter of historic public record” and, thus, something we alter only in exceptional circumstances. The BBC’s rules on deleting content from BBC Online are strict; in general, unless content is specifically made available only for a limited time, the assumption is that what we publish on BBC Online will become part of a permanently accessible archive. To do anything else risks reducing transparency and damaging trust.
…

Kudos for the BBC for demonstrating the extent of censorship implied by the EU’s “right to be forgotten. The “right to be forgotten” combines ignorance of technology with eurocentrism at its very worst. Not to mention being futile when directed at a search engine.

Just to get you started, here are the links from the post:

One caveat: when looking through this list it is worth noting that we are not told who has requested the delisting, and we should not leap to conclusions as to who is responsible. The request may not have come from the obvious subject of a story.

[Searching on several phrases and NERC (North American Electricity Reliability Corporation), I have been unable to find the entire slide deck.]

Did you catch the line:

Information is power; sharing is seen as loss of power

You can use topic maps for sharing, but how much sharing you choose to do is up to you.

For example, assume your department is responsible for mapping data for ETL operations. Each analyst is using state of the art software to create mappings from field to field. In the process of creating those mappings, each analyst learns enough about those fields to make sure the mapping is correct.

Now one or more of your analysts leave for other positions. All the ad hoc knowledge they had of the data fields has been lost. With a topic map, you could have been accumulating power as each analyst discovered information about each data field.

If management requests the mapping you are using, you output the standard field to field mapping, with none of the extra information that you have accumulated for each field in a topic map. The underlying descriptions remain solely in your possession.

With topic maps, you can share a little or a lot, your call.

PS: You can also encrypt the values you use for merging in your topic map. Which could enable different levels of merging for one map, based upon a level of security clearance. An example would be a topic map resource accessible by people with varying security clearances. (CIA/NSA take note.)