_Decoding simple regex features to match complex text patterns._
Many data science, analyst, and technology professionals have encountered regular expressions at some point. This esoteric, miniature language is used for matching complex text patterns, and looks mysterious and intimidating at first. However, regular expressions (also called "regex") are a powerful tool that only require a small time investment to learn. They are almost ubiquitously supported wherever there is data. Several analytical and technology platforms support them, including SQL, Python, R, Alteryx, Tableau, LibreOffice, Java, Scala, .NET, and Go. Major text editors and IDEs like Atom Editor, Notepad++, Emacs, Vim, Intellij IDEA, and PyCharm also support searching files with regular expressions.
The ubiquity of regular expressions must mean they offer universal utility, and, surprisingly, they do not have a steep learning curve. If you frequently find yourself manually scanning documents or parsing substrings just to identify text patterns, you might want to give them a look. Especially in data science and data engineering, they can assist in a wide spectrum of tasks, from wrangling data to qualifying and categorizing it.
In this article, I will cover enough regular expression features to make them useful for a great majority of tasks you may encounter.
SETTING UP
You can test these examples I am about to walk through in a number of places. I recommend using Regular Expressions 101, a free web-based application to test a regular expression against text inputs. As we go through these examples, type in the regular expression pattern in the "Regular Expression" field, and a sample text in the "Test String" field. You will then immediately see in the right panel whether a full or partial match succeeded, as well as a broken down explanation of what your regex is doing (see Figure 1). Figure 1. The regex101.com site is a helpful tool to test regular expressions against text inputs.
For Python, you can also import and use the native re package as shown below. The fullmatch() function will accept a regex pattern and an input string to test against. It will return a match object if a full match exists.
import re result = re.fullmatch(pattern="[A-Z]{2}", string="TX") IF result: print("match") ELSE: print("Doesnt match")
Now that you are set up, we will walk through all the major functionalities offered by regular expressions.
LITERALS AND SPECIAL CHARACTERS
A regular expression matches a broad or specific text pattern, and is strictly read left-to-right. It is input as a text string itself, and will compile into a mini program built specifically to identify that pattern. That pattern can be used to match, search, substring, or split text.
Most characters, including alphabetic and numeric characters, have no special functionality and literally represent those characters. For instance, a regex of TX will only match the string TX.
REGEX: TX INPUT: TX MATCH: true REGEX: TX INPUT: AZ MATCH: false
However, a small subset of characters have special functionalities we will learn throughout this article. These characters include the following:
[\^$.|?*+()
If you want to treat these characters as literals, you need to precede them with an escape \. To create a literal regex that matches $180, we need to escape that dollar sign so it matches a dollar sign. Otherwise it will treat it as a "start-of-line" character, which we will learn about later.
REGEX: \$180 INPUT: $180 MATCH: false
Conversely, putting a \ on certain letters will yield a special character. One of the most common is \s, which will match any whitespace.
REGEX: Lorem\sIpsum INPUT: Lorem Ipsusm MATCH: true
CHARACTER RANGES
For a given position in a string, we can qualify only a range of characters. To match a string containing a character of 0, 1, or 3 followed by an F, X, or B, we can declare a regular expression with character ranges inside square brackets [].
REGEX: [013][FXB] INPUT: 1X MATCH: true REGEX: [013][FXB] INPUT: 1Z MATCH: false
You can also define a consecutive span of letters or numbers by putting a - between them. We can qualify a character that is any number between 1 through 4 followed by any character that is A through Z.
REGEX: [1-4][A-Z] INPUT: 1X MATCH: true REGEX: [1-4][A-Z] INPUT: 51 MATCH: false
You can also qualify multiple ranges on a single character. For instance, we can qualify the first character in a two-character string to be an uppercase letter, a lowercase letter, or a number.
REGEX: [A-Za-z0-9][0-9] INPUT: i5 MATCH: true
REGEX: [A-Za-z0-9][0-9] INPUT: 1X MATCH: false
To negate characters, meaning you want anything but the specified characters, start your character range with a carrot ^. For example, we can qualify non-vowel letters:
REGEX: [^AEIOU] INPUT: X MATCH: true REGEX: [^AEIOU] INPUT: E MATCH: false
If you want a literal dash - character to be part of the character range, declare it first in the range.
REGEX: [-0-9][0-9] INPUT: -9 MATCH: true REGEX: [-0-9][0-9] INPUT: 99 MATCH: true
ANCHORS
Sometimes you will want to qualify the start ^ and end $ of a line or string. This can be handy if you are searching a document and want to qualify the start or end of a line as part of your regular expression. You can use this regular expression to match all numbers that start a line in a document as shown here:
^[0-9] Figure 2. Using Atom Editor to search for numbers that start a line.
Conversely, an end-of-line $ can be used to qualify the end of a line. Below is a regular expression that will match numbers that are the last character on a line.
[0-9]$
Depending on your environment, using both the start-of-line ^ and end-of-line $ together can be helpful to force a full match and ignore partial ones. This is because qualifying the start ^ and end $ of a string forces everything between them to be the only contents allowed in the input.
REGEX: [0-9][0-9] INPUT: 1432 MATCH: true REGEX: ^[0-9][0-9]$ INPUT: 1432 MATCH: false
QUANTIFIERS
A critical feature of regular expressions is quantifiers, which repeat the preceding clause of a regular expression.
For instance, it is a bit redundant to express [A-Z] three times to match three uppercase letters.
FIXED REPETITIONS
REGEX: [A-Z][A-Z][A-Z] INPUT: YCA MATCH: true
Instead, we can follow the [A-Z] with a quantifier {3} to specify repeating that character range three times, as in [A-Z]{3}. This accomplishes the same task as [A-Z][A-Z][A-Z], but more succinctly expresses it as three repetitions.
REGEX: [A-Z]{3} INPUT: YCA MATCH: true
We can use the regular expression below to match a 10-digit phone number with dashes.
REGEX: [0-9]{3}-[0-9]{3}-[0-9]{4} INPUT: 470-127-7501 MATCH: true REGEX: [0-9]{3}-[0-9]{3}-[0-9]{4} INPUT: 75663-2372 MATCH: false
MIN AND MAX REPETITIONS
You can also express a minimum and maximum number of allowable repetitions. [A-Z]{2,3} will require a minimum of 2 repetitions but a maximum of 3.
REGEX: [A-Z]{2,3} INPUT: YCA MATCH: true REGEX: [A-Z]{2,3} INPUT: AZ MATCH: true
Leaving the second argument empty and having a comma still present will result in an infinite maximum, and therefore specify a minimum. Below, we have a regex that will match on a minimum of two alphanumeric characters.
REGEX: [A-Za-z0-9]{2,} INPUT: YZ1 MATCH: true REGEX: [A-Za-z0-9]{2,} INPUT: YZSDjhfhSBH2342SDFSDFsdfw123412 MATCH: true
0 OR 1 REPETITION (A.K.A., OPTIONAL)
There are a couple of shorthand symbols for common quantifiers. For instance, a question mark ? is the same as {0,1}, which makes that part of the regex optional. If you wanted two uppercase alphabetic characters to optionally be preceded with a number, you can do so like this:
REGEX: [0-9]?[A-Z]{2} INPUT: BC MATCH: true REGEX: [0-9]?[A-Z]{2} INPUT: 3BC MATCH: true
As you start combining different operations, a regular expression can start to look overwhelming. But the secret is to read a regex left-to-right, and looking at the case above you can interpret it as, "Im looking for a number that is optional, followed by an uppercase alphabetic character repeated two times."
Taking our phone number example earlier, we can make the dashes now optional as shown here:
REGEX: [0-9]{3}-?[0-9]{3}-?[0-9]{4} INPUT: 470-127-7501 MATCH: true REGEX: [0-9]{3}-?[0-9]{3}-?[0-9]{4} INPUT: 4701277501 MATCH: true
1 OR MORE REPETITIONS
A + is a shorthand for {1,}, which requires a minimum of 1 repetition, but will capture any number of repetitions after that.
REGEX: [XYZ]+ INPUT: Z MATCH: true REGEX: [XYZ]+ INPUT: XYZZZYZXXX MATCH: true REGEX: [XYZ]+[0-9]+ INPUT: XYZZZYZXXX2374676128963453452990 MATCH: true
0 OR MORE REPETITIONS
A * is a shorthand for {0,}, which makes whatever it is quantifying completely optional, but will capture as many repetitions it can if they do exist.
REGEX: [0-3]+[XYZ]* INPUT: 34 MATCH: true REGEX: [0-3]+[XYZ]* INPUT: 34YYXZZ MATCH: true
WILDCARDS
A dot . is a wildcard for any character, making it the broadest operator you can use. It will match not just alphabetic or numeric characters, but also whitespaces, newlines, punctuation, and any other symbols.
REGEX: ... INPUT: B/C MATCH: true REGEX: .{3} INPUT: B/C MATCH: true REGEX: H.{3}O INPUT: HELLO MATCH: true
A common operation you may see is .*, which allows 0 or more repetitions of any character. This is often used to match any text, making it function as an "everything" wildcard. This can be helpful when using regular expressions as qualifiers, and if you do not want that parameter to restrict anything just make it a .*.
REGEX: .* INPUT: AsdfSJDFJSVdsfBLKJXCasdBNVJWB$TJ$@#ASDFSD@ MATCH: true REGEX: .* INPUT: Alpha MATCH: true
GROUPING
It can be helpful to group up parts of a regular expression in parentheses, often to use a quantifier on that whole group. For instance, if you want to qualify an uppercase letter followed by three numeric digits, but want to repeat that whole operation with a quantifier, you can do so like this:
REGEX: ([A-Z][0-9]{3})+ INPUT: A563 MATCH: true REGEX: ([A-Z][0-9]{3})+ INPUT: A563X264 MATCH: true REGEX: ([A-Z][0-9]{3}-?)+ INPUT: A563-X264-C578 MATCH: true
If we wanted to identify phone numbers (with optional dashes -), but make the area code (the first three digits) optional, we can do so like this:
REGEX: ([0-9]{3}-)?[0-9]{3}-?[0-9]{4} INPUT: 470-127-7501 MATCH: true REGEX: ([0-9]{3}-?)?[0-9]{3}-?[0-9]{4} INPUT: 127-7501 MATCH: true
ALTERNATION
Alternation is expressed with a | and essentially operates as an "OR". It alternates two or more valid patterns where at least one of those patterns must match in that position.
For instance, if we want to capture 5-digit U.S. ZIP codes that end in "35" or "75," we can tail a repeated numeric range with a (35|75). We must group that in parentheses so the | does not mangle the 35 with the repeated numeric range.
REGEX: [0-9]{3}(35|75) INPUT: 75035 MATCH: true REGEX: [0-9]{3}(35|75) INPUT: 75062 MATCH: false
Sometimes an alternator is used simply to qualify a set of literal values. For instance, if I want to only match ALPHA, BETA, and GAMMA, I can use an alternator to achieve this.
REGEX: ALPHA|BETA|GAMMA INPUT: BETA MATCH: true REGEX: ALPHA|BETA|GAMMA INPUT: DELTA MATCH: false
PREFIXES AND SUFFIXES
Especially when you are scanning documents, it can be helpful to qualify something that precedes or follows your targeted text without capturing it. Prefixes and suffixes allow this, and can be leveraged with (?<=regex) and (?=regex) respectively, where "regex" is the pattern for the head or tail you want to qualify but not include.
For instance, if I want to extract numbers that are preceded by uppercase letters, but I dont want to include those letters, I can use a prefix like this:
REGEX: (?<=[A-Z]+)[0-9]+ INPUT: ALPHA12 MATCH: 12 REGEX: (?<=[A-Z]+)[0-9]+ INPUT: 167 MATCH: false
A suffix works similarly, but matches a tail without including that tail.
REGEX: [0-9]+(?=[A-Z]+) INPUT: 12ALPHA MATCH: 12 REGEX: [0-9]+(?=[A-Z]+) INPUT: 167 MATCH: false
CONCLUSIONS
It is important to remember that you often only need to make a regular expression as specific as it needs to be, depending on how predictable your data is. Qualifying a number with [0-9.]+ will work to match an IP address such as 172.18.83.200. But keep in mind it will also match 237476231.345342342334.23423756756856234, which is definitely not an IP address. If you do not know your data well, you should probably err on being more specific, as demonstrated in this Stack Overflow question.
Regular expressions may seem niche, but they can rise up heroically to the most unexpected tasks in your day-to-day work. Hopefully this article has helped you feel more comfortable with regular expressions and find them useful. They can assist in data munging, qualification, categorization, and parsing as well as document editing.
Continue reading An introduction to regular expressions.

_Prototyping, Quantum Algorithms, 6ed Unix Commentary, and Quantum Computing_
* Prototyping: The Scientific Method of Business (Luke Wroblewski) -- _Recruit the right people so your feedback comes from your target audience._ Cant say that enough.
* Quantum Algorithms: An Overview (Nature) -- readable for classic computer scientists such as myself.
* Lions Commentary on 6ed Unix -- the source code, annotated. This was one of the few ways of learning how Unix worked (Back In The Day), and its still an interesting glimpse at how to build a "simple" operating system. Note line 2238s famous comment.
* Interactive Introduction to Quantum Computing -- _This is part one of a two-part series for those who want to learn a little about quantum computing, but lack the mathematics and quantum physics background required by many of the introductions out there. It covers some of the basics of quantum computing, such as qubits, state phases, and quantum interference. Part 2 goes on to look at quantum search._
Continue reading Four short links: 13 December 2017.

_Learned Indexes, Text Tables, Weaponized Ed Data, and Bad Feedback Loops_
* The Case for Learned Index Structures -- _Our initial results show that by using neural nets, we are able to outperform cache-optimized B-Trees by up to 70% in speed while saving an order-of-magnitude in memory over several real-world data sets. More importantly, though, we believe that the idea of replacing core components of a data management system through learned models has far-reaching implications for future systems designs and that this work just provides a glimpse of what might be possible._ (via Simon Willison)
* tty-table -- _displays ASCII tables in your terminal emulator or browser console. Word wrap, padding, alignment, colors, Asian character support, per-column callbacks, and you can pass rows as objects or arrays. Backward compatible with Automattic/cli-table. _
* Weaponization of Ed Data (Audrey Watters) -- _2017 made it clear, I’d like to think, that the dangers education technology and its penchant for data collection aren’t simply a matter of a potential loss of privacy or a potential loss of data. The stakes now are much, much higher._
* Money as Instrument of Change (YouTube) -- asked about exploiting human behaviour in social media, the former VP of User Growth, Mobile & International at Facebook says, _The short-term dopamine-driven feedback loops that we have created are destroying how society works._ The whole talk is interesting. And sweary. (via Gizmodo)
Continue reading Four short links: 12 December 2017.

_Every line of business must have access to the digital tools needed to innovate at the edge._
Every company is now a software company. Digital transformation allows even large enterprises to adapt to changes in markets and customers at lightning speed, responding with new products, new processes, and new business models. Digital transformation doesn’t just require new technology; it requires a new, more agile mindset. Every line of business must have access to the digital tools needed to innovate at the edge, and it’s the job of the core IT team to provide them.
Digital transformation relies on connecting data and systems, people and processes. Integration technologies have traditionally formed the nervous system of a large enterprise, connecting systems and moving data. But the human nervous system doesn’t just connect and sense; it also acts on data in real time. A digital business technology platform augments the intelligence of a digital business by building on its ability to connect and sense, to learn and act automatically, and enables the next stage of your digital transformation. Unlike a biological nervous system, an enterprise’s digital business technology platform will reach beyond the traditional boundaries of the business to run in the cloud, or on devices at the edge or involving the Internet of Things (IoT).
A digital business platform will build on traditional integration technologies, extending them to deal with microservices, serverless architectures, event-driven architectures, machine learning, and edge intelligence. Agile development methodologies, DevOps processes, bimodal IT, and other cultural changes are pieces in a digital business jigsaw.
Every industry sector faces different business challenges, and will therefore emphasize different aspects of a digital business platform.
Manufacturing, transportation, utilities, and other industries with large capital investments need to maximize the value of their infrastructures and improve operational efficiency. Adding intelligence at the edge so IoT devices can sense and act locally will be an essential component of a digital business platform in these sectors.
Banks and telecommunications companies want to redefine their relationships with their customers, creating the kind of seamless, omnichannel environments at which digital retailers have excelled. Personalizing interactions with customers will require digital businesses to combine machine learning models with real-time streaming data.
Retailers and travel companies, which must provide both a personalized customer experience and efficient delivery or transport, face both of these challenges. Every enterprise wants to reduce costs and increase the speed of innovation. Every enterprise needs to make faster and better decisions with machine learning and real-time analytics. A robust digital business platform can support these goals.
A HYBRID INTEGRATION PLATFORM IS KEY
Gartner has defined “pervasive integration” as the integration of on-premise and cloud applications and data sources, business partners, clients, mobile apps, social networks, and IoT devices to enable organizations to pursue digital business.
The foundation of a digital business platform is a hybrid integration platform (HIP) that meets the organization’s pervasive integration needs by supporting different types of users, from integration specialists to business users and modern app developers, and diverse deployment models: on-premises, cloud, mobile, and edge devices for the IoT.
Two distinct kinds of software development take place in a HIP: core IT, which encompasses the kind of data center activities now associated with software development in most organizations, and edge IT, which encompasses innovation by business users to quickly meet new needs for information.
CORE INTEGRATION NEEDS
Data centers that are currently the core of IT in companies will take on a new role, building applications in the cloud or on-premises, deploy them to container-based platforms, and exposing them to the rest of the business in the form of microservices providing access through APIs.
Because all access to data and corporate services will be through the APIs developed in-house, the API development teams require management across the full life cycle of creation, publishing, operations, and maintenance. Services supporting such development are called integration platform as a service (iPaaS). These services enforce the policies defined by the enterprise’s shared cloud governance team and ensure best practices across the life cycle.
EDGE INTEGRATION NEEDS
Innovation can also happen among the business users in the organization. Some will build new applications on top of the APIs provided by the core IT department. Services facilitating such development at the edge are called application platform as a service (aPaaS). Business users can also create new integration flows by combining core IT services, supported by integration service as a service (iSaaS) products. These allow business users with no coding skills to integrate data between various cloud services.
The IoT introduces special requirements because it sometimes requires real-time decision-making that cannot be relegated to the cloud. Two or more tiers of processing may therefore take place: after doing real-time work at the edges, an IoT integration gateway can filter and aggregate data from the sensors and send relevant information back to a private datacenter or public cloud.
_To learn more about how pervasive integration provides a framework for a digital business platform, get the free ebook "Integration and the Path to Becoming a Digital Business."_
_This post is part of a collaboration between OReilly and TIBCO. __See our statement of editorial independence__._
Continue reading How enterprises can build a digital business platform with pervasive integration.

_Programming Falsehoods, Money Laundering, Vulnerability Markets, and Algorithmic Transparency_
* Falsehoods Programmers Believe About Programming -- I feel like "understanding programming" is like learning about science in school: its a progressive series of "well, actually its more complicated than that" until youre left questioning your own existence. (Descartes would tell us _computo ergo sum_.)
* Kleptocrat -- _You are a corrupt politician, and you just got paid. Can you hide your dirty money from The Investigator and cover your tracks well enough to enjoy it?_ The game is made by _a global investigative firm that specializes in tracing assets._ A+ for using games to Share What You Know. (via BoingBoing)
* Economic Factors of Vulnerability Trade and Exploitation -- _In this paper, we provide an empirical investigation of the economics of vulnerability exploitation, and the effects of market factors on likelihood of exploit. Our data is collected first-handedly from a prominent Russian cybercrime market where the trading of the most active attack tools reported by the security industry happens. Our findings reveal that exploits in the underground are priced similarly or above vulnerabilities in legitimate bug-hunting programs, and that the refresh cycle of exploits is slower than currently often assumed. On the other hand, cybercriminals are becoming faster at introducing selected vulnerabilities, and the market is in clear expansion both in terms of players, traded exploits, and exploit pricing. We then evaluate the effects of these market variables on likelihood of attack realization, and find strong evidence of the correlation between market activity and exploit deployment._ (via Paper a Day)
* Principles for Algorithmic Transparency (ACM) -- _Awareness; Access and redress; Accountability; Explanation; Data provenance; Auditability; and Validation and Testing._ (via Pia Waugh)
Continue reading Four short links: 11 December 2017.

_A simple framework for implementing message-based, user-initiated CRUD operations._
A message-based microservices architecture offers many advantages, making solutions easier to scale and expand with new services. The asynchronous nature of interservice interactions inherent to this architecture, however, poses challenges for user-initiated actions such as create-read-update-delete (CRUD) requests on an object. CRUD requests sent via messages may be lost or arrive out of order. Or, multiple users may publish request messages nearly simultaneously, requiring a repeatable method for resolving conflicting requests. To avoid these complexities, user-initiated actions are often treated synchronously, often via direct API calls to object management services. However, these direct interactions compromise some of the message-based architecture’s benefits by increasing the burden of managing and scaling the object-management service.
The ability to handle these scenarios via message-based interactions can prove particularly useful for managing objects that can be modified both by direct user interaction and by service-initiated requests, as well as for enabling simultaneous user updates to an object. In this blog post, we discuss an asynchronous pattern for implementing user-initiated requests via decoupled messages in a manner that can handle requests from multiple users and handle late-arriving messages.
A SIMPLE SCENARIO: TEAM MANAGEMENT
To illustrate our pattern, let us consider a Team Service that manages teams in a software-as-a-service solution. Teams have multiple users, or team members, associated with them. Teams also have two attributes that any team member can request to update: screen-name and display color-scheme. In this scenario, only one user, User A, has permission to make requests to create or delete any team. Any user can request to read the state of any team. All permissions and security policies, including team membership, are managed by a separate authorization service that can be accessed synchronously. User interaction is facilitated by a user-interaction service, which allows users to publish CRUD requests to the message bus. Those events are then retrieved by the Team Service, which stores all allowed requests into an event registry and collapses that request history into the current state of a team upon receipt of a READ request.
While the addition of a user-interaction service increases the complexity of the user’s interaction with Team Service, it results in a more scalable and efficient architecture overall. The user-interaction service can be used to publish user requests for any other service as well, so that scaling up to accommodate bursts in user activity can be confined to the user-interaction service and applied to other services secondarily as needed. This approach also minimizes network traffic throughout the architecture, as all external calls are handled by this service. And while this post describes a solution that checks policies at the time of message receipt, a more sophisticated user-interaction service could implement policy checking at the time of publication.
Let us consider a use case where multiple users are attempting to update and read the state of a team. Figure 1 shows the request messages, the time they were published to the message bus, the time they were received by the Team Service, and the messages published to the message bus by the Team Service in response. Figure 1. Incoming request messages and output messages published based on the message-based interprocess communication pattern described in this post. Figure courtesy of Arti Garg & Jon Powell.
The output messages published by the Team Service based on our pattern demonstrate how we achieve eventual consistency with the users’ desired team state. Our Team Service denies all requests the requesting user does not have permission to make. It also denies any UPDATE or DELETE request associated with a team that does not already exist. Team state is determined by the most recent updates to each team attribute, following an event-sourcing pattern based on request publication time. In this example, we note that message 4 is a “late arrival.” Although it arrived after message 3, it resulted in no change to the Team A state. By 10:12:02, Team A is in a state consistent with all user requests, despite requests coming from multiple users and late-arriving messages. In our implementation, as illustrated by message 9, we include a message time-out for late arrivals beyond a maximum latency time (TIME_OUT_LATENCY = 10 min).
SOLUTION APPROACH
We can achieve the output shown in Figure 1 using two simple sets of rules: business rules and state aggregation rules. Business rules govern how to process incoming request messages. These rules are applied to messages in the order they are received. State aggregation rules govern how to resolve state based on request message history. These rules are applied to the messages in the event registry in the order they were published to the message bus by the requestor. In order to most quickly approach consistency with the state desired by the requestors, this distinction is critical. State should be resolved based on the order requests are published, not the order they are received. This ensures that late-arriving messages do not supersede more recent requests that are in conflict with them.
This example applies the following business rules:
* For all messages, check (i) message format, (ii) whether the request was received more than TIME_OUT_LATENCY time after it was published, and (iii) whether the requestor has permission to make the request.
* Publish request DENIED events to the message bus if any of these checks fail.
* For valid messages, query state of target team.
* Publish request DENIED events to the message bus for the following:
* UPDATE and DELETE requests for teams that do not already exist.
* CREATE requests for teams that already exist.
* Publish all other CREATE, UPDATE, DELETE requests to the event registry.
* Query target team state and publish Team CREATED, Team UPDATED, or Team DELETED if there is a change in team state.
* For READ requests, publish team STATE message to the message bus.
In this example, we resolve state from the messages in the event registry by applying the following state aggregation rules to implement the event-sourcing pattern:
* Determine whether target team exists.
* Find all CREATE and DELETE request messages for the target team.
* If the most recently published request is CREATE, the team exists.
* If the most recently published request is DELETE, the team does not exist.
* If there are no CREATE or DELETE requests for a target team, it does not exist.
* If the team exists, find all request messages associated with the target team that were published after the most recently published CREATE request in the event registry.
* The value for each team attribute is the value set in the most recently published message where it is set or updated. Any updates to that attribute in messages published earlier are ignored.
SOLUTION DESIGN
We implement our pattern using a simple architecture that consists of: i) an append-only event registry that stores allowed events, ii) a state engine, and iii) a business rules engine. Notably, our architecture does not include a persistent object store, which is common in a command query responsibility segregation (CQRS) pattern. This is because the addition of an object store would require adding processes and architectural components to maintain its consistency with the event registry, introducing the possibility of conflicting “sources of truth” between the event registry and the object store. Replacing the object store with an event registry and a state engine allows us to reap the advantages of event sourcing, such as auditability and traceability, without additional complexity. And our simple algorithms allow us to quickly resolve incoming request messages, including late arrivals, to respond to READ requests. Figure 2 illustrates our architecture design. Figure 2. Service architecture. This figure illustrates a simple architecture for implementing our proposed pattern. Figure courtesy of Arti Garg & Jon Powell.
This simple service design minimizes latency in the write path for incoming request messages. And while resolving state does require aggregating multiple messages in the event registry, our algorithm is simple and efficient. By sorting messages by publication time, we can determine state by parsing messages backward in time until the state of all attribute values is known. Regardless of how many UPDATE requests are published, our algorithm only requires parsing enough messages to know the value for each attribute.
In cases where the object being managed is more complex, there are additional rules governing user interactions or how state is determined, or more detailed security policies are needed; our pattern can readily be augmented while maintaining the same architecture. As the event registry grows, we can improve efficiency and reduce registry query load by adding refinements such as caching the most recently resolved state for each object or periodically adding resolved state to the registry and limiting queries to messages received after state publication.
SUMMARY
Event sourcing and asynchronous, message-based interprocess interactions provide an attractive architectural principle for building a scalable, flexible, microservice-based software. User-initiated CRUD operations, however, create challenges for these patterns due to the possibility of lost or late-arriving messages. In this blog post, we suggest a simple architecture and set of algorithms for asynchronously handling user-initiated CRUD requests. A notable feature of our solution is the lack of an object store, which results in a greatly simplified architecture over a traditional CQRS solution. We replace the object store with an immutable, append-only event registry and a state engine, which provides the auditability advantages of the event-sourcing pattern. Overall, our solution provides a simple framework for implementing message-based, user-initiated CRUD operations that can be easily modified for a variety of applications.
Continue reading Handling user-initiated actions in an asynchronous, message-based architecture.

_Books for Young Engineers, Fake News, Digital Archaeology, and Bret Victor_
* Books for Budding Engineers (UCL) -- a great list of books for kids who have a STEM bent.
* Data-Driven Analysis of "Fake News" -- _In sheer numerical terms, the information to which voters were exposed during the election campaign was overwhelmingly produced not by fake news sites or even by alt-right media sources, but by household names like "The New York Times," "The Washington Post," and CNN. Without discounting the role played by malicious Russian hackers and naïve tech executives, we believe that fixing the information ecosystem is at least as much about improving the real news as it about stopping the fake stuff._ A lot of data to support this conclusion. (via Dean Eckles)
* Digital Archaeology -- papers from a conference, whose highlights were tweeted here. In case you thought for a second there was some corner of the world that software wasnt going to eat.
* Dynamicland -- rumours of Bret Victors new AR project about computing with space. See also the Twitter account showing off goodies.
Continue reading Four short links: 8 December 2017.

_The O’Reilly Data Show Podcast: Christine Hung on using data to drive digital transformation and recommenders that increase user engagement._
In this episode of the Data Show, I spoke with Christine Hung, head of data solutions at Spotify. Prior to joining Spotify, she led data teams at the NY Times and at Apple (iTunes). Having led teams at three different companies, I wanted to hear her thoughts on digital transformation, and I wanted to know how she approaches the challenge of building, managing, and nurturing data teams.
I also wanted to learn more about what goes into building a recommender system for a popular consumer service like Spotify. Engagement should clearly be the most important metric, but there are other considerations, such as introducing users to new or “long tail” content.
Continue reading Machine learning at Spotify: You are what you stream.

_Measurement, Value, Privacy, and Openness_
* Emerging Gov Tech: Measurement -- _presenters from inside and outside of government to share how they were using measurement to inform decision-making._ The hologram reminding people to dump biosecurity material was nifty, and the Whare Hauora project is much needed in a country with a lot of dank draughty houses.
* When Is a Dollar Not a Dollar -- _a dollar of cost savings is worth one dollar to the customer, but a dollar of extra revenue is usually worth dimes or pennies (depending on the customers profit margin)._
* Learning with Privacy at Scale (Apple) -- about their differential privacy work. Their attention to detail is lovely. _Whenever an event is generated on-device, the data is immediately privatized via local differential privacy and temporarily stored on-device using data protection, rather than being immediately transmitted to the server. After a delay based on device conditions, the system randomly samples from the differentially private records subject to the above limit and sends the sampled records to the server. _
* When Open Data is a Trojan Horse: The Weaponization of Transparency in Science and Governance -- _We suggest that legislative efforts that invoke the language of data transparency can sometimes function as ‘‘Trojan Horses’’ through which other political goals are pursued. Framing these maneuvers in the language of transparency can be strategic, because approaches that emphasize open access to data carry tremendous appeal, particularly in current political and technological contexts._
Continue reading Four short links: 7 December 2017.

_A look at the rise of the deep learning library PyTorch and simultaneous advancements in recommender systems._
In the last few years, we have experienced the resurgence of neural networks owing to availability of large data sets, increased computational power, innovation in model building via deep learning, and, most importantly, open source software libraries that ease use for non-researchers. In 2016, the rapid rise of the TensorFlow library for building deep learning models allowed application developers to take state-of-the-art models and put them into production. Deep learning-based neural network research and application development is currently a very fast moving field. As such, in 2017 we have seen the emergence of the deep learning library PyTorch. At the same time, researchers in the field of recommendation systems continue to pioneer new ways to increase performance as the number of users and items increases. In this post, we will discuss the rise of PyTorch, and how its flexibility and native Python integration make it an ideal tool for building recommender systems.
DIFFERENTIATING PYTORCH AND TENSORFLOW
The commonalities between TensorFlow and PyTorch stop at both being general purpose analytic problem-solving libraries and both using the Python language as its primary interface. PyTorch roots are in dynamic libraries such as Chainer, where execution of operations in a computation graph takes place immediately. This is in contrast to TensorFlow-style design, where the computation graph is compiled and executed as a whole. (Note: Recently, TensorFlow has added Eager mode, which allows dynamic execution of computation graphs.)
THE RISE OF PYTORCH
PyTorch was created to address challenges in the adoption of its predecessor library, Torch. Due to the low popularity and general unwillingness among users to learn the programming language Lua, Torch—a mainstay in computer vision for several years—never saw the explosive growth of TensorFlow. Both Torch and PyTorch are primarily developed, managed, and maintained by the team at Facebook AI Research (FAIR). PyTorch has seen a rise in adoption due to native Python-style imperative programming already familiar to researchers, data scientists, and developers of popular Python libraries such as NumPy and SciPy. This imperative flexible approach to building deep learning models allows for easier debugging compared to a compiled model. Whereas in a compiled model errors will not be detected until the computation graph is submitted for execution, in a Define-by-Run-style PyTorch model, errors can be detected and debugging can be done as models are defined. This flexible approach is notably important for building models where the model architecture can change based on input. Researchers focusing on recurrent neural network (RNN)-based approaches for solving language understanding, translation, and other variable sequence-based problems have found a particular liking to this Define-by-Run approach. Lastly, the built-in automatic differentiation feature in PyTorch allows model builders an easy way to perform the error-reducing back propagation step.
Late in the summer of 2017, with release 0.2.0, PyTorch achieved a significant milestone by adding distributed training of deep learning models, a common necessity to reduce model training time when working with large data sets. Furthermore, the ability to translate PyTorch models to Caffe 2 (another library from FAIR) was added via the Open Neural Network Exchange (ONNX). ONNX allows those struggling to put PyTorch into production to generate an intermediate representation of the model that can be transferred to Caffe 2 library for deployment from servers to mobile devices. Certainly, using ONNX one can also transfer PyTorch models to other participating libraries.
Recently, we have seen further validation of PyTorch’s rise with problem-solving approaches built on top of the library. The engineering team at Uber, the popular ride sharing company, has built Pyro, a universal probabilistic programming language using PyTorch as its back end. The decision to use PyTorch was driven by the ability to perform native automatic differentiation and construct gradients dynamically, which is necessary for random operations in a probabilistic model. Another development of note was when the popular deep learning training site fast.ai announced it was switching future course content to be based on PyTorch rather than Keras-TensorFlow. In addition to the core PyTorch features, the fast.ai team noted the use of PyTorch by a majority of the top scorers in Kaggle’s “Understanding the Amazon from space” challenge, which uses satellite data to track the human footprint in the Amazon rainforest. The fast pace of adoption and extensibility is making PyTorch a library of choice for researches and application developers alike.
ANOTHER EXPLOSIVE TREND: RECOMMENDER SYSTEMS
Just as software libraries for deep learning are growing, the growth of user-generated content and user behavior signals has been an explosive trend in the last decade. In our vast ocean of consumption choices, the need for improved methods of curation has become ever important. For the last few decades, recommender systems have led the way in tailoring user experience to align user interests with the correct product, content, or action. With growing numbers of users and items, the ability to perform simple deductive recommendation (wine-cheese-crackers, James Bond-Mission Impossible-Jason Bourne Movies) has become challenging. Techniques such as memory-based collaborative filtering, which uses similarity based measures to perform recommendation, do not perform once user and item data becomes sparse, as is the case with most content and product applications. Take for example, a small system with 100K users and 10K items. It is unlikely every user has experienced/purchased/rated more than 100 items. The resulting user-item matrix will be extremely sparse, making it difficult to provide valid recommendations that are not purely random guesses.
An alternate approach to user-item-based distance measurement is to learn the underlying relationships between users and items to build a predictive model for recommendation. For example, let’s say our goal is to recommend movies to a user that are predicted to be rated 4+ stars out of 5. In this rule-learning-based approach, data scientists typically divide the historical preference data into train, test, and validation sets, as one would in supervised learning models. Commonly used rule-learning techniques such as alternating least squares and support vector machines have been state-of-the-art in the prior decade. Among the many recent advances in recommender systems, there have been two key concepts that help solve the challenges faced in large-scale systems: Wide & Deep Learning for Recommender Systems (by a team at Google), and deep matrix factorization (about which several papers have been written by other researchers).
The core idea behind Wide & Deep Learning is to jointly train both the wide and the deep networks. In our example, a wide network is used to learn the underlying rule that would generate a high rating for a recommendation request and item pair. Meanwhile, the sparse user behavior vectors are mapped to a dense representation using a state-of-the-art feature-vector transformation model (for example, word2vec). A deep neural network is trained using these dense vectors as input with targeted rating as output. This approach was put into production in the Google Play Store for Mobile App recommendation using the champion-challenger deployment model. The deep matrix factorization concept attempts to learn the non-linear relationships between users and items. This model is implemented by using user-item pair as input to the neural network with the predicted rating as the output.
It naturally follows that the fast-rising PyTorch library should be used to test these new approaches for recommender systems. In March 2018 at the Strata Data Conference in San Jose, we will do exactly that in a tutorial format. We will use the popular Movie Lens data set to build traditional, Wide & Deep, and deep matrix factorization models for recommendation.
Continue reading When two trends fuse: PyTorch and recommender systems.

_Thoughts on "We are the people they warned you about."_
Chris Anderson recently published "We are the people they warned you about," a two part article about the development of killer drones. Heres the problem hes wrestling with: "I’m an enabler ... but I have no idea what I should do differently."
Thats a good question to ask. Its a question everyone in technology needs to ask, not just people who work on drones. Its related to the problem of ethics at scale: almost everything we do has consequences. Some of those consequences are good, some are bad. Our ability to multiply our actions to internet scale means we have to think about ethics in a different way.
The second part of Andersons article gets personal. He talks about writing code for swarming behavior after reading _Kill Decision_, a science fiction novel about swarming robots running amok. And he struggles with three issues. First (Im very loosely paraphrasing Andersons words), "I have no idea how to write code that cant run amok; I dont even know what that means." Second, "If I dont write this code, someone else will—and indeed, others have." And third, "Fine, but my code (and the other open source code) doesnt exhibit bad behavior—which is what the narrator of _Kill Decision_ would have said, right up to the point where the novels drones became lethal."
How do we protect ourselves, and others, from the technology we invent? Anderson tries to argue against regulatory solutions by saying that swarming behavior is basically math; regulation is essentially regulating math, and that makes no sense. As Anderson points out, Ben Hammer, CTO of Kaggle, tweeted that regulating artificial intelligence essentially means regulating matrix multiplication and derivatives. I like the feel of this _reductio ad absurdum_, but neither Anderson nor I buy it—if you push far enough, it can be applied to anything. The FCC regulates electromagnetic fields; the FAA regulates the Bernoulli effect. We can regulate the effects or applications of technology, though even thats problematic. We can require AI systems to be "fair" (if we can agree on what "fair" means); we can require that drones not attack people (though that might mean regulating emergent and unpredictable behavior).
A bigger issue is that we can only regulate agents that are willing to be regulated. A law against weaponized drones doesnt stop the military from developing them. It doesnt even prevent me from building one in my basement; any punishment for violation comes after the fact. (For that matter, regulation rarely, if ever, happens before the technology has been abused.) Likewise, laws dont prevent governments or businesses from abusing data. As any speeder knows, its only a violation if you get caught.
A better point is that, whether or not we regulate, we cant prevent inventions from being invented, and once invented, they cant be put back into the box. The myth of Pandoras box is powerful and resonant in the 21st century. The box is always opened. Its always _already_ opened; the desire to open the box, the desire to find the box and then open it, is what drives invention in the first place.
Since our many Pandoras boxes are inevitably opened, and since we cant in advance predict (or even mitigate) the consequences of opening them, perhaps we should look at the conditions under which those boxes are opened. The application of any technology is determined by the context in which it was invented. Part of the reason were so uncomfortable with nuclear energy is that it has been the domain of the military. A large part of the reason we dont have Thorium reactors, which cant melt down, is that Thorium reactors arent useful if you want to make bombs.
If this is so, perhaps the solution is opening the box in an environment where it does the least harm. Paradoxically, that means opening the box in public, not in private. My claim is that putting an invention into a public space inevitably makes that invention safer. Military research in many countries is no doubt building autonomous killer drones already. This being the case, does developing open source drone software make us more or less safe? When invention takes place in public, we (the public) know that it exists. We can become aware of the risks; we have some control over the quality of the code; just as many eyes can find the bugs, many minds can think about the consequences. And many minds can think about how to defend against those consequences.
That argument isnt as strong as Id like. We can make it stronger by expanding our concept of an "invention" to include everything that makes the invention work, not just the code. Cathy ONeil has frequently written about the danger of closed, opaque data models, mostly recently in Weapons of Math Destruction. Openness and safety are allies. Regulation is a useful tool, though a tool thats not as powerful as wed like to think.
Regulation or not, we wont prevent the technology from being invented. By inventing in public, inventions can get the scrutiny and critical examination they need. I think Anderson would like to make this point, but isnt really comfortable with it. I share that discomfort (whether its his or not), but I think its unavoidable. That may be all the safety we get.
Continue reading Who, me? They warned you about me?.

_For stack scalability, elasticity at the business logic layer should be matched with elasticity at the caching layer._
Ever increasing numbers of people are using mobile and web apps and the average time consumed using them continues to grow. This use comes with expectations of real-time response, even during peak access times. ​ Modern, cloud-native applications have to stand up to all this application demand. In addition to user-initiated requests, applications respond to requests from other applications or handle data streamed in from sensors and monitors. Lapses in service will add friction to customer touch points, cause big declines in operational efficiency, and make it difficult to take advantage of fleeting sales opportunities during periods of high activity, like Cyber Monday.
WHAT ARE YOUR OPTIONS FOR DEALING WITH DEMAND?
Scaling large monoliths dynamically on demand becomes impractical as these systems grow. Meeting demands like Cyber Monday by scaling up a large clunky deployment of a monolith, and then scaling it back down when the higher capacity is no longer needed, is impractical, because as these systems grow, they become increasingly fragile, severely limiting the scope of what can be changed.
Continue reading How to bring fast data access to microservice architecture with in-memory data grids.

_Without the proper cataloging, curation, and security that self-service data platforms allow, companies are left vulnerable to cybersecurity threats and misinformation._
In our personal lives, data makes the world go round. You can answer almost any question in a second thanks to Google. Booking travel anywhere on the planet is just a few clicks away. And your smartphone has apps for pretty much anything you can think of, and more. It’s a great time to love coffee and craft beer. And it’s a great time to be a consumer of data. Life has never been better.
When we get to work, however, our relationship with data isn’t nearly as friendly. While everyone’s job depends on data, most of us struggle to use it as seamlessly as we do in our personal lives. At work, data is hard to find. It’s slow to access. Each data set has its own tools and “cheat sheet” to use successfully. Frequently, the data we need isn’t available to us in a shape we need, so we open a ticket with IT, then wait and hope for the best. Collaborating with other users to work with data is far from simple, and typically the solution is to copy the data into Excel and email it to someone. Alternately, we might set up a BI server or database we manage ourselves, even potentially on a server hiding in a closet or under someone’s desk.
WE’VE SEEN THIS RODEO BEFORE
If this sounds familiar, it should. For the past decade, workers have found ways to work around limitations in the hardware and software provided by IT—a trend we call “shadow IT_._” Workers started to bring their own laptops, iPads, and smartphones until IT either made these devices available or adopted “bring your own device” policies. Popular apps like Evernote, Dropbox, and Gmail, as well as cloud service providers like AWS and Google Cloud, quickly became everyday tools for millions of people in the workplace, opening up companies to massive security vulnerabilities that most are still trying to address today.
Companies learned that simply shutting down access to these systems and keeping the status quo was not an option. They learned to improve the quality of hardware and software in order to keep people using governed systems, removing the need to take matters into their own hands.
A NEW TREND: SHADOW ANALYTICS
What we saw with software and hardware over the past decade is now happening with data, a phenomenon we call “shadow analytics.” People want to do their jobs, and they’ll find a way to be more productive if IT organizations don’t provide the right tools. They are frustrated with their inability to access and use data, and they’re finding workarounds by moving data into ungoverned environments that sidestep the essential controls put in place by organizations. For example, users download data into spreadsheets, load data into cloud applications, and even run their own database and analytics software on their desktops.
Shadow analytics creates an environment where users can reach misleading conclusions. Because data is disconnected from the source, users can lose important updates in their copies, and the answers to questions they developed may no longer apply. In addition, with each user managing their own copy of the data, each copy can be wrong in different ways. As a result, IT organizations are frequently asked, “My colleague and I have different answers to an essential question—why?”
WHY SHADOW ANALYTICS IS A BIG DEAL
Data is the greatest asset—and the greatest liability—of most organizations. Cybersecurity threats are evolving rapidly. In the past few years, phishing attacks and intellectual property theft have grown by over 50%, ransomware is up over 160%, and the average time to detect compromises has reached 200 days. Consider a few of the recent headlines for major companies such as Equifax, Yahoo, and Target—billions of people were affected.
The main driver for the expansion of these threats is that the surface area for possible cyberattacks has radically increased in the past decade, and the number of threat actors has exploded. With each new device (e.g., smartphones, IoT, automated sensors), each new application, and every copy of data, a new vulnerability is created that potentially opens the door to the entire organization.
Every organization’s data and systems are potential targets for attack from armies of hackers-for-hire, well-organized criminal gangs, and state-sponsored initiatives. Threats are becoming more sophisticated with the emergence of social engineering, advanced persistent threats (APTs), ransomware, and fraud committed through digital identity theft. Cybersecurity software and services are not enough because once youve lost control over your data, your chances of airtight protection are slim, even with the most sophisticated network and endpoint security systems.
A NEW PARADIGM EMERGES: SELF-SERVICE DATA
Organizations have focused on several functional areas to improve the safety, accessibility, and usability of their data. Today, organizations understand that self-service data is essential to avoid the risks associated with shadow analytics. With self-service data, organizations can now provide data consumers with an experience that makes them more productive than they would be taking matters into their own hands.
Lets take a closer look at the essential functional areas of data analytics to examine their importance and the benefits of a self-service approach.
FUNCTIONAL AREA
WHY IS THIS IMPORTANT?
SELF-SERVICE DATA APPROACH
DATA ACCELERATION

The nature of analysis and data science is iterative. Data consumers ask questions that lead to new ideas and follow-on questions. Each query needs to be interactive, no matter the source or size of the data, and using any tool, such as Tableau or Python.
With shadow analytics, users create BI extracts or OLAP cubes to accelerate access to data. Because each person works independently, many redundant copies are created, each potentially ungoverned and disconnected from the source. In addition, these copies are slow to update and create additional cognitive load on the user (i.e., which cube do I connect to for a given query?)

It is impractical to scan all data for each query. For decades, systems have applied techniques to accelerate access to data, such as indexes, sorting, partitioning, and aggregating data to support various query patterns. Traditionally, these approaches are created by an administrator and end users must understand which optimization is best for their given query. In the self-service data approach, these optimizations are invisible to an end user. The system must be able to use them when appropriate without relying on the end-user. In addition, the system must be capable of autonomously identifying the best optimizations and adapting to emerging query patterns over time.
DATA CATALOG

Data consumers struggle to find data that is important to their work. Not all data is created equal—it is important to identify specific data sets as vetted and authoritative for all users.
With shadow analytics, there is no central catalog. Instead, users keep private notes about data sources and data quality, meaning there is no governance and there is no vetted sense of meaning across the organization.

In the self-service approach, the catalog is automatic—as new data sources are brought online, the system must discover the underlying schema automatically, and it must adapt as the source evolves. Organizations develop rich semantic descriptions of their data that should be searchable as well. In addition, data sets that are created by end users must also be cataloged for easy discovery and analysis.
DATA VIRTUALIZATION

It is virtually impossible for an organization to centralize all data in a single system. Analytical tools, including BI tools like Tableau and data science tools like Python and R, assume that all data resides in a single relational database.
With shadow analytics, data is moved from one system into a format that is accessible by tools such as Tableau or Python, typically CSV. This creates a copy that is ungoverned and disconnected from the source data.

Data consumers need to be able to access all data sets equally well, regardless of the underlying technology or location of the system. Access should be through SQL, as it is widely supported by all tools and well understood by most users.
DATA CURATION

There is no single “shape” of data that works for everyone. Each data consumer needs data in a particular form that is useful for the task at hand. This can mean filtering data in various ways, blending multiple data sets together, converting data types, formatting the data in different ways, and more.
With shadow analytics, curation is performed by making copies of data that are typically ungoverned and disconnected from their sources.

Data consumers need the ability to interact with data sets from the context of the data itself, not exclusively from simple metadata that fails to tell the whole story. Data consumers should be capable of reshaping data to their own needs without writing any code or learning new languages. In the self-service data approach, these capabilities are provided without making copies of the data—no organization wants thousands of copies of their data.
DATA LINEAGE

As data is accessed by data consumers and in different processes, it is important to track the provenance of the data, who accessed the data, how the data was accessed, what tools were used, and what results were obtained. In cases of sensitive data, erroneous data, or data breaches, it is critical to be able to establish the full lineage of data.
In shadow analytics, data is accessed and copied independent of any governing process. There is no clear record of data lineage or custody.

With each data consumer capable of creating data sets for themselves, the importance of data lineage becomes paramount—no company wants to govern thousands of copies of each data set. It is critical this lineage be tracked automatically—organizations cannot rely on end users to record and register their work in a central system themselves. Instead, as users reshape and share data sets with one another through a virtual context, a self-service data platform can seamlessly track these actions and all states of data along the way, providing full audit capabilities as well.
OPEN SOURCE

Because data is essential to every area of every business, the underlying data formats and technologies used to access and process the data should be open source. Organizations should not be locked into a specific vendor or commercial model.
In shadow analytics, users decide for themselves which tools are used, including proprietary tools from unknown vendors, and cloud services whose access is not controlled by the organization (e.g., if the employee leaves, how will the data be accessed?)

Self-service data platforms build on open source standards like Apache Parquet, Apache Arrow, and Apache Calcite to store, query, and analyze data from any source. In addition, the end user interface is also open source and runs in any modern browser, while providing access to visualization and analytical tools over open standards like ODBC, JDBC, and REST.
SECURITY CONTROLS
Organizations safeguard their data assets with security controls that govern authentication (you are who you say you are), authorization (you can perform specific actions), auditing (a record of the actions you take), and encryption (you can only read the data if you have the right key). In shadow analytics, users download data into environments that are outside these central controls, exposing companies to unnecessary risk.
Self-service data platforms integrate with existing security controls of the organization, such as LDAP and Kerberos. They respect the controls of underlying data sources and do not create copies of data that are outside of these controls.
CONCLUSION
In terms of data, companies need to find the right balance between control and convenience—control of the data and systems in a safe and auditable way, and convenience for end users so they don’t invent ways to work around these controls. Self-service data platforms are a new open source approach that helps to prevent shadow analytics. They preserve and extend existing security controls, and give data consumers a way to use data that makes them more productive than taking matters into their own hands.
Continue reading How self-service data avoids the dangers of “shadow analytics”.

_Learn how the Defense Advanced Research Projects Agency (DARPA) has spurred significant advances in the promising field of synthetic biology._
GOOGLE, VENTURE CAPITAL, AND BIOPHARMA
Biotech is a different world from Silicon Valley’s tech scene, and one with its own ways and traditions. Where Silicon Valley has a reputation for being brash and anarchistic, biotech developments are perceived—even if it’s inaccurate—as being bound by regulation and a cat’s cradle of connections with academia, big pharma, and government. But they have one thing in common: an overlapping venture capital community ready to fund the next big thing, be it cloud computing or CRISPR.
And, for many biology startups, there’s something unexpected: Google is there.
Continue reading DARPA and the future of synthetic biology.

_The O'Reilly Security Podcast: The objectives of agile application security and the vital need for organizations to build functional security culture._
In this episode of the Security Podcast, I talk with Rich Smith, director of labs at Duo Labs, the research arm of Duo Security. We discuss the goals of agile application security, how to reframe success for security teams, and the short- and long-term implications of your security culture.
Continue reading Rich Smith on redefining success for security teams and managing security culture.

_TouchID for SSH, Pen Testing Checklist, Generativity, and AI Data_
* SeKey -- _an SSH agent that allow users to authenticate to UNIX/Linux SSH servers using the Secure Enclave_.
* Web Application Penetration Testing Checklist -- a useful checklist of things to poke at if youre doing a hygiene sweep.
* The Bullet Hole Misconception -- _Computer technology has not yet come close to the printing press in its power to generate radical and substantive thoughts on a social, economical, political, or even philosophical level._ I really like this metric of success.
* AI Index (Stanford) -- _This report aggregates a diverse set of data, makes that data accessible, and includes discussion about what is provided and what is missing. Most importantly, the AI Index 2017 Report is a starting point for the conversation about rigorously measuring activity and progress in AI in the future._
Continue reading Four short links: 6 December 2017.

_Using the keras TensorFlow abstraction library, the method is simple, easy to implement, and often produces surprisingly good results._
The multitude of methods jointly referred to as “deep learning” have disrupted the fields of machine learning and data science, rendering decades of engineering know-how almost completely irrelevant—or so common opinion would have it. Of all these, one method that stands out in its overwhelming simplicity, robustness, and usefulness is the transfer of learned representations. Especially for computer vision, this approach has brought about unparalleled ability, accessible to practitioners of all levels, and making previously insurmountable tasks as easy as from keras.applications import *.
Put simply, the method dictates that a large data set should be used in order to learn to represent the object of interest (image, time-series, customer, even a network) as a feature vector, in a way that lends itself to downstream data science tasks such as classification or clustering. Once learned, the representation machinery may then be used by other researchers, and for other data sets, almost regardless of the size of the new data or computational resources available.
In this blog post, we demonstrate the use of transfer learning with pre-trained computer vision models, using the keras TensorFlow abstraction library. The models we will use have all been trained on the large ImageNet data set, and learned to produce a compact representation of an image in the form of a feature vector. We will use this mechanism to learn a classifier for species of birds.
There are many ways to use pre-trained models, the choice of which generally depends on the size of the data set and the extent of computational resources available. These include:
* FINE TUNING: In this scenario, the final classifier layer of a network is swapped out and replaced with a softmax layer the right size to fit the current data set, while keeping the learned parameters of all other layers. This new structure is then further trained on the new task.
* FREEZING: The fine-tuning approach necessitates relatively large computational power and larger amounts of data. For smaller data sets, it is common to “freeze” some first layers of the network, meaning the parameters of the pre-trained network are not modified in these layers. The other layers are trained on the new task as before.
* FEATURE EXTRACTION: This method is the loosest usage of pre-trained networks. Images are fed-forward through the network, and a specific layer (often a layer just before the final classifier output) is used as a representation. Absolutely no training is performed with respect to the new task. This image-to-vector mechanism produces an output that may be used in virtually any downstream task.
In this post, we will use the feature extraction approach. We will first use a single pre-trained deep learning model, and then combine four different ones using a stacking technique. We will classify the CUB-200 data set. This data set (brought to us by vision.caltech) contains 200 species of birds, and was chosen, well...for the beautiful bird images. Figure 1. 100 random birds drawn from the CUB-200 data set. Image courtesy of Yehezkel Resheff.
First, we download and prepare the data set. On Mac \ Linux this is done by:
curlhttp://www.vision.caltech.edu/visipedia-data/CUB-200-2011/CUB_200_2011.tgz|tar-xz
Alternatively, just download and unzip the file manually.
The following describes the main elements in the process. We omit the import and setup code in favor of more readable and flowing text. The full code is available in this GitHub repo.
We start by loading the data set. We will use a utility function (here) to load the data set with images of a specified size. The constant CUB_DIR points to the “images” directory inside the “CUB_200_2011” folder, which was created when unzipping the data set.
X,y=CUB200(CUB_DIR,size=(244,244)).load_dataset()
To begin, we will use the Resnet50 model (see paper and keras documentation) for feature extraction. Notice that we use images sized at 244X244 pixels. All we need in order to generate vector representations of the entire data set are the following two lines of code:
X=preprocess_input(X)X_resnet=ResNet50(include_top=FALSE,weights="IMAGENET",pooling=AVG).predict(X)
The preprocess_input function performs some normalizations that were done on the original training data (ImageNet) with which the model was built. Namely, subtraction of the mean channel-wise pixel value. ResNet50.predict does the actual transformation, returning a vector of size 2048 representing each of the images. When first called, the ResNet501 constructor will download the pre-trained parameter file; this may take a while, depending on your internet connection. These feature vectors are then used in a cross-validation procedure with a simple linear SVM classifier:
clf=LinearSVC()results=cross_val_score(clf,X_resnet,y,cv=3,n_jobs=-1)print(results)print("OVERALL ACCURACY: {:.3}".format(np.mean(results)*100.))[ 0.62522158 0.62344583 0.62852745]Overall accuracy: 62.6
With this simple approach, we obtain 62.6% accuracy on the 200-class data set. Not bad! In the following section, we will use several pre-trained models and a stacking approach to try to improve this result.
The intuition behind using more than one pre-trained model is the same as in any case of using more than one set of features: they will hopefully provide some non-overlapping information, allowing superior performance when combined.
The approach we will use to combine the features derived from the four pre-trained models (VGG19, ResNet, Inception, and Xception) is generally referred to as “stacking.” Stacking is a two-stage approach, where the predictions of a set of models (base classifiers) is then aggregated and fed into a second stage predictor (meta classifier). In this case, each of the base classifiers will be a simple logistic regression. The probabilistic outputs of these is then averaged, and fed into a linear SVM, which then provides the final decision.
base_classifier=LogisticRegressionmeta_classifier=LinearSVC
We start off with the sets of features (X_vgg, X_resnet, X_incept, X_xcept) generated from each of the pre-trained models, as in the case of ResNet above (please refer to the git repo for the full code). As a matter of convenience, we stack the the feature sets into a single matrix, but keep the boundary indexes so that each model may be directed to the correct set.
X_all=np.hstack([X_vgg,X_resnet,X_incept,X_xcept])inx=np.cumsum([0]+[X_vgg.shape[1],X_resnet.shape[1],X_incept.shape[1],X_xcept.shape[1]])
We will use the great mlxtend extension library, which makes stacking exceedingly easy. For each of the four base classifiers, we construct a pipeline that consists of selecting the appropriate features, followed by a LogisticRegression.
pipes=[make_pipeline(ColumnSelector(cols=list(range(inx[i],inx[i+1]))),base_classifier())FORiINrange(4)]
The stacking classifier is defined and configured to use the average probabilities provided by each of the base classifiers as the aggregation function.
stacking_classifier=StackingClassifier(classifiers=pipes,meta_classifier=meta_classifier(),use_probas=TRUE,average_probas=TRUE,verbose=1)
Finally, we are ready to test the stacking approach:
results=cross_val_score(stacking_classifier,X_all,y,cv=3,n_jobs=-1)print(results)print("OVERALL ACCURACY: {:.3}".format(np.mean(results)*100.))[ 0.74221322 0.74194367 0.75115444]Overall accuracy: 74.5
With this method of stacking of individual pre-trained model-based classifiers, we obtain 74.5% accuracy—a substantial improvement over the single ResNet model (one could try each of the other models on their own in the same way to see how they compare).
In summary, this blog post describes the method of using multiple pre-trained models as feature extraction mechanisms, and a stacking method to combine them, for the task of image classification. This method is simple, easy to implement, and most often produces surprisingly good results.
_This post is a collaboration between OReilly and __TensorFlow__. __See our statement of editorial independence__._

[1] Running data through deep learning models, like deep learning in general, is usually done on a GPU. However, if you are using a low-end laptop GPU, some of the models we use here might not fit in memory, leading to an out-of-memory exception. If this is the case, you should force TensorFlow to run everything on the CPU by putting everything deep learning related under a with tf.device("/cpu:0"): block.(Return)

_Analog Computing, Program Synthesis, Midwestern Investment, and Speed Email_
* A New Analog Computer (IEEE) -- _Digital programming made it possible to connect the input of a given analog block to the output of another one, creating a system governed by the equation that had to be solved. No clock was used: voltages and currents evolved continuously rather than in discrete time steps. This computer could solve complex differential equations of one independent variable with an accuracy that was within a few percent of the correct solution._
* Barliman -- _Barliman is a prototype "smart editor" that performs real-time program synthesis to try to make the programmers life a little easier. Barliman has several unusual features: given a set of tests for some function foo, Barliman tries to "guess" how to fill in a partially specified definition of foo to make all of the tests pass; given a set of tests for some function foo, Barliman tries to prove that a partially specified definition of foo is inconsistent with one or more of the tests; given a fully or mostly specified definition of some function foo, Barliman will attempt to prove that a partially specified test is consistent with, or inconsistent with, the definition of foo._
* Investing in the Midwest (NYT) -- Steve Case closes a fund backed by every tech billionaire youve heard of, for investing in midwestern businesses. _Mr. Schmidt of Alphabet said he was sold on the idea from the moment he first heard about it. “I felt it was a no-brainer,” he said. “There is a large selection of relatively undervalued businesses in the heartland between the coasts, some of which can scale quickly.”_
* Email Like a CEO -- see also How to Write Email with Military Precision. (via Hacker News)
Continue reading Four short links: 5 December 2017.

_Find out how to get your voice heard and bring a positive impact to a receptive audience._
The 2018 Fluent CFP closes in a few days — on December 8, 2017. Don’t be alarmed! You still have plenty of time to send in your proposal. Here’s why I really, really hope you do.
1. OUR INDUSTRY ALWAYS NEEDS NEW VOICES
I can’t even begin to express how much we want to see fresh faces and hear new voices on our stages. Yes, we all love seeing talks by industry rockstars (and we’ll definitely have our share of those), but what makes Fluent an important event is that it’s also a launchpad for the next generation of industry leaders. (Hint: This means you.)
Continue reading 3 reasons why you should submit a proposal to speak at Fluent 2018.

_Campaign Cybersecurity, Generated Games, Copyright-Induced Style, and Tech Ethics_
* Campaign Cybersecurity Playbook -- _The information assembled here is for any campaign in any party. It was designed to give you simple, actionable information that will make your campaign’s information more secure from adversaries trying to attack your organization—and our democracy._
* Games By Angelina -- _The aim is to develop an AI system that can intelligently design videogames, as part of an investigation into the ways in which software can design creatively._ The creators GitHub account has some interesting procedural generation projects, too. (via MIT Technology Review)
* Every Frame a Painting -- _Nearly every stylistic decision you see about the channel—the length of the clips, the number of examples, which studios’ films we chose, the way narration and clip audio weave together, the reordering and flipping of shots, the remixing of 5.1 audio, the rhythm and pacing of the overall video—all of that was reverse engineered from YouTube’s Copyright ID. [...] So, something that was designed to restrict us ended up becoming our style. And yet, there were major problems with all of these decisions. We wouldn’t realize it until years later, but by creating such a simple, approachable style that skirted the edge of legality, we pretty much cut ourselves off from our most ambitious topics._
* Love the Sin, Hate the Sinner (Cory Doctorow) -- the best review of Tims new book that Ive seen. _[T]he reason tech went toxic was because unethical people made unethical choices, but those choices werent inevitable or irreversible._
Continue reading Four short links: 4 December 2017.

_Creepy Kid Videos, Cache Smearing, Single-Image Learning, and Connected Gift Guide_
* /r/ElsaGate -- Reddit community devoted to understanding and tackling YouTubes creepy kid videos, from business models to software used to create them.
* Cache Smearing (Etsy) -- to solve the problem where one key is so powerful it overloads a single server, a technique for turning a single key into multiple so they can be spread over several servers.
* Deep Image Prior -- _Deep convolutional networks have become a popular tool for image generation and restoration. Generally, their excellent performance is imputed to their ability to learn realistic image priors from a large number of example images. In this paper, we show that, on the contrary, the structure of a generator network is sufficient to capture a great deal of low-level image statistics prior to any learning. In order to do so, we show that a randomly initialized neural network can be used as a handcrafted prior with excellent results in standard inverse problems such as denoising, superresolution, and inpainting. Furthermore, the same prior can be used to invert deep neural representations to diagnose them, and to restore images based on flash/no flash input pairs._
* Privacy Not Included (Mozilla) -- shopping guide for connected gifts, to help you know whether they respect your privacy or not. (most: not so much)
Continue reading Four short links: 1 December 2017.

_Object Models, Open Source Voice Recognition, IoT OS, and High-Speed Robot Wars_
* Object Models -- _a (very) brief run through the inner workings of objects in four very dynamic languages_. Readable and informative. (via Simon Willison)
* Mozilla Releases Open Source Voice Recognition and Voice Data Set -- _we have included pre-built packages for Python, NodeJS, and a command-line binary that developers can use right away to experiment with speech recognition._ Data set features samples from _more than 20,000 people, reflecting a diversity of voices globally_.
* FreeRTOS -- Amazon adds sync and promises OTA updates. Very clever from Amazon: this is foundational software for IoT.
* Japanese Sumo Robots (YouTube) -- omg, the speed of these robots. (via BoingBoing)
Continue reading Four short links: 30 November 2017.

_New technology is allowing doctors to diagnose and treat heart disease faster and more efficiently._
Detection of heart sounds in the early 1800s and before was limited to direct contact between the physician’s ear and the patient’s chest. Auscultation changed markedly in 1816 with René Laënnec’s invention of the stethoscope, though the wooden tube prototype was far from the bi-aural apparatus of modern medical practice. Throughout the years, the stethoscope has witnessed minor adjustments to improve material quality and ease of use; however, the fundamental design has remained largely the same. Biotech startup Eko addresses this stagnancy with a digital take on a tool utilized by over 30 million clinicians around the world.
Like many pioneers, Berkeley-based company Eko began with a question: if there is a technical gap in cardiology, how can it be resolved? Its answer came in the form of a digital stethoscope known as CORE, which transmits heart sound data straight to a clinician’s compatible device. While CORE represented an unprecedented improvement in auscultation, according to cofounder and COO Jason Bellet, the company’s newest device provides an equally, if not more, powerful tool.
Continue reading DUO: Connecting the home to the hospital.

_Scale changes the problems of privacy, security, and honesty in fundamental ways._
For the past decade or more, the biggest story in the technology world has been scale: building systems that are larger, that can handle more customers, and can deliver more results. How do we analyze the habits of tens, then thousands, then millions or even billions of users? How do we give hordes of readers individualized ads that make them want to click?
As technologists, weve never been good at talking about ethics, and we’ve rarely talked about the consequences of our systems as they’ve grown. Since the start of the 21st century, we’ve acquired the ability to gather and (more important) store data at global scale; we’ve developed the computational power and machine learning techniques to analyze that data; and we’ve spawned adversaries of all creeds who are successfully abusing the systems we have created. This “perfect storm” makes a conversation about ethics both necessary and unavoidable. While the ethical problems we face are superficially the same as always (privacy, security, honesty), scale changes these problems in fundamental ways. We need to understand how these problems change. Just as we’ve learned how to build systems that scale, we need to learn how to think about ethical issues at scale.
Lets start with the well-known and well-reported story of the pregnant teenager who was outed to her parents by Targets targeted marketing. Her data trail showed that she was buying products consistent with being pregnant, so Target sent her coupon circulars advertising the baby products she would eventually need. Her parents wondered why their daughter was suddenly receiving coupons for disposable diapers and stretch-mark cream, and drew some conclusions.
Many of us find that chilling. Why? Nothing happened that couldnt have happened at any small town pharmacy. Any neighborhood pharmacist could notice that a girl had added some weight, and was looking at a different selection of products. The pharmacist could then draw some conclusions, and possibly make a call to her parents. The decision to call would depend on community values: in some cultures and communities, informing the parents would be the pharmacists responsibility, while others would value the girl’s privacy. But thats not the question thats important here, and its not why we find Targets action disturbing.
The Target case is chilling because it isnt a case about a single girl and a single pregnancy. Its about privacy at scale. Its a case about everyone who shops at any store larger than a neighborhood grocery. The analysis that led Target to send coupons for diapers is the same analysis they do to send coupons to you and me. Most of the time, another piece of junk mail goes into the trash, but that’s not always the case. If a non-smoker buys a pack of cigarettes, do their insurance rates go up? If an alcoholic buys a six-pack, who finds out? What can be gathered from our purchase histories and conveyed to others, and what are the consequences? And who is making decisions about how to use this data?
When _nobody_ can presume that their actions are private, were in a different ethical world. The actions of a human pharmacist arent comparable to the behavior of systems that put everyones privacy at risk. Our obsession with scale amplifies problems that might be innocent enough if they could be addressed individually. An individuals need for privacy may depend on context and personal choice; scale ignores both context and choice. Scale creates a different set of ethical problems—and its a set of problems we havent thought through.
There are several aspects of the Target case (and cases like it) that deserve thought. First, who is responsible? It is difficult, if not impossible, to talk about ethics without agents who are accountable for their decisions. A local pharmacist can make a decision, and can bear responsibility for that decision. In Targets case, though, the circular was sent by a piece of software that had no idea that it was engaging in problematic behavior. It was doing what it was supposed to do: analyzing buying patterns and sending coupons. The word “idea” itself is revealing: software doesn’t have “ideas,” but we instinctively feel the need to assign agency to something, some actor that makes an informed decision. Is the programmer who built the system accountable for how it is used? Is the data scientist who created the model? It’s unlikely that either the programmer or the data scientist have any idea what the system is actually doing in the real world, and certainly they have no control over how it is deployed. Is the "management" that ordered the system and specified its behavior responsible? That sounds more concrete, but scratch the surface and you’ll find a murky collective; one might as well say "the stockholders."
Second, exposing a pregnant teenage girl to her parents was clearly an "unforeseen consequence": nobody designed the system to do that. Programmers, analysts, managers, and even stockholders certainly need to think more about the consequences of their work; all too often, unforeseen consequences could have been foreseen. However, I cant be too hard on people for not imagining all possible consequences. The possible consequences of any action easily spin out to infinity, and expecting humans to anticipate them invites paralysis.
Third, collecting and using personal data isnt entirely negative: collecting medical data from millions of patients can lead to new treatments, or to earlier ways of detecting serious diseases. What if the teenagers buying patterns indicated that she was self-medicating for a serious medical condition, such as preeclampsia? Does that merit an automated intervention? There are good ways to use data at scale, and they cant be cleanly separated from the bad ways.
I dont want to presuppose any answer to these questions; ethics is ultimately about discussion and dialog, rather than any one persons opinion. I do want to suggest, though, that scale changes the issues. We need to start thinking about ethics at scale.
Heres another example: any decent thief can pick the lock on your house. We know how to think about that. But an attack against internet-enabled locks could potentially unlock all the locks, anywhere in the world, simultaneously. Is that somehow a different issue, and if so, how do we think about it? (While I was writing this, news came out of the first attack against Amazons Key service.)
Building an ethical argument around the legal system is dubious, but that may give us a way in. I doubt that you could sue the lock manufacturer if someone picked the lock on your front door. That falls into the "shit happens" category. You could possibly sue if the lock was faulty, or if it had a particularly shoddy design. But almost any lock can be picked by someone with the right tools and skills. However, I can easily imagine a class action lawsuit against a lock manufacturer whose locks were "picked" _en masse_ because of a software vulnerability. Anthem Blue Cross has agreed to pay millions of dollars to settle lawsuits over a data breach; people are lining up to sue Equifax. Would a lock manufacturer be any different?
As in the Target case, we see that agency is obscure; we dont know whos responsible. Were almost certainly dealing with unforeseen consequences, and on many levels. An attack against a smart lock could take advantage of vulnerabilities in the lock itself, the vendors data center, the homeowner’s cell phone, the locking app, or even the cell phone provider. The failure could be the consequence of an incredibly subtle bug, a forgotten security update, or a default password. Whatever the cause, the failure is amplified to internet scale. While its not clear who would be bear responsibility if the worlds smart locks were hacked, it is clear that we need to start thinking about safety at a global scale, and that thought process is qualitatively different from thinking about individual locks.
A final example: "fake news" isnt a new thing. Weve all known cranks who believe crazy stuff and waste your time telling you how they were abducted by aliens. We’ve smiled at the grocery store tabloids. Whats scary now is fake news at scale: its not one crank wasting one persons time, but one crank whose idea gets propagated to literally billions. Except that this news doesn’t come from a crank, but from a professional agent of a hostile government. And the "people" passing the news along arent people; theyre sock puppet identities, bots created purely for the purpose of propagating misinformation. And the scale at which this takes place transcends even the most powerful press.
As danah boyd said at her Strata NY keynote, this is no longer a simple social media issue; its a security issue. What happens when you poison the data streams that feed the "artificial intelligences" that tell us what to read? I have an ethical commitment to free speech; but when free speech becomes a computer security issue, its a different game. I can defend someones right to propagate absurd news stories without approving their conduct. But how do we think about intentionally propagating deceptive speech at scale? I wont defend someones right to log into my computer systems and modify data without my permission; should I defend someone’s right to poison the data streams that determine which stories Google and Facebook send to their readers? What are the responsibilities of those who build and maintain those data streams? The ethics of scale around "fake news" certainly needs to account for the platforms (such as Facebook) that are, as Renee DiRestahas said, "tailor-made for a small group of vocal people to amplify their voices."
Whether the issue is privacy, safety, honesty, or any other issue, the ability of our systems to amplify problems to internet scale changes the problem itself. I could have come up with many examples. Banks routinely deny loans, and it’s certainly unethical to loan money to someone who won’t be able to repay, or to refuse loans for reasons unrelated to the applicant’s ability to pay, but what happens when loan applications are denied at scale? Are entire classes of people treated unfairly? Are loans routinely denied to people who come from certain neighborhoods, work at certain occupations, or have certain medical conditions? Informers used to identify opponents of a political regime one at a time; now, face recognition can potentially identify every attendee at a protest or a rally. These problems are superficially the same as they were decades ago—but when scaled, they change completely.
The ethics of scale differs at least in part because of the "fellow travelers" that weve seen: the problems of hidden agency and unforeseen consequences. The tools that we use to achieve scale, by nature, hide agency. Is a judge responsible for sentencing a prisoner, or is that responsibility given to an algorithm that hides control? Do responsible humans create advertising campaigns, or do we delegate those tasks to software? If an algorithm rejects a credit application, who ensures that the decision was fair and unbiased? We cant address the ethics of scale without talking about the people—not the algorithms—responsible for decisions, and we are right to be wary of systems for which no one seems accountable.
The problem of unforeseen consequences is perhaps the greatest irony of the connected internet age. The internet itself is nothing but an unforeseen consequence. Back in the 1970s, it was an interesting DARPA-funded experiment. None of the internet’s inventors could have foreseen its future, and they would probably have designed it differently if they had. Back in the early 1990s, when the public internet was young, it was supposed to bring about world peace by facilitating communication and understanding; and just over a decade later, we proudly proclaimed that social media enabled the Arab Spring. Those of us who shared that naivete also share responsibility: a less naive culture might, in due time, have created a Facebook or a Twitter that wasnt so vulnerable to "fake news." Indeed, everything from the Morris worm and the first email spam to the Equifax attack is an unforeseen consequence.
Its not possible to foresee all consequences, let alone eliminate them, and obsessing over those consequences may well paralyze us and prevent us from doing good. The novelty of any invention makes it even more difficult to predict how the consequences will play out; who would have thought that Mitt Romney’s remark about “binders full of women” would have started an internet meme? But thinking about ethics and participating in an ethical discussion about software at scale requires us to foresee some of those consequences, and think about their effects before all of us become the victims. And as time goes on, we need to become less naive. Once we’ve seen how the reaction to a chance remark can propagate like wildfire through social media, and even into Amazon product reviews, we should be aware of how our systems can be manipulated and gamed.
Immanuel Kants “categorical imperative” may help us to think about ethics at scale. "Act according to the principle which you would want to be a universal law" says that we should think carefully about the kind of world we are creating. Are we building systems optimized to maximize profit for a small group of stakeholders, or are we building a system that will be better for all of humanity? What are the consequences of our actions and creations to individuals, but also multiplied to all the inhabitants of our world? We need to look at bigger pictures, and we certainly should be more skeptical about our abilities than we were in the early days of the internet. The ACM’s Code of Ethics and Professional Conduct is a good starting point for a discussion, as are organizations such as Data & Society and NYU’s AINow. Many colleges and universities now offer classes on data ethics. But many people in the software industry have yet to join the discussion.
We need to think boldly about the concrete, everyday problems we face, many of which are problems of our own making. Were not talking about a hypothetical AI that might decide to turn the world into paper clips. Were talking about systems that are already working among us, defining the world in which we live—and wasting our time on arcane hypothetical issues will only prevent us from solving the real problems. As Tim OReilly says, our Skynet moment has already happened: we live in a web of entangled partially intelligent systems, designed to maximize objective functions that were designed with no thought for our well-being.
Its time to put those systems under our control. It’s time for the businesses, from Google and Facebook (and Target) to the newest startups, to realize that their future isn’t tied up in short-term profits, but in building a better world for their users. Their business opportunity doesn’t lie in building echo chambers or in placing personalized ads, but in helping to create a world that’s more fair and just, even at scale. To build better systems and businesses, we need to become better people. And to be better people, we must learn to think about ethics: not just personal ethics, but ethics at scale.
Continue reading Ethics at scale.

_The O’Reilly Media Podcast: Gayle Sheppard, Saffron AI Group at Intel, and David Thomas, Bank of New Zealand._
In this episode of the O’Reilly Media Podcast, I spoke with Gayle Sheppard, vice president and general manager of Saffron AI Group at Intel, and David Thomas, chief analytics officer for Bank of New Zealand (BNZ). Our conversations centered around the utility of artificial intelligence in the financial services industry.
ASSOCIATIVE MEMORY AI: ENABLING MACHINES TO FIGHT FINANCIAL CRIME WITH HUMAN-LIKE REASONING
According to Sheppard, associative memory AI technologies are best thought of as reasoning systems that combine the memory-based learning seen in humans—recognizing patterns, spotting anomalies, and detecting new features almost instantly—with data. Compared to traditional machine learning methods, Sheppard says, associative memory AI unifies multiple data sources—both structured and unstructured—without relying on pre-defined models. Associative memory AI reasons on that data to deliver insights quickly, accurately, and with less training data—in some cases, using as little as 20% of the available training set. Furthermore, transparency is built into the fabric of associative memory AI, so one can more easily explain the system’s path to insight.
Applications of associative memory AI in the enterprise are varied. “Our strategy is to build comprehensive decision systems for financial services, supply chain management, and manufacturing and defense. ... These systems combine what we think are the best of learning approaches, such as deep learning, traditional statistical machine learning, associative learning, and others. Our goal is to deliver a sum that is much greater than its individual parts.” Intel has developed a sharp focus on the financial services industry, with its October launch of the Intel Saffron Anti-Money Laundering (AML) Advisor. Sheppard described four challenges and opportunities that Intel sees in the financial services industry:
* Financial institutions—specifically banks and insurers—collect data at massive scale, and this data is expected to double every two years. Human and machine-generated data is growing 10 times faster than traditional business data.
* Structured, transactional data has dominated their systems, but almost everything in banking is customer-pattern based, or unstructured. Excluding this data from analysis significantly reduces the relevance of its outcomes.
* Models become stale quickly, but criminals are constantly evolving. It can take up to nine months to update statistical models, from design to test, for banking customers. That’s _before_ these new models can even be put into production. “The fastest time for the industry to deploy simple model changes is considered five to six months. That’s a lot of time for crime to run rampant for these institutions,” said Sheppard.
* Financial organizations can store data on several hundred systems of record. How can customers efficiently access and analyze this data that’s spread across multiple locations? It’s incredibly time consuming and expensive to move that data, particularly in an industry governed by tight regulations.
If associative memory AI can address these four key challenges, then where does that leave investigators, analysts, and auditors—the human decision-makers? As Sheppard explains, AI technologies go to work for the human element. They find similarity/anomaly/novelty patterns, make recommendations and predictions, and provide explanations behind their insights in order to make the human decision better, faster, and more confident. “This full investigative analytic capability that we can provide for humans plus the evidence to explain why we believe a certain course of action should be taken is a terrific productivity aid to the human investigator.”
TAKING BANKING INTO THE FUTURE WITH LEAN ANALYTICS AND AI
Thomas extended the narrative by talking about the evolving business model for financial services, how banks are navigating increasingly strict regulations, and how BNZ is leveraging data and modern tools to engage with its customers. We discussed the opportunities for AI in this new landscape, and how BNZ is using natural language processing, cognitive computing, and neural networks to empower their customers to meet their financial goals. Thomas noted that Intel has been instrumental in not only configuring BNZ’s architecture, but also in thinking about their customer strategy. BNZ is participating in Intel Saffron’s Early Adopter Program, which was recently launched with the goal of targeting institutions that aim to lead the pack on innovation in financial services.
Below are some highlights from our conversation:
INSIGHTS FROM UNSTRUCTURED DATA

I think the use of unstructured data is increasing at quite a pace. Historically, we have very structured data, but weve now set up our Cloudera Hadoop-based data lake, are starting to use natural language processing more and more, and are utilizing bots to understand what our customers are saying. The insight you can get from what customers have actually said as opposed to what theyve said in research or in a survey, just takes it to a whole new level of getting closer to the customer.

PROVEN APPROACHES

In terms of the tools, probably the things weve tried the most that weve really appreciated are: we have changed some operating models (we are using lean analytics, which is a lot of hypothesis-driven work, utilizing data scientists but combining with customer experience teams), taking off bite-sized customer problems or customer opportunities, and completing three-week sprints. Thats been really successful for us. If you can do a three-week sprint and come out with eight or nine prototype interventions that have been tested with customer data, you can move the organization at quite a pace. We also have started developing bots and playing with those on both the sales and service side, and were investing heavily in our digital teams, supported by analytics and thinking quite a lot about the global evolution toward digital banks, and how you blend that with the human interactions as well.

NAVIGATING THE PACE OF CHANGE

The first thing we had to do, particularly working with the board, is define what AI is. ... And we did need to present it as almost a progression and broken into chunks—natural language processing, cognitive computing, neural networks, and machine learning. ... The biggest lesson weve certainly had as an organization is that information or data insights or whatever variation you call it, has got to be an ongoing build. All of our strategies look like staircases. Were very, very mindful that pace of change is substantial and we need partners to stay up with that change, but we need to be constantly moving forward.

_Avoiding State Surveillance, Parallel Algorithms, Smart Tactics, and Voting Security_
* The Motherboard Guide to Avoiding State Surveillance -- a lot of good advice, even if youre not at risk from a nation state (e.g., dont run your own mail server).
* A Library of Parallel Algorithms (CMU) -- what it says on the box. See also CMUs "Algorithm Design: Parallel and Sequential" book.
* EFFs Clever Tactic (Cory Doctorow) -- _when you argue about DRM, the pro-DRM side always says that all this stuff is an unfortunate side-effect of the law, and that theyre really only trying to stop pirates, promise and cross my heart. So, heres what we did at the W3C: we proposed a membership rule that would allow members to use DRM law to sue anyone who infringed their copyrights—but took away their rights to sue people who were breaking DRM for some other reason, like adapting works for people with disabilities, or investigating critical security flaws, or creating legal, innovative new businesses._ Needless to say, they didnt go for that proposal, which revealed their true motives.
* Cybersecurity of Voting Machines (Matt Blaze) -- his written testimony before Congress. _I offer three specific recommendations: (1) Paperless DRE voting machines should be immediately phased out from U.S. elections in favor of systems, such as precinct-counted optical scan ballots, that leave a direct artifact of the voter’s choice. (2) Statistical “risk limiting audits” should be used after every election to detect software failures and attacks. (3) Additional resources, infrastructure, and training should be made available to state and local voting officials to help them more effectively defend their systems against increasingly sophisticated adversaries._
Continue reading Four short links: 29 November 2017.

_O’Reilly Media Podcast: David Hsieh, of Qubole, in conversation with John Slocum, of MediaMath._
In a recent episode of the O’Reilly Media Podcast, David Hsieh, senior vice president of marketing at Qubole, sat down with John Slocum, vice president of MediaMath’s data management platform (DMP), to discuss DataOps in the media industry. “DataOps” refers to the promotion of communication between formerly siloed data, teams, and systems. As discussed in _Creating a Data-Driven Enterprise with DataOps_, a report published by Qubole and O’Reilly in 2016, DataOps leverages process change, organizational realignment, and technology to facilitate relationships between everyone who handles data: developers, data engineers, data scientists, analysts, and business users. As a programmatic advertising platform, MediaMath has a unique lens into the shifting business models across the media industry, and how DataOps is playing a role in those shifts.
During the podcast, Hsieh and Slocum discussed how data has transformed the culture and overall goals of organizations in the media industry in the past 10 years, and shared some best practices for companies that are just embarking on their journey toward becoming data driven.
Here are some highlights from the conversation:
GREATER FOCUS ON OUTCOMES, RETURN ON INVESTMENT

What were seeing specifically evolve over the past 10 years since the introduction of programmatic is that clients are more focused on outcomes than they previously were. Outcomes being return on marketing investment, return on spend, whereas previous goals might have just been to spend a particular budget. It might have been driving a particular number of clicks or visitors and driving reach, but with data, incorporating that into our analytics and optimization in our platform, were able to get a sense of what clients should expect to achieve and help them achieve that. We find our most sophisticated clients are able to differentiate themselves from their competition. We see data providing that differentiation.

MediaMath has long offered simple aggregated reporting that will help advertisers understand the performance theyre seeing in their campaigns. ... That was certainly good enough the first few years of MediaMaths operation, but what we started to see maybe four or five years ago or so, was a lot of demand for more granular insight. More custom insight. Some of our more sophisticated clients were asking for the ability to see performance by a specific sample of audience data. They may not want to see performance aggregated in a particular campaign or a strategy, but they might want to sample performance elsewhere and look for audiences, that may not even be specific, defined audience segments, to see whats popping. ... That really requires storing the data at a user level.

We saw this thirst for more answers that the account team suspected they could find in the data, and they just needed a tool set to access that data, to work with it, and they couldnt wait for the analytics team at the time, or data engineers, to answer all of those questions for them. There was really a hunger for self-service access to this data, which Qubole started providing in the form of an analytics platform that somebody who wasnt a data engineer could work with and could use effectively to start asking and answering those questions of the data. I think thats common in data-driven organizations.
There are definitely some ETL processes on the incoming data that we need to do—some data scrubbing, some compression, some processing prior to exposing that user-level data back in our platform. Were using AWS services to manage the data, specifically, so were able to permission that data appropriately using roles and assigning the proper user privileges on top of that data. We dont want to take our source of truth and expose that across the organization to everyone with read/write/delete privileges because we dont want to be messing up that data set. So, thinking that through, ensuring that you have some admin control over the data sets that youre exposing, and then a basic user group or user role that keeps the newer or less practiced users from getting into trouble is important, and then working with those users as well, identifying the folks who need access to this tool, who want access to this tool, and working with them to train them on the basics. Write a query using partitions on the data so youre not kicking off massive MapReduce jobs that are going to tie up your cluster for the rest of the afternoon. Little things like that, that seem kind of simple to a more experienced analyst, are important to communicate out to a larger user base.

START AT THE BEGINNING: KNOW THE PROBLEM YOU’RE TRYING TO SOLVE

I think the most important thing to think about before pursuing a data-driven approach is the problem that youre trying to solve. Ultimately, what are you trying to do with data, that you think you can do with data, or that you suspect you can do with data? You might not know the exact answer to that question, and thats totally fine. For MediaMath, what we think we can do with data is drive outcomes and performance for our clients, and we have a variety of questions that stem from that overarching goal. ... Who do you want helping you get to that objective? How do you see that happening? What tool(s) do you need to get there, and whats the approach that you want to take? The rest is execution.

_Code for One, Grid Component, Tinder Data, and Engineering Reorg_
* Structure -- _He wrote Structur. He wrote Alpha. He wrote mini-macros galore. Structur lacked an “e” because, in those days, in the Kedit directory eight letters was the maximum he could use in naming a file. In one form or another, some of these things have come along since, but this was 1984 and the future stopped there. Howard, who died in 2005, was the polar opposite of Bill Gates—in outlook as well as income. Howard thought the computer should be adapted to the individual and not the other way around. One size fits one. The programs he wrote for me were molded like clay to my requirements—an appealing approach to anything called an editor._ Personalized software is a wonderful luxury. Programmers forget how rare it is. (via Clive Thompson)
* React Data Grid -- open source _Excel-like grid component built with React_.
* What Tinder Knows (Guardian) -- the UK laws that let you request this data are wonderful; without it, wed have little idea how much of our lives we reveal.
* How We Reorganized Instagram’s Engineering Team While Quadrupling Its Size (HBR) -- _Once we decided to reorg, the first thing we did was determine our desired outcomes as a team. We gathered our leadership in a room and came up with 20 different outcomes—from speed to cost efficiency—and prioritized them, No. 1 to No. 20. We picked our top five outcomes, which became our organizational principles: Minimize dependencies between teams and code; Have clear accountability with the fewest decision-makers; Groups have clear measures; Top-level organizations have roadmaps; Performance, stability, and code quality have owners._
Continue reading Four short links: 28 November 2017.

_A survey of usage, access methods, projects, and skills._
If you’re an IT professional, software engineer, or software product manager, over the past few years, you’ve likely considered using modern data platforms such as Apache Hadoop; NoSQL databases like MongoDB, Cassandra, and Kudu; search databases like Solr and Elasticsearch; in-memory systems like Spark and MemSQL; and cloud data stores such as Amazon Redshift, Google BigQuery, and Snowflake. But are these modern data technologies here to stay, or are they a flash-in-the-pan with the traditional relational database still reigning supreme?
In the Spring of 2017, Zoomdata commissioned O’Reilly Media to create and execute a survey assessing the state of the data and analytics industry. The focus was on understanding the penetration of modern big and streaming data technologies, how data analytics are being consumed by users, and what skills organizations are most interested in staffing. Nearly 900 people from a diverse set of industries, as well as government and academia, responded to the survey. Below is a preview of some of the insights provided by the survey.
MODERN DATA PLATFORMS HAVE ECLIPSED RELATIONAL DATABASES AS A MAIN DATA SOURCE
Of course, relational databases continue to be the core of online transactional processing (OLTP) systems. However, one of the most interesting findings was that when asked about their organization’s main data sources, less than one-third of survey respondents listed the relational database, with around two-thirds selecting non-relational sources. This is a clear indication that these non-relational data platforms have firmly crossed the chasm from early adopters into mainstream use.
Of further interest is the fact that just over 40% of respondents indicated their organizations are using what could be categorized as “modern data sources” such as Hadoop, in-memory, NoSQL, and search databases as a main data source. These modern data sources are optimized to handle what is often referred to as the “three V’s” of big data: very large data volumes; high velocity streaming data; and high variety of unstructured and semi-structured data, such as text and log files.
Drilling further into the details, analytic databases (19%) and Hadoop (14%) were the two most popular non-relational sources. Analytic databases are a category of SQL-based data stores such as Teradata, Vertica, and MemSQL that typically make use of column-store and/or massively parallel processing (MPP) to greatly speed up the kinds of large aggregate queries used when analyzing data. Hadoop, as many readers know, is a software framework used for distributed storage and processing of very large structured and unstructured data sets on computer clusters built from commodity hardware.
Download the full report to learn about other findings we uncovered in this survey, including:
* The proportion of organizations with big data projects in production and under development
* How important different levels of data freshness is to organizations
* The most popular streaming data platforms
* The leading technical skills for which organizations are staffing
* Whether organizations are consuming analytics via standalone BI apps or as analytic components embedded into other business applications and processes
_This post is part of a collaboration between OReilly and Zoomdata. __See our statement of editorial independence__._
Continue reading The state of data analytics and visualization adoption.

_Using analytics to improve your product doesn’t have to be complicated._
As a product manager, you need to understand who your users are in order to build a great product that meets your users’ needs. That is easy to say but much harder to do. One approach to understanding your users and finding new ways to improve your product is through analytics. However, analytics tools can seem complicated to configure and even once configured, much of the data included in the reports can seem overwhelming or, worse yet, meaningless.
Using analytics to improve your product doesn’t have to be complicated. There are simple ways to start, and you don’t have to measure everything. Instead, you want to use analytics tools to measure just a few aspects of your product’s performance and your user’s interests. Even a little bit of data collected and analyzed can help you make better decisions. To help you get started, here are five questions every product manager should ask about their users and their product, along with some ideas on how to begin answering these questions.
1. WHAT WORDS DO PEOPLE USE TO DESCRIBE YOUR PRODUCT?
One of the more critical things to measure is how the people who use your product talk about whatever it is your product does. At a basic level, these words offer clues to the problems users face and the types of solutions they are seeking. This helps you decide what features to build and what improvements to make. By knowing what users care about the most, you can also get ideas on how to prioritize the long list of tasks.
Knowing how your users speak can give you even more information. By knowing the words people use, you can get insight into the minds of your users. As you study how people speak, patterns will emerge about the user’s motivations and intentions for working with products like yours. How big is this problem they are facing? Is this a systemic problem or something new? Are your users optimistic or pessimistic about the problems your product solves? Do they view your product as an essential aspect of their lives or their company’s success?
When you get inside people’s minds, you can do a better job connecting with your users. As you adjust your product’s features to align with these thoughts, your users will feel like your product “gets” them because your product will talk and think the way they talk and think.
There are numerous ways to learn how your users speak—through analyzing forums, support tickets, and more. But doing some type of word analysis of forum posts, tickets, social shares, or other similar written conversations with users can get tricky fast. A simpler way to begin is to research the keywords people type into Google related to your product and your industry. Keyword research tools aren’t just for search engine marketers—they are a vital first step for product managers to understand something about your users and what they want from your product.
2. WHAT CONTENT OR PAGES ARE MOST VISITED OR ARE PEOPLE MOST INTERESTED IN?
Next, you need to know how people interact with the pages (or screens) within your product. By reviewing log files or analytics tools, you can review all the pages people accessed. At a basic level, this tells you how often people accessed those pages, which can help you prioritize updates. Pages accessed more often might deserve more attention or more updates. Along with this, you can also dig deeper into the analytics tools to see which pages people went to before or after a particular page, giving you a deeper understanding of how the pages within your product work together to create an experience for your users.
The biggest problem people run into when accessing page reports is that there are usually hundreds or thousands or hundreds of thousands of pages within your product. Your product might generate dozens of unique, dynamic URLs for each user based on specific configuration settings. This can be too many pages to comprehend, leaving the product team overwhelmed trying to figure out how to make sense of the report.
To make reviewing the pages accessed in your product more effective, you can utilize content grouping features available in web analytics tools. Pages related to Feature A are grouped here, support pages are grouped there, and the administrative pages are in their own group. By organizing your product’s pages in this way, you can move beyond trying to understand how people use individual pages and can instead understand how people use a whole section of pages. This makes your life simpler and helps you get to the relevant data points you need to view faster.
As you review the pages and screens people access, remember to take note of what isn’t in the reports. Too often, we get lost reviewing all the pages people visit, seeing which pages or groups of pages are most frequently viewed. But we want to find the places nobody goes as well. So, as you review the pages or groups of pages people did access, keep an eye out for what is missing, too. Maybe some pages are being ignored because users didn’t know the information contained on these pages was even available. Of course, people might know they could access pages but don’t because they aren’t interested in what those pages offer. Either way, by exploring the ignored areas, you learn something new about your product and your users.
3. WHAT FEATURES ARE PEOPLE MOST INTERESTED IN?
Along with looking at the pages people access, we want to know what features are used. Many product managers already track this at a high level. For instance, you might know that people use the feature letting them search your database, but nobody is using the feature to get email alerts related to those searches. That is an important first step for measuring product use.
Beyond knowing which features people use or don’t use, we want to go deeper into the metrics to understand how each feature is used. If the feature relies on a form, what fields of the form do people use or not use? What links or buttons within a particular part of product are clicked and which are ignored? For information-rich areas of your product, are people reading the information your product provides? These questions can be answered in a variety of ways, but one of the more effective means of addressing these questions is by using event tracking in analytics tools. The good news is, setting up a basic event-tracking code requires modifying just a little bit of code. In other cases, you only need to copy in open source scripts, with no coding knowledge required. This makes it possible for even the least technical person to start using event tracking to monitor how people interact with each part of a particular feature.
Another approach is to use heatmap tools. Heatmaps create a visual report of where people are interacting or engaging with your product. By reviewing a heatmap, you can easily see what parts of the page users scrolled to and what links or buttons users clicked on the most. Like with the event tracking method, heatmaps help you understand how users engage with your product and how you could modify your product to encourage deeper engagement.
4. HOW MUCH TIME DO PEOPLE SPEND USING OUR PRODUCT?
A key part of product usage is the time people spend using your product. You want to avoid the most common mistake people make using this metric: assuming there is a “right” amount of time people should spend. Some products, when performing at their best, will require people to spend a lot of time interacting with the various features offered. In the case of other products, people spending a lot of time interacting with features may actually indicate a problem.
To avoid this mistake, take time to consider what amount of time you would expect people to spend before you begin looking at the amount of time people spend. Is your product designed for quick or long usage sessions? This is where surveying or interviewing your users can help you set your own benchmarks on what amount of time users should spend when working with your product. From there, you can use the time metrics to determine how your users are interacting with your product.
As you measure time, keep in mind there are two numbers you’ll want to track: total and active time. Let’s go through an example. People may open up your web app, interact with it a little, then browse away to do something else while leaving your product open in a background tab. They might return later to use it some more. The time people are interacting with your product is the active time. Total time is the active time added to the time people had your product open in the background.
Knowing both of these numbers can help you gauge user behavior. Do people interact with your product and then leave, rarely (if ever) leaving your product open in the background? Or, are people almost always leaving your product open in the background while they go do something else? This begins to change how quickly you want to time people out of a session, forcing the user to log back in. Additionally, if many of your users leave your product open in a background tab only to return some time later, you may need to add reminders to help people remember where they left off when last interacting with your product.
5. HOW OFTEN DO PEOPLE USE OUR PRODUCT?
Finally, you want to know how frequently people interacted with your product. Do people log in or interact with your product multiple times per day? Or are they interacting with your product more irregularly? Here again, there is no “normal” frequency you should be working toward. Your product is unique, and the frequency of use will differ. For a product helping people complete daily tasks, you would expect to have daily product usage. But for a product helping people track something like monthly expenses, you might only see people return after a few weeks of not using the product at all.
As you think about the frequency of product usage, remember that it isn’t just a login area, app, or website where people may interact with your product. Those are the easier places to measure the frequency of use inside an analytics tool, but another place that is often forgotten is product usage within emails. When people receive emails from your product, such as alerts or notices, you need to track these interactions as well. Do people open the emails? Do they click on the links in the email? By tracking this information, you can fully capture all the ways people interact and better understand the frequency of how users interact with your product.
FINAL THOUGHTS
As you begin using analytics to understand how people use your product, remember that every product is unique. Your job isn’t to measure yourself against an average product or some “ideal” standard. Rather, the process of measuring your product requires finding the right metrics that will help you understand how to make your product work better for your users. Don’t worry about measuring everything you possibly can—there is always more to measure. The five questions discussed above are intended to give you ideas on where to begin measuring your users and your product.
If you are looking for even more ideas on what to measure in your product, check out my video series, _Product Management Core Skills: Using Analytics To Inform Product Design_.
Continue reading 5 things every product manager should know how to measure.

_PV Growth, Digital Rights, Unit Testing, and Open Source Innovation_
* Photovoltaic Growth: Reality vs. Projections of the International Energy Agency -- that graph.
* Digital Rights in Australia -- _three aims: to assess the evolving citizen uses of digital platforms, and associated digital rights and responsibilities in Australia and Asia, identifying key dynamics and issues of voice, participation, marginalization and exclusion; to develop a framework for establishing the rights and legitimate expectations that platform stakeholders—particularly everyday users—should enjoy and the responsibilities they may bear; to identify the best models for governance arrangements for digital platforms and for using these environments as social resources in political, social, and cultural change._
* Unit Testing Doesn’t Affect Codebases the Way You Would Think -- nice approach to checking hypotheses like "unit testing results in fewer lines of code per method," with results (it doesnt).
* Capabilities for Open Source Innovation (Allison Randal) -- _Over the past decade, I’ve been researching open source and technology innovation, partly through employment at multiple different companies that engage in open source, and partly through academic work toward completing a Master’s degree and soon starting a Ph.D. The heart of this research is looking into what makes companies successful at open source and also at technology innovation. It turns out there are actually many things in common between the two._
Continue reading Four short links: 27 November 2017.