Tag Archives: data

Getting back into the swing of meetups again, last night I went to the MTLData meetup – a group of data scientists and enthusiasts who are looking to raise the profile of data science in Montreal. The event featured a panel discussion on the topic of ‘Build vs Buy?’ when considering software for data solutions.

Data as liquid

The issues were very familiar to me from considering EDRM and DAM systems, which made me think about the way data has changed as an asset, and how management and security of data now has to include the ‘liquid’ nature of data as an asset. This adds another layer of complexity. Data still needs to be archived as a ‘record’ for many reasons (regulatory compliance, business continuity, archival value…) but for a data-driven organisation, the days of rolling back to ‘yesterday’s version of the database’ seem like ancient history. Data assets are also complex in that they are subject to many levels of continuous processing, so the software that manages the processing also has to be robust.

The metaphor of data flowing around the organisation like water seems especially telling. If there is a system failure, you can’t necessarily just turn off the tap of data, and so your contingency plans need to include some kind of ‘emergency reservoir’ so that data that can’t be processed immediately does not get lost and the flow can be re-established easily.

Build vs Buy?

The issues highlighted by the panel included costs – available budget, restrictions from finance departments, balance between in-house and outsourced spending (again all familiar in EDRM and DAM procurement), privacy, security, ability to maintain a system, and availability of skills. Essentially balancing risks, which will be unique to each team and each business. In terms of deciding whether to build something in house, availability of in house resource is an obvious consideration, but Marc-Antoine stressed the importance of thinking through what added value a bespoke build could offer, as opposed to other ways the team could be spending their time. For example, if there are no off-the-shelf or open source products that match requirements, if there is value in owning the IP of a new product, if risks can be kept low, and resources are available, a build might be worthwhile.

There are risks associated with all three of the main options – a big vendor is less likely to go bust, but sometimes they can be acquired, sometimes they can stop supporting a product or particular features, and they can be very costly. Open source has the advantage of being free, but relies on ad hoc communities to maintain and update the code base, and how vibrant and responsive each specific community is, or will remain, can vary. Open source can be a good option for low risk projects – such as proof-of-concept, or for risk tolerant startups with plenty of in-house expertise to handle the open source code themselves.

AI future

The conversation diverged into a discussion of the future of AI, which everyone seemed to agree was going to become a standard tool for most businesses eventually. Jeremy noted that AI at the moment is being sought after for its scarcity value, to give early adopters an edge over the competition, while Maxime suggested that early advantage is likely to fade, just as it has with data science. Data analysis is now so ubiquitous, even small businesses are involved to a certain extent. Jeremy pointed out that it is hard to maintain a competitive edge based on the scarcity of data itself, as data can so easily be copied and distributed, but knowing how to make intelligent use of the data is a scarce commodity. Making connections and managing data in a very tailored specific way could even be a way for organisations to compete with Google, who have more data than anyone else, but are not necessarily able to answer all questions or have the most useful insights into specific problems.

The value of meaning

I was intrigued by this, as it validates the role of semantics – data without meaning is useless – and the importance of the imaginative and creative leaps that humans can make, as well as the moral and social reasoning that humans can bring. With reports of early AI systems reflecting existing biases and prejudices, and with disasters like the SimSimi chatbot causing social problems such as bullying amongst youngsters, the need for a real human heart to accompany artificial intelligence seems ever more important.

Scarcity of understanding?

Someone asked if the panel thought companies would soon need ‘Chief Intelligence Officers’ in the way that many now have ‘Chief Data Officers’. The panel did not seem particularly enthusiastic about the idea (“it sounds like something that you do with founders when you put them out to pasture”) but I think it would be a fascinating role. The BBC had someone to oversee ethics and advise on editorial ethics issues. Perhaps it is in the skills of a Chief Intelligence Officer – someone who can combine an understanding of how data, information, knowledge and wisdom interact, whether within IT systems or beyond, with an understanding of social issues and problems – that the scarcity value lies. Insight, imagination, and compassion could be the skills that will give the competitive edge. In the AI future, could a Chief Intelligence Officer make the difference between a company that succeeds by asking the right questions, not just of its data or its customers, but of itself, and one that fails?

I was particularly fascinated by the ideas suggested by Martin Dodge of mapping areas that are not “space” and what this means for the definition of a “map”. So, the idea of following the “path” of a device such as a phone through the electromagnetic spectrum brings a geographical metaphor into a non-tangible “world”. Conversely, is the software and code that devices such as robots use to navigate the world a new form of “map”? Previously, I have thought of code as “instructions” and “graphs” but have always thought of the “graph” as a representation of coded instructions, visualized for the benefit of humans, rather than the machines. However, now that machines are responding more directly to visual cues, perhaps the gap between their “maps” and our “maps” is vanishing.

Understanding how data mining works is going to become increasingly important. There is a huge gap in popular and even professional knowledge about what organisations can now do “under the surface” with our data. For a very clear and straightforward explanation of how social graphs work and why we should be paying attention read Data Ghosts in the Facebook Machine.

Catching up on my reading, I found this post by Jonah Bossewitch: Pick a Corpus, Any Corpus and was particularly struck by his clear articulation of the growing information gulf between organizations and individuals.

I have since been thinking about the contrast between our localised knowledge organization systems and the semantic super-trawlers of the information oceans that are only affordable – let alone accessible – to the megawealthy. It is hard not to see this as a huge disempowerment of ordinary people, swamping the democratizing promise of the web as a connector of individuals. The theme has also cropped up in KIDMM discussions about the fragmentation of the information professions. The problem goes far beyond the familiar digital divide, beyond just keeping our personal data safe, to how we can render such meta-industrial scale technologies open for ordinary people to use. Perhaps we need public data mines to replace public libraries? It seems particularly bad timing that our public institutions – our libraries and universities – are under political and financial attack just at the point when we need them to be at the technological (and expensive) cutting edge.

We rely on scientists and experts to advise us on how to use, store and transport potentially hazardous but generally useful chemicals, radioactive substances, even weapons, and information professionals need to step up to the challenges of handling our new potentially hazardous data and data analysis tools and systems. I am reassured that there are smart people like Jonah rising to the call, but we all need to engage with the issues.

Back in the summer, I was very lucky to meet Jonah Bossewitch (thanks Sam!) an inspiring social scientist, technical architect, software developer, metadatician, and futurologist. His article The Bionic Social Scientist is a call to arms for the social sciences to recognise that technological advances have led to a proliferation of data. This is assumed to be unequivocably good, but is also fuelling a shadow science of analysis that is using data but failing to challenge the underlying assumptions that went into collecting that data. As I learned from Bowker and Star, assumptions – even at the most basic stage of data collection – can skew the results obtained and that any analysis of such data may well be built on shaky (or at the very least prejudiced) foundations. When this is compounded by software that analyses data, the presuppositions of the programmers, the developers of the algorithms, etc. stack assumption on top of assumption. Jonah points out that if nobody studies this phenomenon, we are in danger of losing any possibility of transparency in our theories and analyses.

As software becomes more complex and data sets become larger, it is harder for human beings to perform “sanity checks” or apply “common sense” to the reports produced. Results that emerge from de facto “black boxes” of calculation based on collections of information that are so huge that no lone unsupported human can hope to grasp are very hard to dispute. The only possibility of equal debate is amongst other scientists, and probably only those working in the same field. Helen Longino’s work on science as social practice emphasised the need for equality of intellectual authority, but how do we measure that if the only possible intellectual peer is another computer? The danger is that the humans in the scientific community become even more like high priests guarding the machines that utter inscrutable pronouncements than they are currently. What can we do about this? More education, of course, with the academic community needing to devise ways of exposing the underlying assumptions and the lay community needing to become more aware of how software and algorithms can “code in” biases.

This appears to be a rather obscure academic debate about subjectivity in software development, but it strikes to the heart of the nature of science itself. If science cannot be self-correcting and self-criticising, can it still claim to be science?

A more accessbile example is offered by a recent article claiming that Facebook filters and selects updates. This example illustrates how easy it is to allow people to assume a system is doing one thing with massed data when in fact it is doing something quite different. Most people think that Facebook’s “Most Recent” updates provides a snapshot of the latest postings by all your friends, and if you haven’t seen updates from someone for a while, it is because they haven’t posted anything. The article claims that Facebook prioritises certain types of update over others (links take precedence over plain text) and updates from certain people. Doing this risks creating an echo chamber effect, steering you towards the people who behave how Facebook wants them to (essentially, posting a lot of monetisable links) in a way that most people would never notice.

Another familiar example is automated news aggregation – an apparently neutral process that actually involves sets of selection and prioritisation decisions. Automated aggreagations used to be based on very simple algorithms, so it was easy to see why certain articles were chosen and others excluded, but very rapidly such processing has advanced to the point that it is almost impossible (and almost certainly impractical) for a reader to unpick the complex chain of choices.

In other words, there certainly is a ghost in the machine, it might not be doing what we expect, and so we really ought to be paying attention to it.

I heard a talk byBen Shneiderman about information visualisation yesterday for the Cambridge Usability Professionals Group. (It was ironic that I had a “locational usability” problem and was almost late, having made the novice error of trying to find Microsoft Research in the William Gates Building – which is indeed named after Microsoft’s Bill Gates foundation – but Microsoft Research in Cambridge was set up by Roger Needham, so it is in his building!)

The talk itself was very easy to follow with lively demonstrations of a number of visualisation tools. Shneiderman was very careful to point out that you need to have a good question and good data to get good results from information visualisation, and that it is no panacea, but when it works, it is fantastically powerful. One of the key strengths is that it makes it easy to spot outliers or anomalies in huge masses of data, particularly when there is a general underlying correlation. It is almost impossible to detect trends in a big spreadsheet full or numbers, but convert that into a visual form and the trends leap out. This means that you can see at a glance things like which companies’ stocks are rising when all the others are falling. Of course, graphs are nothing new, but the range of analytical tools that are now available mean you can quickly pick out things like spikes and shapes in your data in a way that would have been painstaking previously. There are also very important applications in medical research and diagnosis, as the ability to track which order certain events happen, helps researchers establish whether one condition causes another and could even be used to generate personal health alerts.

I liked the smart-money style treemaps (although the choice of red-green can’t be great for anyone who is colourblind), but I found the marumushi newsmap fun but not much more informative than traditional newspapers, mainly because the newsmap crams in more words than I can take in. Newspapers are really pretty good at writing headlines that work, and you can usually see at a glance what today’s top story is anyway – it’s the one in big letters at the top! However, if you need an aggregation of global news for international comparison, the newsmap does give you quick access to a lot of international sources all together.**

One of the great pleasures of these events is getting to talk to other people who are there and I met a fascinating researcher who had been monitoring importance of stories by keyword frequency, showing that when something happens you get a burst of news activity around the relevant keywords, a ripple effect, and then it dies away. By looking at those patterns you can produce a measure of the impact of different events.