Transcription

2 Our vision To be the world s leading provider of high quality, globally relevant International Standards through its members and stakeholders. Our mission ISO develops high quality voluntary International Standards that facilitate international exchange of goods and services, support sustainable and equitable economic growth, promote innovation and protect health, safety and the environment. Our process Our standards are developed by experts all over the world who work on a volunteer or part-time basis. We sell International Standards to recover the costs of organizing this process and making standards widely available. Please respect our licensing terms and copyright to ensure this system remains independent. If you would like to contribute to the development of ISO standards, please contact the ISO Member Body in your country: members.htm This document has been prepared by: ISO/IEC JTC 1, Information technology Cover photo credit: ISO/CS, 2015 Copyright protected document All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized otherwise in any form or by any means, electronic or mechanical, including photocopy, or posting on the internet or intranet, without prior permission. Permission can be requested from either ISO at the address below or ISO s member body in the country of the requester: ISO 2015, Published in Switzerland ISO copyright office Case postale 56 CH-1211 Geneva 20 Tel Fax Web ii

5 1 Scope This document does the following: Survey the existing ICT landscape for key technologies and relevant standards/models/ studies/ use cases and scenarios for Big Data from JTC 1, ISO, IEC and other standard setting organizations; Identify key terms and definitions commonly used in the area of Big Data; and Assess the current status of Big Data standardization market requirements, identify standards gaps, and propose standardization priorities to serve as a basis for future JTC 1 work. 2 Terms and definitions 2.1 Terms defined elsewhere This document uses the following terms defined elsewhere capability quality of being able to perform a given activity [SOURCE: ISO :2004] cloud computing paradigm for enabling network access to a scalable and elastic pool of shareable physical or virtual resources with self-service provisioning and administration on-demand [SOURCE: Recommendation ITU-T Y.3500 ISO/IEC 17788:2014] framework structure expressed in diagrams, text, and formal rules which relates the components of a conceptual entity to each other [SOURCE: ISO :2014] Internet of Things integrated environment, inter-connecting anything, anywhere at anytime [SOURCE: ISO/IEC JTC 1 SWG 5 Report:2013] lifecycle evolution of a system, product, service, project or other human-made entity from conception through retirement [SOURCE: ISO/IEC/TR :2011] 1

6 2.1.6 ownership legal right of possession, including the right of disposition, and sharing in all the risks and profits commensurate with the degree of ownership interest or shareholding, as demonstrated by an examination of the substance, rather than the form, of ownership arrangements [SOURCE: ISO :2011] privacy right of individuals to control or influence what information related to them may be collected and stored and by whom and to whom that information may be disclosed [SOURCE: ISO/TS 17574:2009] provenance information on the place and time of origin or derivation or a resource or a record or proof of authenticity or of past ownership [SOURCE: ISO 19153:2014] relational model data model whose structure is based on a set of relations [SOURCE: ISO/IEC :1999] repository collection of all software-related artefacts belonging to a system or the location/format in which such a collection is stored [SOURCE: ISO/IEC IEEE 24765:2010] role set of activities that serves a common purpose [SOURCE: Recommendation ITU-T Y.3502 ISO/IEC 17789:2014] security all aspects related to defining, achieving, and maintaining confidentiality, integrity, availability, non-repudiation, accountability, authenticity, and reliability of a system [SOURCE: ISO/IEC 15288:2008] 2

7 sensor device that observes and measures a physical property of a natural phenomenon or man-made process and converts that measurement into a signal [SOURCE: ISO/IEC :2013] smart grid electric grid system, which is characterized by the use of communication networks and the control of grid components and loads [SOURCE: ISO/IEC/TR 27019:2013] streaming data data passing across an interface from a source that is operating continuously [SOURCE: ISO/IEC :2011] traceability property that allows the tracking of the activity of an identity, process, or an element throughout the supply chain [SOURCE: ISO/IEC :2013] 2.2 Terms defined in this report This document defines the following terms: Big Data Analytics analytical functions to support the integration of results derived in parallel across distributed pieces of one or more data sources. This is a rapidly evolving field both in terms of functionality and the underlying programming model Big Data Engineering storage and data manipulation technologies that leverage a collection of horizontally coupled resources to achieve a nearly linear scalability in performance Big Data Models logical data models (relational and non-relational) and processing/computation models (batch, streaming, and transactional) for the storage and manipulation of data across horizontally scaled resources Big Data Paradigm distribution of data systems across horizontally-coupled independent resources to achieve the scalability needed for the efficient processing of extensive datasets 3

9 NIST NoSQL OGC OASIS POSIX RA RFID SC SDO SG SOA SQL SQL/MM SWG TPC W3C XML National Institute of Standards and Technology Not Only Structured Query Language Open Geospatial Consortium Organization for the Advancement of Structured Information Standards Portable Operating System Interface Reference Architecture Radio-Frequency Identification Standards Committee Standards Development Organization Study Group Service-Oriented Architecture SQL Query Language SQL Multimedia Special Working Group Transaction Processing Performance Council World Wide Web Consortium Extensible Markup Language 3 Introduction to Big Data In recent years, the term Big Data has emerged to describe a new paradigm for data applications. New technologies tend to emerge with a lot of hype, but it can take some time to tell what is new and different. While Big Data has been defined in a myriad of ways, the heart of the Big Data paradigm is that is too big (volume), arrives too fast (velocity), changes too fast (variability), contains too much noise (veracity), or is too diverse (variety) to be processed within a local computing structure using traditional approaches and techniques. The technologies being introduced to support this paradigm have a wide variety of interfaces making it difficult to construct tools and applications that integrate data from multiple Big Data sources. This report identifies potential areas for standardization within the Big Data technology space. 3.1 General concept of Big Data Big Data are used as a concept that refers to the inability of traditional data architectures to efficiently handle the new data sets. Characteristics that force a new architecture to achieve efficiencies are the data set-at-rest characteristics volume, and variety of data from multiple domains or types; and from the data-in-motion characteristics of velocity, or rate of flow, and variability (principally referring to a change in velocity). Each of these characteristics results in different architectures or different data lifecycle process 5

10 orderings to achieve needed efficiencies. A number of other terms (often starting with the letter V ) are also used, but a number of these refer to the analytics and not big data architectures. The new big data paradigm occurs when the scale of the data at rest or in motion forces the management of the data to be a significant driver in the design of the system architecture. Fundamentally the big data paradigm represents a shift in data system architectures from monolithic systems with vertical scaling (faster processors or disks) into a horizontally scaled system that integrates a loosely coupled set of resources. This shift occurred 20-some years ago in the simulation community when the scientific simulations began using massively parallel processing (MPP) systems. In different combinations of splitting the code and data across independent processors, computational scientists were able to greatly extend their simulation capabilities. This of course introduced a number of complications in such areas as message passing, data movement, and latency in the consistency across resources, load balancing, and system inefficiencies while waiting on other resources to complete their tasks. In the same way, the big data paradigm represents this same shift, again using different mechanisms to distribute code and data across loosely-coupled resources in order to provide the scaling in data handling that is needed to match the scaling in the data. The purpose of storing and retrieving large amounts of data is to perform analysis that produces additional knowledge about the data. In the past, the analysis was generally accomplished on a random sample of the data. Big Data Paradigm consists of the distribution of data systems across horizontallycoupled independent resources to achieve the scalability needed for the efficient processing of extensive data sets. With the new Big Data Paradigm, analytical functions can be executed against the entire data set or even in real-time on a continuous stream of data. Analysis may even integrate multiple data sources from different organizations. For example, consider the question What is the correlation between insect borne diseases, temperature, precipitation, and changes in foliage. To answer this question an analysis would need to integrate data about incidence and location of diseases, weather data, and aerial photography. While we certainly expect a continued evolution in the methods to achieve efficient scalability across resources, this paradigm shift (in analogy to the prior shift in the simulation community) is a one-time occurrence; at least until a new paradigm shift occurs beyond this crowdsourcing of processing or data system across multiple horizontally-coupled resources. Big Data Engineering is the storage and data manipulation technologies that leverage a collection of horizontally coupled resources to achieve a nearly linear scalability in performance. New engineering techniques in the data layer have been driven by the growing prominence of data types that cannot be handled efficiently in a traditional relational model. The need for scalable access in structured and unstructured data has led to software built on name-value/key-value pairs or columnar (big table), documentoriented, and graph (including triple-store) paradigms. 6

11 Non-Relational Models refers to logical data models such as document, graph, key value and others that are used to provide more efficient storage and access to nontabular data sets. NoSQL (alternately called no SQL or not only SQL ) refers to datastores and interfaces that are not tied to strict relational approaches. Big Data Models refers to logical data models (relational and non-relational) and processing/computation models (batch, streaming, and transactional) for the storage and manipulation of data across horizontally scaled resources. Schema-on-read big data are often stored in a raw form based on its production, with the schema, needed for organizing (and often cleansing) the data, is discovered and transformed as the data are queried. This is critical since in order for many analytics to run efficiently the data must be structured to support the specific algorithms or processing frameworks involved. Big Data Analytics is rapidly evolving both in terms of functionality and the underlying programming model. Such analytical functions support the integration of results derived in parallel across distributed pieces of one or more data sources. The Big Data paradigm has other implications from these technical innovations. The changes are not only in the logical data storage, but in the parallel distribution of data and code in the physical file system and direct queries against this storage. The shift in thinking causes changes in the traditional data lifecycle. One description of the end-to-end data lifecycle categorizes the steps as collection, preparation, analysis and action. Different big data use cases can be characterized in terms of the data set characteristics at-rest or in-motion, and in terms of the time window for the end-to-end data lifecycle. Data set characteristics change the data lifecycle processes in different ways, for example in the point of a lifecycle at which the data are placed in persistent storage. In a traditional relational model, the data are stored after preparation (for example after the extract-transform-load and cleansing processes). In a high velocity use case, the data are prepared and analysed for alerting, and only then is the data (or aggregates of the data) given a persistent storage. In a volume use case the data are often stored in the raw state in which it was produced, prior to the application of the preparation processes to cleanse and organize the data. The consequence of persistence of data in its raw state is that a schema or model for the data are only applied when the data are retrieved, known as schema on read. A third consequence of big data engineering is often referred to as moving the processing to the data, not the data to the processing. The implication is that the data are too extensive to be queried and transmitted into another resource for analysis, so the analysis program is instead distributed to the data-holding resources; with only the results being aggregated on a different resource. Since I/O bandwidth is frequently the limited resource in moving data, another approach would be to embed query/filter programs within the physical storage medium. At its heart, Big Data refers to the extension of data repositories and processing across horizontally-scaled resources, much in the same way the compute-intensive simulation community embraced massively parallel processing two decades ago. In the past, classic parallel computing applications utilized a rich set of communications 7

12 and synchronization constructs and created diverse communications topologies. In contrast, today, with data sets growing into the Petabyte and Exabyte scales, distributed processing frameworks offering patterns such as map-reduce, offer a reliable highlevel, commercially viable compute model based on commodity computing resources, dynamic resource scheduling, and synchronization techniques. 3.2 Definition of Big Data The term Big Data is used in a variety of contexts with a variety of characteristics. To understand where standards will help support the big data paradigm, we have to reach some level of consensus on what the term really means. This report uses the following working definition of Big Data : Big Data is a data set(s) with characteristics (e.g. volume, velocity, variety, variability, veracity, etc.) that for a particular problem domain at a given point in time cannot be efficiently processed using current/existing/established/traditional technologies and techniques in order to extract value. The above definition distinguishes Big Data from business intelligence and traditional transactional processing while alluding to a broad spectrum of applications that includes them. The ultimate goal of processing Big Data is to derive differentiated value that can be trusted (because the underlying data can be trusted). This is done through the application of advanced analytics against the complete corpus of data regardless of scale. Parsing this goal helps frame the value discussion for Big-Data use cases. Any scale of operations and data: Utilizing the entire corpus of relevant information, rather than just samples or subsets. It s also about unifying all decision-support time-horizons (past, present, and future) through statistically derived insights into deep data sets in all those dimensions. Trustworthy data: Deriving valid insights either from a single-version-of-truth consolidation and cleansing of deep data, or from statistical models that sift haystacks of dirty data to find the needles of valid insight. Advanced analytics: Faster insights through a variety of analytic and mining techniques from data patterns, such as long tail analyses, micro-segmentations, and others, that are not feasible if you re constrained to smaller volumes, slower velocities, narrower varieties, and undetermined veracities. A difficult question is what makes Big Data big, or how large does a data set have to be in order to be called big data? The answer is an unsatisfying it depends. Part of this issue is that Big is a relative term and with the growing density of computational and storage capabilities (e.g. more power in smaller more efficient form factors) what is considered big today will likely not be considered big tomorrow. Data are considered big if the use of the new scalable architectures provides improved business efficiency over other traditional architectures. In other words the functionality cannot be achieved in something like a traditional relational database platform. Big data essentially focuses on the self-referencing viewpoint that data are big because it requires scalable systems to handle it, and architectures with better scaling have come about because of the need to handle big data. 8

13 3.3 Organizational drivers of Big Data The key drivers for Big Data in organizations are about realizing value in any of several ways: Insight: enable discovery of deeper, fresher insights from all enterprise data resources Productivity: improve efficiency, effectiveness, and decision-making Speed: facilitate more timely, agile response to business opportunities, threats, and challenges Breadth: provide a single view of diverse data resources throughout the business chain Control: support tighter security, protection, and governance of data throughout its lifecycle Scalability: improve the scale, efficiency, performance, and cost-effectiveness of data/analytics platforms 3.4 Key characteristics of Big Data The key characteristics of Big Data focus on volume, velocity, variety, veracity, and variability. The following subclauses go into further depth on these characteristics Volume Traditionally, the data volume requirements for analytic and transactional applications were in sub-terabyte territory. However, over the past decade, more organizations in diverse industries have identified requirements for analytic data volumes in the terabytes, petabytes, and beyond. Estimates produced by longitudinal studies started in 2005 [8] show that the amount of data in the world is doubling every two years. Should this trend continue, by 2020, there will be 50 times the amount of data as there had been in Other estimates indicate that 90 % of all data ever created, was created in the past 2 years [7]. The sheer volume of the data are colossal - the era of a trillion sensors is upon us. This volume presents the most immediate challenge to conventional information technology structures. It has stimulated new ways for scalable storage across a collection of horizontally coupled resources, and a distributed approach to querying. Briefly, the traditional relational model has been relaxed for the persistence of newly prominent data types. These logical non-relational data models, typically lumped together as NoSQL, can currently be classified as Big Table, Name-Value, Document and Graphical models. A discussion of these logical models was not part of the phase one activities that led to this document. 9

14 3.4.2 Variety Traditionally, enterprise data implementations for analytics and transactions operated on a single structured, row-based, relational domain of data. However, increasingly, data applications are creating, consuming, processing, and analysing data in a wide range of relational and non-relational formats including structured, unstructured, semistructured, documents and so forth from diverse application domains. Traditionally, a variety of data was handled through transforms or pre-analytics to extract features that would allow integration with other data through a relational model. Given the wider range of data formats, structures, timescales and semantics that are desirous to use in analytics, the integration of this data becomes more complex. This challenge arises as data to be integrated could be text from social networks, image data, or a raw feed directly from a sensor source. The Internet of Things is the term used to describe the ubiquity of connected sensors, from RFID tags for location, to smart phones, to home utility meters. The fusion of all of this streaming data will be a challenge for developing a total situational awareness. Big Data Engineering has spawned data storage models that are more efficient for unstructured data types than a relational model, causing a derivative issue for the mechanisms to integrate this data. It is possible that the data to be integrated for analytics may be of such volume that it cannot be moved in order to integrate, or it may be that some of the data are not under control of the organization creating the data system. In either case, the variety of big data forces a range of new big data engineering in order to efficiently and automatically integrate data that is stored across multiple repositories and in multiple formats Velocity The Velocity is the speed/rate at which the data are created, stored, analysed and visualized. Traditionally, most enterprises separated their transaction processing and analytics. Enterprise data analytics were concerned with batch data extraction, processing, replication, delivery, and other applications. But increasingly, organizations everywhere have begun to emphasize the need for real-time, streaming, continuous data discovery, extraction, processing, analysis, and access. In the big data era, data are created in real-time or near real-time. With the availability of Internet connected devices, wireless or wired, machines and devices can pass-on their data the moment it is created. Data Flow rates are increasing with enormous speeds and variability, creating new challenges to enable real or near real-time data usage. Traditionally this concept has been described as streaming data. As such there are aspects of this that are not new, as companies such as those in telecommunication have been sifting through high volume and velocity data for years. The new horizontal scaling approaches do however add new big data engineering options for efficiently handling this data Variability Variability refers to changes in data rate, format/structure, semantics, and/or quality that impact the supported application, analytic, or problem. Specifically, variability is a change in one or more of the other Big Data characteristics. Impacts can include the 10

15 need to refactor architectures, interfaces, processing/algorithms, integration/fusion, storage, applicability, or use of the data. The other characteristics directly affect the scope of the impact for a change in one dimension. For, example in a system that deals with petabytes or exabytes of data refactoring the data architecture and performing the necessary transformation to accommodate a change in structure from the source data may not even be feasible even with the horizontal scaling typically associated with big data architectures. In addition, the trend to integrate data from outside the organization to obtain more refined analytic results combined with the rapid evolution in technology means that enterprises must be able to adapt rapidly to data variations Veracity Veracity refers to the trustworthiness, applicability, noise, bias, abnormality and other quality properties in the data. Veracity is a challenge in combination with other Big Data characteristics, but is essential to the value associated with or developed from the data for a specific problem/application. Assessment, understanding, exploiting, and controlling Veracity in Big Data cannot be addressed efficiently and sufficiently throughout the data lifecycle using current technologies and techniques. 3.5 Roles in Big Data ecosystem The different functional roles within a typical Big Data ecosystem are as follows: Data Provider: introduces new data or information feeds into the ecosystem Big Data Application Provider: executes a life cycle (collection, processing, dissemination) controlled by the system orchestrator to implement specific vertical applications requirements and meet security and privacy requirements Big Data Framework Provider: establishes a computing fabric (computation and storage resources, platforms, and processing frameworks) in which to execute certain transformation applications while protecting the privacy and integrity of data Data Consumer: includes end users or other systems who utilize the results of the Big Data Application Provider System Orchestrator: defines and integrates the required data application activities into an operational vertical system Security and Privacy: the role of managing and auditing access to and control of the system and the underlying data including management and tracking of data provenance Management: the overarching control of the execution of a system, the deployment of the system, and its operational maintenance 11

16 3.6 Security and privacy for Big Data Security and Privacy issues arise in any distributed computing environment. These issues are exacerbated by Big Data for a number of reasons Issues Much of the value of Big Data comes from combining data from different sources. Combining data in this manner can provide context. Thus, data that may not have been intelligible on its own can be mined for private information given enough context. Some of the Big Data comes from social media and medical records and inherently contains private information. While social media sites may not do much to protect their users, analysis of such data, particularly in the presence of context, must protect privacy. Big Data may be gathered from diverse end points and brought together for a variety of applications. There may be more types of actors than just providers and consumers primarily, data owners, such as mobile users and social network users. Some actors may be devices that ingest data streams for still different data consumers. Moreover, the volume of Big Data necessitates storage in multi-tiered storage media some of which may store aggregated data. Aggregation and the movement of data between applications and tiers can lose provenance and metadata information and open the door to privacy violations. Security and Privacy are important for both data quality and for protection. Big Data frequently moves across individual boundaries to group, community of interest, state, national, and international boundaries. Provenance, a component of veracity, addresses the problem of understanding the data s original source and what has been done with the data. One approach is through the use of metadata, though the problem extends beyond metadata maintenance. Provenance also encompasses information assurance for the methods through which information was collected. For example, when sensors are used, traceability to calibration, version, sampling and device configuration are needed. The universal attribute of data ownership must be addressed in the context of the security and privacy of Big Data. Ownership is an attribute (which may or may not be visible to users) that associates data with one or more entities who own or can influence what can be done with the data (For example, you influence but cannot change your credit record). In databases, ownership confers the privileges to create, read, update, and delete data. Transparency of ownership enables trust and control for data owners, as well as openness and utility for enterprises and society. Maintaining data provenance allows traceability through the data lifecycle and tracks data ownership and change. Distributed programming frameworks developed to support volume and velocity were not necessarily designed with security in mind. Malfunctioning computing nodes might leak confidential data. Partial infrastructure attacks could compromise a significantly large fraction of the system due to high levels of connectivity and dependency. If the system does not enforce strong authentication among geographically distributed nodes, rogue nodes can be added that can eavesdrop on confidential data. Data search and selection can lead to privacy or security policy concerns because results can be provided without provenance and access control policies may be lost in 12

17 the search and selection process. It is often unclear what capabilities are provided by a provider in this respect. A combination of user competency and system protections is likely needed, including the exclusion of databases that enable re-identification. Because there may be disparate processing steps between the data owner, provider, and data consumer, the integrity of data coming from end points must be ensured. End-to-end information assurance practices for Big Data for example, for verifiability are not dissimilar from other systems, but must be designed on a larger scale. Retargeting traditional relational database security to non-relational databases has been a challenge. These systems were not designed with security in mind, and security is usually relegated to middleware. The movement and aggregation of data between applications has led to a requirement of systematically analysing the threat models and research and development of novel techniques. The threat model for network-based, distributed, auto-tier systems includes the following major scenarios: confidentiality and integrity, provenance, availability, consistency, collusion attacks, roll-back attacks and recordkeeping disputes. The flip side of having volumes of data are that analytics can be performed to detect security breach events. This is an instance where Big Data technologies can fortify security. Big Data systems will exert stresses upon security and privacy aspects of conventional applications and data produced by those applications. The potential of Big Data analytics, whether a current option or merely a future possibility, creates a natural bias against discarding data. Inconvenient archives could be relegated to specialized uses, rather than a recognized design pattern which relegates data to an intentionally degraded access modality. Security and privacy frameworks will evolve as Big Data systems are deployed more widely, but much Big Data may be collected through legacy applications that did not benefit from those frameworks and did not anticipate Big Data uses. Requirements development for Big Data systems will emphasize extensibility and scalability in ways that set the stage for greater threats to security and privacy. For example, systems will be architected to one day incorporate real time feeds from devices that are part of the Internet of Things even if those feeds are not yet available. While US. Safe Harbor privacy principles have been criticized as inadequate, a few of its principles serve to highlight areas in which Big Data systems are likely to be tested: notice, choice, onward data transfer, security, data integrity, the access of an individual to correct or delete data, and effective enforcement of these guidelines. Each of these areas is challenging enough in traditional IT settings. The productive use of derived, indirect and correlated data in Big Data will further amplify the need for increased control; however, current trust exchange technologies do not address many Safe Harbor needs. The human element in privacy and security for Big Data will also be transformed in ways not easily anticipated. As more data becomes available through Big Data analytics engines, there will be more analysts, some of them less well versed in best practices for preserving security and privacy. Similarly, analysts will likely gain access to data whose provenance and usage they are comparatively unfamiliar with. The security and privacy problems created through human agents will vary from benign to accidental to malicious. 13

18 Security and privacy measures in Big Data must scale nonlinearly. Consider first the scope of existing regulations, such as the EU General Data Protection Regulation, APEC Cross Border Privacy Rules, the Privacy Act of 1974 and the California Right to Privacy. Then consider what new regulations are likely to emerge to address perceived and real risks as the public and regulators become aware of Big Data capabilities. For instance, the HIPAA guidance minimum necessary use and disclosure could not have anticipated the many possible uses salutary or otherwise for personal health records. Architects who believed they understood the full scope of audit, forensic, compliance, civil rights and risk elements of security and privacy may come to feel otherwise. The relative comfort of many isolated information systems will, for many practitioners, become a relic of a less ubiquitously connected past. Information assurance including responsibility for resilience and reliability summons different specializations within computing, but Big Data are likely to give them a security and privacy face to the public. Technical solutions must take this into account Recommendations It is of paramount importance that Big Data systems be designed with security in mind from the ground up rather than have it emerge as an afterthought - which often leads to adoption of ad hoc solutions with unsystematic and vague threat models in mind. If there is no global perspective on security then fragmented solutions to address security may not offer any security at all and often impart a false sense of safety. Data aggregation and dissemination should be secured inside the context of a formal, understandable framework. This process should be an explicit part of a data consumer s contract with the data owner. Privacy-preserving mechanisms are needed for Big Data, such as for Personally Identifiable Information so that provenance information and data access policies are not lost. Anonymization and obfuscation of some data values can be used to protect sensitive information. For example, geographic location may be generalized to a village or a town rather than the exact coordinates. The availability of data and its current status to data consumers is often an important aspect of Big Data. In some settings, this may dictate a need for public or closed-garden portals and ombudsman-like roles for data at rest. While the context of the data, in terms of its structure, might be a standard Big Data problem, the payload might be encrypted to enforce confidentiality. However, traditional encryption technology hinders organization of data based on semantics. The aim of encryption is to provide semantic security, which means that the encryption of any value is indistinguishable from the encryption of any other value. Data encrypted using known standard and/or commercial algorithms cannot be searched, ordered, added, etc. While some basic processing operations can be performed on the data encrypted using the emerging homomorphic algorithms, it will take time until this approach matures and becomes applicable to real-life scenarios. 14

19 4 Relevant standardization activities This clause describes related standardization activities with Big Data including ISO/IEC JTC 1 in order to identify standards gaps. The current content is based on an informal survey by this Study Group and contributions from other SDOs. Specific Big Data standards are being developed by a variety of well-established SDOs and industry consortia as outlined in Table 1. The following sublcauses provide additional details on activities by those organizations that relate to Big Data. Table 1 The mission and key members of major Consortia for Big Data standardization SDO/Consortium Interests area on standardization Main deliverables ISO/IEC JTC 1/SC 32 Data management and interchange, including database languages, multimedia object management, metadata management, and e-business. e-business standards, including role negotiation; metadata repositories, model specification, metamodel definitions; SQL; and object libraries and application packages built on (using) SQL. ISO/IEC JTC 1/SC 38 Standardization for interoperable Distributed Application Platform and Services including Web Services, Service Oriented Architecture (SOA), and Cloud Computing Cloud Data Management Interfaces, Open Virtualization Format, Web Services Interoperability ITU-T SG13 Cloud computing for Big Data Cloud computing based big data requirements, capabilities, and use cases. W3C Open Geospatial Consortium Organization for the Advancement of Structured Information Standards Transaction Processing Performance Council TM Forum Web and Semantic related standards for markup, structure, query, semantics, and interchange. Geospatial related standards for the specification, structure, query, and processing of location related data. Information access and exchange. Benchmarks for Big Data Systems Enable enterprises, service providers and suppliers to continuously transform in order to succeed in the digital economy Multiple standards including ontology specification standards, data markup, query, access control, and interchange. Multiple standards related to the encoding, processing, query, and access control of geospatial data. A set of protocols for interacting with structured data content such as OData ( standards for security, Cloud computing, SOA, Web services, the Smart Grid, electronic publishing, emergency management, and other areas Specification of TPC Express, BenchmarkTM for Hadoop system and the related kit Share experiences to solve critical business challenges including IT transformation, business process optimization, big data analytics, cloud management, and cyber security. 15

20 4.1 ISO/IEC JTC 1/SC 32 ISO/IEC JTC 1/SC 32, titled Data management and interchange, currently works in several distinct, but related, areas of Big Data technology. SQL is already adding new features to support Big Data. In addition, SQL has been supporting bi-temporal data, two forms of semi-structured data (XML and JSON), and multidimensional arrays. SQL implementations are known to exist, which utilize storage engines that are built using several of the NoSQL technologies, including name-value pairs, big table, and document. Metadata efforts have focused on two major areas: (1) the specification and standardization of data elements, including the registration of those data elements (essentially, a repository for data element definitions); and (2) the definition of metamodels (to describe data and application models) and definitions of those models themselves. 4.2 ISO/IEC JTC 1/SC 38 ISO/IEC JTC 1/SC 38, titled Distributed application platforms and services (DAPS), currently works in several areas related to areas of the Big Data Paradigm: Cloud Data Management Interfaces; Open Virtualization Format; Web Services Interoperability. 4.3 ITU-T SG13 ITU-T SG13 Question17 has initiated a new draft Recommendation on Big Data (Y.Bigdatareqts)[6] with the title of Requirements and capabilities for cloud computing based big data in July The scope of Y.BigData-reqts is: Overview of cloud computing based big data; Cloud computing based big data requirements; Cloud computing based big data capabilities; Cloud computing based big data use cases and scenarios. 4.4 W3C Most W3C work revolves around the standardization of Web technologies. Given that one of the primary contributors to the growth of Big Data has been the growth of the Internet and World Wide Web (WWW) many of the developing standards around web technologies must deal with the challenges inherent in Big Data. 16

Traditional BI vs. Business Data Lake A comparison The need for new thinking around data storage and analysis Traditional Business Intelligence (BI) systems provide various levels and kinds of analyses

ISO/IEC JTC 1 Information technology Internet of Things (IoT) Preliminary Report 2014 Our vision To be the world s leading provider of high quality, globally relevant International Standards through its

A Next-Generation Analytics Ecosystem for Big Data Colin White, BI Research September 2012 Sponsored by ParAccel BIG DATA IS BIG NEWS The value of big data lies in the business analytics that can be generated

5 Keys to Unlocking the Big Data Analytics Puzzle Anurag Tandon Director, Product Marketing March 26, 2014 1 A Little About Us A global footprint. A proven innovator. A leader in enterprise analytics for

W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the

Reference Architecture, Requirements, Gaps, Roles The contents of this document are an excerpt from the brainstorming document M0014. The purpose is to show how a detailed Big Data Reference Architecture

The Bloor Group IBM AND NEXT GENERATION ARCHITECTURE FOR BIG DATA & ANALYTICS VENDOR PROFILE The IBM Big Data Landscape IBM can legitimately claim to have been involved in Big Data and to have a much broader

Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica

An Overview Of Future Impact Of Cloud Computing Shiva Chaudhry COMPUTER SCIENCE DEPARTMENT IFTM UNIVERSITY MORADABAD Abstraction: The concept of cloud computing has broadcast quickly by the information

BUSINESS INTELLIGENCE Bogdan Mohor Dumitrita 1 Abstract A Business Intelligence (BI)-driven approach can be very effective in implementing business transformation programs within an enterprise framework.

VIEWPOINT High Performance Analytics Industry Context and Trends In the digital age of social media and connected devices, enterprises have a plethora of data that they can mine, to discover hidden correlations

Next Generation Business Performance Management Solution Why Existing Business Intelligence (BI) Products are Inadequate Changing Business Environment In the face of increased competition, complex customer

The Principles of the Business Data Lake The Business Data Lake Culture eats Strategy for Breakfast, so said Peter Drucker, elegantly making the point that the hardest thing to change in any organization

Are You Ready for Big Data? Jim Gallo National Director, Business Analytics February 11, 2013 Agenda What is Big Data? How do you leverage Big Data in your company? How do you prepare for a Big Data initiative?

White Paper BIG DATA-AS-A-SERVICE What Big Data is about What service providers can do with Big Data What EMC can do to help EMC Solutions Group Abstract This white paper looks at what service providers

Gradient An EII Solution From Infosys Keywords: Grid, Enterprise Integration, EII Introduction New arrays of business are emerging that require cross-functional data in near real-time. Examples of such

SQLstream Blaze and Apache Storm A BENCHMARK COMPARISON 2 The V of Big Data Velocity means both how fast data is being produced and how fast the data must be processed to meet demand. Gartner The emergence

TopBraid Insight for Life Sciences In the Life Sciences industries, making critical business decisions depends on having relevant information. However, queries often have to span multiple sources of information.

The need for data integration tools exists in every company, small to large. Whether it is extracting data that exists in spreadsheets, packaged applications, databases, sensor networks or social media

An Enterprise Framework for Business Intelligence Colin White BI Research May 2009 Sponsored by Oracle Corporation TABLE OF CONTENTS AN ENTERPRISE FRAMEWORK FOR BUSINESS INTELLIGENCE 1 THE BI PROCESSING

Detecting Anomalous Behavior with the Business Data Lake Reference Architecture and Enterprise Approaches. 2 Detecting Anomalous Behavior with the Business Data Lake Pivotal the way we see it Reference

Offload Historical Data to Big Data Lake The Need to Offload Historical Data for Compliance Queries How often have heard that the legal or compliance department group needs to have access to your company

STATS-DC 2012 Data Conference July 12, 2012 Washington State s Use of the IBM Data Governance Unified Process Best Practices Bill Huennekens Washington State Office of Superintendent of Public Instruction,