Security concerns of Open Data publishing

The security considerations we would like to discuss in this blog post cover these two aspects:

securing the IT environment of the data publisher organisation

security (in the sense of „protection“) of the published data

Basis for security evaluation is finding that the data publication and subsequent interaction with the users who work with these data almost always lies in different security context than production data processing in the organization.

Therefore best practice is to put infrastructure necessary for data publication place in a separate computing environment. This measure is typically implemented at the network layer. Data publication infrastructure is mostly placed into the network segment/environment for organization’s servers directed toward Internet (as these did well provide public access to data).

One of the primary benefits of Open Data approach in terms of security is that published data are not subject to any confidentiality requirements – they are all the time public. If you want this assumption to be effectively used to simplify data publishing, beware of the non-public information not be included in any form in the infrastructure for data publishing. For example if you use the ODN software, and have it placed in the DMZ or similar security zone, all inputs should already be “cleaned” from what data that has not to be disclosed – such cleaning should not be implemented in the ODN using its data transformations.

More interesting is the question of data integrity and authenticity. Solution depends mostly on how much liability publisher wants to impose to the data. There is a whole continuum of options in this field, but typically it is one of two situations: data are provided for information purposes only or data can be used as legally binding.

It is necessary to note that even with “informative data” provider is responsible to some extent for ensuring the accuracy of the data (as their integrity) – he is usually legally obliged to do so. A violation of data integrity can have many negative effects: from damage to the reputation of the organization to excessive loads on personnel tasked with solving the existing problem (especially communication with users). If data publication is for informational purposes only, normally it is sufficient to ensure the protection of integrity at the same level as for web servers or sites of the organisation. The authenticity of the data is achieved at the level of metadata, which means by declaration.

If the intention is to be data useful for legally binding purposes, there should be given high attention to ensuring the integrity and authenticity. It should be noted that in this area there is no generally accepted technical standard for machine-processed data. Best practice is to implement a mechanism ensuring data integrity and authenticity outside the infrastructure for data publishing, which means to implement it in the internal production environment of organization. There is mechanism of electronic signature used in this process most often, which is applied to the entire dataset (but, however, then the partial access to data through API is unable), or separate signing selected dataset entities (ie. if the data are in tabular form, each row in the table is signed separately). The authenticity of the data is then guaranteed through the signing certificate (the subject of certificate is the data provider) and relevant certification path to trust anchor.

It would seem that if there is no need to address confidentiality, it also means lack of necessity for access control. However, access control is often important for the efficient management of resources used by data users – especially network capacity and utilization of servers making data available. Overall, it is necessary to foresee and implement solution for the data availability, since if data are to be used seriously, there must be a certain guarantee of availability for the users, or they must be at least rigorously informed about the service level parameters. Such information should include the acceptable use policy for all resources of the infrastructure (e.g. capacity limits for data downloading, the allowed frequencies of API queries).

Indeed accessibility protection is usually the reason for detailed monitoring of the use of data publishing infrastructure and its services. We recommend monitoring the current status and resources usage, as well as storing historical data for possible subsequent analysis.

While enabling the access through the API the good practice is not to create direct access to the database or application server of production environment of the organisation, but to use a separate tool for processing API requests. Main reason is the security (protection of internal systems from unwanted external access) and protection of resources (intent is to guarantee that internal servers and infrastructure are not overloaded regardless of the amount of user requests). If there needs to be direct access to the production systems, there should be requirements for maintaining security included in the design from the initial phases of creating publication infrastructure in this case.

There are special tools created with the intent to simplify this task, for example ODN software produced by the COMSODE project is easy deployable, self-contained package, accessible and Open Source, yet maintained and supported.

About the author:

Ľubor Illek received a degree in Informatics at Comenius University, Faculty of Mathematics, Physics and Informatics. Since 1998 he is active in area of information security and standardization activities. Since 2003 he is a member of Slovak informatics society. In 2009 he was one of the founding members of the Society for Open Information technologies. SOIT is a non-profit civic association of people who advocate the use of open information technologies in diverse areas of our society, with main focus on Slovakia. It brings together experts from the field of information technology to promote the idea of openness in the public policy making, access to data, and use of OSS.