Transcription

1

2 The International Journal on Advances in Security is published by IARIA. ISSN: journals site: contact: Responsibility for the contents rests upon the authors and not upon IARIA, nor on IARIA volunteers, staff, or contractors. IARIA is the owner of the publication and of editorial aspects. IARIA reserves the right to update the content for quality improvements. Abstracting is permitted with credit to the source. Libraries are permitted to photocopy or print, providing the reference is mentioned and that the resulting material is made available at no cost. Reference should mention: International Journal on Advances in Security, issn vol. 7, no. 3 & 4, year 2014, The copyright for each included paper belongs to the authors. Republishing of same material, by authors or persons or organizations, is not allowed. Reprint rights can be granted by IARIA or by the authors, and must include proper reference. Reference to an article in the journal is as follows: <Author list>, <Article title> International Journal on Advances in Security, issn vol. 7, no. 3 & 4, year 2014, <start page>:<end page>, IARIA journals are made available for free, proving the appropriate references are made when their content is used. Sponsored by IARIA Copyright 2014 IARIA

9 A Secure Logging Framework with Focus on Compliance 37 Felix von Eye, David Schmitz, and Wolfgang Hommel Leibniz Supercomputing Centre, Munich Network Management Team, Garching n. Munich, Germany Abstract Handling log messages securely, for example, on servers or embedded devices, has often relied on cryptographic messages authentication codes (MACs) to ensure log file integrity: Any modification or deletion of a log entry will invalidate the MAC, making the tampering evident. However, organizational security requirements regarding log files have changed significantly over the decades. For example, European privacy and personal data protection laws mandate that certain information, such as IP (internet protocol) addresses, must only be stored for a certain retention period, typically seven days. Traditional log file security measures, however, do not support the delayed deletion of partial log message information for such compliance reasons. This article presents SLOPPI (secure logging with privacy protection and integrity), a three-tiered log management framework with focus on integrity management and compliance as well as optional support for encryption-based confidentiality of log messages. Keywords-log file management; secure logging; compliance; log message encryption; privacy by design. I. INTRODUCTION For the secure logging, von Eye et al. presented SLOPPI [1] a framework for secure logging with privacy protection and integrity, which is extended in this article. This framework helps to ensure that the log files, independent of the storage format, fulfill the well-known goals of information technology security: The log s integrity must be ensured: Neither a malicious administrator nor an attacker, who successfully has compromised a system, shall be able to delete or modify existing, or insert bogus log entries. The log shall not violate compliance criteria. For example, European data protection laws regulate the retention of personal data, which includes, among many others, user names and IP addresses. These restrictions also apply to log entries according to several German courts verdicts that motivate the presented approach; details are part of previous work [2]. The confidentiality of log entries shall be safeguarded; i. e., read access to log entries shall be confined to an arbitrary set of users. The availability of log entries shall be made sure of. The security of log files is a very important aspect in the overall security concept of a service or device. Many attacks or resource abuse cases can be detected by analyzing log files as part of a forensics process, just like system or service breakdowns. With the knowledge embedded in log data, administrators and forensics experts are often able to reconstruct the way an attacker was intruding the system or the root cause of the system disaster. Because of the concentrated information, the log data is often a primary target of attackers once they have compromised the system. On the one hand, an attacker could erase the whole log file to cover up the traces. This a very efficient way but it also provides a clear information that something went wrong on the system, which most likely will arouse the system administrator s suspicion or trigger an automated alert. When this happens, the administrator is able to detect the attack very fast, even if not every detail can be reconstructed. On the other hand, an attacker could change some of the log entries in a way that the manipulation is not obvious. The approach presented in this article focuses on the second scenario. In any way it would be possible for an attacker to fully delete the log files, which cannot be circumvented as long as the log file resides on a fully compromised machine. Even if there is the possibility to store the log files on an external system, such as a log server, the attacker can be able to disturb the connection between the system and the log server, e. g., by firewalling the connection, once the system has been hacked and the attacker managed to get administrator privileges. But beside this, there are also a lot of systems, which cannot be connected to a central log server, e. g., because of the mobility of the systems or when the organization is too small to operate a dedicated central log server. This motivates the SLOPPI approach [1], which allows administrators to protect their log files against unwanted changes, while the deletion of log files, e. g., a regular log rotation, is still possible. The SLOPPI architecture allows administrators to have long time logs, while privacy related parts of the log file can be deleted after a well-defined period of time. In this article, the SLOPPI approach, which has substantially been improved since its introduction in [1], is presented. The primary limitation of the previous SLOPPI approach was, that the deletion of log entries also deletes any other information inside this log entry. So, it was not possible to keep parts of the information, e. g., for statistical or diagnostic reasons. In this paper we deal with this drawback and present an improved and more detailed approach, which is now able to keep some predefined parts of the information. This work is motivated by the large-scale distributed environment of the SASER-SIEGFRIED project (Safe and Secure European Routing) [3], in which more than 50 project partners design and implement network architectures and technologies

10 38 for secure future networks. The project s goal is to remedy security vulnerabilities of today s IP layer networks in the 2020 timeframe. Thereby, security mechanisms for future networks are designed based on an analysis of the currently predominant security problems on the IP layer, as well as upcoming issues such as vendor-created loopholes and SDNbased (software defined network) traffic anomaly detection. The project focuses on inter-domain routing, and routing decisions are based on security metrics that are part of log entries sent by active network components to central network management systems; therefore, the integrity of this data must be protected, providing a use case that is similar to traditional intra-organizational log file management applications. The remainder of this article is structured as follows: Section II introduces the terminology and notation that is used throughout the article. In Section III, the related work and state of the art as well as its influence on the design of SLOPPI are discussed. SLOPPI s architecture and workflows are presented in-depth in Section IV. The process for verifying the integrity of SLOPPI log files is specified in Section V. Before the article s conclusion, Section VI analyzes the security properties of SLOPPI. II. TERMINOLOGY In this article, a few terms and symbols are used to avoid ambiguity. These symbols and terms have the following meaning: In the special focus of this work is the untrusted device U, which could be for example a web server or a Linux system. As a matter of its regular operations, U produces log data, which is saved in one or more log files. As U is a system, which is not necessarily hardened in any matter, it can be assumed that U may be compromised by an attacker and therefore, the log data is not guaranteed to be trustworthy, i. e., the security goals confidentiality, integrity, and availability cannot reliably be achieved. However, the SLOPPI approach can be used to ensure the integrity and compliance of log data produced by U, making it reliable under this specific aspect. A trusted machine T. In any related work there is a need for a separate machine T U. The working assumption in the related work is that T is secure, trustworthy, and not under the control of an attacker at any point in time. In the presented approach, T is not needed any more as a fully-fledged computer system. To ensure a uniform notation, T is also used in the following sections in the meaning of a trusted storage for a security key, e. g., a piece of paper that is written on with a pencil. As long as this written paper cannot be read by an outside attacker, it can be assumed as trusted enough. Certainly there are also other solutions, for example, saving the key on a USB memory device (universal serial bus), but in this article the focus is not on the hardening of T because offline and analogous solutions are sufficient. The verifier V. Related work often differentiates T and V; V then is only responsible for verifying the integrity and compliance of a log file or log stream. In this case, T is only used to store the needed keys and V does not have to be as trustworthy as T. Also in this case, T is able to modify any log entry, while V is not. These symbols are used for cryptographic operations: A strong cryptographic hash function H, which has to be a one way function, i. e., a function, which is easy to compute but hard to reverse, e. g., SHA-256(m) (Secure Hash Algorithm) or Keccak. HMAC k (m) (hash-based message authentication code). The message authentication code of the message m using the key k. SLOPPI does not anticipate, which particular function should be used for cryptographic functions; instead, they should be chosen specifically based on each implementation s security requirements and constraints, such as available processing power, system and data sensitivity, and induced storage overhead. Furthermore, without loss of generality, the terms log files, log entries, and log messages are discerned as follows: A log file is an ordered set of log entries. The order is implied by the order, in which log messages are received by SLOPPI. In line-based logs, a log entry normally corresponds to one line of the log file. Otherwise, a log entry consists of all information, which is related to one event. For example, on a typical Linux system, the file /var/log/messages is a text-based log file and each line therein is a log entry; log entries are written in chronological order to this log file. For massively parallel operations, the resulting order is determined by the implementation of a syslog-style system service at run-time based on criteria outside the scope of this article. Log entries cannot be re-ordered once they have been written to a log file in the SLOPPI architecture, which is consistent with related work. Other log file formats, such as the proprietary binary Microsoft Windows event log format, can also be used with SLOPPI; for simplicity, however, all the examples given in this article refer to text-based log files with one log entry per line of text. Log messages are the payload of log entries; typically, log messages are human-readable character strings that are created by applications or system/device processes. Besides a log message, a log entry includes metadata, such as a timestamp and information about the log message source. Log messages typically have an applicationspecific structure of their own, which SLOPPI exploits for its slicing technique as detailed below. As shown in Figure 4, the SLOPPI approach uses several log files, which are related to each other in the following way: The master log L m is the root of the SLOPPI data structure and only used twice a day to first generate and to then close a new integrity stream for the so-called daily log L d. This log file L d is basically used to minimize the storage needs for the master log L m, which must never be deleted and therefore shall not

11 39 contain any personal data or otherwise potentially complianceoffending content; otherwise, no complete verification of the integrity of all log messages could be performed. L d is kept as long as necessary and contains a new integrity stream for the application logs L a. These logs, e. g., the access.log or the error.log generated by an Apache web server or the firewall.log created by a local firewall, are restarted from scratch once per day and yield all information generated by the related processes; they are extended by integrity check data. Please note that other intervals than 24 hours are arbitrarily possible, but daily log file rotation are most commonly used and the term is used here for its clarity. After an arbitrarily specified retention time seven days in the SASER scenario these logs, which contain privacy-law protected data, have to be cleaned up or purged completely because of legal constraints in Germany and various other countries. It is important to mention that a simple deletion of whole log files or log entries would also remove any information about attempted intrusions and other attack sources. This would cover an intruder, who could be detected by analyzing the log files, so the log file should be analyzed periodically before this automated deletion. Other time periods than full days or a rotation that is based, e. g., on a maximum number of log entries per log file, as well as other deletion periods may be applied but for the sake of simplicity daily logs and a seven day deletion period are used for the remainder of this article. This setup is also used in the SASER project and currently recommended for IT services operated by European providers. Extending [1], this article presents details about how the application logs can be handled with a fine-grained policy that allows to keep as much information as needed arbitrarily longer than the mentioned seven days, while still being compliant. We also discuss log message encryption to ensure confidentiality and how SLOPPI can be used to secure structured log messages, for example, in order to remove substrings from log messages for privacy reasons, to serve as a data source for business intelligence tools, and to facilitate the visualization of security-related log entries. III. RELATED WORK With the exception of Section III-A, none of related work offers a possibility to fulfill compliance as it is not possible to delete log entries or parts thereof a posteriori. Complementary, the approach summarized in Section III-A does not address the integrity issues. A. Privacy-enhancing log rotation Metzger et al. presented an organization-wide concept for privacy-enhancing log rotation in [4]. In this work, log entries are deleted by log file rotation after a period of seven days, which is a common retention period in Germany based on several privacy-related verdicts. Based on surveys, Metzger et al. identified more than 200 different types of log entry sources that contain personal information in a typical higher education data center. Although deleting log entries after s 0 s 1 =H(s 0 ) s 2 =H(s 1 ) s 3 =H(s 2 ) s 4 =H(s 3 ) s 5 =H(s 4 ) l 0 l 1 l 2 l 3 l 4 HMAC s (l 0 ) 0 HMAC s (l 1 ) 0 HMAC s (l 2 ) 0 HMAC s (l 3 ) 0 HMAC s (l 4 ) 0 Fig. 1. The basic idea behind Forward Integrity as suggested in [5]. seven days seems to be a simple solution, the authors discuss the challenges of implementing and enforcing a strict data retention policy in large-scale distributed environments. B. Forward integrity Bellare and Yee introduced the term Forward Integrity in [5]. This approach is based on the combination of log entries with message authentication codes (MACs). Once a new log file is started, a secret s 0 is generated on U, which has to be sent in a secure way to a trusted T. This secret is necessary to verify the integrity of a log file. Once the first log entry l 0 is written to the log file, the HMAC of l 0 based on the key s 0 is calculated and also written to the log file. To protect the secret s 0, there is another calculation of s 1 = H(s 0 ), which is the new secret for the next log entry l 1. To prevent that an attacker can easily create or modify log entries, the old and already used secret key for the MAC function is erased after the calculation securely. Because of the characteristics of one-way functions, it is not possible for an attacker to derive the previous key backwards in maintainable time. Figure 1 shows the underlying idea. In their approach, in order to verify the integrity of the log file, V has to know the initial key to verify all entries in sequential order. If the log entry and the MAC do not correspond, the log file has been corrupted from this moment on, and any subsequent entry is no longer trustworthy. However, the strict use of forward integrity also prohibits authorized changes to log entries; for example, if personal data shall be removed from log entries after seven days, the old MAC must be thrown away and a new MAC has to be calculated. While this is not a big issue from a computational complexity perspective, it means that the integrity of old log entries may be violated during this rollover if U has meanwhile been compromised. C. Encrypted log files Schneier and Kelsey developed a cryptographic scheme to secure encrypted log files in [6]. They motivated the approach for encrypting each log entry with the need of confidential logging, e. g., in financial applications. Figure 2 shows the process to save a new log entry. Any log entry D j on U is encrypted with the key K j, which in turn is built from the secret A j (in this article s j ) and an entry type W j. This entry type allows V to only verify predefined log entries. There is also some more information stored in a log entry, namely Y j

12 Fig. 2. The Schneier and Kelsey approach taken from [6]. Fig. 3. The Holt approach taken from [7]. and Z j, which are used to allow the verification of a log entry without the need of decryption of D j. Therefore, only T is able to modify the log files. However, this approach does not allow for the deletion method of log entries or parts thereof because then the verification would inevitably break. D. Public key encryption Holt used a public/private-key-based verification process in his approach to allow a complete disjunction of T and Figure V in [7]. 4: Forward Therefore, asecurity limited amount with public of public/private-key verifiability. pairs are generated. The public keys are stored in a meta-log entry, which is signed with the first public key, which should entry be erased couldafterwards create arbitrary securely. All alternative other log entries are also which would signed also withappear the precomputed correct. private keys. If there are no more keys Public left, akey newcryptography limited amount of public/private-key provides the pairs ability are to separate generated. signing from verification and encryption from The decryption. main benefit This of this section approach describes is that the howverifier the signing/verification cannot modify any separation log entry because can be it only used knows to the create publiclogs which keys, can which becan verified be usedby for anyone. verifying the We signature omit discussion but does V of creative not allow any applications inference of the the usedencryption/decryption private key. separation, although several such applications are possible, particularly when using identity based encryption. Bellare and Miner proposed a public key counterpart to hash chains in [2] which could be used with E. Aggregated Signatures In scenarios where disk space is the limiting factor it is necessary that the signature, which protects each single log entry, does not take much space. In all of the approaches sketched above, the disk usage by signatures is within O(n), where n is the total amount of log entries. To deal with a more space-constrained scenario, Ma and Tsudik presented a new signature scheme, which aggregates all signatures of the log entries in [8]. This approach uses archiving so that the necessary disk space amount is reduced to only O(1). The main drawback of this approach is that a manipulation of a single log entry would break the verification process, yet the verifier is not able to determine, which (presumably modified) entry causes the verification process to fail. As a consequence, it is also not possible to delete or to modify log entries, e. g., to remove personal data after reaching the maximum retention time. F. BAF Yavuz and Ning specified how log entries can be secured by using blind signatures in [9]. Their approach uses the log entry combined with the actual number of the log entry. For example, if the log entry D n is the nth entry in the log file, D n is combined with the number n to prevent an attacker from reordering the log entries. This result is hashed and modified with the secret key (a, b) by using a simple addition and multiplication modulo a large prime p. As in all other approaches, the secret key is updated and the previous version securely deleted. The most interesting result of this approach is that a verification is possible for a verifier V, while it is not possible for V to modify any entry in the log file. This property is normally only satisfied by public/private key schemes, which are typically very expensive to compute. IV. THE SLOPPI ARCHITECTURE AND WORKFLOWS SLOPPI, as presented in [1], follows the classic clientserver architecture of well-known POSIX (Portable Operating System Interface) logging mechanisms, such as syslogng and rsyslog. Any SLOPPI implementation therefore is a continuously running process, which offers interfaces, such as an application programming interface (API) and IPC (IP code), TCP (Transmission Control Protocol) or UDP (User Datagram Protocol) sockets, to receive new log messages from Figure 5: Verifying entries in the public key schem various system services, applications, or remote servers. After internal processing, log entries are stored in plain-text files in the local file system, where they can be processed with whichever log file viewing mechanism the local users are familiar with; alternatively, log entries can be forwarded to remote SLOPPI servers where they are treated in the same manner. If the application log files make use of the optional encryption, SLOPPI tools can be used to decrypt the log entries using standard input and output channels, typically in combination with POSIX pipes. Similar tools can be used to strip any SLOPPI-specific information from log entries so any other log file processing tools can be used for parsing and Create random keypai (pub 1, private 1 )..(pub n, private n ) Create the meta-entry listing the pub keys: meta = pub 1..pub n Generate the signature on the meta-entr sig 0 = Sign(private 0, meta) Securely delete private 0. (pub 0 may also removed). Output meta, sig 0 Set i = 0 Loop Increment i If i == n, exit the inner loop Wait for the next log entry: log i 40

13 41 processing the application log files even if they are not aware of the extensions brought by the SLOPPI data format. As SLOPPI has been designed with compliance regulations as its primary motivation, ensuring integrity and allowing for log-file rotation without cryptographic re-keying are its core functionality. In the following sections, first described is how the master log, the daily log, and the application log are intertwined to achieve these properties. Afterwards follows a discussion of the various optional functionalities for the application logs, e. g., the encryption of application log messages. A. The SLOPPI log file hierarchy The analysis of the related work shows that there is no solution yet that fulfills both necessary minimum requirements for log files: integrity and compliance, where the latter requires making changes to integrity-checked log entries once they have reached a certain age. The SLOPPI approach combines key operations from previous approaches in a new innovative way to achieve both characteristics. As introduced in Section II, a couple of types of log files, which are all handled a bit differently, are used for the framework. They form the following hierarchy as shown in Figure 4: The master log file is the root of the SLOPPI data format. It is created only once and must not be deleted. If it is deleted, e. g., by an attacker, integrity checking is no longer possible. The daily log files are, as implied by their name, created in a daily manner. Although other rollover periods could be used, such as hourly or weekly, we refer to them as daily logs for the sake of simplicity and because they are a de-facto industry standard. Daily log files can be deleted after a retention period, for which 7 days is the standard setting; it is, however, recommended to delete the affected application log files first. Application log files are the only log files, in which actual payload log messages are stored both the master log and the daily logs only contain SLOPPI-specific metainformation. There can be an arbitrary number of application log files depending on how many files all the log information should be scattered across. SLOPPI supports the typical syslog-like distribution of log messages to log files based on the originating host, application, log level, and log message content in an administrator-configured manner. While storing a log message in exactly one log file is the usual mode of operation, the same log message can be logged to multiple log files if desired, or thrown away without being written to a file, and therefore without influencing the integrity mechanism. As an alternative to local log files, log entries can also be forwarded to remote SLOPPI services, which treat them similarly to locally logged messages; communication is secured using TLS (Transport Layer Security) connections. At the very core of SLOPPI, the master log file L m has to be secured. As it has only a few entries per day, it can be protected using a public key scheme, e. g., RSA, to protect the log entries, which is described in detail in the upcoming Section IV-B. In the subsequent sections, the two keys of a public key scheme are called signing key (k sign ) and authentication key (k auth ). Then the daily log file L d is considered in Section IV-C. Similar to the master log file, it only has very few entries per day, but they already must be considered too many entries for using public key schemes, so a symmetric key scheme is certainly the best choice. Finally, application log file details and options are presented in Section IV-D. B. The SLOPPI Master Log File As stated above, the master log file L m contains important data of the SLOPPI approach to protect the integrity of the log files. To protect L m, the following steps are necessary: 1) Log Initialization: Whenever a new master log is initialized, U generates an authentication key (kauth 1 ) and a signing key (ksign 1 ). These two keys are important to protect (using ksign 1 ) now and to verify (using k1 auth ) the log file later. As the verification key should not be stored on U, it is sent to T over a secure connection, e. g., a TLS connection. As mentioned above, it is not necessary that T is a computer system as kauth 1 could also written on a piece of paper by the administrator. But mostly it could be assumed that T is a specially secured and encrypted database. After sending the authentication key, U deletes (kauth 1 ) securely. U can now initialize the master log file by saving the first message STARTING LOG FILE in the log file as described next. For this step, ksign 1 is the actually used secret. Important is that the master log is normally generated only once per SLOPPI instance. 2) Saving New Log Entries: Let m be the log message of the log entry to be stored in the log file. As the master log file has only one entry per day, it can be assumed that there is enough time between saving the last entry and the actual one to generate a new authentication/signing key pair (k n+1 while ksign n is the actual secret. U now generates and computes m = (timestamp,m, k n+1 auth ) e = Enc k n sign (m ). auth, kn+1 sign ), The result e is the new log entry, which is written to the log file. Immediately after calculating the encrypted result, the keys ksign n, kn+1 auth, which are not needed anymore, are erased securely from the system. Now the master log file only contains fully encrypted data and k n+1 sign is the secret for the next log entry. The motivation for the data format used for m is the following: The timestamp is used to verify the time, at which a new event is logged in the master log. An abnormal high or low rate of entries in a specific time interval can indicate a system failure or an attack. To prevent any changes of the timestamp, this is also part of the encrypted data. item

14 42 init Master Log START Start day 1 Stop day 1 Start day 2 Stop day 2 Start day 3 Stop day 3... init close init close Daily Log (day 1) START Start app 1 Start app 2 Stop app 2 Stop app 1 END Daily Log (day 2) START Start app 1 Start app 2 init close init close Application 1 (day 1) START Log entry Log entry... END Application 2 (day 1) START Log entry... END Stop app 1 Stop app 2 END init Application 1 (day 2) START... Log entry... END... Fig. 4. Overview of all relevant log files. The log message m has to be encrypted, to protect the main content of the log entry. auth is the verification key for the next log entry. This is the application of the forward integrity approach, because the necessary information to decrypt and to verify the next step are all available if the previous verification and decryption step is completed. As this is data very worthy of protection, it is naturally part of the encrypted data. k n+1 For example, let the first log message be m = STARTING LOG FILE and the actual secret k 1 sign. After generating (k2 auth, k2 sign ), U now composes m = ( , STARTING LOG FILE, k 2 auth) and computes e = Enc k 1 sign (m ) = Enc k 1 sign ( , STARTING LOG FILE, k 2 auth), which is written to the log file. Now, k 1 sign and k2 auth are deleted securely. 3) Closing the Log File: During regular SLOPPI operation, there should not be the need to close the master log file. But in case of a system restart, a serious failure, or in the case where the storage requirement of the master log is too much increasing, there can be the desire to restart the master log. If the master log file has to be closed gracefully, the last message CLOSING LOG FILE is saved into the log file. It is important that in this case it is not necessary to generate a new key pair and therefore, the next authentication key is also irrelevant. To fulfill the data format defined above, it is needed that the log entry consists the next authentication key, which is set to an empty string. 4) Content of Log Messages: As L m is used as a meta log, which does not contain any application or system messages, the content of the log messages m are now specified. As mentioned before, the daily log is encrypted with a symmetric crypto scheme. Every day a new daily log is initialized by the system. The name and location of the created daily log is the variable p 1. Furthermore, the variable p 2 is the first entry in L d and finally the variable p 3 appoints the necessary key for the log initialization step. m is then the concatenation of p 1, p 2, and p 3, e. g., /var/log/ log;starting LOG FILE;VerySecretKey together with H(p 1, p 2, p 3 ). Because of the need to detect manipulations of L m, it is necessary that m also contains a hash value of p 1, p 2, and p 3. With the knowledge of H(p 1, p 2, p 3 ) it is possible to detect where the decryption process failed exactly. C. The SLOPPI Daily Log File The main reason to use the daily log is to reduce the storage space requirements of the main log. It is quite unusual that the main log is initialized for a second time if the system is running normally. There are round about two entries per day, which have to be stored over a long time. The daily log could be deleted after all application logs mentioned in this specific daily log are deleted. Depending on the amount of running applications on a server it is not unusual that there is much more than one application log used on a system. 1) Log Initialization: Every day a new daily log has to be initialized if a daily log rotation is configured on the system. Otherwise, another initialization interval is used for starting a new daily log. At the beginning, U generates a symmetric key k sym which is necessary for both, encryption and verification. This key is the initial secret and has to stored in a trusted space, e. g., on T. As the SLOPPI approach does not need a separate T, the already existing L m is used as a trusted third party. As described above, it is unlikely that an

15 43 attacker can get information out of L m because this log file is fully encrypted. If the attacker has already compromised the system, no new log message can be trusted any longer and the attacker is able to modify any computation step. Therefore, it is secure to store the key inside of L m. The key k sym is now written together with the name and the path of the actual log file. This information is combined with the first message STARTING LOG FILE. The actually used secret is now k sym. U can then initialize the daily log file by saving the first message STARTING LOG FILE in the log file as described next. 2) Saving New Log Entries: To encrypt the important information of the daily log, a symmetric key scheme is used, e. g., AES (Advanced Encryption Standard). In difference to the master log, not all of the information stored in L d can be encrypted, because parts are needed in plain text during later steps as detailed below. Let (m, m ) be the message that has to be stored in the log entry and ksym old the actually used secret key. The precise meaning of (m, m ) is defined below along with the log message content. U now randomly chooses a new secret key ksym. new Because of the use of symmetric key schemes, this step is not computationally expensive. Analogous to the master log processing, an expended message is now generated by U, which has the structure m = (timestamp, m, k new sym, H(m, (timestamp, m, k new sym))). With this, the log entry e = (m, Enc k old sym (m )) can be computed, which is written to the log file. The hash value is stored for verification purposes, so it is possible to detect the exact log entry where a manipulation took place. Immediately after calculating the encrypted result, the key ksym, old which is not needed anymore, is erased securely from the system. Now ksym new is the secret for the next log entry. The motivation for the data format used for m is the following: The timestamp is used to verify the time when a new event is logged in the daily log. An abnormal high or low rate of entries in a specific time interval can indicate a system failure or an attack. To prevent any changes of the timestamp, this is also part of the encrypted data. The log message m has to be trivially encrypted, to protect the main content of the log entry. Besides m, there is also a part of information, which has to remain in plain text to use them in the typical usage of the SLOPPI tools. ksym new is the verification key for the next log entry. This is the application of the forward integrity approach, because the necessary information to decrypt and to verify the next step are all available if the previous verification and decryption step is completed. As this is a data very worthy of protection, it again is naturally part of the encrypted data. 3) Closing the Log File: If the daily log file has to be closed, the last message CLOSING LOG FILE is saved. It is important that in this case it is not necessary to generate a new key pair and therefore, the next authentication key is also irrelevant. To fulfill the data format defined above, it is required that the log entry consists the next authentication key, which is set to an empty string. This message, the file name and location, the MAC of the entire log file, and the last generated key are committed to be stored in the master log. 4) Content of Log Messages: In the daily log, there are five types of messages. In difference to the master log, which contains only meta information, which is only interesting for verification purpose, the daily log also contains information, which is used in the daily use of the system. Therefore, these parts of the contained information of a message have to remain unencrypted. In the following, means an empty string of zero bytes size. STARTING LOG FILE. This message only consists of the string, which should be encrypted. (m, m ) = (STARTING LOG FILE, ). CLOSING LOG FILE. This message only consists of the string, which should be encrypted. (m, m ) = (CLOSING LOG FILE, ). START APPLICATION LOG. This message contains the timestamp when the application log was initialized (this is in most cases the same timestamp, which is used above for saving the log file), the file name, and location of the application log. This information is saved in plain text inside the daily log because it is needed to identify the application logs, which are connected to the specific daily log. Furthermore, the message consists of the initialization key, the first message, and the file name and location of the application log in encrypted form. The encryption step happens when the log entry is being saved to the daily log. The initialization key of the application log is saved in the daily log because the daily log is the trusted third party T for the application log. This leads to m = (START APPLICATION LOG, initialization key, first message, file name, location) and m = (START APPLICATION LOG, timestamp, file name, location). STOP APPLICATION LOG. Similar to the start message, this message contains in plain text a timestamp, the file name, and the location of the application log. The encrypted parts, which are also encrypted when saving the log entry, are the last key, the last message, and the file name as well as the location of the application log. This leads to m = (STOP APPLICATION LOG, last key, last message, file name, location) and m = (STOP APPLICATION LOG, timestamp, file name, location).

16 44 ROTATING APPLICATION LOG. In case of a log rotation procedure, it is necessary that the SLOPPI tool is able to know, which new log file was previously protected by the SLOPPI approach and which daily log this new log file is assigned to. For this reason, the message contains a timestamp and both file names, i. e., before and after a rotation, in plain text. As in the previous message type, there are also for security reasons the file names and locations as parts of the cipher text. This leads to m = (ROTATING APPLICATION LOG, file name before, file name after) and m = (ROTATING APPLICATION LOG, timestamp, file name before, file name after). It is important that all application logs, which have their starting message in a daily log, have to write their stopping message also to the same daily log. This is the reason why some contents in the log file are still in plain text. Otherwise, the logging engine would have to remember, which application log is connected to which daily log. This also means that it is possible that a daily log is still open when the next daily log is initialized. The ROTATING APPLICATION LOG message could be in later daily logs because the specifics of the log rotation algorithm are not known and it could be that a log rotation is performed only once a week. D. The SLOPPI Application Log Files In general, the application log files L a can be protected by any approach presented in Section III and are not a mandatory part of the SLOPPI architecture. These log files can also be deleted for compliance reasons. It is also possible to use log rotation techniques to fulfill local data protection policies. It is necessary to mention that any information about an attacker, which is not detected during this period, will be lost and cannot be recovered. But this is not a drawback of the presented framework because this is necessary to fulfill the data protection legislation especially in Europe or in Germany, which mandates to erase any privacy protected data after seven days. In scenarios where the log files can only be read after a longer offline period, e. g., low power sensor-networks devices, the period to delete log files should be set individually so an administrator is able to analyze any log data before they are deleted. SLOPPI s core components ensure that the essential requirements of secure logging are fulfilled: integrity and compliance. However, in high-security environments, additional characteristics are required, such as the confidentiality of application log entries with selective deletion of personal data. In this section, we present modular extensions to the SLOPPI architecture and workflows to provide such additional functionality. Due to the openness of SLOPPI s architecture, arbitrary other modules can also be implemented. 1) Application Log Message Encryption: In many cases application log files should be in plain text to allow other applications or human administrators to take a look into the log files. However, in other cases it is appropriate to protect the content of the log files from prying eyes. In the basic SLOPPI approach, it is not required to encrypt the log entries in the application log files explicitly. But sometimes it becomes necessary that the application logs get encrypted. For this, SLOPPI provides an extension, which supports different encryption methods for the most common scenarios as discussed below. In general, any existing encryption algorithm may be used for securing particular application logs. So, here the encryption extension for SLOPPI is specified in an abstract way assuming an encryption method using particular, respective secrets k sign (encryption) and k auth (decryption) to secure a particular application log L a. For sure, k sign = k auth if symmetric encryption scheme is used, which is generally recommended due to the amount of data that is typically logged to application logs. Based on this assumption, there are still several different options how to actually encrypt the respective application log messages. These options must be bases upon the underlying scenario; the most common ones are discussed in the following: Sometimes different persons or groups are responsible for the administration of a system or service. As this sharing of responsibility mostly leads to more than one application logs, in which the different groups are divided, it is important to define, whether all application log files using the same key pair for encryption or different ones for each log file are chosen. Log messages in application logs can be very critical. For example, in a bank it is possible that each money transfer is also logged into the application log to prove that the transfer succeeded or failed. Assume that in this case it is insufficient if the log message is readable even for a short period of time. So here is the requirement that each log message is encrypted with a key, which is different from the previous message as well as from the following message. As SLOPPI should be runnable on nearly every system, it is sometimes difficult to generate a new randomly chosen key every time (especially in the case one message, one key ). For this scenario, SLOPPI also supports an autoderivation of the encryption key by using an appropriate key derivation function. This function can, for example, be based on Keccak or other algorithms that also support the generation of arbitrary-length key material. In general, log file analysis should be done nearly in real time to detect unusual events. For this it is not necessary to take a look into the log file after a predefined period of time, e. g., after the daily log rotation. If this is the case, it is sufficient to analyze the log files, while the verification process is running and so the log file encryption can start, for example, with a randomly chosen key pair. Otherwise, the verification and the log file analysis, i. e., the log file decryption, are different processes. In that case, the decryption key has to be deduced by applying an appropriate key derivation function. It is obvious that in this case the initial secret has to be stored on T otherwise,

17 45 decryption would also possible for any attacker. The encryption extension for SLOPPI supports any combination of the described options above. Which options are chosen in a specific scenario depends on the characteristics of the application log file and environment used, e. g., the number of log lines per day, the speed of adding of log lines in maximum and in average, the computing power and memory resources of the system versus the confidentiality in the face of potential attackers for a day-period. Even if the single key-pair option, i. e., for one or more log files one key per day, is chosen, a potential attacker will not be able to read any application log files of prior days, but he may be able to read the log file of the current day, as the single key pair is still accessible. In any case, all necessary information about the choices made has to be stored initially in the respective start application log message in the daily log file. Therefore, the content of the startup application log message has to be extended in comparison to basic SLOPPI: m = (START APPLICATION LOG, initialization key, first message, file name, location) has to be changed to m = (START APPLICATION LOG, initialization key, encryption method, initial key or derivation info, iteration used?, chaining used?, first message, file name, location), where encryption method gives some information about the used encryption algorithm and also, which of the extensions above is used. initial key or derivation info is the place, where the used (initial) decryption key is stored. If this key is computable from previous information, the key information is replaced by information about the used key derivation function. The last two optional parts iteration used and chaining used define information about parts of the extensions described above. m = (START APPLICATION LOG, timestamp, file name, location) does not require any changes. All application log messages of the respective application log file have to be encrypted according to the choices made. Additionally, if the chaining mode was chosen, each newly selected decryption key for the next application log entry has to be concatenated with the proper message of the current application log entry before this concatenation is encrypted using the current encryption key. This leads to the following: applogentry i = E i (decryptkey i+1 ; msg i ) if chaining is used applogentry i = E i (msg i ) otherwise 2) Application Log Message Enrichment: Log messages sent to SLOPPI can be parsed and enriched before being stored in an application log file. This brings the following benefits: The application-specific structure of log messages can be made explicit, for example, by marking data fields containing personal data or specific error codes. This simplifies processing SLOPPI application log files in log management tools or business intelligence software, such as Splunk. Recommendations on how the content of the log message should be visualized can be added. Given log message m of length n, c 1,..., c n denotes the individual characters that m is composed of. A slice is defined as tuple c i, c j with i j and denotes the boundaries of an arbitrary non-empty sub-string of m. Unique names are assigned to slices and can be used to describe their semantics. For example, if a network service s log message contains the IP address of a client, the resulting slice could be named ipaddress. The exact boundaries for each slice in any log message are determined by parsing rules. Typically, regular expressions will be used to parse a log message in order to determine the slice boundaries. If a parsing rule matches the given log message, the slice name and boundaries are appended to the log entry. For example, if the IPv4 address is detected in a log message starting at c i with i = 27, then the is appended to the log entry. In general, slices can be overlapping. For example, a whole log message may be slice-tagged as service x s error message, while a sub-string may be tagged ipaddress. However, for the personal data anonymization procedure discussed below, care must be taken that the two types of used slices do not overlap to ensure that one slice s hash value does not change when another slice is being modified. Otherwise, the a-posteriori anonymization of log files would break the verification procedure. Slices are useful for two slightly different use cases: On the one hand, the SLOPPI tools can be instructed to verify and decrypt only selected slices; this allows for a fine-grained access model where system administrators can be restricted to which parts of single log file entries they are allowed to read this allows for a more detailed access management than the traditional per-file or per-log-entry model found in most of today s implementations. On the other hand, slice names and ranges can be fed to other log entry processing tools, such as business intelligence software and log file visualization tools. This not only saves the overhead of parsing the same log message multiple times in different processing tools, but its true power lies in that the slicing can be integrity-checked. For example, it becomes obvious whether an IPv4 address has been recognized as such by looking at the slice names; compliance violations, such as not checking which parts of log messages contain personal data that must be removed after the maximum retention time become easy to spot for an auditor. Also, administrators do not need to wait until the maximum retention time has been reached to verify whether an anonymization batch run will modify the correct parts of the log messages. E. Generalized Privacy Protection for the Application Logs As introduced in the previous section, SLOPPI also provides a semantic tagging of log entries. This can be used to identify privacy-relevant data. With this knowledge it is possible to anonymize these relevant parts of the log file, while the remainder of the information can be stored untouched.

18 46 In general a log message is only a string without any semantics for the normal logging process. Because of that the normal logging process is not able to differentiate privacy relevant data. With SLOPPI and the application log message enrichment extension, it is possible to give the logging process the needed information to separate compliance sensitive data from other text. As the goal is to enable the a-posteriori anonymization of log files, the content of each log message is split into two categories: The part of the log, which has to be anonymized later on. Everything else. To perform the seperation between anonymized and nonanonymized data, there has to defined a priori the exact log format. For example, in the SSH (Secure Shell) log files, the log entries are always in a uniformed format, e. g., on a standard Debian web server there is always at the beginning a timestamp, followed by the host name, the process name and number and finally the log message, which contains the authentication method, the user name and finally the source IP address and the port. In this example, only the username and the IP address is necessary to anonymize for fulfilling the legislation boundaries. With the help of regular expressions, the positions of this information can be found. In other examples, this might be more complicated, but in general each logging source has its own specified output format. If there are to many mixed entries in a log file, there is always the possibility to split this log file into more than one others. As written above, for the protection of the application log any approach of Section III can be used. In the following the forward integrity approach is described; any other approach can be handled analogous. For the integrity protection, a MAC of each log entries is used. This MAC would change if an anonymization is performed afterwards and a verification would be impossible. But with the knowledge of the relevant parts, it is possible to calculate two MACs. The first is the original MAC with all information, while the other MAC is calculated with already anonymized parts of the log message. These two MACs are appended accordingly to the log entry. The verification process can now check both MAC values. If no anonymization has been performed yet, the original MAC can be used. Otherwise, the verification step uses the new MAC. For this it is important that the SLOPPI anonymization uses the same anonymization string every time. To ensure human readability, i. e., administrators should know, which parts of a log message have already been anonymized, a placeholder replacement string, such as XXX or *** should be used. For simple implementation, it needs to be of constant length; this also avoids issues regarding the de-anonymization success probability when using variable-length strings. In general it is important to know, which log entry has been anonymized at what time. Because an attacker could otherwise, anonymize any log entry himself to cover up the traces. To prevent this, SLOPPI generates a new log entry in the daily log, which contains the message LOG ANONYMIZATION and also the information, about which log file and which log entries are being anonymized, e. g., all log entries between the timestamps and To avoid race conditions, this log entry is created after an anonymization batch run has successfully finished. F. Generalized Privacy Protection for the Application Logs with Partial Encryption Both extensions of sections IV-D1 and IV-E can be combined to allow for on the one hand the partial deletion of log entries, e. g., to support privacy for relevant data, and on the other hand for partial encryption of log entries, simultaneously. Assumed is an partial application log file L a and its application log messages being partitioned into different slices (i. e., tagged parts): Different slices of application log messages are either iteratively or in chained mode to be hashed separately as described in Section IV-E; this is done using multiple initial hash secrets, which are being stored in the respective start application log message for the application log file in the daily log file. Particular slices of application log messages can in addition be iteratively encrypted similar to IV-D1, but only for the particular slice. Each slice is treated separately with different encryption methods and secrets. Therefore, the method identifier needs to be stored along with the initial description key for each slice in the respective application log message for the application log file in the daily log file. In sum, this requires the slicing of log messages along with partial encryption of each slice and therefore adds some overhead to the application and daily log files. V. SLOPPI LOG ENTRY VERIFICATION PROCEDURE To verify log entries, the initial master key is needed. Each log entry in the master log is encrypted as Enc k n sign (m ) with m = (timestamp, m, k n+1 auth ). To decrypt the message only the authentication key is needed, which is stored during the log file initialization step. After the first log entry is decrypted, the authentication key to decrypt the second log entry is obtained implicitly, and so on. The first occurrance of a log entry, which cannot be decrypted gives proof that a manipulation of the log files, which has been caused by an attacker or a malicious administrator who has tried to blur his traces. This verification step is to verify the master log and to obtain the verification keys for the daily log. As the entries in the daily log look like Enc ksym (m ) with the content m = (timestamp, m, ksym), new the first entry could be decrypted by using the symmetric key stored in the master log. The symmetric key for any other entries is in the message payload of the previous log entry. As above in the master log, it is not possible for an attacker to modify any log entry in such a way that the encryption step works correctly. A. Master Log: Verification of Log Messages To verify an existing master log, it is necessary to use the authentication key saved during the generation of the

19 47 log file. With this key it is possible to decrypt the first entry, which leads to the next authentication key. This step can be performed until the actual last message or the literal CLOSING LOG FILE entry is reached. Because m consists of the necessary information about the daily log files, it is possible to verify any daily log that is still available. If a daily log has already been deleted, then this daily log and the connected application logs cannot be verified any more, but it is still possible to use the subsequent log entries of L m. Regularly the master log file is not closed because it is intended that the master log runs endlessly. Therefore, the last timestamp has to be near the current timestamp, but the derivation depends on the daily log scenario. In the case of a failure, e. g., in the storage device, it is possible that writing in the master log file is interrupted suddenly. In this case, the last message of the master log is not the CLOSING LOG FILE message and probably has a somewhat out-dated timestamp, depending on the daily log scenario. In this case, verification is only possible up until the last timestamp just as if the master log had been closed regularly and no new master log had been started. B. Daily Log Verification of Log Messages Because of the need to detect any manipulation of L m, it is necessary that e also contains a hash value of the message. With the knowledge of this hash value it is possible to detect, where the decryption process failed. To verify an existing daily log, it is necessary to use the authentication key saved during the generation of the log file. With this key it is possible to decrypt the first entry, which in turn leads to the next authentication key. This step can be performed until the actual last message or the CLOSING LOG FILE entry is reached. Because m consists of the necessary information about the application log files, it is possible to verify any application log that is still available. If an application log has already been deleted, then this application log cannot be verified any more, but it is still possible to use the subsequent log entries of L d. In the case of a failure, e. g., in the storage device, it is possible that writing in the daily log file is interrupted suddenly. In this case, the last message of the daily log is not the expected CLOSING LOG FILE message; it may also have an out-dated timestamp. In this case, verification is only possible until the last timestamp just as the daily log has been closed regularly and no new daily log has been started. Furthermore, the transfer of the data between the daily log and the master log can fail. In this case the master log doesn t get the information that a daily log was initialized or was closed. This leads to a point where the verification process fails because the last secret keys are already deleted and so it is impossible to recover them to verify the last log entry. C. Application Log: Verification of Log Messages The procedure for verifying application log entries heavily depends on the options chosen for encryption. Basically, the key material that is required for beginning with the verification is part of the daily log. If the application log is not encrypted, the verification is concerned only with checking the application log s integrity by calculating the hash values given the current application log file line by line, and comparing the result calculated during the verification process with the values stored in the log file. Any mismatch is an indicator that the system has been compromised after the timestamp of the previous entry, i. e., the one before the offending line. If the application log is additionally encrypted, there are two verification methods. On the one hand, full verification requires decrypting the application log file line by line as the decryption key material required for the next line is either based on the current line s material (iterative mode) or stored encrypted in the current log entry (chaining mode). An application log file s full verification is therefore O(n) with n being the total number of lines in an application log file. Full verification is successful when all hash values match and the last line decrypts to yield the end application log line. On the other hand, the partial, or basic, verification omits any decryption steps. In other words, it just ensures that the integrity-protecting hash values are correct. The advantage of this approach is that basic verification can be performed without supplying the decryption key. It therefore can be used with automatisms not trusted enough to be provided with the decryption password. VI. SLOPPI SECURITY ANALYSIS In the SLOPPI approach, the method authenticate-thenencrypt is used to secure for example the daily log, as the the message m is in a first step authenticated by a hash function H and then both (m and H(m)) are encrypted. As Krawczyk discovered, there are some problems by using authenticatethen-encrypt, which makes them vulnerable against chosen plaintext attacks [10]. But on the other side, by choosing the right encryption methods (e. g., CBC (Cipher Block Chaining) mode or stream ciphers) also the authenticate-then-encrypt method is secure. This security consideration has to be taken into account, if the SLOPPI framework is implemented. Furthermore, it is possible to relinquish most of the calculated MACs by using authenticated encryption with associated data cipher modes [11]. By using these cipher modes, only the encryption step is necessary to perform to gain encryption and also authentication of the messages. On the other side it has to be evaluated, if the performance and memory needs of these ciphers are comparable to traditional cryptographic schemes. On the one hand it is not possible for an attacker to gain information from the daily log or the master log as they are both encrypted with well-known crypto schemes. On the other hand it is not possible to delete any entry of the log files because during the verification step, the decryption of the entry would fail and the manipulation would become evident. Furthermore, assuming proper implementation, any used key is removed securely from memory immediately so no attacker could restore it.

20 48 The only possibility of an attacker is to be fast enough to gain access to the system and to shut the logging mechanism down before any log entry is written to the disk. This may eventually happen, e. g., when the attacker performs a DoS attack on the system before he is breaking in. However, implementations of the logging service of the system are able to prevent the system from this method of attack by aggregating multiple identical events before writing them to application log files; the same approach can be used for future SLOPPI implementations. The confidentiality of log messages is related to the feasibility of unauthorized decryption. For SLOPPI, the following aspects need to be considered: The master log file is strongly encrypted in any of the operational modes described above, i. e., for the iterative mode as well as the chaining mode, which is applied to each individual application log file. The master log file uses public/private key encryption and therefore secure as long as either of both keys is kept out of an attacker s reach. The daily log file is also strongly encrypted in any of the operational modes. However, it uses a symmetric encryption scheme, which means its security relies on generating high-quality random key material, which an attacker must never see. Generating key material, e. g., by using pseudo-random number generators (PRNGs), typically works well on systems with good entropy pools, such as networked servers or interactively used machines. However, especially embedded systems without entropygenerating sensors may be bad at generating new key material. In this case, manually planting an unique random number seed that provides enough input for further key derivation during the system s estimated run-time before system deployment is mandatory. In the basic SLOPPI operating mode, only iteratively securely hashed key material is used. The initial hash secret is saved in the respective opening daily log message in the daily log. Therefore, the confidentiality of the daily log file, which has been discussed above, is sufficient to ensure the secrecy of this keying material. When SLOPPI is used to also encrypt log messages in the application log files, these log files make use of the same iterative or chained encryption: The initial decryption key k m,d,1 is being stored along with its initial hash secret in the respective daily log message about opening the application log; the confidentiality of the daily log protects this key material. With each log entry i, i 1, in the application log, the entry meta-data also contains the next iteration of the decryption key k m,d,i+1. When using SLOPPI with application log file encryption, using either iterative or chained key material generation deserves further discussion. The basis for both options is the generation of new key material in the beginning; it depends on the quality of the PRNG, similar to the daily log, which has been discussed above. Furthermore, key material that is no longer needed needs to be correctly and irrevocably be erased from memory, i. e., after the first log message iteration or after the application log file has been closed. The confidentiality also depends on the strength and quality of the used key derivation algorithm and its mode of operation, e. g., counter mode, feedback mode, or double-pipeline iteration mode. Similar to encryption, many options are available and SLOPPI does not prescribe, which one to use. If SLOPPI is used without any iteration re-keying option, then an attacker may be able to read plain text messages that are already written to the application log file before the attacker gained control of the system, but limited to the particular application log file of the current day. For any other, already closed application log files, i. e., those from earlier days, the used key material should already have been erased irrevocably from memory. If the auto-iteration SLOPPI mode is used, which is based on key-derivation functions to generate new key material during re-keying, and the proper removal of no longer needed key material from memory is ensured, then the attacker is unable to access the plain text of the written application log messages up until the current re-keying at the time of system compromise. Finally, if the chaining SLOPPI mode is used, i. e., new key material is generated for each application log entry, the quality of the achieved protection again is based on the quality of the used PRNG and proper reliable removal of outdated key material from memory. Attackers cannot gain access to previously written log entries plain text either. Depending on the cipher that is used, ideally both options only differ in the amount of new key material that needs to be generated. As the application log files can be deleted independent of their encryption status which is required to fulfill compliance policies regarding retention time without violating their integrity status because the actual content of the application log message is irrelevant for an integrity check, the mode of encryption does not interfere with regular SLOPPI operations. It must not be neglected that SLOPPI does not address the availability issue of log files. Attackers can remove traces of attacks by simply deleting log files. Although this leads to an alarming situation, i. e., administrators can be informed about missing, truncated, or integrity-violating files, SLOPPI does not guarantee full traceability of all logged messages. Other techniques, such as using write-only memory, would be needed as this problem is all but impossible to fix in software realistically, given today s operating systems and their potential vulnerabilities, i. e., even a write-once file system could be accessed in a reading fashion, e. g., after system reboot and re-parametrization by an attacker who gained administrative privileges. VII. CONCLUSION SLOPPI is a log file management framework that supports integrity-checking and confidentiality through encryption like

21 49 other approaches, but has a special focus on compliance. Compliance with privacy and data protection acts can, at least in Europe, only be achieved by limiting the retention time of personal data, which may also be a part of log messages written to log files. SLOPPI therefore supports the fine-grained partial removal of log entries, while still ensuring the properly working integrity-monitoring of the other logged data. For example, German Internet service providers and any other providers of Internet-based services are obliged to remove personal data from log files after 7 days based on court verdicts that have become effective. Using the SLOPPI approach, log files older than the maximum personal data retention time can be modified to have all personal data removed, while still ensuring that the log file has not been tampered with otherwise. In this article, we have presented the inner workings and security analysis of SLOPPI in more detail than our previous work [1] and specified two extensions to the original SLOPPI architecture. First, we introduced the concept of slicing log messages and use them for semantic tagging, which can also be used in conjunction with external tools, such as business intelligence applications, for a-posteriori removal of personal data from log entries, and applying encryption for fine-grained access management protecting read access to log file information. Second, we presented the two options of iterations and chaining to address re-keying for encryption application log files, which were not part of the original SLOPPI specification. To this extent, SLOPPI has several important functional advantages over the previously state-of-the-art logging mechanisms. However, they come with the inherent overhead of additional cryptographic operations and, to a large extent, depend on the quality of random numbers generated for constructing key material, which can be a problem for low-interaction systems without sufficient entropy-generating pools. As future work on SLOPPI it is therefore planned to support adaptive protection mechanisms that vary the strength of the used key material depending on the sensitivity of the data it is applied to. For example, stronger cryptography could be applied to log message slices containing personal data, while trivial log entries are only protected using weaker mechanisms. ACKNOWLEDGMENT Parts of this work has been funded by the German Ministry of Education and Research (FKZ: 16BP12309). The authors wish to thank the members of the Munich Network Management (MNM) Team for helpful comments on previous versions of this article. The MNM-Team, directed by Prof. Dr. Dieter Kranzlmüller and Prof. Dr. Heinz-Gerd Hegering, is a group of researchers at Ludwig-Maximilians-Universität München, Technische Universität München, the Universität der Bundeswehr München, and the Leibniz Supercomputing Centre of the Bavarian Academy of Sciences and Humanities. REFERENCES [1] F. von Eye, D. Schmitz, and W. Hommel, SLOPPI a framework for secure logging with privacy protection and integrity, in ICIMP 2013, The Eighth International Conference on Internet Monitoring and Protection, W. Dougherty and P. Dini, Eds. Roma, Italia: IARIA, Jun. 2013, pp [Online]. Available: [accessed: ] [2] W. Hommel, S. Metzger, H. Reiser, and F. von Eye, Log file management compliance and insider threat detection at higher education institutions, in Proceedings of the EUNIS 12 congress, Oct. 2012, pp [3] M. Hoffmann, The SASER-SIEGFRIED project website. [Online]. Available: /SASER/SASER-b-Siegfried/saser-b-default.asp [accessed: ] [4] S. Metzger, W. Hommel, and H. Reiser, Migration gewachsener Umgebungen auf ein zentrales, datenschutzorientiertes Log- Management-System, in Informatik Springer, 2011, pp [Online]. Available: pdf [accessed: ] [5] M. Bellare and B. S. Yee, Forward integrity for secure audit logs, Department of Computer Science and Engineering, University of California at San Diego, Tech. Rep., Nov [6] B. Schneier and J. Kelsey, Cryptographic support for secure logs on untrusted machines, in Proceedings of the 7th conference on USENIX Security Symposium, vol. 7. Berkeley, CA, USA: USENIX Association, Jan. 1998, pp [7] J. E. Holt, Logcrypt: forward security and public verification for secure audit logs, in ACSW Frontiers, ser. CRPIT, R. Buyya, T. Ma, R. Safavi- Naini, C. Steketee, and W. Susilo, Eds., vol. 54. Australian Computer Society, Jan [8] D. Ma and G. Tsudik, A new approach to secure logging, in ACM Transactions on Storage, vol. 5, no. 1. New York, NY, USA: ACM, Mar. 2009, pp. 2:1 2:21. [9] A. A. Yavuz and P. Ning, BAF: an efficient publicly verifiable secure audit logging scheme for distributed systems, in ACSAC, 2009, pp [10] H. Krawczyk, The order of encryption and authentication for protecting communications (or: How secure is ssl?), in Advances in Cryptology CRYPTO 2001, ser. Lecture Notes in Computer Science, J. Kilian, Ed. Springer Berlin Heidelberg, 2001, vol. 2139, pp [11] C. S. Jutla, Encryption modes with almost free message integrity, in Advances in Cryptology EUROCRYPT 2001, ser. Lecture Notes in Computer Science, B. Pfitzmann, Ed. Springer Berlin Heidelberg, 2001, vol. 2045, pp

22 50 Design and Application of a Secure and Flexible Server-Based Mobile eid and e-signature Solution Christof Rath, Simon Roth, Manuel Schallar, and Thomas Zefferer Institute for Applied Information Processing and Communications Graz University of Technology Graz, Austria {first name}.{last Abstract Electronic identities (eid) and electronic signatures are basic concepts of various applications and services from security-critical domains including e-government, e-business, and e-commerce. During the past years, server-based approaches have been increasingly followed to implement these concepts. Unfortunately, existing server-based eid and electronic-signature solutions are usually tailored to a specific use case or deployment scenario. This renders a deployment of these solutions in arbitrary application scenarios difficult. To overcome this issue, we propose a flexible server-based eid and electronic-signature solution that can be easily deployed in arbitrary application scenarios while still providing a sufficient level of security and usability. The feasibility of the proposed solution is demonstrated by means of a concrete implementation. Furthermore, the claimed flexibility of the developed solution is shown by integrating it into a productive web-based time-tracking application. Its successful deployment and integration shows that the proposed solution provides a secure and flexible alternative to existing eid and electronic-signature solutions and that it has the potential to improve the security of security-critical services and applications from arbitrary domains. Keywords e-government, e-business, eid, electronic identity, electronic signature, identity management, mobile security. I. INTRODUCTION With the rise of digital society, remote identification of users has become an increasing challenge as a growing number of services have been moved to the Internet. Design and development of concepts and solutions that provide remote identification of users have been a topic of interest for many years. We have recently contributed to this topic and have proposed and presented a server-based eid and electronicsignature solution that facilitates remote identification of users in arbitrary application scenarios [1]. In this article, we further delve into this topic and elaborate on our proposed solution. In general, the need for reliable remote identification of users applies to public-sector applications (e-government) as well as to private-sector applications (e-commerce, e-business). Remote identification is usually achieved by means of a unique eid assigned to the user. An eid can for instance be a unique number, user name, or address. During authentication, the claimed identity (eid) is proven by the user. Reliance on secret passwords for authentication purposes is still the most popular and most frequently used authentication approach for online services. However, password-based authentication schemes have turned out to be insecure due to their vulnerability against phishing attacks and their poor usability, which often leads to the use of weak passwords that are easy to guess or easy to break [2][3]. Transactional online services from the e-government domain and related fields of application typically require reliable remote identification and authentication of users. Given the obvious drawbacks of password-based eid and authentication schemes in terms of security, two-factor authentication schemes have been developed for applications with high security requirements such as transactional e-government services. Current two-factor authentication schemes typically comprise the authentication factors possession and knowledge. Popular examples of two-factor authentication schemes are smart card based solutions. During the authentication process, the user proves to be in possession of the eid token (i.e., the smart card) and proves knowledge of a secret PIN (personal identification number) that is specific to this eid token and that protects access to the token and to eid data stored on it. In most cases, smart cards additionally enable users to create electronic signatures (e-signatures). For this purpose, the smart card additionally stores a secret signing key and features hardware-based signature-creation capabilities. Access to the signing key and to the smart card s signaturecreation functionality is again protected by means of two-factor authentication. Smart cards are an ideal technological choice to combine the concepts of eid and e-signature, as they are capable to implement both eid and e-signature functionality. Thus, they are frequently used in security-critical fields of application such as e-business, e-banking, or e-government. For instance, various transactional e-government services that have been launched in Europe during the past years require users to authenticate themselves remotely with a personalized smart card and to complete online transactions by applying an electronic signature with the same card [4]. Unfortunately, smart card based solutions usually lack an appropriate level of usability, as they require users to obtain, install, and use an appropriate card-reading device in combination with the associated software [5]. Powered by the recent emergence of mobile communication technologies and motivated by the low user acceptance of smart card based eid and e-signature solutions, several mobile eid and e-signature solutions have been developed during the past years [6]. These solutions render the use of smart cards unnecessary, as they cover the authentication factor possession

23 by means of the user s mobile phone. This way, mobile eid and e-signature solutions have the potential to significantly improve usability while maintaining a comparable level of security to smart card based solutions. This is supported by the fact, that, e.g., in Austria qualified signatures can be issued both, with smart cards and with mobile eid and e-signature solutions. Due to their improved usability compared to smart card based authentication schemes [5], mobile eid and e-signature solutions are in principle also suitable for use cases with lower security requirements. Unfortunately, existing mobile eid and e-signature solutions are usually tailored to the requirements of specific use cases and fields of application. This applies to most mobile eid and e-signature solutions that have been introduced and launched worldwide during the past years. Due to their limitation to specific use cases, these solutions can hardly be used in different fields of application. This leads to situations, in which most applications cannot benefit from the enhanced security and usability of existing mobile eid and e-signature solutions. To overcome this problem, we propose a modular and flexible concept for mobile eid and e-signature solutions. The main idea behind the design of the proposed concept was to achieve a flexible solution and to maintain its compatibility to different use cases and application scenarios. Details of the proposed concept are presented in this article. In Section II, we start with a brief survey of existing mobile eid and e-signature solutions and discuss their strengths and limitations. We then derive requirements of a mobile eid and e-signature solution that is applicable in arbitrary application scenarios in Section III. In Section IV, we introduce a technology-agnostic architecture for a mobile eid and e-signature solution that meets all predefined requirements. Based on the proposed architecture, we model three technology-agnostic processes that cover required functionality in Section V. The practical applicability and feasibility of the proposed solution is assessed in Section VI by means of a concrete implementation. The compatibility of this implementation with existing security-critical applications is evaluated in Section VII. Finally, conclusions are drawn in Section VIII. II. RELATED WORK The reliable remote identification and authentication of users by means of two-factor based approaches has been a topic of interest for several years. For many years, smart cards have been the preferred technology to implement two-factor based authentication schemes. Thus, smart card based solutions have been introduced in several security-sensitive fields of application during the past decades. Especially in Europe, various countries, such as Austria [7], Estonia [8], Belgium [9], or Spain [10] have issued personalized smart cards to their citizens in order to reliably identify and authenticate them during transactional e-government procedures [4]. In most cases, smart cards do not only provide eid functionality but also enable users to create electronic signatures. This is of special importance in Europe, where electronic signatures can be legally equivalent to handwritten signatures according to the EU Directive 1999/93/EC [11]. The importance of electronic signatures is even strengthened by the EU Regulation on electronic identification and trusted services for electronic transactions in the internal market [12], which will soon replace EU Directive 1999/93/EC. While smart cards work fine from a functional point of view, their usability is usually rather poor. This poor usability is mainly caused by the need for a card-reading device to physically connect the smart card to the user s computer. The need for additional drivers and software to communicate with the smart card and to integrate its functionality into securitycritical applications also decreases the usability of smart-card technology in general and of smart card based eid and e- signature solutions in particular. This has for instance been shown by Zefferer et al. [5], who have set up a thinkingaloud test with 20 test users to determine and compare the usability of different approaches to provide eid and e-signature functionality. The conducted usability test has shown that users clearly prefer solutions that do not require smart cards and card-reading devices. To overcome usability limitations of smart card based solutions, several mobile two-factor based eid and e-signature solutions have been developed during the past years. Surveys of mobile eid and e-signature solutions have for instance been provided by Ruiz-Martinez et al. [6] and Pisko [13]. All these solutions have in common that the factor possession is not covered by a smart card but by the user s mobile phone. All mobile eid and e-signature solutions that comply with demanding legal requirements, such as those defined by the EU Signature Directive, include some kind of secure hardware element, which is able to securely store eid data and to carry out cryptographic operations. Depending on the realization and location of this secure hardware element, mobile eid and e- signature solutions can be basically divided into the following two categories: 1) SIM-based solutions: Solutions belonging to this category make use of the mobile phone s SIM (subscriber identity module) to securely store eid data and to carry out cryptographic operations. In most cases, the use of a special SIM is required, as off-theshelf SIMs do not feature the required cryptographic operations. Access to eid data stored on the SIM and to cryptographic functionality provided by the SIM is typically protected by a secret PIN that is only known to the legitimate user. This way, SIM-based solutions rely on two different authentication factors. This PIN covers the factor knowledge of the two-factor based authentication scheme. The factor possession is covered by the SIM itself, which is under physical control of the user. With regard to security, all SIMbased solutions share one conceptual drawback. As required cryptographic operations such as the creation of electronic signatures are carried out on the mobile end-user device, these operations and all the data that is processed by these operations are potentially prone to malware residing on this device. This is especially an issue on current popular smartphone platforms such as Android, which are known to be vulnerable against malware [14]. 2) Server-based solutions: Server-based mobile eid and e-signature solutions implement the secure hardware element centrally, e.g., in a hardware security module (HSM) at the service provider. Such a solution has been proposed by Orthacker et al. [15]. The user s mobile phone does neither implement cryptographic functionality, nor store eid data. However, the mobile phone is an integral component of the 51

24 authentication process that is mandatory in order to gain access to centrally stored eid data and to carry out electronic signatures. Server-based solutions rely on two authentication factors. During signaturecreation processes, the user needs to provide a secret password first. This password covers the authentication factor knowledge. Covering the authentication factor possession is more challenging. As the serverbased secure hardware element is not under physical control of the user, this element cannot cover the authentication factor possession. This factor is again covered by the user s mobile phone, concretely by the user s SIM. To complete the authentication process, a one-time password is sent to the user s mobile device via SMS. This one-time password has to be returned by the user. This way, the user proves possession of the SIM, as the one-time password can only be received, if the user has control over the SIM. With regard to security, server-based approaches are conceptually advantageous, as they do not require critical data to present on potentially insecure and compromised mobile end-user devices. The weakest point of the server-based signature solution presented by Orthacker et al. [15] is probably the SMS-based user-authentication step, as SMS messages must not be assumed to be secure on certain smartphone platforms any longer [14]. For above-mentioned categories, concrete mobile eid and e-signature solutions have been developed and rolled-out on a large scale. For instance, SIM-based mobile eid and e- signature solutions have been set into productive operation in Estonia [16] and Norway [17]. A server-based mobile eid and e-signature solution has been in productive operation in Austria since 2009 [18]. Most existing solutions are tailored to a specific legal framework (e.g., national laws) or to a certain identity system (e.g., to a specific national eid system). For instance, the Austrian mobile eid and e-signature solution has been purpose-built for the Austrian official eid infrastructure and bases on data structures, protocols, and registers that are specific to the Austrian use case. The Austrian eid infrastructure has been discussed by Stranacher et al. [19] in more detail. Deploying this purpose-built solution in other countries would require major adaptations and cause additional costs. Similar limitations apply to most mobile eid and e-signature solutions that have been set into productive operation so far. Their purpose-built nature renders a use of these solutions in different fields of application difficult and expensive. This prevents a broad roll-out of mobile eid and e-signature solutions and prevents that all applications can benefit from their improved security and usability. III. REQUIREMENTS The conducted survey on existing mobile eid and e- signature solutions has identified a lack of dynamically adaptable solutions that can easily be applied to arbitrary use cases. To tackle this issue, we propose a mobile eid and e-signature solution that can easily be used in arbitrary application scenarios. We have designed the proposed solution according to a set of requirements. These requirements have been extracted from an analysis of existing solutions and from published evaluations of these solutions such as the one presented in [5]. The derived requirements (R1-R5) are discussed in the following in more detail. R1: Flexibility regarding external components: Mobile eid and e-signature solutions typically rely on external parties and components. Common examples for such components are certification authorities (CA), which bind a user s identity to her signing key, or identity databases (e.g., official person registers or company databases), which are required to derive eids for users. A generic mobile eid and e-signature solution must not be limited to certain external components but provide flexible means to integrate different external components (e.g., different CAs). R2: Avoidance of token roll-outs: Long-term experience with smart card based solutions has shown that the roll-out of eid and e-signature tokens (e.g., smart cards, SIMs) causes additional (financial) effort and hence reduces user acceptance. Avoidance of necessary roll-outs of such tokens is hence a key requirement for usable mobile eid and e-signature solutions. R3: Usability: The often disappointing user acceptance of smart card based solutions shows that usability is an important success factor of eid and e-signature solutions. For mobile eid and e-signature solutions, the following aspects need to be considered in particular in order to achieve an appropriate level of usability: R3a: Avoidance of installations: Usable solutions must not require the user to obtain, install, and maintain additional hardware or software, as this causes additional effort. R3b: Platform and device independence: Usable solutions must not be restricted to certain computing platforms, operating systems, or enduser devices, as users want to access services everywhere and at any time irrespective of their current execution environment. R3c: Location independence: Usable mobile eid and e-signature solutions must not be bound to a certain mobile network but must also be accessible when roaming in foreign networks. R4: Security: Security is an important requirement, as mobile eid and e-signature solutions are mainly applied in security-sensitive fields of application such as e- government or e-commerce. Hence, mobile solutions must assure a comparable level of security to other two-factor based eid and e-signature solutions and must be able to comply with given legal requirements such as the EU Signature Directive [11]. R5: Easy and flexible deployment and operation: From the service operator s point of view, mobile signature solutions should support an easy and flexible deployment as well as an efficient operation, in order to save installation, set-up, and operation costs. Based on these requirements, we propose a generic and adaptable mobile eid and e-signature solution, which removes limitations of existing solutions. We introduce and discuss the concept of our solution in the next sections before providing details on its implementation in Section VI. IV. ARCHITECTURE Mobile eid and e-signature solutions follow either a SIMbased or a server-based approach to store eid data and to create electronic signatures. Other approaches would be possible on 52

25 smartphones but cannot be applied on standard mobile phones due to their limited capabilities. Considering the requirements defined in Section III, we have decided to follow a serverbased approach for our solution. This means, that a central HSM is responsible for protecting all eid data as well as for computing electronic signatures. Since solutions based on server-side signatures have very limited hardware requirements on the user side, they are comparatively cheap, user-friendly, and flexible in their deployment, as no roll-out of tokens is required. This way, Requirement R2 and Requirement R5, which demand avoidance of token roll-outs and an easy and flexible deployment and operation, are fulfilled. Furthermore, server-based approaches require no up-front investments in dedicated SIM cards and no requirements towards the mobile network operators (MNO), hence, the targeted user group is not limited to a single, or certain MNOs. This reduces barriers and enhances usability. Advantages of server-based signature-creation approaches in terms of usability and user acceptance have also been discussed by Zefferer et al. [5]. Thus, reliance on a server-based approach assures that Requirement R3, which demands a sufficient level of usability, is met. A theoretic concept of a server-based mobile signature solution and an approach to store users private keys in a secure manner on a remote server have been proposed by Orthacker et al. [15]. The proposed solution fulfills the requirements of qualified electronic signatures as defined by EU Directive 1999/93/EC [11], which emphasizes the suitability of this concept for security-critical application scenarios. Furthermore, a server-based mobile eid and e-signature solution that is compliant to the EU Directive 1999/93/EC has been in productive operation in Austria for several years. This provides evidence that server-based solutions are capable to achieve a sufficient level of security and hence to meet Requirement R4. On a high level view, our solution defines the three processes: registration, activation and usage. These processes have different properties regarding computational effort and security constraints. During registration, which is mainly a matter of legal and organizational requirements, the identity of the user is verified. Usually, it is sufficient to perform the registration only once per user. During activation, a new eid including a signing key and a certificate is created for a registered user. Activation is required once per life span of an eid. In the usage process, created eids and signing keys are used by the user for authentication purposes and to create electronic signatures. Details of the three processes will be provided in the following section. The architecture of our mobile eid and e-signature solution reflects the three processes defined above. This is illustrated in Figure 1. The entire architecture is split into an inner part and an outer part. Components implementing functionality of the activation and the usage processes are executed within these two parts. As shown in Figure 1, each part has its own database to store required internal data. This way, the architecture is mainly composed of two databases and the four core components Activation Outer, Activation Inner, Usage Outer, and Usage Inner as well as a central HSM as inner component. The split between inner and outer components is a security feature as it reduces the impact of a data loss in case a service connected to the outer world gets compromised. Communication between outer and inner components happens via a limited, pre-defined set of commands over an encrypted channel. The separation of the core components allows for a very flexible deployment where, e.g., the activation parts can run on different machines, a different network or, if the business process allows/demands it, without a remote access at all. Additionally, access rights can be granted more restrictively, as only the activation process requires write access to many fields in the databases. At the same time, it is also possible to deploy the complete service on a single machine, if this is the preferred deployment scenario. By defining separate components to cover the proposed solution s functionality, the chosen architecture meets Requirements R4 and Requirement R5, which demand a sufficient level of security as well as an easy and flexible deployment and operation. Internet Certification Authority OTP Gateway SIR Web Service Activation Outer Shadow Database Usage Outer Person Register Activation Inner Private Database Usage Inner Person Records Figure 1: Overview of Core Components. HSM In addition to the four core components, the two databases and the HSM, the proposed architecture defines two internal and two external components. The external component OTP Gateway, which stands for one-time password gateway, is required during the registration, activation and usage process to send OTPs or activation codes to users. The internal component SIR Web Service is necessary for receiving socalled Standard Identification Records (SIRs). These records enable offline registrations, which will be discussed in detail in the next section. The components Person Register and Certification Authority (CA) are required during the activation process. While the CA is an external component, the Person Register is an internal component, which usually connects to an external database. The purpose of these components will also be discussed in detail in the next section. By clearly separating these components from the core components of the proposed solution, Requirement R1, which demands flexibility regarding external components, is already fulfilled on architectural level. The three processes, which build up our solution and cover its functionality, as well as all involved components, are described in the following section in detail. V. PROCESSES The entire functionality of the proposed technologyagnostic mobile eid and e-signature solution is covered by the processes registration, activation and usage. The purpose of these processes is discussed in the following subsection in more detail. A. Registration Process During the registration process, data necessary to unambiguously identify a user is collected. Each user has to 53

26 run the registration process at least once, before being able to use the proposed solution. To complete the registration process, the user has to prove her identity for example by means of a passport or an existing eid. In order to allow for a flexible setup of the registration process and to cover a broad range of legal and organizational requirements, the registration process has been designed to support different types of registration. These types of registration cover use cases from the e-government domain as well as use cases from related private-sector domains such as e-commerce or e-business. Furthermore, the proposed architecture is flexible enough to allow for an easy integration of further alternative registration types, in case they are required by the given use case. So far, the following four types of registration have been defined. Registration via registration officer: The identity of the user is verified face-to-face by a registration officer (RO) using official IDs, e.g., a passport or a driving license. After the verification of the user s identity, the RO manually registers the user in the proposed solution by filling the registration form with userspecific data. Offline registration: This registration type takes place in an asynchronous way. A user-data form has to be filled by an RO, after identifying the user similar to the registration type sketched above. After a validity check, the collected data has to be signed by the RO. The signed data is transmitted to the proposed solution and an activation code linked to the data is generated and passed to the user. The registration can be completed by the user at a later date using the issued activation code. Self registration: Self registration is carried out by the user herself with the help of an existing eid. While self-registration is common practice at online platforms, our solution relies on existing qualified eids for this purpose. The system verifies the user s identity by means of the provided eid and enables her to complete the registration afterwards on her own. An RO is not required for this type of registration, as the verification of the identity must have happened before during the activation of the existing eid. Registration via trusted organization: Many organizations have the legal requirement to identify their customers. Examples are bank institutes or universities. If a trust relationship with these organizations is established, existing identification data from these organizations can be used to register new users. Figure 2 illustrates the general registration process of the proposed solution. The basic goal of the registration process is the creation of a Standard Identification Record (SIR) for a specific user. The SIR can be created using the four registration types sketched above. Irrespective of the applied registration type, a SIR is created which unambiguously identifies a user and provides this user basic access to the proposed solution. Support of different types of registration allows for a very flexible setup of the registration process and covers a broad range of legal and organizational requirements regarding the registration process. This, in turn, contributes to a flexible operation of the proposed solution. This way, the proposed solution fulfills Requirement R5. Figure 2: Registration. B. Activation Process After successful registration, users can run the activation process to create a new eid. For this purpose, the user needs to log-in to the proposed solution. This is only possible, if a valid SIR is available for the user (i.e., if the user has successfully completed the registration process) or if the user has already created an eid during an earlier activation process. In the former case, the user is unambiguously identified by means of the SIR. In the latter case, the user can log-in using the already created eid. After a successful log-in, the user can create a new eid. For this purpose, the user is asked to fill the activation form. In general, the proposed solution supports multiple eids for each user. Therefore, the activation process can be run multiple times by each user. Each created eid can be managed separately. This enables users to have eids for different purposes, e.g., private and official affairs. During each activation process, a new cryptographic key pair is created for the user. This key pair can be used for subsequent signature-creation processes. Additionally, a certificate is issued to bind the user s identity to the created key. For each created eid, specific authentication data need to be defined by the user including a secret signature password (which will be verified against a regular expression pattern defined by the system administrator) and a mobile-phone number. The user has to prove possession of the specified mobile phone. This is achieved by means of OTPs that are sent to the user through an OTP Gateway. Figure 3: Activation. In addition to the created key pair and the defined authentication data, also identity-related information, like full name and birth date, is assigned to the newly created eid. This information is obtained by the Person Register. The Person Register is a component that connects to an external database containing potential users of the service. Depending on the deployment and application scenario, this can be an existing official database like a central register of residence maintained by a public authority, an existing domain-specific database like the database of employees of a private-sector company, or a database specifically operated for this service that grows with every new registration. After fetching required identity-related information from the relevant database, the Person Register 54

27 returns a signed data structure that contains the unique eid of the applicant and also the public key of the created signature key-pair. This way, it is possible to link a signature to a person for means of identification without the need to embed the unique eid directly in the signing certificate. By clearly separating eid functionality from e-signature functionality the users privacy is assured. A similar concept has already been successfully applied in existing national eid solutions [20]. After completion of the activation process, a new eid has been created. This eid comprises signed identity-related information and a key pair (and certificate) for the creation of electronic signatures. Additionally, a secret password has been chosen and a mobile-phone number has been registered for the created eid, which are required during subsequent usage processes. The basic principle behind the activation process is illustrated in Figure 3. C. Usage Process After the successful completion of the activation process, the user can use the created eid to securely and conveniently authenticate at services and to create electronic signatures. To create an electronic signature or to authenticate, the user has to enter her phone number and signature password. If the data provided by the user can be verified, an OTP is sent to her mobile phone in order to verify possession of the mobile phone. If the user can prove possession of the mobile phone by entering the OTP, the requested signature creation is performed. The main concept behind the usage process is shown in Figure 4. Figure 4: Usage. VI. IMPLEMENTATION Based on the proposed architecture and the defined processes, we have implemented a prototype to evaluate and demonstrate the applicability of our solution. This prototype has been named ServerBKU. The ServerBKU represents a server-based eid and e-signature solution. In the following subsections, we elaborate on the technologies used to realize the ServerBKU and discuss in detail the implementation of the ServerBKU s main processes. A. Choice of Technologies We have built our prototype implementation on a set of well-known and production-ready Java-based frameworks and libraries. This way, an efficient development process has been achieved and the probability of implementation errors has been minimized. Furthermore, reliance on appropriate frameworks assures that the prototype meets the requirements defined in Section III. Employed development frameworks, used libraries, and their underlying technologies are briefly introduced in this section. The foundation of all implemented modules is the Spring Framework [21], which supports the development of modular and flexible software solutions. The basic underlying approach, followed by the Spring framework that enables a flexible design, is called dependency injection (DI). Following this approach, the dependencies of the various components are wired, i.e., injected, during runtime by the so-called inversion of control (IoC) container, a core component of the Spring framework. During development, concrete functionality, e.g., the OTP gateway, is abstracted by interfaces or base classes. Concrete implementations of the abstracted functionality are selected by configuring the IoC container. In the case of the OTP gateway, for instance, a special SMS-gateway implementation has been selected to implement the functionality of the OTP gateway interface. The flexible and easy selection of concrete implementations for abstract functionality enables a loose coupling of modules and allows software to be tailored to the specific needs of the use-case at hand. The loose coupling of modules also facilitates a test-driven development, as a single component can easily be tested without many dependencies. The concepts of DI and IoC are actually no unique features of the Spring framework. The Spring framework just implemented these concepts from the very beginning in 2002 in order to enable the development of flexible software. Apart from the concepts DI and IoC, the Spring framework also provides templating mechanisms for various common tasks. This minimizes the amount of boilerplate code, which in turn reduces the chances of copy-and-paste errors and keeps the source code slim and readable. A prominent example of templates provided by the Spring framework is the Hibernate template. Hibernate [22] is an object-relational mapping (ORM) library, i.e., entries of relational databases are mapped to Java objects and vice versa. This way, most of the specifics of an underlying database can be abstracted by Hibernate. This enables an adoption of databases to the needs of certain deployment scenarios. By relying on the Spring framework, our prototype implementation, i.e., the ServerBKU, achieves a sufficient level of flexibility as demanded by the requirements defined in Section III. In order to further improve the flexibility of the Server- BKU, a suitable proxy mechanism has been selected. This mechanism enables data exchange between different modules of the ServerBKU. For this purpose, the Java messaging service (JMS) API has been the technology of choice. Using this technology, the actual instance of an interface may run transparently on a different host. This way, it is possible to run the complete stack on a single machine or distribute the components over several servers. Apache ActiveMQ [23] has been chosen as implementation of the JMS API. Apache ActiveMQ supports out of the box redundancy and load balancing mechanisms. Furthermore, all exchanged messages can be protected via TLS secured channels. For this purpose, the IAIK isasilk library has been used, which provides an extensible and highly configurable implementation of SSL 2.0 and 3.0 and TLS 1.0 and 1.1. Libraries provided by IAIK [24] have also been employed to implement required cryptographic operations. Concretely, the IAIK provider for the Java Cryptography Extension (IAIK JCE) has been used to implement relevant functionality. Furthermore, the ServerBKU relies on the IAIK ECCelerate library to implement functions related to elliptic curve cryptography. To access the hardware security module (HSM), the ServerBKU uses the IAIK PKCS#11 Provider and Wrapper. The wrapper provides the Java Native Interface (JNI) to the hardware-dependent PKCS#11 library, while the PKCS#11 Provider implements a JCE provider for a specific hardware 55

28 module. You can see an overview of the technologies used in Table I on page 10. Besides appropriate development frameworks and cryptographic libraries, also a suitable technology to implement the required OTP gateway has finally been selected. Concretely, the ServerBKU has been defined to use transaction numbers (TAN), which are generated randomly, delivered via an SMS gateway to cover the functionality of OTPs and the OTP gateway. The employed SMS gateway operator provides an proprietary interface that enables the delivery of SMS messages via HTTP POST. To assure the security of the ServerBKU, appropriate technologies have been chosen to assess our implementation by means of systematic security analyses. To follow an approved approach, the ServerBKU has been evaluated regarding the most recent critical risks according to OWASP [25]. Risks and flaws proposed by OWASP to be investigated are for example different types of code injection, Cross Site Scripting (XSS) or Cross Site Request Forgery (CSRF) amongst others. Analyses have been carried out using a white-box testing approach, as this method reveals most implementation errors. Several tools exist that facilitate such tests. Examples are Burp Suite [26] and several useful browser plugins that for instance allow the editing of cookies. Following the white-box approach allows the auditor having knowledge of the internal structure of the project, like the knowledge of libraries and frameworks in use, as well as having access to the source code. This way, the ServerBKU has been systematically and reliably assessed in terms of security. B. Realization of Processes Based on selected technologies, development frameworks, and libraries, the three processes defined in Section V have been implemented. The implementation of these processes is described in detail in the following subsections. 1) Registration Process: In this step, the user has to prove her identity. Our implementation supports all types of registration defined in Section V. In a traditional setup, registration happens at the office of the RO. For this scenario, our implementation provides a web-based UI, through which the RO can register the user in the system by entering user data after identity verification. This UI is shown in Figure 5. However, in some situations it might be beneficial for the RO to travel from user to user. This requires means to carry out asynchronous offline registration, as access to the ServerBKU is potentially not available at the user s place. To support this type of registration, the ServerBKU supports registration of users via SIRs created offline. A SIR contains information to identify a person, information about the ID used to verify the identity of the user, a binding towards a hardware token, i.e., a mobile phone for the use case at hand, and the electronic signature of a RO. Alternatively, a SIR can also be signed by a trusted partner, e.g., a bank or a university. This corresponds to the fourth type of registration listed in Section V. SIRs can be created from data entered by the RO (or trusted partner) or by the user using additional software. During the creation of the SIR, an activation code is generated and delivered to the user, cf. Figure 2. Created SIRs must be sent to the ServerBKU s external SIR web service component via SOAP. The SIR webservice verifies the validity of a provided SIR by means of its electronic signature. If this verification succeeds, the SIR is stored in the user database of the ServerBKU. The user can use the stored SIR together with the activation code at a later date to start the activation process. The ServerBKU supports different front-ends that enable this type of registration. Initially, we developed a simple, yet comprehensive, stand-alone application based on Spring MVC. This application can be used on mobile devices in case of traveling ROs and supports the RO in creating SIRs (Figure 5). Furthermore, we developed an interface component that enables a traveling RO to take a picture of the ID of the user to be registered. The required data is extracted from this picture using optical character recognition (OCR). From these data, the required SIR is finally created. By supporting differnet means to create SIRs offline, the ServerBKU facilitates the offline registration of users. Standard Identification Record Fields tagged with * Binding Mobile Phone Number * +43 Applicant Data Salutation * Mr. Ms. Firstname * Date of Birth * Nationality --- please select --- Full Address of Residence Yes No Proof of Identity ID Type * --- please select --- Issuing Date * Issuing Country * Austria Submit??? are mandatory.? Title Lastname * Place of Birth Country of Birth --- please select --- Postal Code of Residence ID Number * Authority * Scan of ID? * Datei auswählen Keine ausgewählt Figure 5: Offline Registration. en de v1.2.0 ec Support To cover the last registration type, the ServerBKU provides a UI for the user. This UI is similar to the one developed for registrations vio ROs. It allows the user to carry out a self-registration in case she has already a trusted eid, e.g., smart card. As the user is identified and authenticated by means of this eid, no RO is necessary to complete the registration process. 2) Activation Process: In this process, the user creates and activates a new mobile eid. The activation process offers again a web-based interface. It has been developed using Java Server Faces (JSF) 2.1 [27] and Primefaces [28] for the frontend. The decision to use a different technology to create the UI is based on the rich set of UI components that is part of Primefaces. This facilitates development of a flexible, easy to 56

29 use, role/permission-based interface in a short amount of time. If the registration was performed in the classical way including an RO or as self-registration, the activation process starts automatically after registration. A pre-registered user can start the activation any time and independent of the registration process by submitting the received activation code and her telephone number to a specific URL. This way, available user data is automatically pre-populated as far as possible in the provided activation form by extracting the corresponding data from the SIR received from the database. Additionally, required data such as a signature password and a revocation password have to be entered by the user into the activation form. As users may activate an arbitrary number of mobile eids for each phone number, activated eids have to be distinguishable by the system. This is achieved by the SHA- 512 hash of the phone number and the signature password that have been selected by the user. Consequently, passwords have to be unique per telephone number. As the phone number is usually constant, a unique password has to be chosen for each eid. To verify the user s phone number and the possession of the device, a random OTP is generated, sent to the user s phone, and queried at the web interface. If the user enters a wrong OTP too often or if the code has expired, the activation process is aborted. The length and appearance (e.g., numeric, alphanumeric, etc) of this OTP as well as the number of trials and the time of validity is configurable. The user has also the possibility to resend the OTP a configured number of times in case the message gets lost on its way. After the user has proven possession of the mobile phone, a signing key-pair for the user is created in the HSM. The private key is then wrapped by and exported from the HSM and securely, i.e., encrypted, stored in the ServerBKU s database. For details on the encryption scheme see below. Additionally, a certificate signing request (CSR) is generated. The public key is extracted from the CSR and sent to the Person Register together with additional data such as name and date of birth, which are required to identify the user in the Person Register. The Person Register returns a signed data structure that contains the unique eid of the user and the public key of the created signature key-pair. The returned signed data structure is, again encrypted, stored in the ServerBKU s database. Subsequently, an end-user certificate is requested from the CA using the already created CSR. The obtained certificate is stored together with the private key and the created eid data in the database. The encryption of stored user data is based on a secret signature password, which the applicant chooses during the activation process. The ServerBKU relies on a hybrid encryption scheme as suggested by Orthacker et al. [15]. Here, the user has an additional encryption key-pair (Kenc pub and Kenc priv ), which is generated alongside the signing key-pair. The private key is then encrypted (EKenc priv ) under the users signature password (P W sig ) and stored in the database. This happens only once during the activation phase. SK PW = derive(p W sig ) = encrypt(kenc priv, SK PW ) EK priv enc (1a) (1b) To encrypt a plain message M, a random symmetric key SK rand is generated. This random secret key has to be encrypted for the user using her public encryption key (2b) and stored together with the cipher text (2a). However, this does not involve data from (1a) and (1b). EM = encrypt(m, SK rand ) ESK = encrypt(sk rand, Kenc pub ) (2a) (2b) This enables encryptions of data on behalf of the user without knowledge of the user s signature password. The decryption, however, requires the consent of the user, which she gives by providing the signature password (3a). SK PW = derive(p W sig ) Kenc priv = decrypt(ekenc priv, SK PW ) SK rand = decrypt(esk, Kenc priv ) M = decrypt(em, SK rand ) (3a) (3b) (3c) (3d) After the generated certificate has been stored in the Server BKU s database, the user gets a notification per that the activation of the eid was successful. This finally completes the activation process. Apart from the actual activation process, the implemented user interfaces also provide additional functionality. For instance, an interface has been implemented for ROs to perform activations on behalf of someone else as a usability feature. Furthermore, an interfaces is provided for each user that facilitates the management of eids, both for the user and also for a support team. This interface is shown in Figure 7. Finally, an administration UI has been developed that allows the definition and assignment of roles. 3) Usage Process: The usage process has been developed alongside the activation process and therefore is built on the same technologies, i.e., JSF [27] and Primefaces [28]. The interfaces are reduced to the bare minimum required for authenticating users and authorizing the creation of electronic signatures. This facilitates an easy integration of the Server- BKU into arbitrary third-party applications. The two forms that are used during the user-authentication process are shown in Figure 6. Telephone Number: Signature Password: en I (a) Login Cancel Login View Signature data Reference value:: 42pMNpuKFp TAN: resend SMS Cancel (b) TAN Verification Figure 6: Interface of the Usage Process. Sign The signature-creation process starts with the receipt of an appropriate HTTP POST request at the web interface provided by the ServerBKU. The system returns the form shown in Figure 6(a), where the signer has to provide her phone number and signature password. The signature password is used to decrypt a private key that is part of the hybrid encryption used to securely store user-related data. Thus, neither the decryption 57

30 58 Englisch Aristo ServerBku Timeout in 14 minutes 1 seconds Switch role Standarduser Log out Welcome John Doe Date of Birth: Account State: active Mobile Citizen Cards Create new Show Recent information Options : ServerBKU prototype deployment for "Design and Application of a Secure and Flexible Server-Based Mobile eid and e-signature Solution" Delete account User Profile Settings Account Billing ServerBku - Version: Build: Contact us Figure 7: Activation Management. of the user s signature key has to take place before the two-factor authentication is complete, nor must the signature password be stored in a session. In order to prevent brute force attacks, the account for a given phone number gets locked a configurable period of time if too many unsuccessful log-in attempts are recognized. If the user authentication was successful, two random values are generated: the OTP, i.e., the TAN, and a reference value, which is displayed in the TAN verification form (Figure 6(b)) and in the SMS that is used to deliver the TAN. This way, a link is provided between the TAN and the current session. Next, the service sends the TAN via the OTP gateway to the user s mobile phone in order to verify its possession. After verifying the reference value received by SMS against the reference value displayed in the TAN verification form, the user enters the received TAN into the form. The form also provides a link to display the signature data. This enables the user to check the data to be signed prior to authorizing the signature creation. If the user has been successfully authenticated, the user data is read from the database and decrypted using the user s private key of the hybrid encryption scheme. Then, the still-wrapped private key of the signing keypair is loaded into the HSM where it is unwrapped. Thus, the users private signing key is never accessible in a usable form outside the HSM. Finally, the unwrapped key is used inside the HSM to create an electronic signature on behalf of the user. After successful completion of the signature-creation process, the unwrapped key is discarded and the created signature is returned to the requesting entity. VII. EVALUATION The ServerBKU shows that the server-based signature solution proposed in this article can be implemented in practice. To further evaluate its applicability in real-world use cases, we have deployed the ServerBKU in-house and linked it to an already existing application. We elaborate on this deployment and on the evaluation of the ServerBKU in this section. For this purpose, we first introduce the in-house application Timesheep, which has been used to evaluate the ServerBKU. We then show how the ServerBKU has been integrated into Timesheep to evaluate its applicability. A. Timesheep In the past, our organization used simple Excel sheets to record efforts, i.e., working hours, for employees and projects. Each employee had to fill out an Excel sheet with the efforts he spent on assigned projects. These Excel sheets were printed and had to be signed by the user by hand. After signing, the Excel sheets were forwarded to the group leaders. In the last step, responsible group leaders had to sign the Excel sheets themselves, in order to approve them. Signed Excel sheets were archived for project calculations. This process was cumbersome for several reasons including the following ones. Employees often forgot to fill out their Excel sheets in time and had to be reminded frequently.

31 Employees sometimes forgot to print and sign Excel sheets, which caused delays in project calculations. Excel sheets had to be maintained and forwarded to group leaders manually, representing a potential source of error. When out of office, group leaders were not able to sign Excel sheets resulting in delays in project calculations. To overcome these problems, our organization now uses a web-based time tracking tool called Timesheep. It tracks the efforts, i.e., working hours, for each employee and project. Timesheep runs on a virtual machine, which is only accessible from our internal network or through a Virtual Private Network (VPN) connection. Timesheep runs on similar technologies like the ServerBKU. It uses the Spring Framework [21] as a basis for modular and flexible development. Hibernate [22] is used to make the implementation independent from the underlying database. In addition, Spring Roo [29] has been used for fast prototyping. Spring Roo has been configured to use Spring Web MVC [30] as the web-rendering framework. To extend Spring Web MVC s tagx components, Prime UI [31], handsontable [32] and vis.js [33] have been used. You can see an overview of the technologies used in Table I. TABLE I: Choices of Technologies Overview Technologies / Application ServerBKU timesheep Java X X Spring Framework X X Spring Roo X Spring WEB MVC X handsontable X vis.js X Primefaces X Prime UI X Hibernate X X Apache ActiveMQ X IAIK isasilk X IAIK JCE X X IAIK Eccelerate X IAIK PKCS#11 Provider and Wrapper X Timesheep defines several roles within our organization, in order to model required functionality: User: The first role is the role of normal users, i.e., employees. Users must have an easy access to Timesheep to track their efforts. This is achieved by providing a simple web-based interface, which can be accessed by using any common web browser. Group Leader: The second role is the role of group leaders. Group leaders must be able to access tracked efforts of employees assigned to their group, check these efforts, and approve, i.e., sign, them. Furthermore, the group leaders must be able to plan projects based on the budget of a project and on the efforts assigned users can raise until the project deadline. Administrator: The third role is the role of the administrator. Timesheep was developed to minimize the efforts administrators have to do. New employees are automatically added to the system with the necessary information for Timesheep to work. This is achieved by linking Timesheep to our in-house domain log-in system. This approach has also been followed by the ServerBKU, in order to avoid double registrations and to achieve maximum comfort. Thus, employees were able to log-in to Timesheep and to the ServerBKU with their domain log-in name without an additional registration process. Financial Department: The fourth role is the role of the financial department. Every project has milestones and a defined end date. At every milestone and end date, the current costs of the project must be calculated and submitted. Every submission is double checked by external financial auditors. Therefore, the auditors need the efforts done by each user on each project. After the project calculations are accepted by the auditors, tracked efforts must be protected against subsequent changes. Based on these roles, Timesheep provides all required functionality to facilitate the tracking of efforts and the generation of time sheets. As all data are collected by one central web application and stored in one central database, required processes can be automated and delays caused by the manual tracking of efforts and processing of Excel sheets can be eliminated. Although Timesheep has significantly improved the tracking of efforts in our organization, room for improvement could still be identified. The main drawback of Timesheep was its reliance on handwritten signatures. Even though efforts are tracked in electronic form and stored centrally in a database, employees and group leaders still had to sign generated time sheets per hand. Even though means to sign documents electronically exist, these means were not integrated into Timesheep. To overcome this issue, we have integrated the ServerBKU into Timesheep. This way, Timesheep has been enhanced by means to electronically sign generated time sheets. Furthermore, this integration evaluates the practical applicability of the ServerBKU. The integration of the ServerBKU into Timesheep is discussed in the following section. B. Combining Timesheep and ServerBKU After all required efforts have been entered by the user and approved by the group leader, Timesheep creates a PDF-based time sheet called monthly timesheet. The monthly timesheet has to be signed by both the user and the responsible group leader. By integrating the ServerBKU, this signing process has been improved in terms of efficiency. For this purpose, we have integrated the ServerBKU s signature process seamlessly into Timesheep in order to achieve the best usability possible. Timesheep is organized into multiple modules as shown in Figure 8. Each module fulfills a specific purpose for our organization. Time Tracking Module: This module offers a webbased interface for users to store their efforts. Planning Module: This module provides group leaders project planning and budget estimation based on user s contracts and employment status, and on project deadlines. Calculated values, e.g., hours, budgets, etc., are the so called target values. Monthly Timesheet Module: This module is responsible for generating and managing monthly timesheets. This module is also responsible for handling the signature process with the ServerBKU. Financial Module: This module offers our financial department the possibility to check how much a project did actually cost. These values are the so called actual values and may differ from the target values 59

32 defined by the Planning Module. Actual values will be submitted to external financial auditors. Time(Tracking( Module( Planning( Module( Timesheep(base(module( Webserver( ServerBKU(Usage(Outer(Module( Module( Figure 8: Timesheep Modules. Communica)on* over*https* Financial( Module( Database( Figure 8 shows that the monthly timesheet module is responsible for the generation and management of timesheets. Hence, this module needs to be enhanced, in order to integrate the ServerBKU into the timesheet management and signing process. The ServerBKU-based signature process of a timesheet is shown in Figure 9. This figure shows how the monthly timesheet module interacts with components of the ServerBKU to electronically sign generated time sheets. User((Browser(&(Mobile(Phone)( 1( Monthly6mesheet(Module( 8( 7( 3( 5a( 5a( SMS(Gateway( 4( 5b( 6( Figure 9: Work Flow of the Signature Process. In the initial Step (1), the user starts the signature-creation process by clicking a button in his browser. A new browser window opens, in which all further communication between the ServerBKU and the user takes place. In Step (2), the monthly timesheet to be signed is sent to the ServerBKU. In Step (3), the ServerBKU prepares a PDF Advanced Electronic Signature (PAdES) and displays an authentication form to the user as shown in Figure 6(a). After that step, the user enters her phone number and signature password (Step (4)). In the next step, the ServerBKU sends a generated TAN to the user s mobile phone (Step (5a)) and displays the TAN verification form (Step (5b)). Next (Step (6)), the user enters the received TAN in the TAN verification form as shown in Figure 6(b). If the TAN provided by the user was successfully verified, the ServerBKU signs the monthly timesheet and sends the signature back to Timesheep (Step (7)). The browser window 2( ServerBKU(Usage(Outer(Module( opened in Step (1) closes. Timesheep receives the signature, verifies it and notifies the user that the signature has been successfully verified. This is covered by Step (8). Signed monthly timesheets are stored in Timesheep s database and can be provided to external financial auditors to report efforts done by users. By integrating the ServerBKU into our in-house application Timesheep, two goals have been reached. First, the process of signing time sheets containing tracked efforts of employees has been improved in terms of efficiency and usability. Even though the performance of the developed solution has not been systematically measured so far, related work on the usability of server-based signature solutions indicate that these solutions are advantageous in terms of usability [5]. This is also supported by first practical experiences gained with the ServerBKU-enhanced Timesheep instance. These experiences show that the integration of the ServerBKU improves the user acceptance of Timesheep and helps to reduce delays in reporting efforts to the financial department. A systematic measurement of the concrete usability and performance improvement that has been reached by integrating the ServerBKU into Timesheep is regarded as future work. Second, integration of the ServerBKU into our in-house application Timesheep shows that the proposed server-based signature solution in general and its concrete implementation ServerBKU in particular are applicable in practice and can be smoothly integrated into existing applications. Thus, the proposed server-based signature solution and the ServerBKU have been evaluated successfully. VIII. CONCLUSION In this article, we have proposed, presented, and discussed an enhanced server-based eid and e-signature solution. Based on a set of relevant requirements, we have developed an appropriate architecture for the proposed solution first. We have then carried this architecture over to a concrete implementation called ServerBKU using common state-of-the-art technologies. A test deployment of this implementation is publicly available online and can be accessed for test purposes [34]. Furthermore, the practical applicability of the ServerBKU has been evaluated by integrating it into the time-tracking tool Timesheep. Even though the ServerBKU is ready for productive use, there are still some open issues that are regarded as future work. First, we need to gain more practical experience with our solution especially with regard to different deployment and application scenarios. Although first empirical results obtained by integrating the ServerBKU into the time-tracking tool Timesheep are promising, further experiences are required to further develop and optimize our solution. Second, we want to systematically measure the efficiency of the ServerBKU, in order to identify potential usability limitations. For instance, Single Sign-on solutions could help to reduce required user interactions and, hence, improve efficiency and usability. While the concept of server-based eid and e-signature solutions is not completely new, the ServerBKU is the first one that is not tailored to a certain application scenario. While existing solutions such as the Austrian Mobile Phone Signature have been developed for a specific deployment scenario, the ServerBKU has been designed such that it can be easily integrated into arbitrary application and deployment scenarios. This way, the ServerBKU leverages the use of eid and e-signature functionality in arbitrary applications and 60

34 No Place to Hide: A Study of Privacy Concerns due to Location Sharing on Geo-Social Networks 62 Fatma S. Alrayes and Alia I. Abdelmoty School of Computer Science & Informatics Cardiff University Wales, UK {F.S.Alrayes, Abstract User location data collected on Geo-Social Networking applications (GeoSNs) can be used to enhance the services provided by such applications. However, personal location information can potentially be utilised for undesirable purposes that can compromise users privacy. This paper presents a study of privacy implications of location-based information provision and collection on user awareness and behaviour when using GeoSNs. The dimensions of the problem are analysed and used to guide an analytical study of some representative data sets from such applications. The results of the data analysis demonstrate the extent of potential personal information that may be derived from the location information. In addition, a survey is undertaken to examine user awareness, concerns and subsequent attitude and behaviour given knowledge of the possible derived information. The results clearly demonstrate users needs for improving their knowledge, access and visibility of their data sets as well as for means to control and manage their location data. Future work needs to investigate the current state of personal data management on GeoSNs and how their interfaces may be improved to satisfy the highlighted users needs and to protect their privacy. Keywords location privacy; Geo-social networks; mobility patterns; privacy concerns. I. INTRODUCTION The proliferation and affordablity of GPS-enabled devices are enabling individuals to accumulate an increasing amount of personal information, such as their mobility tracks, geographically tagged photos and events. Embracing these new locationaware capabilities by social networks has led to the emergence of Geo-Social Networks (GeoSNs) that offer their users the ability to geo-reference their submissions and to share their location with other users. Subsequently, users can use location identifiers to browse and search for resources. GeoSNs include Location-Enabled Social Networks (LESNs), for example, Facebook, Twitter, Instagram and Flickr, where users locations are supplementary identification of other primary data sets, and Location-Based Social Networks (LBSNs), for example, Foursquare and Yelp, where location is an essential key for providing the service. In addition to location data that describe the place the places visited by users, GeoSNs also records other personal information, such as user s friends, reviews and tips, possibly over long periods of time. User s historical location information can be related to contextual and semantic information publicly available online and can be used to infer personal information and to construct a comprehensive user profile [1], [2]. Derived information in such profiles can include user activities, interests and mobility patterns [3], [4]. Such enriched location-based profiles can be considered to be useful if used to personalise and enhance the quality of the services provided by the social networking applications. For example, by recommending a place to visit on Foursquare and showing local trends on Twitter. However, they can potentially be used for undesirable purposes and can pose privacy threats ranging from location-based spams to possible threats by an adversary [5]. Users may not be fully aware of what location information are being collected, how the information are used and by whom, and hence can fail to appreciate the possible potential risks of disclosing their location information. In this paper, a study of location privacy of users when using GeoSNs is presented. The aims are to investigate potential privacy implications of GeoSNs, as well as examine users privacy concerns and attitude when using these networks. We demonstrate the privacy implications by identifying possible derived information from typical data sets collected by LBSNs for different types of users, as was shown in an earlier work [1]. In addition, a survey was undertaken to gauge users understanding and reaction to possible types of privacy threats resulting from the knowledge of their location information. Firstly, the dimensions of the problem are examined and the factors that can impact users privacy are identified. These factors include, the type of data collected, its visibility and accessibility by users, as well as the possible exploitation of these data by the application. Secondly, an analytical study is conducted using a representative data set to explore the location data content and the range of possible inferences that can be made from them. The frequency of usage of the networking application is used to classify users and in the analysis of their behavioural patterns. Finally, a survey was undertaken to examine users awareness and concerns with respect to privacy implications of their location data and their needs to control access to their data on GeoSNs. Previous studies explored users privacy concerns and attitude when sharing their location for social purposes, but presented limited evaluations using restricted application scenarios [6], [7]. Questionnaire analysis demonstrate a strong feasibility of inference of users personal information that may pose a threat to their privacy on these networks. The survey also reveals users concerns about their location privacy and their motivation to control their location information. The outcomes highlight the need for further work on improving the visibility

35 63 of the information collected, to allow users to better understand the implications of their location sharing activities and assess their need to control access to their location data sets. The rest of this work is organized as follows. Section II gives an overview of related work. In Section III, the dimensions of the location privacy problem in GeoSNs are discussed. Section IV describes the experiment conducted with a realistic data set to explore the spatiotemporal information content explicitly described and that may be inferred from the data. Section V builds on the results of Section IV by designing and deploying a questionnaire that explores users awareness and attitude towards potential privacy threats. Discussion of the results and conclusions are presented in Section VI. II. RELATED WORK Security and privacy of online social networks is a general research area that includes evaluating potential privacy risks, as well as developing privacy-protection methods [8], [9], [10]. This paper focuses on the privacy implications of locationrelated information in GeoSNs. Two relevant questions to the problem studied are: to what extent is location privacy a potential concern for users in GeoSNs, and what sort of location-based inference is possible from the data collected in GeoSNs. In this section, related works on both issues are reviewed. A. Users Attitude and Privacy Concerns in Geo-Social Networks Much interest has been witnessed over the past few years for studying users attitude and concerns to location privacy and investigating how user-empowered location privacy protection mechanisms can influence their behaviour. Tsai et al. [6] developed a social location sharing application, where participants were capable of specifying time-based rules to share their location and were then notified of who viewed their locations. Their findings suggested that the control given to users for setting their sharing preferences contribute to the reduction of the level of their privacy concern. Sadeh et al. [7] enabled users of their People Finder application to set rule-based location privacy controls by determining the where, when and with whom to share their location and were notified when their location information was requested. Participants were initially reluctant to share their location information and then tended to be more comfortable over time. Patil et al. [11] developed a system to represent actual users workplace, offering live feeds about users and their location and asked users to define different levels of permissions for their personal information sharing. They found that participants were concerned most about their location information and that they utilised the permission feature to control this information. Another study by Kelley et al. [12] showed that users were highly concerned about their privacy especially when sharing location information with corporateoriented parties. Other works were carried out to examine how the employment of visualization methods may impact users attitude to location privacy and behaviour. Brush et al. [13] studied users attitude towards their location privacy when using GPS tracking over long periods of time and questioned whether using some obfuscation techniques can address their concerns. Participants were concerned about revealing their home, identity and exact locations. They visually recognised and chose the best obfuscation techniques they felt can protect their location privacy. In addition, Tang et al. [14] investigated the extent of presenting various visualizations of users location history on influencing their privacy concerns when using location-sharing applications. They developed text-, map-, and time-based visualization methods and considered spatiotemporal properties of sharing historical location. They noted that the majority of participants found visualization of location history to be more revealing and tended to prefer text-based presentation methods to limit the amount of data exposed. With regards to public GeoSNs, there are relatively few research works that examine privacy concerns of users. Lindqvist et al. [15] considered users motivations in using Foursquare and questioned their privacy concerns. Their analysis showed that most of the participants had few concerns about their privacy and users who were more concerned about their privacy chose not to check into their private residence or to delay checking into places till after they leave, as a way of controlling their safety and privacy. A similar observation was noted by Jin et al. [16], where it was found that users were generally aware of the privacy of their place of residence and tended not to provide full home addresses and blocked access to their residential check-ins to other users. In summary, it is evident that location privacy presents a real concern to users in location-sharing applications, and particularly as they become aware of the data they are providing. Previous studies may have been limited by several factors, including the size and representativeness of the sample user base used in the experiments conducted and the limited features of the proprietary applications used in testing [6], [7], [11], [12]. Moreover, as far as we are aware, no studies so far have considered the problem of location privacy on public LBSNs. B. Location-Based Inference from GeoSNs There are some studies that utilised publicly available information from GeoSNs in order to derive or predict users location. In [17], Twitter users city-level locations were estimated by only exploiting their tweet contents with which it was possible to predict more than half of the sample within 100 miles of their actual place. Similarly, Pontes et al. [18] examined how much personal information can be inferred from the publicly available information of Foursquare users and found the home cities of more than two-thirds of the sample within 50 kilometres. Sadilek et al. [19] investigated novel approaches for inferring users location at any given time by taking advantage of knowing the GPS positions of their friends on Twitter. Up to 84% of users exact dynamic locations were derived. Interestingly, Gao et al. [20] formulated predictive probability of the next check-in location by exploiting socialhistorical ties of some Foursquare users. They were able to predict with high accuracy possible new check-ins for places that users have not visited before by exploiting the correlation between their social network information and geographical distance in LBSNs [21]. Other works focussed on investigating the potential inference of social relationships between users of GeoSNs. Crandall et al. [22] investigated how social ties between people can be derived from spatial and temporal co-occurrence by using

36 64 publicly available data of geo-tagged pictures from Flickr. They found that relatively limited co-occurrence between users is sufficient for inferring high probability of social ties. Sadilek et al. [19] also formulated friendship predictions that derive social relationships by considering friendship formation patterns, content of messages of users and their location. They predicted 90% of friendships with accuracy beyond 80%. Additionally, Scellato et al. [23] investigated the spatial properties of social networks existing among users of three popular LBSNs and found that the likelihood of having social connection decrease with distance. In [24], they developed a link prediction system for LBSNs by utilising users checkins information and properties of places. 43% of all new links appeared between users with at least one check-in place in common and especially for those who have a friend in common. Studying and extracting spatiotemporal movement and activity patterns of users on GeoSNs attracted much research in recent years. Dearman et al. [25] exploited location reviews on Yelp in order to identify a collection of potential activities promoted by the reviewed location. They derived the activities supported by each location by processing the review text and validated their findings through a questionnaire. Noulas et al. [26] studied user mobility patterns in Foursquare by considering popular places and transitions between place categories. Cheng et al. [27] examined a large scale data set of users and their check-ins to analyse human movement patterns in terms of spatiotemporal, social and textual information associated with this data. They were able to measure user displacement between consecutive check-ins, distance between users check-ins and their centre of mass, as well as the returning probability to venues. They also studied factors affecting users movement and found considerable relationship between users mobility and geographic and economic conditions. More recently, Preotiuc-Pietro et al. [28] investigated the behaviour of thousands of frequent Foursquare users. They analysed users movements including returning probability, check-in frequency, inter-event time, and place transition among each venue category. They were also able to group users based on their check-in behaviour such as generic, businessmen or workaholics as well as predict users future movement. The above studies show a significant potential for deriving personal information form GeoSNs and hence also imply the possible privacy threats to user of these applications. Whereas previous studies considered mobility and behaviour of large user groups and determined general patterns and collective behaviour, in this work we consider the privacy implications for individual users, with the aim of understanding possible implied user profiles from location data stored in GeoSNs. III. DIMENSIONS OF THE LOCATION PRIVACY PROBLEM ON GEOSNS Four aspects of the data collected can be identified that can affect location privacy. These are: 1) the amount of data collected and its quality, 2) its visibility and accessibility, 3) its possible utilisation by potential users, and 4) the level of security offered to users by the application. This discussion focuses on the type of privacy-related questions that can be asked and the confidence level in the information that can be derived. Both factors can affect the degree of privacy concern to users. The study considers both LBSNs (Foursquare) and LESNs (Twitter), the difference in the way location data are acquired in both and the issues implied. A. Location Data Collection Here the types of data, its density and quality, as well as the methods of collection and storage are considered. 1) Method of Collection: Both LBSNs and LESNs depend on the user device to acquire the user s current location using GPS, wireless access points (WAP) or cellular networks. When using LBSNs, location data are collected automatically since location is mandatory to providing the service. In Foursquare specifically, user s location is implicitly acquired on a continuous basis, even without using the service. User s check-ins into specific places are verified against their estimated current location and recorded explicitly. In LESNs, user s location data are collected only when location-based features are enabled and used. Some features require continuous collection of location data, for example, when tailoring trends to the user s location in Twitter. The mode of data collection, whether continuous or periodic; automatic or manual, will impact the volume of data collected and its accuracy, and hence also the degree of confidence in inferences made from the data. 2) Types of Data: The completeness and accuracy of location information are primary factors that determine the possible inferences made based on this information and the possible privacy threats to users. Three types of data can be associated with location data collected in GeoSNs: spatial, non-spatial and temporal. Spatial semantics: These refer to any type of information that can be used to identify the places visited. In both LBSNs and LESNs, user s location is identified as a point in space with a latitude and longitude. In LBSNs, users identify their locations explicitly, allowing for a rich definition of place identity, including place name, type classification and street address. On the other hand, location in LESNs is determined automatically by reverse geocoding the registered latitude and longitude coordinates, and thus carry a degree of inaccuracy and ambiguity. Increasingly, some LESNs are able to use resources from LBSNs for defining locations. For instance, Instagram allows users to geotag their pictures using the Foursquare API [29]. Twitter also uses Google API for linking users selected place names with a location on a map. Hence, in both cases it can be assumed that detailed and precise place identities visited by users may be stored by the applications. Non-spatial semantics: Non-spatial semantics are other types of data about both users and places that may be associated with location information. These include explicit user data, as for example defined in their personal profiles on the application or placerelated data, such as reviews, tags and pictures. With the user permission, applications will identify users and share their personal information. Rich placerelated semantics may also be mined from resources on the web [30]. Temporal semantics: These represent the time of user s visit to a place and the duration of their visit. In LBSN, the time of visit is registered by the user as they

37 check-in to a place. The user s physical presence in the place may be validated by comparing their actual GPS coordinates with those of the place they check into. In LESN, a time stamp is encoded with the resource used, for example, a tweet location. However, in this case it is difficult to ascertain whether the user is intentionally visiting the place or happened to be passing by it. In both cases, further processing of the user tracks is needed to estimate the duration of the user s visit. 3) Data Volume: The amount of location data collected is another important factor to be considered and is dependent of the user attitude and behaviour when using the application. The pattern of data logging and the frequency of usage will determine the density of the data collected over time and will thus influence the type of information that may be inferred from the data. For example, regular visits to specific places can determine routine mobility patterns, while incidental visits to other places can signify special events or activities. B. Location Information Accessibility Location information accessibility represents how much of the user s data are available and visible to others including the user, other users and third parties of the service. In terms of users accessibility to their collected location and location-related data, GeoSNs provides only limited means for accessing these kinds of information. In Foursquare, users previous check-in information are available in the form of check-in history, where users can view their visited venues, dates of visits and tips they made. These raw data provide only a limited view of the information content in the data, as discussed in the previous section. In Twitter, users can request to download their tweet history, but location information are not included in this data. As for information visibility, most of the users information published on GeoSNs are available to their friends and can be visible to other users. Generally, users of GeoSNs have limited control over the visibility and accessibility of their information by others, since the privacy settings provided to them is not adequate enough to manage all aspects of their information accessibility. In Foursquare, almost all of the user s information is publicly available by default and can be viewed by other users. This include profile information, tips, likes, friends list, photos, badges, mayorships, and check-ins. Users are only able to block access to their check-ins and photos by setting their view to private. Similarly in Twitter, users profiles and their tweets are public by default, and can be accessed by others. This means that location information attached with tweets is publicly available as well unless users mark their profile as private, where only followers can view their data. All of the publicly available users information is accessible by third parties including the geo-social application APIs users. Third parties can also have privileges to access the user s personal information. In the case of Foursquare, third parties can get check-in data in anonymous form, but they also indicate that they will share user s personal information with their business partners and whenever is necessary in some situations, such as enforcement of law. Twitter, on the other hand, states that any content the user submits or displays through the service is available to their third parties without anonymity. C. Location Data Exploitation Location information exploitation refers to how the application or third parties can utilise the data and for which purposes. This dimension involves the actual exploitation of user s location and location-related data that lead to posing various levels of privacy threats. It seems that GeoSNs have unlimited rights to utilise their users data in any way, for any purpose as stated in their terms of use. For example, Foursquare gives itself absolute privileges over using and manipulating user information as stated in their terms of use [31]. By submitting User Submissions on the Site or otherwise through the Service, you hereby do and shall grant Foursquare a worldwide, non-exclusive, royalty-free, fully paid, sublicensable and transferable license to use, copy, edit, modify, reproduce, distribute, prepare derivative works of, display, perform, and otherwise fully exploit the User Submissions in connection with the Site, the Service and Foursquare s (and its successors and assigns ) business, including without limitation for promoting and redistributing part or all of the Site (and derivative works thereof) or the Service in any media formats and through any media channels (including, without limitation, third party websites and feeds). Similarly, Twitter has the right to utilise users data, including location information, in various ways, as stated in their terms of use [32]. By submitting, posting or displaying Content on or through the Services, you grant us a worldwide, non-exclusive, royalty-free license (with the right to sublicense) to use, copy, reproduce, process, adapt, modify, publish, transmit, display and distribute such Content in any and all media or distribution methods. It is clear therefore that there are no commitments on GeoSNs as to how the data may be used or shared by the application or by other parties. In addition, the reasons for the potential exploitation of users data are vague (e.g., to improve the services) or even not stated. Hence, by agreeing to the terms and conditions, users effectively are giving away their data and unconditional rights to the use of their data to the application. D. Location Data Security Location data security refers to the level of data protection provided by the application for securing the user s data against the risk of loss or unauthorized access. In general, the fact that data are stored somewhere on servers opens the doors for potential undeclared access and use, and hence it is almost impossible to guarantee the security of the user data. Foursquare declares that the security of users information is not guaranteed and any unauthorized entry or use, hardware or software failure, and other factors, may compromise the security of user information at any time. Without any commitment to responsibility for data security, the application provider is declaring the possible high risk of data abuse by any adversary or even by the application provider themselves. Twitter states that Twitter complies with the U.S.-E.U. and U.S.-Swiss Safe Harbor Privacy Principles of notice, choice, onward transfer, security, data integrity, access, and enforcement, but give no additional explanation or examples on situations or access methods that these laws apply to. In the following section, a sample data set from a LBSN is used to explore and analyse the potential information content 65

38 that can be derived from the location data. IV. EMPIRICAL INVESTIGATION This analysis is carried out using a real-world data set from Foursquare, as a typical example of a LBSN. The purpose is to demonstrate possible privacy implications in terms of personal information inferences and exploitation from user activity on GeoSNs. The effect of location data density and diversity on the possible inferences that can be made is analysed. A. Dataset The Foursquare dataset used in this analysis is provided by Jin et al. [16]. The dataset contains venue information and public check-ins for anonymised users around the wide area of Pittsburgh, USA from 24 February, 2012 to 22 July, Places on Foursquare are associated with pre-defined and structured place categories, e.g., Home, Office, Restaurant, etc. The data set contains 60,853 local venues, 45,289 users and 1,276,988 public check-ins of these users. B. Approach and Tools Used To study the possible impact of location data density on users privacy, users of the dataset were first classified into groups based on their check-in frequency. A filter was initially imposed to disregard sparse user activity. Hence, users with less than five check-ins per month were removed from the dataset. The rest of the users were categorised into three groups based on their check-in frequency per day, to moderate, frequent and hyper-active user groups, as shown in Table I. One representative user is selected from each group who has the nearest average check-ins per day to the average checkins for the whole group. Table II shows some statistics for the selected users. The R statistical package was used for analyses and presentation of results. Mainly, the SQLDF package was used for querying, linking and manipulating the data and the ggplot2 package was used for the presentation of the results of the analysis [33]. C. Results Analysis of the data set questioned the sort of implicit user-related information that can be considered to be private that may be extracted using the location data collected. User s spatial location history can be extracted in the form of visits to venues and the exact times of such visits. The places visited are identified and described in detail. For example, user7105 visited Kohl s ; a department store, located at latitude and longitude at 9 a.m. on Monday 27/2/2012. The basic information on venue check-ins can be analysed further and combined with other semantic information from the user profile to extract further information that can compromise user s privacy. Analysis will investigate the relationship between users and places visited, their mobility patterns and the relationships between users and other users as follows. TABLE I: Statistics of user groups in the Foursquare dataset. Group Name Check-ins Range in Total Users Count Check-ins Range per Day Moderate Between 50 and to Frequent Between 301 and to Hyper-active Between 751 and to Average Check-ins per Day TABLE II: Profiles of selected users. Factor Selected Users User9119 User7105 User2651 Number of total check-ins Average check-ins per day Number of visited venues Number of visited venues categories Number of visited venues main categories Number of friends Degree of association between user and place. Relationship with individual place instances as well as with general place types or categories will be studied. Elements of interest will include visit frequency, and possible commuting habits in terms of the association between the visit frequency of places and their location. Spatiotemporal movement patterns. Visiting patterns to individual places or to groups of places can identify regular movement patterns. In addition, a change of visit patterns can also be a significant pointer to user activity. Degree of association with other users. Relationship between users can be derived by studying their movement patterns and analysing their co-occurrence in place and time. 1) The Moderate User: The analysis results of user9119 selected from the moderate group are as follows. a) Degree of Association Between User and Place: Two frequently visited venues by user9119 are Penn Garrison whose category is Home and USX Tower whose category is Office representing 44% and 36%, respectively of the total check-ins. Home and Office are highly sensitive places, yet they represent 80% of this user s check-ins. Other visited place types with significantly less frequency include, Nightlife Spot : 0.5%, Travel & Transport : 0.27%, and Shop & Service : 0.27%. User9119 is also interested in Hockey, Garden Center and Museum place types. As could be predicted, the location of venues visited indicates that most of them are close to Home and Office, whereas this user commutes further away to visit some less frequent venues such as Hockey Arena. Figure 1 shows this user s check-in frequency for different categories of venues classified by the time of day. As can be seen from the figure, this user s association with sensitive places like home and place of work can be identified. In addition, a strong association with other place categories is also evident. b) Spatiotemporal Movement Patterns: About 40% of this user s total check-ins occurs at 9 am, mostly in the Office and at 7pm, mostly at Home. More than two-thirds of the check-ins are between 10am and 2pm and between 6pm and 11pm, which indicates that this user commutes more frequently during these hours. From the weekly patterns of movement, it can be seen that 71% of the venues were visited after 6pm. Mondays and Thursdays are when this user is most active, representing 41% of the check-ins. User9119 tends to go to Nightlife spots more frequently during working days, whereas visits to other specific place types occur only at weekends, including, Salon or Barbershop, Coffee Shop and Garden Centre. This user 66

39 Figure 1: The moderate user s check-ins count, classified by the category of venues for different hours of the day. typically starts commuting earlier on working days and visits more places than on weekends. Observing the checkins by month shows that the months of May and June are the most active in terms of check-in frequency, comprising 60% of total check-ins, as well as diversity of category of venues visited (99% of the total visited categories of venues occurred in those months, including the emergence of new categories such as Museum, Airport and Hotel ). The user was least active in April. Figure 2 demonstrates this user s check-ins count in different categories of venues, classified by day and grouped by month. Some changes of this user s habits can be noticed as well, which can suggest a change of personal circumstances. For example, the user has not visited any Nightlife spots in March and April and has not checked-in in any place on Sundays of June and July including Home and Office. In addition, the user has not checked in any place for a period of a week between the 21st and 28th of April. User9119 last check-in before this week was on the 20th of April at Home. This may indicate a possible period of time-off work in that week. c) Degree of Association with Other Users: Co-location is used here to denote that users have visited the same venue at the same time. This can be used as a measure of interest in a place and relationships between users. User9119 was co-located in 6 unique venue categories with two (out of twenty) friends. He shared three co-occurrences with two friends; once with friend1236 at American Restaurant and twice with friend15229 at Office, which may indicate that friend15229 is a colleague at work. In fact, this user shared 95 co-occurrences with 52 other users, 90% of which were in the Office suggesting the probability of those users being work colleagues. 2) The Frequent User: Analysis of results of user7105 from the frequent user group is as follows. a) Degree of Association between User and Place: Similar to the moderate user, user7105 most checked-in venue category is Home, whose location is identified in detail. However, the second most visited venue is a specific restaurant, whose category is American Restaurant, representing 25% of the total check-ins and 28% of category check-ins. This visit pattern may indicate that this is the user s work place. The third most visited venue category for this user is Bar (4%), that is a subcategory of Nightlife Spot, representing about 7% of check-ins. Generally, the third most visited main category is Shop & Service corresponding to 10% of check-ins, where specifically 40% of those are to Gas Station or Garage and 25% are to Drugstore or Pharmacy. User7105 is occasionally interested in visiting places described as Great Outdoors, Professional & Other Places and Arts & Entertainment. The majority of the most frequently visited venues are within close distance to Home and to the American Restaurant, whereas user7105 commutes further away for other less frequently visited places, such as, Medical Center. b) Spatiotemporal Movement Patterns: Generally, about 20% of the check-ins occurs from 10am to 12pm, half of which are at Home. In addition, user7105 tends to move the most between 3pm and 5pm, representing 23% of his total checkins to 46% of the visited venues categories. More than half of the check-ins are at Atria s, which may indicate that the user starts his work shift in this place at that time. This hypothesis can be ascertained by examining his subsequent check-ins, where 18% of the check-in happens between 12am and 3am at Home, possibly when the user comes back from work. There is a high correlation in terms of place transition between Home and the American Restaurant. When examining the weekly mobility, user7105 is more active on Tuesdays followed by Saturdays corresponding to 19% and 16%, respectively of total checkins. Noticeably, the majority of Friday and Tuesday checkins occurs at 12am, whereas Monday and Saturday at 4pm. Furthermore, this user has visited more diverse venues on Tuesdays followed by Thursdays and Wednesdays representing 53%, 43% and 38%, respectively of the total visited categories. During the working week, this user tends to visit a Bar (5%), especially on Tuesdays, and Gas Station or Garage (4%). This is reasonable considering his working shifts. While on weekends, Grocery or Supermarket and Drugstore or Pharmacy venues are among the top four visited categories corresponding to 4% and 5%, respectively of weekends check-ins. User7105s check-in patterns were regular over the whole period. However, visits of this user are more frequent and diversified in the month of March. Noticeably, about 28% of the check-ins between 12am and 3am occurred in March, indicating a possible change of lifestyle. Figure 3 presents this user s check-ins count in different categories of venues, classified by day and grouped by month. c) Degree of Association with Other Users: User7105 had colocations in 36 unique venues from 19 different categories with 7 friends. In particular, 26 co-locations are shared with freind38466 at 14 venues categories including Coffee Shop, Bar, Fast Food Restaurant and Other Nightlife. Co-locations shared with the rest of the friends include Bar, Mexican Restaurant, Hospital and Government Building. Moreover, user7105 has 16 spatiotemporal cooccurrences at 14 unique venues from 6 different categories with two friends, where 14 co-occurrences with freind38466 at 6 different categories including mostly Bar, American Restaurant, and Sandwich Place, which 67

40 68 Figure 2: The moderate user s check-ins count in different categories of venues, classified by day and grouped by month. Figure 3: The frequent user s check-ins count in different categories of venues, classified by day and grouped by month.

41 can denote a close friendship between them. The other two co-occurrences are with friend15995 at American Restaurant on May 13th and June 17th, The place and time of this user s co-occurrences with friends are shown in Figure 4. Similarly, this user has 89 co-occurrences with other users, who are not stated as friends, at 29 unique venues, where 38% of these co-occurrences are at American Restaurant and 24% at Plaza. 3) The Hyper-Active User: The results of analysis for user2651 selected from the hyper-active user group are as follows. a) Degree of Association Between User and Place: The first most visited venue by this user is a Nightlife Spot corresponding to 15% of total check-ins. Two Home venues were recorded, My Back Yard and La Couch, representing 23% of the check-ins. Both home venues have the same location coordinates, implying that they are actually the same place. Automotive Shop, Pool and Italian Restaurant, representing 9%, 8% and 5%, respectively of this user s total check-ins indicate the user s interests and activities - swimming and Italian food in this case. A particular instance with a vague category of Building was among the top 10 most visited venues. Further investigation of this venue using the given place name revealed that this building is a place where an international summit for creative people is held [34], which may indicate that user2651 is possibly an active participant of such an event. When considering the main category of the visited venues, this user generally visits Shop & Service, Nightlife Spot, Arts & Entertainment and Food on a regular basis, representing 17%, 14%, 11% and 10%, respectively of this user s check-ins. User2651 also usually visits Gas Station or Garage : 4%, and Church : 3%. The location of the visited venues can be clustered into two main areas on a map as illustrated in Figure 5. One area includes Home as well as other frequently visited venues such as Nightlife Spots and Gym or Fitness Center. The other area includes mostly less frequently visited venues such as Hospital. b) Spatiotemporal Movement Patterns: Overall, 53% of residential check-ins occurs between 9am and 12pm. A significant number of check-ins (10%) occur at 2pm, of which almost two-thirds occur in an Automotive Shop. Check-in frequency reaches another peak between 11pm and 12am (18%), of which more than half are in Nightlife Spot. Noticeably, this user tends to be more active at night, where about 70% of the check-ins are registered after 6pm. In his case, weekends have similar check-in frequency as the working week, but Sundays register as the most active day in terms of check-in frequency. Moreover, user2651 checks in considerably less frequently at the Automotive Shop and the Pool on Wednesdays and Fridays, but checks in the Automotive Shop and Nightlife Spot in weekends. This may indicate that he works shifts on weekends. User2651 has regular check-in patterns over the whole period. However, in the months of June and July, check-ins into Hotel and Pool significantly increased representing 75% and 60%, respectively of these venues total checkins. Figure 6 demonstrates this user s check-ins count in different categories of venues, classified by day and grouped by month. c) Degree of Association with Other Users: As with other users, user2651 was co-located with 23 users at 12 distinct venues, half of these co-occurrences happened in Bar, Automotive Shop and Grocery or Supermarket. User2651 is co-located in 27 unique venues from 19 categories with 9 friends, 13 of which are with friend12432 and 9 with friend Most of the co-locations are in Nightlife Spots, Gas Station or Garage, Pool, Flower Shop and Bar. The three dimensions analysed above will form the basis of the questionnaire design described in the next section. 69 Figure 4: Spatiotemporal tracks of the frequent user cooccurrences with friends. Figure 5: Coordinates of venues visited by the hyper-active user, considering the frequency of visit.

42 70 Figure 6: Count of check-ins for the hyper-active user in different categories of venues, classified by day and grouped by month. V. USER STUDY None of the related studies reviewed in Section II above has fully explored or focused on improving users full awareness and understandability of the potential privacy implications when sharing their location information on GeoSNs. Here, a survey is undertaken to examine the privacy concerns and behaviour of users of online social networks, in particular users concerns towards their location information. Three main aspects are addressed in this study: the extent of users awareness of the terms of use they sign up to when using these applications, their understanding and attitude to potential privacy implications, and how they may wish to control access to their personal information on these applications. A. Study Design The questionnaire was developed using Google Forms. Targeted participants were users of online social networking applications who use location features, e.g., adding location to their posts and photos and checking-in when visiting places. A pilot study was first carried out to ensure the clarity and coherence of the survey. Four volunteers with no specific background completed the survey and provided valuable feedback into the wordings and layout of the questions used. The survey was then disseminated widely within the university to staff and students and was also advertised on social networks through the author s account. A token incentive of 10 Amazon vouchers was offered to ten randomly chosen participants who completed the survey. The questionnaire consists of four main sections. The first section collects background information on the participants and their use of GeoSNs. The next section examines users knowledge of terms of use and privacy policies of the applications, followed by a section on studying perception of possible inferences of personal information. The last section is intended to capture users attitude to privacy on social networks as well as their attitude to controlling their personal information. B. Results The questionnaire data were analysed using the R statistical package and the results are presented below. 186 participants completed the survey of which 60% are young adults in the age group 15-24, divided almost equally between males and females. The vast majority of participants (77%) use the services frequently (several times a day) and 72% of participants use the location services in GeoSNs. About 60% use location features on only one application. Adding locations to posts and pictures on Facebook was the most used application, corresponding to 47% of the total number of location services used. This is followed by adding location to tweets on Twitter, photo mapping pictures on Instagram, and checking-in on Foursquare representing 17%,16% and 10%, respectively as illustrated in Figure 7. In addition, most of the users noted that they sometimes use geosocial applications with almost a fifth of users always using the location services. Foursquare users are more frequent users of the service than other services and 25% of the users have linked their accounts on different social networking applications. The questionnaire is divided up into four sections, was presented to participants in whole and takes roughly about 10 minutes to complete. In what follows, the results from the different sections of the questionnaire are analysed.

43 71 Figure 7: Percentage of the type of location services used by the 186 participants. 1) Knowledge of Terms of Use and Privacy Policies for Social Networking Applications: Here, the awareness of the terms of use and privacy policies are examined and analysed against users profiles. In general, the majority of the users (81%) have not read terms of use or privacy policies of the social networking applications they use. Users were presented with the following typical statements representing the terms of use relating to location information and were asked to indicate whether they are aware of the information in the statements. Note that the following statements are representative of the terms of use of all the GeoSNs in question. The results are shown in Figure 8 grouped by the frequency of use. Term 1: The application collects and stores your precise location (as a place name and/or a GPS point), even if you mark your location as private, for a possibly indefinite amount to time. Term 2: The application can use your location information in any way possible including sharing it with other applications or partners for various purposes (commercial or non-commercial). Term 3: If you share your location information, your friends and any other users are able to access and use it in any way possible. Term 4: The application can collect other personal information, such as your personal profile information and browsing history from other web applications. More than half (53%) of users acknowledged awareness of all of the statements and of those 73% have read the terms and policies. Most users (75%) are aware of statement 3, relating to the sharing of information with friends, but are generally unaware of statements 1 and 4, relating to how their location and other information may be collected and stored by the applications. It is interesting to note that frequent users of such application are generally unaware of such statements (49%) as demonstrated in Figure 8. Younger users aged between 15 and 34 tend to be more knowledgeable of these polices (60%), but gender does not seem to be a factor in these results. 2) Perceptions of Possible Privacy Implications: In this section, users attitude towards the inference by the application of personal information is examined. In particular, the questions aim to gauge users awareness of plausible inferences about their private places, activities at different times, their connections to other users, and possible knowledge of this Figure 8: Users awareness of general terms and policies of GeoSNs (Term1-Term4) grouped by the frequency of use. information by the application. Participants were presented with 14 statements, shown below. They were then asked to indicate, for each statement, whether they are aware that the statement is possible and to score their reaction to the possibility of this statement as either OK, Uncomfortable or Very Worried. The first twelve statements refer to knowledge by the application itself, while the last two statements are reflection of the terms of use that suggest that the application can share the user s data with other users and third parties. S1: I can guess where your home is. S2: I can guess where your work place is. S3: I know which places you visit and at what times. S4: I can tell where you normally go and what you do in your weekends. S5: I can tell you where you go for lunch or what you do after work. S6: I know your favourite store (your favourite restaurant, your favourite coffee shop, etc.) S7: I can guess what you do when you are in a specific place. S8: I can guess when you are AWAY from home. S9: I can guess when you are OFF work. S10: I know who your friends are. S11: I know when and where you meet up with your friends. S12: I can guess which of your friends you see most. S13: Other people can know where you are at any point in time. S14: Other people can know what you are doing at any point in time.

44 In terms of awareness, users seem to be most aware of statements S1, S2 and S10, regarding the location of home, place of work and friends, representing 88%, 89% and 93%, respectively. On the other hand, users are least aware of statements S5, S13 and S14 that relate to other users knowledge of personal mobility patterns and activities, representing 34%, 37% and 40%. The awareness level of the users is demonstrated in Figure 9 grouped by the frequency of use. Despite a reasonable level of awareness about the plausibility of these statement, users seemed to be relatively concerned about their privacy. 66% of users reactions were either uncomfortable (41%) or very worried (25%) as can be seen in Figure 10(a). Over half of the responses to S2 (awareness of workplace-53%) and S10 (awareness of friends- 65%) were not concerned. On the other hand, participants were most concerned with S13 and S14, with the Very Worried category scoring 83% and 84%, respectively. S1 and S11, relating to the location of home and meetings with friends were rated most Uncomfortable corresponding to 53% and 51%, respectively. Statement S8, suggesting the knowledge of user s absence from home and S13, indicating the possible knowledge of this information by other people presented a significant source of worry to users, with 45% and 42%, respectively indicating that they are Very Worried about these statements. It appears that users who read the terms and polices are more aware (by 9%) of the statements, while users who have not read the terms and polices were significantly Very Worried (by 21%) than other users. Moreover, there is a positive correlation between the age of the participant and their level of awareness; level of awareness considerably increases with increase in age group, with the oldest active age group (35 to 44 years) scoring 89%. Yet, younger users, in the age group 15 to 34 years, tend to be relatively less concerned than older users (by 4%). The level of users concern increases with the decrease in the frequency of use of the applications, where 76% of occasional users are concerned compared to 63% of frequent users. Users of Facebook and Instagram registered the highest degree of concern among all users of GeoSNs scoring 63% and 62%, respectively as shown in Figure 10(b). Again, gender does not seem to have any significant influence in this study. 3) Attitude to Privacy on Social Networks: The aim of this section of the questionnaire is to understand the users reaction with regards to using the applications, given the knowledge of potential implications on privacy from the previous section. 61% of users stated that they would change the way they share their location information, 55% of whom are willing to stop sharing their location information completely, with the rest of the group indicating they would share it less often. Frequent users seem to be the most motivated to change their sharing behaviour (13% more than infrequent users), as illustrated in Figure 11, but they are also less willing to stop sharing the information and would prefer to share less frequently than the infrequent users (by 47%). Interestingly, users of location services are more tempted (by 10%) to change how they disclose their location information compared to users who have not used them. 57% of the first group of users want to share their location less frequently and 43% are willing to Figure 9: Users awareness about potential information inferences (S1-S14) grouped by frequency of use of GeoSNs. discontinue disclosing their location data. Younger users (15-34) are more willing to change their usage behaviour (by an average of 18%) and are even more willing to stop sharing location information completely (by an average of 10%) than older users. In this case, it seems that female users are more motivated to change their attitude regarding location disclosure (by 11%) than males, yet 60% of male participants suggested their willingness to discontinue using location services. 4) Managing Personal Information: In this section, users views on managing and controlling access to their location information are explored. This includes several aspects related to what information is stored, how it is shared or viewed by the application and by others, and whether users need to manage access to their information. The following statements were presented to the participants who were asked to rate how often they would use them: All the time, Occasionally or Never. C1: I would like to be able to turn off location sharing for specific durations of time. C2: I would like to turn off location sharing when I visit specific types of places. C3: I would like to decide how much of my location information history is stored and used by the application for example use only my check-in history for the last 7 days. C4: I would like to see the predicted personal information that the application stores about me based on my location information. C5: I would like to decide how people see my current location for example, exact place name, or a rough indication of where I am. C6: I would like to decide who can download my location information data. 72

45 73 (a) (b) Figure 10: (a) Users reaction towards potential inferences grouped by the inference statements (S1-S14). (b) Data in (a) grouped by the GeoSNs used. C7: I would like to know, and control, which information can be shared with other Web applications. C8: I would like to make my location information private seen only by myself and by the people I choose. Results are given in Figure 12(a) and show a significant desire to use these controls for location privacy. Overall, 76% of participants would like to apply those controls All the time, 20% are happy to apply them Occasionally, and only 4% of users will not consider these controls. In general, C2, C6, C7 and C8 were most favoured controls, scoring over 97% each of users responses. Controls C1, C6 and C7 were the most chosen controls to be applied all the time, representing 91%, 88% and 86% of users responses, respectively. It is worth noting that users of different location services have similar acceptance rate for these control. Foursquare and Facebook users have the highest preference for applying the controls All the Time, corresponding to 76% and 75%, respectively. A negative correlation appears to exist between users tendency to use these privacy controls all the time and their age group. The youngest active age group of years old has the highest desire for all-the-time application of controls representing 78% of this group s responses. Figure 11: Users attitude to location privacy risk, grouped by frequency of use of GeoSNs. As expected, users who are tempted to change their location sharing behaviour have relatively higher motivation to use these controls representing 97% of this group s responses (4% higher than users who are reluctant to change). The factors of gender, whether users read the applications terms or how frequent they use the social networks, as shown in Figure 12(b), seem to have minimal influence on their willingness to use these controls. In the future it will be useful to undertake a longitudinal study that tracks user behaviour over time to understand the factors that may influence their attitude to location privacy, for example the impact of friends and age group. VI. DISCUSSIONS AND CONCLUSIONS The proliferation of location-based GeoSNs and the largescale uptake by users suggest the urgency and importance of studying privacy implications of personal information collected by these networks. Identifying user profiles is a goal of many businesses that is now commonly accepted by users for the purpose of improving the quality of service. However, GeoSNs do not explicitly present similar business goals and thus their motivations for collecting and sharing personal location information are not clear. Also, the issue is complicated as the data collected may be shared or accessed by other users and applications. The results of this study highlight the possible implications to user privacy and the need for developing means for raising the user awareness of these issues, and possibly also giving the user control on managing access to their data. The data analysis experiment conducted here shows the amount and types of personal information that can be inferred using location data. Users spatiotemporal mobility tracks can be analysed to identify where they are, where they are likely to be, and sometimes more significantly, where they are not present. Tracking user location data may also give indications to their preferred activities, places, habits and friendship community. As can be expected, the more frequent the applications are used, the more dense the spatiotemporal history of user data collected and the more certainty in the derived information extracted from this data. Whilst the statistical analysis carried

46 74 (a) (b) Figure 12: (a) Users desire to use location privacy controls grouped by statement of controls C1-C8 (see section V-B4). (b) Data in (a) grouped by frequency of use of GeoSNs. out in this study highlight some of the basic and interesting inferences that can be made, more sophisticated location-based inference methods can be developed to infer, for example, the probability of future movements, methods of transport and places visited. The now common practice of linking user accounts in several GeoSNs increases the availability of data and compounds the privacy risks to users, who sign up to different, possibly contradicting, terms of use and policies of different applications. For example, developers now use the Twitter API to collect user check-ins in Foursquare. The questionnaire conducted in Section V provides valuable insights that convey many aspects of location privacy on the Social Web from the perspective of the end user. The main and (possibly only) means of communicating how the collected user information may be used and exploited by the application is described in the application s terms of use. It is clear from the results of the questionnaire undertaken that the majority of users, especially those who use location services, do not read the terms of use and policy documents. The findings also indicate that users are aware of the potential information, and possible derivatives thereof, stored by the application. However, it appears that they are also quite concerned about the privacy implications. This apparently contradicting findings may be due to that such awareness and concerns are evident when users are actively questioned about these issues, but are somewhat screened from the users minds during the continuous use of the application. The study also suggests that users may not fully understand the privacy implications, where their level of concern was much more pronounced when faced with statements that indicate that other people may be aware of their location information in comparison to statements indicating that the application holds such information. The study reveals that there is a strong need for the users to be continuously aware of their data, how it is stored and to have the ability to control access to and visibility of their location data sets. Further research is needed into methods that enhance the communication of the information by the applications as well as method to allow users to better understand and control their personal profiles on such networks. REFERENCES [1] F. Alrayes and A. Abdelmoty, Privacy concerns in location-based social networks, in GEOProcessing 2014: The Sixth International Conference on Advanced Geographic Information Systems, Applications, and Services. IARIA, 2014, pp [2] C. Vicente, D. Freni, C. Bettini, and C. Jensen, Location-related privacy in geo-social networks, IEEE Internet Computing, vol. 15, no. 3, 2011, pp [3] D. Riboni, L. Pareschi, and C. Bettini, Privacy in georeferenced context-aware services: a survey, in Privacy in Location-Based Applications. Springer Verlag, 2009, pp [4] S. Gambs, O. Heen, and C. Potin, A comparative privacy analysis of geosocial networks, in SPRINGL 11 Proceedings of the 4th ACM SIGSPATIAL International Workshop on Security and Privacy in GIS and LBS, 2011, pp [5] R. Shokri, G. Theodorakopoulos, J.-Y. Le Boudec, and J.-P. Hubaux, Quantifying location privacy, in IEEE Symposium on Security and Privacy, 2011, pp [6] J. Tsai, P. Kelley, P. Drielsma, L. Cranor, J. Hong, and N. Sadeh, Who s viewed you? the impact of feedback in a mobile location-sharing application, in CHI 09, Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 2009, pp [7] N. Sadeh, J. Hong, L. Cranor, I. Fette, P. Kelley, M. Prabaker, and J. Rao, Understanding and capturing peoples privacy policies in a mobile social networking application, Personal and Ubiquitous Computing, vol. 13, no. 6, 2008, pp [8] H. Gao, J. Hu, T. Huang, J. Wang, and Y. Chen, Security issues in online social networks, Internet Computing, IEEE, vol. 15, no. 4, 2011, pp [9] J. Nagy and P. Pecho, Social networks security, in Emerging Security Information, Systems and Technologies, SECURWARE 09. Third International Conference on. IEEE, 2009, pp [10] P. Joshi and C.-C. Kuo, Security and privacy in online social networks: A survey, in Multimedia and Expo (ICME), 2011 IEEE International Conference on. IEEE, 2011, pp [11] S. Patil and J. Lai, Who gets to know what when: configuring privacy permissions in an awareness application, in Proceedings of the SIGCHI conference on human factors in computing systems (CHI 2005), 2005, pp [12] P. Kelley, M. Benisch, L. Cranor, and N. Sadeh, When are users comfortable sharing locations with advertisers? in Proceedings of the 2011 annual conference on Human factors in computing systems - CHI 11, 2011, pp [13] A. Brush, J. Krumm, and J. Scott, Exploring end user preferences for location obfuscation, location-based services, and the value of location, in Proceedings of the 12th ACM international conference on Ubiquitous computing - Ubicomp 10, 2010, pp

48 76 A Generalized View on Pseudonyms and Domain Specific Local Identifiers Lessons Learned from Various Use Cases Uwe Roth SANTEC CRP Henri Tudor L-1855 Luxembourg, Luxembourg Abstract Pseudonymisation as a data privacy concept for medical data is not new. The process of pseudonymisation gets difficult in concrete use-case setups and the different variations of data flow between those who collect, who store, and who access the data. In all cases, questions have to be answered about, who has access to the demographics of a person, who has access to the pseudonym, and finally, who creates the pseudonym. Since a fundamental part of the pseudonym creation depends on the identification of a person on base of its demographics, things even get more difficult in case of unclear matching decisions, management of wrong matching or update of demographic information. In this journal article, a unified view on pseudonyms is proposed. Pseudonyms are treated as a local identifier in an identifier domain, but in a domain that has no demographics. Additionally, persistent identifiers are introduced that allow the handling of updates and internal matching reconsiderations. Finally, two concepts for pseudonymisation are shown: First, a National Pseudonymisation Service is sketched with focus on resistance against update problems and wrong matching decisions. It is designed to cover every possible variation of the exchange of local identifiers between a source of personal data and the storage destination. Second, an algorithm for the pseudonym creation from a person identifier is described. This algorithm is needed if the pseudonymisation is not performed by an external service but in-house and in case of limited number space of the pseudonyms. Both solutions are suitable to solve a huge variety of pseudonymisation setups, as it is demanded by researchers of clinical trials and studies. Keywords-patient privacy-enhancing technologies; secure patient data storage; pseudonymisation; local identifer; identifier domain. I. INTRODUCTION This article is an extended version of [1], which covers the algorithm for the generation of pseudonyms with a limited number of bits. Pseudonymisation is a process where demographics and identifier of a person are removed out of an information record and replaced by a pseudonym. This step is demanded to protect the privacy of patients in cases of secondary usage of medical data, e.g., for research or statistical purposes. In these cases knowledge about the identity of the person is unnecessary and therefore must be protected against disclosure. In contrast to anonymisation, a pseudonym allows to link data from several sources to the same person, which helps to improve the quality of the research or statistics. An example for the need of pseudonymization is the storage of medical data, samples, blood, and urine in biobanks. Researchers are not interested in the identity of the person behind this material. A pseudonym is needed to link all samples that have been taken from the same person at different locations and during different collection events. The pseudonym will not only allow the linkage to the same person but also allows protecting the identity of the patient behind the sensitive data. One part of this article describes a generalized concept on how identities of patients and their pseudonyms are used and managed (including identity matching, linkage of identifiers from different domains) to securely exchange data. Despite the fact that these problems are discussed in many publications (e.g., [2] and [3]) this article gives a generalized overview of how a source-destination relation can be defined. The main idea behind the generalization is the concept of local identifiers of identifier domains that are either bound to demographics or not. With the generalization of pseudonyms as local identifiers in a domain without demographics, transitions of identifiers between certain identifier domains become only a matter of permissions, e.g., permission to pseudonymise, permission to re-identify. So the main cases that are discussed in the article differentiate the variations of visibility of demographics, local identifiers and pseudonyms amongst the source of data and the destination storage. All cases can be implemented by the use of a pseudonymisation service as a trusted third party. The article defines the fundamental services of the pseudonymisation service that are needed to treat all identified cases. They have been specified for the National Pseudonymisation Service of Luxembourg, which is solely responsible for the management of persons and the transition of the identifiers between the different identifier domains. The National Pseudonymisation Service will not perform pseudonymisation on medical data, nor will it have access to medical data. With the provisioning of demographics in a certain domain (e.g., hospital, laboratory), the introduction of faulty data is likely. The update of such data might lead to a revised decision at the National Pseudonymisation Service, i.e., demographics from a certain domain now match a different

49 77 known person or it is assumed that the persons is unknown yet. This has consequences at the destination side and requires an update of the pseudonym for some of the stored data. With the introduction of persistent identifiers that are linked to the initial matching decision, update of only the pseudonyms that are concerned is possible. Central or national pseudonymisation services run as Trusted Third Parties for example in the Netherlands (ZorgTTP [4]), and in Germany the Patient Identifier (PID) generator in combination with a pseudonymisation service of TMF (Telematikplattform für Medizinische Forschungsnetze e.v.) is well known [5]. These solutions mainly provide global person identifiers for identified persons, which can be used to create domain specific pseudonyms. Mechanisms and information to handle faulty matching decisions after the update of demographics are not foreseen. In the TMF solution, the visibility of the demographics and the pseudonym at source and destination are restricted by passing the (encrypted) medical data, together with the global identifier (from the PID generator) through the pseudonymisation service. Such a setup on national level would require, that the National Pseudonymisation Service must be able to access services in the research domains to push the pseudonymized data to it. As a consequence, researchers need to maintain a service in their Demilitarized Zone (DMZ) that is able to receive the pseudonymized data. In the proposed solution, the pseudonymisation service acts only as a passive service that can be accessed from Intranets without the need of a DMZ. Additionally the solution does not need to bypass medical data and therefore is able to manage more requests per time. An alternative to the use of a National Pseudonymisation Service is the implementation of a local in-house pseudonymization, which means that the pseudonym is calculated either at the data source or the storage destination out of a given person identifier without the use of an external service. In such a setup no matching decisions will take place and a person requires a stable person identifier. In both cases (National Pseudonymisation Service or inhouse pseudonymisation) the pseudonym number itself has to be calculated or determined at one point in time. There are several options to create a pseudonym with a given set of demographics. Some of these techniques base on hashing or encryption of a unique identifying number of the person. Others simply chose a random number and link this number with the identity. Current hashing and encryption algorithms work with 128 bits minimum, which might be too much in some cases, e.g., the pseudonym must be 31 bit unsigned integer. In that case, the outcome of the process must be cropped to the desired bit-length, which leads to an unpredictable risk for pseudonym collisions. Research that takes smaller number of bits into account is known as small-domain pseudo random permutation or small-domain cipher (e.g., [6][7][8]). Solutions that base on this research use techniques that are also used in symmetric encryption (e.g., Advanced Encryption Standard AES [9]) or hashing algorithms (e.g., Secure Hash Algorithm SHA [10]): Permutation, rotation, transformation, and diffusion of the given bit-set of data. A similar research area that uses the same tools deals is Format Preserving Encryption (FPE) (e.g., [11]). Here, more focus is made on the format of the encrypted block of data, which also includes the format on char- or word-level. The FALDUM Code [12] as another example tries to create a code with error correction properties and good readability. For all proposals, it is difficult to estimate how secure these algorithms finally are and how difficult it is to recompute the person identifier with a given pseudonym. Cryptanalysis on existing symmetric encryption algorithms and hashing algorithms have shown, that weaknesses can be found years after the algorithms has been proposed (e.g., [13]). Therefore, an alternative pseudonym calculation algorithm is proposed to calculate pseudonyms from a person identifier on the base of a chosen primitive root of a fixed prime number. This calculation is more similar to asymmetric encryption techniques (e.g., the RSA algorithm [14]) or the Diffie-Hellman-Key Exchange protocol [15]. The algorithm guarantees a collision free pseudo-random distribution of the pseudonyms. The pseudonymisation algorithm acts as a one-way function if all of the calculation parameters are kept secret. The article is structured as follows: In Section II Methods, the concept of identifier of persons and identifier domains and its relation to pseudonyms is introduced. Later, the main cases of data transmission between a source system and a destination storage are listed, including the different visibilities of local identifiers at source and destination. Persistent identifiers are introduced to solve two problematic cases that might get relevant in case of update of demographics. Then a look at the number space of identifiers and the existence of demographics in a certain identifier domain is taken. In a setup of a National Pseudonymisation Service, properties and permissions of systems and domains have to be defined. Then finally, the main identity related services of the National Pseudonymisation Service will be outlined. Since the National Pseudonymisation Service use an existing Master Patient Index for matching decisions, aspects of this relation will be discussed. The section ends with discussions about the creation of new local identifiers, especially the calculation of local identifier with small number of bits. In Section III Results, the use of the National Pseudonymisation Service and the use of an in-house pseudonymisation solution will be shown on existing use cases that have been implemented already or which are in planning. The paper ends with Section IV Conclusion and Future Work, in which the positive effects of the proposed solutions for researchers will be outlined. II. METHODS The generalized concept of a National Pseudonymisation Service (NPS) and of an in-house pseudonymisation solution bases on use cases that have been identifier by questioning various researchers in the field of clinical and population

50 78 Figure 1. Identities and Domains based studies. First, some terms must be clarified; later, these cases will be discussed. A. Identifier of persons and identifier domains In the digital world, the use of identifier of persons is quite common. It simplifies the linkage of data of the same persons, if unique identifiers are used. This linkage is quite complicated if only demographics (e.g., name, address, birthday) are given. 1) Local identifier and identifier domain The concept of (local) identifiers for persons that are only valid in a certain local context is one of the basic concepts of the IHE Patient Identifier Cross Referencing (PIX) Integration Profile [16][17], as it is implemented inside hospitals or laboratories. Usually different systems (e.g., storage systems, imaging systems) use different identifiers inside the same institution to identify the same person. The Patient Identifier Cross-reference Manager enables the systems to communicate with each other, even if they use different identifier for the same person. This is solved by so called identifier domains for the different systems. Usually, the same person should only have one identifier inside an identifier domain. This concept cannot only be used for the exchange of data inside an institution but also between different institutions (different domains), for which a person has different patient identifiers (local identifier). The local identifier of a person in one domain is different from the local identifier of the same person in another domain. Without help of the Patient Identifier Crossreference Manager it is difficult to translate the link of persons between the two domains. The concept of local identifier and identifier domains is used in the concept of the National Pseudonymisation Service. Identifier domains not only describe institutions but also might identify applications or application contexts, e.g., national laboratory-application, clinical study about cancer. The identifier domain usually is identified by a unique OID (Object Identifier) [18]. Figure 2. Different cases of transmitting data 2) Pseudonym In the proposed concept, a pseudonym is seen as a local identifier inside an identifier domain where no demographics are available. Pseudonyms from different domains must be different. Having a local identifier from one domain must not allow calculating the pseudonym from another domain, except the domain is responsible for the creation of the pseudonym. This statement ensures that it is not possible to break the pseudonymisation on known identifiers. 3) 2 nd -level pseudonym As for pseudonyms, a 2 nd -level pseudonym is also only a local identifier in a certain identifier domain where no demographics are available. In this case, the source of data is a domain that identifies persons by pseudonyms and not by demographics. 2 nd -level pseudonyms in an identifier domain can be linked to the same person, even if the 1 st -level pseudonym was from different domains. Example (Figure 1): Medical data of a person are sent to Biobank A that works with 1 st -level pseudonyms. Medical data of the same person are sent to Biobank B that works with different 1 st -level pseudonyms. Data of both biobanks are sent to a researcher who works with 2 nd -level pseudonyms. The researcher is able to link data of both biobanks to the same person, in case of the same 2 nd -level pseudonym. It is clear that such a scenario in reality requires approval by ethics commissions or data protection authorities. B. Main cases of data transmission After being familiar with the terms local identifier and identifier domains, it is possible to describe the main cases of transmitting data between a source and a destination. The described use cases cover cases that include the use of a

51 79 National Pseudonymisation Service and the use of an inhouse pseudonymisation, with a stronger focus on the design of the National Pseudonymisation Service. In the proposed setup the communication between source and destination systems is direct, so no system is involved during the transmission of medical data between source and destination that modifies the transmitted data. This is not only true for the in-house pseudonymisation but also in case of the use of the National Pseudonymisation Service. The National Pseudonymisation Service is defined solely as a passive service that is used to identify persons and request or to manage local identifies. It will not allow the bypass of medical data from source to destination. Also, it will not perform a pseudonymisation of medical data on the fly (i.e., replace demographics in the medical data by pseudonyms). The question that results from these restrictions is: What information is send from source to destination (apart from the medical data) that allows the mapping of the medical data to a certain person at the destination? There are several options to answer this question. The five possible cases that describe these options are shown in Figure 2: A. Demographics of the person are exchanged. B. The local identifier from destination domain is exchanged. C. The local identifier from the source domain is exchanged. D. The local identifier form a third domain is exchanged. E. A warrant is exchanged that can be used by the destination to request the local identifier from its domain by showing the warrant. All cases do not make an assumption on how the data and information is transmitted between source and destination. It does not have to be electronically only. Alternatively, this data could be send by the use of physical objects (e.g., as barcode on paper or box). 1) Case A: Demographics of the person are exchanged In this case private data of a person is exchanged between source and destination together with the demographics of the person. So the destination is forced to link data from the same identity on base of the given demographics. This could be done by the use of a local Mater Patient Index (MPI) or by the use of a National Pseudonymisation Service. Anyway, it is clear that this local identifier is not a pseudonym, as the identity of the person is known. 2) Case B: The local identifier from the destination domain is exchanged In this case private data of a person is exchanged between source and destination together with the local identifier of the person of the destination domain attached to it. So the sources need to calculate, determinate or know the local identifier of the destination domain on base of its own local identifier or the known demographics of the given person. Alternatively, it needs to ask the National Pseudonymisation Service to provide this identifier of the destination. As a consequence, all source systems from all source domains will know the local identifiers or pseudonyms from the destination domain but not vice versa. In case of the use of a National Pseudonymisation Service, the sources systems needs permissions to request the local identifier of the destination domain on base of its own local identifier or demographics. 3) Case C: The local identifier from source domain is exchanged In this case private data of a person is exchanged between source and destination together with the local identifier of the person at the source domain attached to it. As a consequence the destination system of the destination domain will know the local identifiers from the source domain but not vice versa, so the local identifier or pseudonym that is used at the destination is hidden to all sources. In case of in-house pseudonymisation, this case only makes sense in case of one source only, otherwise it will be impossible to link identifier from different sources to the same person. This limitation does not exist in a setup with the use of a National Pseudonymisation Service, for which the destination needs permission to translate the local identifiers of the sources to its local domain identifier. 4) Case D: The local identifier from a third domain is exchanged This case introduces a third identifier domain. This case makes sense if such a third domain is created especially for the exchange between source and destination and nowhere else. In such a setup local identifiers from a source will not be disclosed at the destination and vice versa. Source and destination systems must only use the identifier of the third domain during the exchange of the private data and not for the storage of the private data. This case allows different variations by using in-house pseudonymisation or the National Pseudonymisation Service during the transition of the identifier between source to the third domain, and between third to the destination domain. As for Case C, an in-house pseudonymisation between sources and third domain is only useful in case of only one source domain, because it is impossible to define a calculation or determination process that would allow the transition of local identifiers of the same person from different sources that result in the same identifier in the third domain. The translation between the identifier of the third domain and destination domain can be performed in-house or at the National Pseudonymisation Service. 5) Case E: A warrant is exchanged In this use case private data of a person is exchanged between source and destination together with a warrant attached to it. This use case requires the use of the National Pseudonymisation Service and does not work with in-house pseudonymisation. The warrant is created and/or managed by the National Pseudonymisation Service on base on information, provided by the source (e.g., local identifier or demographics). The

52 80 destination will then be able to retrieve the identifier belonging to the destination domain on base of the warrant. In this case, the source does not know the local identifiers or pseudonyms in the destination domain and the destination does not know the demographics at the source domain. In contrast to Case D and the use of identifiers from a third domain, the warrant can be managed by the source and might be defined with at time-to-live. The warrant should not be used as a replacement of a local identifier because they are not unique in case of the same person. Additionally, the National Pseudonymisation Service might delete the warrant out of its systems after use. The warrant-based approach might be used in cases of re-identification of patients. In that case, a warrant is requested by the destination on base of the pseudonym and the source is able to re-identify the patient on base of the warrant. C. Identifier management Usually, hospital information systems or equivalent systems manage local identifier for patients themselves. In this case, the local identifier is created inside the identifier domain of the data sources. The identifier domain guarantees that the person behind the local identifier never changes, even if the demographics of that person change significant. This is an important requirement. In the future, it might be possible that two identifiers are merged because they have been identified as doublets of the same person. But an identifier never changes the link to the individual person. Not all identifier domains manage identifiers themselves. As an example, collection sites of a clinical study might be located at hospitals, but have no access to the hospital information systems and therefore not to the local identifier of that hospital. In that case a new local identifier has to be created for the collection site domain. The National Pseudonymisation Service can overtake this task on the base of given demographics. Figure 3. Persistent identifier and initial identification In the National Pseudonymisation Service, it must be configured for each data source, if it creates and manages identifiers themself or if the National Pseudonymisation Service has to take responsibility for this. D. Persistent local identifier The National Pseudonymisation Service decides with given demographics, if the demographics match with the demographics of a known person or not. If demographics of a person are updated at a source, this might lead to a different matching decision at the National Pseudonymisation Service, so the demographics are linked to a different person. Sources who manage local identifiers in their domain are not affected by this decision because the local identifier of the person at the source will not change. For sources and destinations with local identifiers management by the National Pseudonymisation Service, things are different: some data sets with an associated local identifier might need to be changed in a way that reflects the new matching decision, i.e., local identifiers of some datasets need to be updated too. The identification of these datasets on base of the current local identifier is not sufficient, because for some datasets from different sources, the change must not be performed. To solve this issue, an additional persistent local identifier is introduced. The persistent local identifier will never change, regardless of updates of demographics. It can be used to provide update information for exactly those entities that are affected by the update decision. The persistent identifier is an addition to the local identifier inside an identifier domain and is linked to the demographics that were used during the first identification step of the demographics from a source at the National Pseudonymisation Service.

53 81 Figure 4. Problematic case: Update affects destination At one point in time, a data source needs to identify demographics of a person at the National Pseudonymisation Service, either to make it aware of the local identifier in the source domain or to request a local identifier on base of the demographics. The persistent identifier is bound to that identification process. In the example of Figure 3, Source A is identifying John Doe together with its local identifier Medical data that is sent from Source A to the destination is linked to that person at the destination via the local identifier 0167 and the persistent identifier In the same example, Source B identifies demographics of Sam Dae without a locally managed local identifier. This source will receive the local identifier 6724 from the National Pseudonymisation Service plus a persistent identifier If Source B identifies a person with the same demographics in future, it will receive the same local identifier 6724 but always with a different persistent identifier. E. Problematic cases The persistent identifier can be used to solve two problematic cases: Update affects destination Update affects source 1) Update affects destination Two sources from different domains (Example Figure 4: Source A, B) provide demographics that lead to the same local identifier/pseudonym at the destination (0167). Then one of the sources (Source A) updates the demographics and the National Pseudonymisation Service decides that the previous matching decision was wrong and that this new demographics belongs to a different person. So the local identifier (1234 of Source A) is re-linked in the National Pseudonymisation Service to a different or new person. On base of the persistent identifier (3465), the destination can be informed to update the local identifier (0245). This affects only the medical data that has its origin in Source A. One could argue that a persistent identifier could be avoided, if the destination would store information about the source domain together with the local identifier. In the example an update then would be: Update data from Source A with local identifier 0167 to the new local identifier This argument is true, but there are good data protection arguments to hide the origin of the data at the destination. The persistent identifier in that case acts as a pseudonymisation of the source. 2) Update affects source The National Pseudonymisation Service manages the local identifiers of a source domain (Example Figure 5: Source A), so the National Pseudonymisation Service provides the local identifiers plus a persistent identifier after identification of demographics. During two independent events, demographics are identified at the National Pseudonymisation Service that lead to the same local identifier at the source (1234) but with different additional persistent identifiers (2347, 5678). Then later, one set of demographics (identified by local identifier plus persistent identifier: 1234 / 2347) is updated and the National Pseudonymisation Service decides that the demographics belong to a different person as previously suggested (Figure 6). So the data that was provided in one event at Source A has the wrong local identifier and needs to be changed to the new local identifier (5667). This change does not affect the local identifier from the second event.

54 82 Figure 5. Problematic case: Update affects source (initial state) One might argue that such a use case is not likely, especially in case of information systems inside the source domain. In case of clinical studies, sometimes the management of identities depends on papers, Excel sheets or other unreliable tools. So it is not an unrealistic scenario that nurses collect samples at different collection events and use the National Pseudonymisation Service to retrieve a local identifier on base of re-typed erroneous demographics, which later needs to be updated. F. Avoidance of persistent local identifiers In the case of local identifiers of a source domain that is managed by the National Pseudonymisation Service, the use of persistent local identifiers is one way to manage updates. An alternative approach can avoid the use of persistent local identifiers. It foresees that each matching request at the National Pseudonymisation Service that is performed without the use of a local identifier will lead to a new local identifier, regardless whether the demographics match a known person or not. As for local identifiers that are managed by the sources, this identifier will never change even after update of demographics. The National Pseudonymisation Service can be asked, for which local identifiers it assumes that they belong to the same identity. This list might change after demographics are updated for a given local identifier. So there are two options to treat potential update problems at domains that do not create or manage local identifiers: Either a persistent identifier is provided together with the local identifier, or always new local identifier are created even if the National Pseudonymisation Service assumes that the demographics belong to a known person. G. Identifier domain and identifier number space In case of local identifier created and managed by the National Pseudonymisation Service, it is suggested, that the number is a purely (pseudo-)random integer number from the range zero to a maximal number. It does not include any information that is linked to the demographics of the person. The maximum has to be defined per identifier domain at the National Pseudonymisation Service. It is up to the users of the local identifier if they encode the number into a character representation or if they add error correction or error detection codes, e.g., to make it human readable. During communication with the National Pseudonymisation Service, only the integer representation must be used. H. Identifier domain and availability of demographics One can distinguish between domains where demographics are available and domains where demographics are not available. Usually domains with no demographics are these where the identifier is seen as the pseudonym. But this is not always the case. There are cases where an identifier is linked to a person and at the same time demographics of that person are not available. An example for such a case is the domain of health professionals. In that case, the ehealth ID of a health professional is not a pseudonym, but access to the demographics of the health professional is not necessary available at the source. This case is important, as a pseudonymisation of identifiers from a domain without demographics is generally possible (e.g., to pseudonymise the ehealth ID of health professionals). Since such identifiers are registered at the National Pseudonymisation Service without any demographics, a link to an existing person is never possible. To stay in the example: Pseudonyms of health professionals are never linkable to pseudonyms of patients, even if the health professional and the patient are the same person.

55 83 Figure 6. Problematic case: Update affects source (after update) I. Identity Linking If a source figures out that two local identifiers belong to the same person (even if the demographics are different), the source can perform a linkage-request to tell the National Pseudonymisation Service that one local identifier will never be used anymore and that all data that is linked to the obsolete identifier will belong to the surviving local identifier. Properties and permissions Figure 7 gives an overview about the relationship of properties and rights concerning systems and identifier: A system, e.g., server, client application or user, belongs to one or more identifier domains. For a specific domain it is defined whether a system has certain permissions or not: Provide demographics: Not all systems inside a domain should be allowed to provide demographics at the National Pseudonymisation Service. Some systems are only allowed to use the local identifier inside that domain. Update demographics: In case of first contact, some systems must be allowed to provide demographics to the National Pseudonymisation Service. Update of demographics is a critical task that only should be permitted to some selected systems. Link identifier: Similar to update of demographics, the linking of identifier is a rare case that only should be done after the identification of doublets in the local system is beyond question. Retrieve demographics: This is the most critical task in the whole concept of the National Pseudonymisation Service. Retrieval of demographics on base of a given local identifier should only be possible in rare cases, e.g., re-identification of persons in case of important notifications. For reasons of data protection, the National Pseudonymisation Service will only provide the latest version of demographics that has been provided by a system in that domain. Demographic details from other domains will not be accessible. Also this permission will only provide data, if the domain itself manages demographics. Since domains that only have access to pseudonyms never provide demographics to the National Pseudonymisation Service, the retrieval of demographics in that domain is excluded. For a specific domain, properties define whether it is a source domain with demographics or a destination domain with pseudonyms: Demographics available: In domains with demographics available, a source domain is given. Usually, in domains without demographics, this is not the case (except in a relation 1 st level pseudonym, 2 nd level pseudonym). Identifier managed by source: For source domains, it has to be defined, whether a local identifier is managed inside the domain, or if it has to be provided by the National Pseudonymisation Service. In the second case, it must be defined, whether a persistent identifier is used to manage update conflicts or if always a new local identifier will be used in that case. For destination domains without demographics, the National Pseudonymisation Service will always manage the local identifier. It is not possible that the domain itself manages pseudonyms. Number range of local identifier: In Section G. Identifier domain and identifier number space it is explained, why the National Pseudonymisation Service only manages numbers as local identifiers. This property defines the range of the number space.

56 84 Figure 7. Properties and permissions Notification required: Systems might fail or crash in the wrong moment. Some tasks might require notification to ensure that the involved systems have stored the result of a request in their databases. If a system notifies a certain result, the responsibility for the use of the information is moved from the National Pseudonymisation Service to the notifying system. In the use cases that are described in Section B. Main cases of data transmission, all cases have a from-to relation in regards to the translation of local identifiers or the creation of warrants. These relations need to be defined as permissions for direct or warrant-based translations in the National Pseudonymisation Service. Case B: A source system from a source domain has permission to translate its local identifier (From-Domain) directly to the destination domain (To-Domain). Case C: A system from a destination domain has permission to translate the local identifier from a source domain (From- Domain) directly to its local identifier (To-Domain). Case D: A source system from a source domain has permission to translate its local identifier (From-Domain) directly to a third domain (To-Domain), and a system from a destination domain has permission to translate the local identifier from a third domain (From-Domain) directly to its local identifier (To-Domain). Case E: A direct translation of local identifiers is not permitted, so a translation requires the use of a warrant. A system of the source domain (From-Domain) has permission to create a warrant (Warrant by Source) or retrieve a warrant (Warrant by NPS) for the destination domain (To-Domain). The data model of Figure 7 allows the definition of permissions that are not useful: In the direct translations the system must either belong to the from-domain or to the todomain. In the warrant-based translation, the permission, the system must belong to the from-domain. J. Identity related services As a result from the previous sections, the following services are required at the National Pseudonymisation Services. Services that are needed in the in-house setup are explicitly named. For simplification reasons, persistent local identifiers are mentioned in most of the description, but it depends on the definition of the domain, whether a persistent identifier has to be used or if it will be returned or not. Services for the notification of the reception of identifiers and warrants are not listed. All functions require the "Identifier Domain" parameter. This parameter is needed to identify the current domain of the calling system, since systems might belong to several domains. 1) Register a person by Demographics Register Person Identifier Domain Demographics Local/Persistent Identifier Returns a local identifier on base of demographics. 2) Register a person by demographics and local identifier Register Identified Person Identifier Domain Local Identifier Demographics In domains that manage the local identifiers by themselves, the service makes the National Pseudonymisation Service aware of the local identifier and its associated demographics in that identifier domain. No persistent identifiers are provided in self-managed domains. This function does not return any result.

57 85 3) Update of demographics of a person Update Person Identifier Domain Local/Persistent Identifier Demographics Local Identifier If a local identifier has already been registered at the National Pseudonymisation Service, or the National Pseudonymisation Service has returned a local/persistent identifier, this function is used to update the demographics. This might lead to an update of the local identifier (see E.2). 4) Translate identifer at the source domain Translate Identifier Local Identifier Domain Foreign Identifier Domain Local/Persistent Identifier Foreign/Persistent Identifier This is a simple translation of identifiers between the local (source) and the foreign (destination) domain. This is mainly the function that is needed in the in-house setup, so the local identifier of the foreign domain is calculated or determined on base of the local identifier only. 5) Trananslate identifier at the destination domain Retrieve Identifier Local Identifier Domain Foreign Identifier Domain Foreign/Persistent Identifier Local/Persistent Identifier This service is similar to 4) but in this case, the destination domain is calling the service. This leads to a change of the focus of the local-foreign relation: the destination is requesting its local identifier on base of the foreign identifier of the source. 6) Register a warrant, associated to a local identifier Register Warrant Local Identifier Domain Foreign Identifier Domain Local/Persistent Identifier Warrant This function registers a warrant for a foreign domain with a given local identifier. In this case the warrant is managed (provided) by the source The warrant is only valid in the foreign domain to retrieve the local identifier of that domain. 7) Request a warrant, associated to a local identifier Request Warrant Local Identifier Domain Foreign Identifier Domain Local/Persistent Identifier Warrant This function requests a warrant for a foreign domain with a given local identifier. In this, case the warrant is managed (provided) by the National Pseudonymisation Service. The warrant is only valid in the foreign domain to retrieve the local identifier of that domain. 8) Retrieval of the local identifier at the foreign domain on base of a warrant Redeem Warrant Local Identifier Domain Warrant Local/Persistent Identifier Having a warrant of the correct domain, this service will allow the retrieval of the identifier in that domain. 9) Re-identification of demographics on base of a local identifier Re-Identify Person Local Identifier Domain Local/Persistent Identifier Demographics In case of re-identification requests, this service will only provide the latest version of demographics that have been registered in that domain. Demographics from different domains related to the same persons are not accessible. This service might also be useful in the in-house setup in case of re-identification requests. 10) Linking of local identifiers, in case of identified doublets Link Local Identifier Local Identifier Domain Obsolete Local Identifier Surviving Local Identifier If a source that manages the local identifiers itself identifies doublets, should use this function to inform the National Pseudonymisation Service about the merge in the local (in-house) system. 11) Get updates of identifiers in the domain Get Updates Local Identifier Domain List of [Persistent Identifier: New Local Identifier] In case of updates at the National Pseudonymisation Service, local identifiers might change for some data (see E. Problematic cases). The National Pseudonymisation Service is a passive service so it only responses to requests. Identifier domains must use this service regularly to get notified about the latest updates, since the last request. 12) Identification of potential duplicates Vigilance Request Local Identifier Domain Local/Persistent Identifier 1 Local/Persistent Identifier 2 On base of the medical data, a destination domain might come to the conclusion that the given local identifiers are potential duplicates and belong to the same person. An alert will be triggered at the identity vigilance of the National Pseudonymisation Service to check the case. 13) Identification of potential splits Vigilance Request Local Identifier Domain Local Identifier Persistent Identifier 1 Persistent Identifier 2

58 86 On base of the medical data, a destination domain might come to the conclusion that the given local and persistent identifiers are potential splits and should belong to different persons. An alert will be triggered at the identity vigilance of the National Pseudonymisation Service to check the case. K. Matching of identities The National Pseudonymisation Service uses an underlying Master Patient Index to figure out, whether the given demographics of a person are known (match), or if they identify an unknown person (no-match). The matching algorithm depends on mandatory demographics (first name, last name, gender, and birthday) and optional demographics (national social security number, zip-code of the birthplace). Depending on the degree of agreement, the algorithm will distinguish, true matches (the person is known with high probability), true non-matches (the person is not known with high probability), and ambiguous matches (there is more than one potential candidate or it is not clear whether the person is known or not). If the decision is not clear (ambiguous match), a new person will be created in the system, and the identity vigilance will be informed to solve the problem by requesting additional information from the involved domains. Since the National Pseudonymisation Service acts as a shell around an existing Master Patient Index, the Master Patient Index service could be replaceable at any time in case without affecting the pseudonymisation service. L. Calculation of local identifiers An important part of the entire process of identification of persons is the creation of the local identifiers of a domain. This calculation has to be done at the National Pseudonymisation Service or locally at the in-house solution for new persons or for persons that are accessed for the first time by an identifier domain. Domains that provide their own local identifier are not affected by this question. Each person that is managed by the National Pseudonymisation Service (or internally by its Master Patient Index) is represented by a person-object. This object consist of the single best record of the demographics of the persons plus an internal identifier of the object. Each local identifier of an identifier domain is linked to that object via the internal object identifier. The link will be established during the registration step of the person or the translation of identifiers between different domains. If an identifier does not exist at that time, it must be created. In the in-house solution, usually the managed persons are stored inside a database with a person identifier associated to it. This might be an attribute of the database table or it is a given identifier that was inscribed together with the demographics (e.g., social security number) or it was already a pseudonym that was given with the data. If personal data needs to be delivered to a certain domain, the domain specific identifier needs to be created, if this has not been done already. In both cases there are several options to create the identifier out of the person identifier (object identifier, person identifier, pseudonym, social security number etc.): Take the next free available number: last used number plus 1: In this case all created numbers build a continuous running number. This must be avoided, if the identifier is used as a pseudonym. If the original identifiers are already continuous numbers, a link could be established between identifier and time of creation of the person inside the system. Chose a random number: The use of random numbers should be the preferred choice, but require the management of mapping tables (local identifier person identifier). Such mapping tables could be used in case of selective anonymisation of individuals: If the entry (local identifier person identifier) is replaced with (local identifier NULL) Every data that is stored with the local identifier can never be linked to the person again. Calculate the identifier from the person identifier: If the management of mapping tables must be avoided (especially in the in-house setup) and a selective anonymisation is not required, the calculation of the local identifier on base of the person identifier together with a certain secret is a good alternative to the random number. Good strategies are the use of salted hashes (Hash(Salt + person identifier)) or encryption (Enc(Key, person identifier)). In both cases, the salt or the key is the secret that is linked to the identifier domain. This strategy is problematic, if the calculated local identifier has limitations related to the data type. Example: The person identifier at the source is of data type 4 byte unsigned integer (=32 bit), and the resulting local identifier must be from the same data type. Current hashing or encryption algorithms usually work with 128 bit minimum, so are not suitable in the described case. Cropping of the result to 32 bit is not a way to go because this introduces a risk of collisions, which means that for some person identifier the calculated local identifier will be the same. This behavior cannot be tolerated. For this special case, a new calculation algorithm is proposed. M. Calculation of local identifier with small number of bits The mathematics behind the local identifier calculation of a person identifier is based on selected primitive roots of fixed prime numbers as it is used in the Diffie-Hellman protocol to ensure a secure key exchange [15]. First we need to introduce some fundamental mathematics. to 1) Discrete logarithm Having the equation: b = a i mod p, with p prime, i {1..p-1} (1) Then i is called the discrete logarithm. This is equivalent i=log a b mod p, i {1..p-1} (2)

59 87 The calculation of b is easy but currently there exists no efficient way to find the discrete logarithm i with given a, b and p. This statement is only true if p is big enough to make the use of pre-calculated solution tables impossible and if no pre-knowledge about i exists that allows reducing the search space. 2) Primitive roots The property of a being a primitive root of prime p means that a i mod p, with i =1..p-1 (3) results in all values of 1..p-1, with no value double or missing. This property is relevant to create collision free local identifiers. Primitive roots have been used already a long time ago to build good random number generators [19]. The proposed algorithm uses this knowledge to introduce pseudorandomness into the series of pseudonyms. 3) Adaption for the calculation of the local identifier With k bits that are reserved for the local identifier, a prime number p should be chosen that in best case is the highest prime number lower than 2 k. With the given p, the interval of possible person and local identifiers is 1..p-1. The numbers that are invalid in the k-bit number space are 0 and p..2 k -1. As an example: For k=31, the highest prime lower than 2 31 is In this case, only 0 and cannot be used as person and local identifier. The difficulty to find the discrete logarithm i of the equation a i mod p is based on the assumption that i is randomly distributed and that no information can be used to reduce the number of possible values. This may not be the case if the persons person identifier is used as exponent i. Two examples might help to demonstrate the problem. In both cases, i equals the person identifier id. In the first example the exponent i is a continuous number starting with 1, so the n th local identifier belongs to the person identifier n. If an attacker is able to estimate the number of already managed persons, the number of potential i is heavily reduced. In the second case, the person identifier is created out of the birthday and a running number (e.g., for the 12 th person born in March 23 of 1985). In the example, knowing that a person was born at a certain day, this limits the number of potential i to 100. To avoid the reduction of potential i with prior knowledge about the person identifier id, two processingsteps are performed, including one non-linear step: 1. XOR (non-linear exclusive or): The person identifier will be XORed with a constant c 0 of k bits 2. EXPAND: The intermediate result is multiplied with an expansion factor q mod p, (1<q<p) Step 1 might lead to an invalid results that is out of the range of the allowed values (0, p..2 k -1). If this happens the XOR must be reversed. In case of p be close to 2 k, the number of invalid values (p..2 k -1) can be minimized, which lowers the risk to revers the XOR step. p being prime guarantees that the result of step 2 is still in the range of 1..p-1, avoiding any doubles. At that point, even with pre-knowledge about the person identifier, no conclusions about the exponent i of the calculation a i mod p can be made, which would allow to reduce the search space. Finally, the main calculation step a i mod p can be performed. Unfortunately, if the prime number p is small, it is possible to calculate all possible b=a i mod p to set up a solution table b i. For a prime smaller than 2 31, maximal 8GiB are needed to setup such a table (1GiB = 2 30 Byte). Even for prime smaller than 2 40, a solution table with maximal 5TiB needs to be pre-calculated (1TiB = 2 40 Byte). Tables with that size fit in currently used RAM or hard disks and are no burden for potential attackers. A solution to overcome this problem is to also keep the primitive root a secret. In that case, with given b and p, for each a a different i exists that fulfills the equation. The entropy of the secrets a, q and c that have been used so far might be insufficient to avoid brute force attacks. So a final round of confusion is performed: 3. XOR (non-linear exclusive or): The intermediate result will be XORed with a constant d 0 of k bits 4. ROL (shift rotate left): The intermediate result will be shift-rotated s bits left ( s >0) As with step 1, step 3 must be reversed, if the result is invalid. If the intermediate result of step 4 leads to an invalid value, it must be repeated until the intermediate result is in the allowed range. Both strategies do never introduce duplicates. The calculated local identifier finally is the outcome of step 4. Figure 8 lists the entire algorithm as pseudo code. The complexity of an attacker to re-identify the person ID is based on the secrets a, c, d, q and s and requires knowledge about some person and local identifier pairs to proof if the secrets are correctly identified. 4) Example All calculation steps of the local identifier for the person identifier id= are shown in Figure 9. Let k=31 and prime p=2 31-1= The initial value of id will be XORed with c= The expansion factor is defined as q= a= is a primitive root from p. The intermediate result will be XORed with d= Finally, an intermediate result will be shift-rotated left with s=11 bits. The pseudonym that has been calculated from this identifier is ) Finding a primitive root For a given prime number p it is unnecessary to find all primitive roots to select the secret a; only one primitive root

60 International Journal on Advances in Security, vol 7 no 3 & 4, year 2014, 88 FUNCTION calculatelocalidentifer (id, k, a, p, c, d, q, s) BEGIN t1 := id XOR c // XOR person identifier // with secret c IF (t1 {1.. p-1}) THEN t2 := id END IF // if out of range // reverse if necessary t2 := (t1 * q) mod p // expand with secret p i := t2 // this is the exponent b := a^i mod p // the main calculation t3 := b XOR d // XOR with secret d IF (t3 {1.. p-1}) THEN t3 := b END IF // if out of range // reverse if necessary t4 := t3 ROL s // shift-rotate-left s bits WHILE (t4 {1.. p-1}) DO t4 := t4 ROL s END WHILE // if out of range // repeat if necessary lid := t4 // the local identifier t1 = id XOR c = XOR = = t2 = (t1 q) mod p = ( ) mod = b = at2 mod p = mod = t3 = b XOR d = XOR = t4 = t3 ROL s = ROL 11 = lid = t4 = Figure 9. Example calculation RETURN lid!!! END (5) Then 𝑎! 𝑚𝑜𝑑 𝑝 =!!!!!!!!!! 𝑎 (8) 𝑚𝑜𝑑 𝑝 = (9) 𝑚𝑜𝑑 𝑝 (10)!!! 𝑎!!!!!!! This calculation is very fast in case of pre-calculated! 𝑎! 𝑚𝑜𝑑 𝑝 using! This is a quite time consuming process. A faster way to go is this: First all prime factors of p-1 have to be identified. In case of p=231-1, the prime factors of = are 2, 3, 7, 11, 31, 151, and 331. The time to identify the prime factors has only to be spent once and does not affect the time to test the primitive root candidates. For each prime factor f from p-1 the values ai mod p with i=(p-1)/f need to be calculated. a is a primitive root of p if none of the results equals 1. In the example the series of a /2 mod p, a /3 mod p, a /7 mod p,..., a /331 mod p needs to be calculated. These are maximal seven calculations. i 6) Calculating a mod p For the calculation of ai mod p in the described algorithm, the pre-calculation of ai-1 mod p is not available; so, the recursion as mentioned in the equations (4) and (5) is not applicable. Alternatively, the calculation can be quickened if i is split into its binary representation of k bits: 𝑖 = 𝑖!!!, 𝑖!!!,, 𝑖!, 𝑖!, 𝑖! 𝑤𝑖𝑡ℎ 𝑖! 0,1 (7)!!! is needed. The density of primitive roots is quite high so it requires approximately four random tries in case of p=231-1 until a primitive root is found. To proof if a selected a is a primitive root, the series of ai mod p (i=1..p-1) has to be checked. If ai mod p = 1 with i p-1, the series can be stopped and a is not a primitive root. In that case two exponents are found resulting in the same value: ai+1 mod p = a = a1 mod p. The series can easily be calculated with (4) a0 mod p = 1 ai mod p = a(ai-1 mod p) mod p for i=1..p-1 2! 𝑖! 𝑤𝑖𝑡ℎ 𝑖! {0,1} 𝑖 = Figure 8. Pseudocode of the algorithm (6) 𝑎! 𝑚𝑜𝑑 𝑝 = 𝑎!!!!!! (11)! 𝑎 𝑚𝑜𝑑 𝑝 = (𝑎 𝑚𝑜𝑑 𝑝) 𝑚𝑜𝑑 𝑝 for j=1..k-1. As an example, let i = 25 = Then 𝑎!" 𝑚𝑜𝑑 𝑝 = 𝑎!!! 𝑎!!!! 𝑎!!! 𝑎!!! 𝑎!!!! 𝑚𝑜𝑑 𝑝 =! 𝑎! 𝑎! 𝑎! 𝑎! 𝑎! 𝑚𝑜𝑑 𝑝 =!!! 𝑎! 𝑎! 1 1 𝑎! 𝑚𝑜𝑑 𝑝 =!!! 𝑎! 𝑚𝑜𝑑 𝑝 𝑎! 𝑚𝑜𝑑 𝑝 𝑎! 𝑚𝑜𝑑 𝑝 𝑚𝑜𝑑 𝑝 (12) (13) (14) (15) (16) (17) 7) Bit-depth of the secrets The algorithm for the calculation of the local identifiers is useless, if the used secrets allow a brute-force attack. This is not the case, if the entropy of the used secretes is big enough. Furthermore, the effort to calculate the pseudonym must allow the calculation of a high number of pseudonyms per time. Several secrets to calculate the pseudonym are used:

61 89 TABLE I. FACTS TABLE II. ENTROPY OF THE SECRET 4-byte signed integer 5-char base64 6-char base32 2-byte signed short integer Bits maximal positive value highest possible prime highest possible person identifer number of invalid values number of possible primitive roots of the prime The number of possible primitive roots can be calculated with Eulers φ-function and is φ(φ(p))= φ(p-1). The random number c that was used to XOR the exponent. The factor q that was used to expand the exponent. The primitive root a. The random number d that was used to XOR the intermediate result. The number of ROLs (left-shift-rotate) of the intermediate result s. As an example, the bit-depth of the secrets are calculated in case of data types that are usually used to store person identifiers 4-Byte signed integer: The number space is sufficient for a third of the entire living population on earth or four times the number of the living population of the European Union. 2-byte signed short integer: The number space is only useful for a small set of persons, e.g., for persons of a clinical study. 5 chars of base64-encoded numbers or 6 chars of base32-encoded numbers (in case of efficient human readability): The number space is sufficient for two times of the living population of the European Union but insufficient for the living population the People's Republic of China. With the information of Table I, the entropy of the secrets can be calculated that are used during the calculation (Table II). For integer and the encoded char-values, the secret with entropy of 124 bits is sufficient to avoid effective brute force attacks. This is void for short integer. Here the entropy of the secrets is only 64 bits. In that case, the calculation of the pseudonym must be performed in two rounds with different primitive roots, expansion factors, XORs and shift values. This does not fully double the entropy of the secrets because the final steps XOR and ROL are directly followed by another XOR step of the next round. All three steps can be simplified to only one XOR plus ROL. However, the entropy of the secret ( 111 bits) is sufficient today. Secret 4-byte signed integer 5-char base64 6-char base32 2-byte signed short integer a: primitive roots 29 bit 29 bit 13 bit q: expansion factor 31 bit 30 bit 16 bit c: XOR exponent 31 bit 30 bit 15 bit d: XOR result 31 bit 30 bit 15 bit s: ROL result 5 bit 5 bit 4 bit total 127 bit 124 bit 63 bit 8) Calculation speed There are only a few steps involved in the calculation of the pseudonym. The calculation of a i mod p is identified as the most time consuming calculation. The calculation is straightforward and avoids several rounds until the final result is available. Multiplications are always more time consuming than XOR or shift operations so it is assumed that the pseudonym calculation is slower that the competitive approaches. In the known scenarios, the number of pseudonymisation calculations per time is sufficient: Tests have shown that on average hardware (Intel Core 2 Duo, 2.66 GHz) thousand pseudonyms per second can be calculated. 9) Attacks Important for the evaluation of the algorithm is the resistance against attacks and the possibility for reidentification. It is known that for b = a i mod p (p prime, a primitive root of p) it is difficult to calculate the discrete logarithm i, if b, a, and p are known and p being big enough to avoid solution tables. In our case, also the primitive root a is unknown. On the other hand, there might be pre-knowledge about i. With the non-linear diffusion steps that base on the use of non-trivial secrets (e.g., q 1, c 0), the exponent is complex enough to make the information of the initial series useless. Brute force attacks will only be possible if an attacker is able to validate the set of parameters with a given set of person identifiers and their associated local identifiers. An attacker will in worst case only get both sets, not knowing what person identifier and local identifier is finally linked. Depending on the size of the set, it is likely that several secret sets lead to the same transformation of the set of person IDs to the set of pseudonyms. In case of leaked pairs of person plus local identifier, this information can only be used to perform a brute force attack. A recalculation of the used parameters is not possible. 10) Re-Identification A fast re-calculation of the person identifier is possible if all secrets are known. In case of small p and a given a, the solution table for b=a i mod p is made fast and every step of the entire calculation process can be reversed. Only if the solution table cannot be pre-calculated, it is quicker to pseudonymise all known person identifiers again to find the correct local identifier.

62 90 III. RESULTS A National Pseudonymisation Service on base of the described concept has been specified and is in the final phase of implementation in Luxembourg. The concept was developed after an intensive study of the demands has been carried out in Luxembourg. The National Pseudonymisation Service creates a shell around the National Master Patient Index that will be used in the National ehealth Platform of Luxembourg. This ensures, that the National Pseudonymisation Service will cover all persons working or living in Luxembourg and that all persons are managed with high quality demographics. Matching difficulties of identities should therefore be an exception. Identity vigilance in case of uncertainty will be covered on a national level and no double structures have to be created. The use of the National Master Patient Index by the National Pseudonymisation Service does not affect the productivity of the used system. Both systems can be enhanced independently and update paths do not affect each other. The described functions and the possibility to adapt the properties of an identifier domain for several needs, allows the use of the National Pseudonymisation Service in all Cases from B to E as described in Section B. A. Case B: Cancer Register using National Pseudonymisation Service The use of the National Pseudonymisation Service as described in Case B is planned for a cancer register. In the described use case, the sources have access to the clinical data of the patients and will send pseudonymized extracts of this data to the cancer register. Sources can be divided into sources that manage their local identifier, and those who do not manage a local identifier. The process of sending data from the sources to the cancer register can be described as follows: 1a) Sources with managed local identifer register local identifer and demographcis at the National Pseudonymisation Service Register Identified Person Source Domain (Managed) Local Identifier Demographics of Patient 1b) Sources with unmanaged local identifer request local identifer on base of demographics from the National Pseudonymisation Service Register Person Source Domain (Unmanaged) Demographics of Patient Local/Persistent Identifier 2) Request the pseudonym of the cancer register from the National Pseudonymisation Service Translate Identifier Source Domain (Managed/Unmanaged) Cancer Register Domain Local/Persistent Identifier Pseudonym/Persistent Identifier of Cancer Register 3) Source sends medical data and pseudonym to the cancer register Send Medical Data Pseudonym/Persistent Identifier of Cancer Register Medical Data 4) National Cancer Register stores medical data and pseudonym Store Medical Data Pseudonym/Persistent Identifier of Cancer Register Medical Data B. Case E: Biobank using National Pseudonymisation Service A Luxembourgish biobank currently uses the principles of Case E with the use of a Trusted Third Party as pseudonymisation service. The migration of that service to the National Pseudonymisation Service is planned as soon as the service is available. The specialty of this concept is the uses of the warrant. In the biobank case, cryo-boxes are sent by the biobank to the collection sites. If samples are collected from donors (specimen, blood, urine) the samples are put into the cryo-box that is sent back to the biobank. The kit-id of the cry-box acts as the warrant in the process of person identification and pseudonym retrieval. The process can be described as follows: 1) The biobank sends cryo-boxes with unique kit-ids to the collection sites Send Cryo-Box Kit-ID 2) Collection sites with unmanaged local identifer request local identifer on base of demographics from the National Pseudonymisation Service Register Person Collection Site Domain Demographics of Donor Local Identifier 3) Collection Sites take samples of donors and stores it into cryo-boxes Collect Samples Cryo-box with Kit ID Samples of a Donor 4) Collection Sites send cryo-boxes to biobank Send Cryo-Box Cryo-box with Kit ID Samples of a Donor 5) Collection Site registers Kit-ID of the cryo-box as warrant at the National Pseudonymisation Service Register Warrant Collection Site Domain Biobank Domain Local Identifier Kit-ID

63 91 6) Biobank request pseudonym at the National Pseudonymisation Service on base of the Kit-ID of the received cryo-box Redeem Warrant Biobank Domain Kit-ID Pseudonym/Persistent Identifier 7) Biobank stores samples in its repository and links it in its Laboratory Information Management System (LIMS) with the pseudonym Store and Manage Sample of Donor Pseudonym/Persistent Identifier C. Case B: HIV Register using In-House Pseudonymisation A local HIV register performs long-term studies on HIV. It was created several years ago and recently introduced the concept of in-house pseudonymisation to improve data privacy and data security. Since all tools and mechanisms had been implemented around an existing database structure, it was decided to keep the original database with all the medical data plus the demographics of the patient untouched. Persons who have direct contact to the patients fill this database. A tool is used to create pseudonymized copies of the original database that contain only research and study specific subsets of the original data. Therefore, the database model is only a subset of the original database model, but it is ensured that none of the used tools have to be adapted. Such extraction, transform and load tools are called ETL tools. The ETL tool will handle all mappings between both database models and finally will create the pseudonym out of a given person identifier. In the described case, the keeping of mapping tables (person identifier-to-random pseudonym) was not wanted, and the described techniques of hashing or encryption have also not be suitable, since the data type that the person identifier and pseudonym is 4 byte signed integer with 1 as the smallest, and as the highest possible values. With the algorithm that has been described in II.M. Calculation of local identifier with small number of bits, study specific pseudonyms are calculated out of the given person identifier. For each study, a different set of secrets (as listed in Table II) is used as the calculation parameters. Since the identification of primitive roots is not an easy task, a tool was provided to identify primitive roots. D. Use cases in discussion Other Luxembourgish institutes are highly interested in using a National Pseudonymisation Service in the near future, either to secure their existing databases or for newly planned databases. For some of the analyzed use cases, the use of a National Pseudonymisation Service seems to be far too much and an in-house pseudonymisation is demanded. IV. CONCLUSION AND FUTURE WORK The use of a National Pseudonymisation Service solves several problems of researchers. It divides infrastructure and personal costs among all users of the national service. It ensures the quality of the underlying demographics that is ensured by the existence of a centralized identity vigilance that already exists for the underlying National Master Patient Index of the National Pseudonymisation Service. The team that performs identity vigilance on national level has permission to solve unclear matching decisions of the Master Patient Index by questioning all sources of demographics. Update of demographics and reconsiderations at the National Pseudonymisation Service about the matching of persons are manageable. With the use of the persistent identifiers, only selective data needs to be updated in case of change of identifiers. In case of new studies or trials that have to be approved by ethics commissions, questions about data protection will be asked. The use of mechanisms that already have been accepted on a national level will simplify the answering of these questions. If approved by the ethics commission and with given consent by patients, a National Pseudonymisation Service enables the exchange of data between different studies or trials and link data from the different sources to the same person, even if the sources only use pseudonyms. The given set of services and the various properties that can be configured for an identifier domain allows the implementation of all described cases A to E. There are always arguments pro and contra the implementation of a certain case, depending on risks to disclose sensitive information. Each designer of a clinical study or trial setup can decide, which of the cases suits most his requirements and data privacy demands. Even with an up and running National Pseudonymisation Service, the use of in-house pseudonymisation might be the first choice, especially in case of limited participants in the setup. In that case the use of a national service might be far too much, and the costs for the service might be too high. In that case the described algorithm for the creation of pseudonyms out of person identifiers provides a collision free one-way pseudonymisation technique for small bitdepth that still fulfills the requirements of a one-way function, if the secrets behind the calculation are kept secret. The past has shown that an up-and-running National Pseudonymisation Service improves the willingness to include pseudonymisation solutions already during the design phase of new research databases. This is good, since privacy-by-design strategies are more durable than security patches that are introduced in a later phase of development. It is expected that with the establishing of the National Pseudonymisation Service, local companies will link their software solution to the national service. Alternatively, consultant companies will offer help in the planning of the integration of the National Pseudonymisation Service into future applications and to find the correct setup (case A to E) that suites most the demands of the customer on data protection and disclosure risks.

65 An Evaluation Framework for Adaptive Security for the IoT in ehealth 93 Wolfgang Leister Norsk Regnesentral Oslo, Norway Stefan Poslad Queen Mary University London, UK Mohamed Hamdi School of Communication Engineering Tunisia Habtamu Abie Norsk Regnesentral Oslo, Norway Arild Torjusen Norsk Regnesentral Oslo, Norway Abstract We present an assessment framework to evaluate adaptive security algorithms specifically for the Internet of Things (IoT) in ehealth applications. The successful deployment of the IoT depends on ensuring security and privacy, which need to adapt to the processing capabilities and resource use of the IoT. We develop a framework for the assessment and validation of context-aware adaptive security solutions for the IoT in ehealth that can quantify the characteristics and requirements of a situation. We present the properties to be fulfilled by a scenario to assess and quantify characteristics for the adaptive security solutions for ehealth. We then develop scenarios for patients with chronic diseases using biomedical sensors. These scenarios are used to create storylines for a chronic patient living at home or being treated in the hospital. We show numeric examples for how to apply our framework. We also present guidelines how to integrate our framework to evaluating adaptive security solutions. Keywords Internet of Things; evaluation framework; scenarios; assessment; ehealth systems; adaptive security. I. INTRODUCTION Wireless Body Sensor Networks (WBSNs) improve the efficiency of ehealth applications by monitoring vital signs of a patient using low-rate communication media and constitute an important part of the Internet of Things (IoT) by bringing humans into the IoT. However, the successful deployment of the IoT depends on ensuring security and privacy, which need to adapt to the processing capabilities and resource use of the IoT. To evaluate such adaptive mechanisms we introduced evaluation scenarios specifically designed for applications in ehealth and proposed an evaluation framework [1]. This evaluation framework is extended in this study with a quantitative component that allows us to quantify the quality of security solutions. The Adaptive Security for Smart Internet of Things in ehealth (ASSET) project researches and develops risk-based adaptive security methods and mechanisms for IoT that will estimate and predict risk and future benefits using game theory and context awareness [2]. The security methods and mechanisms will adapt their security decisions based upon those estimates and predictions. The main application area of ASSET is health and welfare. Health organisations may deploy IoT-based services to enhance traditional medical services and reduce delay for treatment of critical patients. In a case study, we evaluate the technologies we developed for adaptive security using both simulation and implementation in a testbed based upon realistic cases. Blood pressure, electrocardiogram (ECG) and heart rate values can be gathered from patients and anonymised. The sensor data can be stored in different biomedical sensor nodes that are capable of communicating with any of the following connectivity options ZigBee, Wi-Fi, 3G, GPRS, Bluetooth, and 6LoWPAN. For instance, a smartphone with a suitable transceiver could act as an access point between sensor nodes and a medical centre. For the evaluation, we developed a set of scenarios to assess the adaptive security models, techniques, and prototypes that will be introduced in ASSET. These scenarios describe the foreseeable interactions between the various actors and the patient monitoring system based on IoT. In computing, a scenario is a narrative: it most commonly describes foreseeable interactions of user roles and the technical system, which usually includes computer hardware and software. A scenario has a goal, a time-frame, and scope. Alexander and Maiden [3] describe several types of scenarios, such as stories, situations (alternative worlds), simulations, story boards, sequences, and structures. Scenarios have interaction points and decision points where the technology under consideration can interact with the scenario. This means that the scenarios developed for a particular situation have to take into consideration the technologies used by the different actors. The importance of scenarios in the assessment of security solutions has been discussed in the literature [4], [5]. This work focuses on the development of scenarios that support the evaluation of adaptive security techniques for the IoT in ehealth. There are many definitions of the IoT. For instance, while the ITU-T [6] defines the IoT as a global infrastructure for the information society, enabling advanced services by interconnecting (physical and virtual) things based on existing and evolving interoperable information and communication technologies, the European Research Cluster on the Internet of Things (IERC) defines the IoT as A dynamic global network infrastructure with self-configuring capabilities based on standard and interoperable communication protocols where

66 physical and virtual things have identities, physical attributes, and virtual personalities and use intelligent interfaces, and are seamlessly integrated into the information network [7]. For our purposes we use Abie and Balasingham s shorter definition: IoT is a network of things [2]. Habib and Leister [8] present a review of IoT layer models, including the ITU-T IoT reference model [6]. The primary contributions and advances of this study are the development of a quantitative framework for the assessment of adaptive security solutions on the basis of security, privacy, Quality of Service (QoS) requirements, and costs. In Section II, the requirements and the proposed assessment framework are described including metrics that make this framework quantifiable in order to enable comparison of various situations. We define the properties that must be fulfilled by a scenario to assess adaptive security schemes for ehealth. We show the interaction between the scenarios, the threats, and the countermeasures in an assessment framework for the ASSET project. In Section III, we describe the extension of a previously developed generic system model, which is used for the structure of the scenarios in Section III-A with different QoS requirements, contexts and adaptive security methods and mechanisms. These scenarios, first proposed by Leister et al. [9], include a patient monitored at home scenario, a hospital scenario, and an emergency scenario. These scenarios are reviewed and their adequacy to the evaluation of adaptive security techniques for the IoT is analysed. We propose storylines that can support requirements analysis, as well as adaptive security design, implementation, evaluation, and testing. Further, in Section IV, we present storylines for both the home monitoring scenario and the hospital scenario. These storylines are used in Section V to show how our framework can be applied to selected episodes of the home scenario and storyline. In Section VI, we show how to use our framework in the context of adaptive security as defined by Abie and Balasingham [2]. Finally, Section VII discusses our framework and relates it to other work before Section VIII offers concluding remarks and future prospects. II. THE ASSET EVALUATION FRAMEWORK Designing the scenarios is of central significance for the ASSET project. They depict the operation of systems, here applied to IoT-based ehealth systems, in the form of actions and event sequences. In addition, scenarios facilitate the detection of threats and the identification of the solutions to cope with these threats. In a scenario-based assessment, a set of scenarios is developed to convey the design requirements. With regard to the specific objectives of IoT-based systems, the scenarios should capture two types of requirements: 1) Security requirements: Novel adaptive security and privacy mechanisms and methods are required to adapt to the dynamic context of the IoT and changing threats to them. Thus, the scenarios should be generic enough to capture the security needs for the data processed and exchanged within a patient monitoring system. This is particularly challenging because this system encompasses multiple 94 networking technologies, data, users, and applications, addressing varying processing capabilities and resource use. In an assessment context, privacy and security requirements are related. Privacy addresses the ability to control the information one reveals about oneself over the Internet and who can access that information. 2) QoS requirements: QoS addresses the overall performance of a system regarding technical parameters. Unlike many traditional applications and services relying on communication networks, ehealth applications have stringent QoS requirements. Items such as the communication delay, the quality of the communication channels, and the lifetime of the self-powered sensor nodes are crucial context parameters that have significant impact on the safety of the patient. The scenarios should highlight the needs in terms of QoS requirements and illustrate the dynamic interplay between these needs and the security requirements. Security and QoS mechanisms are interrelated. Adaptation of security mechanisms may impact the QoS and vice-versa. QoS requires adaptive security mechanisms to ensure appropriate level of QoS. While adapting poor security mechanisms can hamper the performance of QoS, an inappropriate QoS level can leak sensitive information about the importance of the service in question. Therefore, adaptation must consider both security and QoS together to achieve the best possible security and QoS levels. Otherwise, weaker security and/or less effective QoS guarantees may be the result. For example, the requirement of using stronger cryptographic algorithms could have negative impact on the performance or battery consumption. A. Requirements and Sets of States The ASSET scenarios appear as a component of an assessment framework that will serve to improve the applicability of the security techniques proposed in the frame of the project. The other components of the assessment framework are (i) a set of threats describing the actions that violate the security requirements, (ii) a set of security solutions that mitigate the threats, and (iii) a set of system states representing the dynamic context in which the patient monitoring system operates. Fig. 1 illustrates the ASSET assessment framework. The security and QoS requirements are the output of the scenario design activity. In other terms, the scenarios should give information about the set of reliable states from the security requirements, here denoted as S, and the set of states where the QoS is acceptable, here denoted as Q. The intersection of these sets is the set of desirable states, denoted in Fig. 1(a) by D (Desirable), where the security and QoS requirements are balanced. One of the intrinsic features of the ASSET scenarios is that the sets of security requirements and QoS requirements could vary in time and space. This will make the threats and the security solutions also vary in time and space. Threats

67 Scenarios Scenarios Scenarios 95 System states (Context) System states (Context) System states (Context) S D Q S I D Q S I D M C Q Threats Threats Security solutions (a) (b) Fig. 1: The ASSET assessment framework. (c) are viewed as potential events that may generate insecure system states, while countermeasures are intended to thwart the effects of these threats. The realisation of a threat reduces the set of secure states in the scenario of interest and affects the QoS. This is represented by the region I (Impact) in Fig. 1(b). This region represents a set of states that will not fulfil the security or QoS requirements if a threat is realised. The countermeasures or controls [10] will reduce both the likelihood of a threat being realised and the impact of an emerging threat. Hence, the size of the set of potentially insecure states is decreased. Fig. 1(c) illustrates the effect of the countermeasures through the Region M (Mitigate). This region extends the set of secure states. Nonetheless, the countermeasures can have a negative effect on the QoS, represented by the region C (Cost), consisting of power, processing resources, memory, communication overhead, and cases where QoS requirements may not be fulfilled. These elements are used in a scenario-based assessment framework to evaluate the strength of the adaptive security solutions. For instance, the scenarios allow us to evaluate the strength of the security controls to minimise the impact of threats in a given context. For adaptive security solutions, the proposed protection techniques will vary in time and space according to the context. This is not conveyed by the scenario representation of Fig. 1. To overcome this issue, we derive a set of storylines from the ASSET scenarios. These can be viewed as a sequential application of the scenarios in a way that the selection of the appropriate countermeasures must take into consideration: The space transition between scenarios. Space encompasses much useful information that affect the security decision-making process. For instance, the location of the WBSN may increase/decrease its vulnerability. Moreover, mobility introduces significant challenges including horizontal and vertical handover management, i.e, managing handover on the same layer or within the same access technology and between different layers or different access technologies, respectively. The time transitions between scenarios (with its implications on the context). The time interplay between the Fig. 2: Illustration of context changes during the execution of a storyline. The use of the different shaded regions follows that of Fig. 1 threats and countermeasures has a substantial and dynamic impact on the environment where the patient monitoring system is deployed. The amount of energy, memory, and processing resources are crucial parameters from the QoS perspective and the security solutions have to adapt accordingly. In addition, the state of the communication channel and the proper temporal interplay in all these contexts are important in the selection of the appropriate security decisions. Fig. 2 illustrates the evolution of the storyline and the underlying impact on the context. Of course, the sequence of scenarios forming a storyline should be consistent so that it translates to a real-case situation. B. Making the ASSET Framework Quantifiable Assessing the qualities of a given system state can be done by means of data given by human assessors and by means of objective data from measurements. Our goal is to establish an estimation function that takes measured data as input and which is a prerequisite to implement functionality for adaptive security. To establish such an estimation function the assessment a panel of users and specialists is queried to calibrate a function that uses measured data as input. Similar methodology has been used to estimate the quality of streamed video [11]. In the following we present how to assess a given system state by using human assessors. To make the ASSET framework quantifiable we define a real function 0 q(system state) 1 that shall express

68 the degree of how well the requirements are fulfilled in the system state in question. A low value, below a given threshold, denotes that the system state in question is unacceptable, while a value close to 1 denotes that most requirements are well fulfilled. The function q is composed of three parts: 1) security requirements that need to be fulfilled, expressed in the function q S ; 2) degree of fulfilled QoS requirements, expressed in the function q Q ; and 3) costs that occur due to mitigation of threats. The function q is then composed of a product of all partial functions of q i {S,Q,C} : q = i q i wi. The weights are real numbers 0 w i < and express the importance of a single q i, large values indicating more importance. A weight w i = 1 is considered neutral. The importance of each parameter is defined by the assessor according to the nature of the requirement before assessing the q i values. The above definition has the disadvantage that the resulting q is sensitive to the number k of factors q i that are used to define it. To mitigate this we propose to replace the weights w by v i = wi i kj=1 w k resulting in ˆq i = q j i. Thus, the value j=1 wj q is expressed by: q = i ˆq i = i wi kj=1 w q j i (1) 1) Security Requirements: Define G S = (S \ I) M as a set of states where security requirements are fulfilled or threats are mitigated. For states j outside G S we define a deviation from the ideal requirements and a normalised distance d Sj : 0 d Sj 1 according to a suitable metric to denote how far the current state is from ideal fulfilment of the requirement. We set d Sj = 1 when deviations cannot be tolerated. Thus, we define the following function: { 1 if state GS q Sj = 1 d Sj if state / G S 2) QoS Requirements: Define G Q = Q \ C as a set of states where all QoS requirements are fulfilled and possible effects from the mitigation are tolerable. For states j outside G Q we define a deviation from the ideal QoS requirement and a normalised distance d Qj : 0 d Qj 1 according to a suitable metric to denote how far the current state is from ideal fulfilment of the requirement. We set d Qj = 1 when QoS requirements are insufficiently fulfilled. QoS requirements may be unfulfilled due to influences from the environment, or become unfulfilled due to adaptation. The latter could, for instance, happen if a security requirement to avoid eavesdropping was met by reducing signal strength, which could impact the available bandwidth or even data availability. Thus, we define the following function: { 1 if state GQ q Qj = 1 d Qj if state / G Q 3) Mitigation Costs: Besides the effect on QoS there may 96 be other costs implied by mitigation, e.g., real costs in payroll or material, changes to the environment, costs for the patient, virtual costs for a lower QoS, and so on. States with unacceptable costs are included in the area C. For costs outside C we define relative costs on a normalised scale d C : 0 d C 1. We define the following function: { 1 dc if costs / C q C = 0 if costs C C. Assessment to define the q i values To aid human assessors in assessing the values for q i (i.e. the value indicating how far a given requirement is from the ideal fulfilment) we propose to base the assessment on a set of questions that are evaluated based on a Likert scale [12]. A Likert scale is a psychometric scale commonly involved in research that employs questionnaires where the questions are to be answered from best to worst on a scale of n steps, where n is an odd integer number. If the questionnaire to be filled out by an assessor is designed so that each q i corresponds to one question on a Likert scale we propose to use a function e that takes the response q i N for 0 q i n 1 as an argument. We use two approaches to express the q i. 1) Linear Approach: q i = e α ( q i ) = 2) Logarithmic Approach: q i n 1 (2) q i = e β ( q i ) = log n ( q i + 1) (3) Using the logarithmic approach leaves less impact of bad values than the linear approach. There are some caveats on using a logarithmic function for values on a Likert scale, as noted by Nevill and Lane [13]. Particularly, the values on the Likert scale should express a continuous and rather equidistant increase of quality. 3) Other Methods: In case the questionnaire is designed in a way that several independent questions result in one value for q i, Bayesian networks developed by Perl and Russell [14] can be employed. However, we consider the design of the questionnaires and the use of Bayesian networks as future work. Note also that for Bayesian networks more data from an assessment are necessary than for the above mentioned methods. While the Likert scale is useful for assessing opinions on a psychometric scale, i.e., subjective data, we need, as well, to be able to assess objective data. In these cases, we set up a scale where discrete choices on a questionnaire are mapped to a similar scale as the Likert scale to reflect the quantity of data based on an objective value. This way of creating assessment data are quite common for assessments, such as in the estimation of the quality of software products in the OpenBRR [15, 16]. When objective data are used as input, e.g., as the result of measurements, these data on a continuous scale can be mapped

69 Fig. 3: Generic ehealth framework indicating the use cases in five levels (Extended from [17]). into the value range 0 q i 1 and used in eq. (1). Note, however, that the mapping function does not necessarily need to be linear, and a specific assessment phase may be necessary to develop a suitable function that maps the values into the value range 0 q i 1. 4) Assessment by Subject Panels: For an assessment often several individuals are put into an assessment panel. These subjects perform the assessment individually while the results are put together into one assessment result. Further work needs to show whether it is more practicable to calculate individual q values and then calculate some mean value of these or whether to calculate mean values for each q i. III. EXTENDED GENERIC MODEL FOR EHEALTH SCENARIOS In the following sections, we develop the scenarios of the ASSET project and show how storylines can be extracted. We also underline the role of the storyline in the assessment of adaptive security techniques for ehealth. Before delving into the details of scenario and storyline engineering, we highlight the major properties that a scenario should have in order to be useful for evaluating adaptive security. Patient monitoring systems are a major data source in healthcare environments. During the last decade, the development of pervasive computing architectures based on the IoT has consistently improved the efficiency of such monitoring systems thereby introducing new use cases and requirements. It is important that these monitoring systems maintain a certain level of availability, QoS, and that they are secure and protect the privacy of the patient. Previously, we have analysed the security and privacy for patient monitoring systems with an emphasis on wireless sensor networks [17] and suggested a framework for providing privacy, security, adaptation, and QoS in patient monitoring systems [18]. We divided patient monitoring systems into four Generic Levels (GLs): (0) the patient; (I) the personal sensor network; (II) devices in the closer environment following several scenarios; and (III) the healthcare information system. We review the generic model presented by Leister et al. 97 [18] and extended by Savola et al. [19]. This extended generic model contains three new levels related to the monitoring of chronic diseases, the communication between multiple healthcare providers, and the communication between healthcare providers and medical research institutions, respectively. Consequently, the extended generic model is composed of five levels numbered from (0) to (IV) depending on the logical distance to the patient to whom Level (0) is assigned. Multiple types are considered at Level (II). Note that only one of these types applies at a time. However, it must be possible to switch between the types in Level (II) depending on the activity of the patient. To this purpose, the communication between Levels (II) and (III) is two-way. The key levels of our extended generic model are as follows, as shown in Fig. 3: (0) Patient. This is the actual patient. (I) Personal sensor network. The personal sensor network denotes the patient and the sensors measuring the medical data. These sensors are connected to each other in a WBSN. While this sensor network can be connected randomly, in most cases one special WBSN node is appointed to be a Personal Cluster Head (PCH), which forwards the collected data outside the range of the WBSN. (IIa) Smart home. The patient is in a smart-home environment where the personal sensor network interacts with various networks and applications within this environment. The smart home infrastructure may be connected to a healthcare enterprise infrastructure using long-distance data communication. (IIb) Mobility. The patient is mobile, e.g., using public or personal transportation facilities. The personal sensor network of the patient is connected to the infrastructure of a healthcare enterprise via a mobile device, e.g., a mobile Internet connection. (IIc) Paramedic. The WBSN is connected to the medical devices of an ambulance (car, plane, and helicopter) via the PCH. The devices of the ambulance can work autonomously, showing the patient status locally. Alternatively, the devices of the ambulance can communicate with an external healthcare infrastructure, e.g., at a hospital. (IId) Intensive care/surgery. During an operation the sensor data are transferred to the PCH or directly to the hospital infrastructure over a local area network. The sensors are in a very controlled environment, but some sensors may be very resource limited due to their size, so extra transport nodes close to the sensors may be needed. (IIe) Pre- and postoperative. During pre- and postoperative phases of a treatment, and for use in hospital bedrooms, the sensor data are transferred from the sensor network to the PCH and then to the healthcare information system. (IIf) Chronic disease treatment. The WBSN data are

70 used by healthcare personnel in non-emergency treatment of individual patients with a chronic disease. (III) Healthcare information system. This is considered a trusted environment. It consists of the hospital network, the computing facilities, databases, and access terminals in the hospital. (IVa) Inter-healthcare provider. Information is shared between different healthcare providers concerning medical information of an individual patient. (IVb) Healthcare provider and research. Information is shared between healthcare providers and medical research organisations for the purposes of research, new solutions development, etc. A. The Structure of the Scenarios Through the potential interactions between these levels, notice that the model can support the elaboration of multiple scenarios where the actors interact by switching from a level to another. The scenarios in healthcare using biomedical sensor networks are quite complex. Therefore, they need to be efficiently structured. We consider three main scenarios (hereafter denoted as overall scenarios) and we decompose them into sub-scenarios (hereafter denoted as core scenarios). A particular interest is given to the transitions between the core scenarios since these transitions constitute substantial sources of threats. Here, we consider three scenarios, a home scenario A shown in Fig. 4, a hospital scenario B shown in Fig. 6, as well as an emergency scenario C. Each of these overall scenarios contain a set of core scenarios which are denoted by the scenario identifier A, B, or C, followed by a dash and the core scenario numbering using roman numbers. The transitions between these core scenarios model the interaction between the various components of the patient monitoring system. In this paper, we focus mostly on Scenario A where the patient is supposed to be monitored outside the hospital while performing normal daily actions. To extract useful technical cases for the evaluation phase we need to structure the scenario according to the patient s actions and situation. TABLE I shows a list of core scenarios used in our work, which overall scenario they belong to, and which transitions are useful. Note that other transitions are theoretically possible, but these are either unlikely or can be achieved by combining a series of transitions, e.g., taking Core Scenario A-ii (moving) as an intermediate for Overall Scenario A. Omitting unlikely transitions helps to reduce the number of states when modelling the scenarios. B. The Structure of the Home Scenario In Scenario A, a monitored patient can be in various contexts performing normal daily actions. For example, for a patient with diabetes the following situations apply: The patient is at home or a nursing home using monitoring equipment. The patient uses sensors and communicates electronically with the doctor s office. TABLE I: Overview of core scenarios. The bullets mark scenarios that are part of the respective core scenario. scenario transition to core scenario & name A B C core scenario i home monitoring ii, xiv ii moving i, iii, iv, viii, vi, v iii public transport ii iv vehicle transport ii v shop ii vi café ii vii doctor s office ii, xiv viii waiting room vii, ix, ii ix diagnosis x, xi, xii, ii x operation xi xi intensive care xii xii observation ii, xi, ix xiii accident xiv xiv ambulance ix A-xiv ambulance B A-i home situation A-iii publ. transp. B A-ii moving A-iv vehicle transport A-v shop A-viii waiting room B A-vi café visit A-vii doctor s office Fig. 4: The Home Scenario with the underlying core scenarios and their transitions. The patient uses specific monitoring equipment for diabetes. The patient visits the doctor s office regularly and uses public transport or a car to get there. At the waiting room the patient can communicate data to the health care infrastructure of the doctor s office. The patient regularly takes walking or jogging trips. The patient regularly visits a café with friends; this includes walking or commuting with public transport. In case of an emergency or planned surgery, not necessarily related to her condition, the patient may be sent to a hospital with an ambulance. This list of situations is not yet a useful narrative. It needs to be structured and enriched with the specific context information, such as the necessary devices of the IoT, the communication channels, and actions of the involved actors. This is done in the core scenarios that describe a specific part of an overall scenario; e.g., a situation a patient experiences. Each core scenarios can be part of several overall scenarios. 1) Home Situation (monitored at home) (A-i): Biomedical sensors are employed in an environment where the patient is at home or in a nursing home. The patient is monitored by a WBSN, and the sensor data and alarms can be transmitted to medical centres and emergency dispatch units. The patient uses a smartphone with health-diary software that also imple- 98

71 A-i:2 sleep A-i:3 kitchen A-i A-i:1 default home sit. A-i:6 training A-i:7 visitor A-i:4 bathroom A-i:8 home care A-i:5 shower Fig. 5: The detail-scenarios of the home-situation. ments personal health records (PHR) and stores measurements continuously. Here, the sensors may not be monitoring or transmitting the physiological patient data continuously in order to reduce battery power consumption. Instead, depending on a predefined algorithm, abnormal sensor data from certain sensors may activate an alarm that is sent to a central monitoring unit. On a regular basis, the patient transmits measurements to the medical information system at the doctor s office, thus synchronising the PHR with the medical information system; the patient also has an audio-/video-conversation where medical questions are discussed. During these sessions the patient may take pictures with the smart phone camera or perform other measurements. In this scenario, the following characteristics are given: 1) Ease of use and non-intrusiveness are important issues. 2) Very low power consumption, enabling a long life span of the batteries, is required. 3) A network infrastructure is available, such as access to the Internet via LAN, WLAN, or mobile networks. 4) Limited mobility, handoff is possible, but infrequent. 5) Privacy and observability of signals are important requirements. Core Scenario A-i can be split up into several detailscenarios that may depend on the patient s activities, time of the day, or context, as shown in Fig. 5. These sub-scenarios may include the generic scenario (A-i:1), sleeping (A-i:2), kitchen work (A-i:3), visiting the bathroom (A-i:4), taking a shower (A-i:5), training (A-i:6), receiving a visitor (Ai:7), or receiving a home care nurse (A-i:8). All these detail scenarios create different challenges regarding security and QoS that need to be addressed by adaptive security methods. For example, when taking a shower, the sensors may need to be unmounted, while receiving visitors may create the need to give access to selected data or devices. 2) Moving (Walking, Jogging, Cycling) Scenario (A-ii): The patient does daily training, i.e., jogs in the nearby park, or does shorter walks from the home to the public transport, to the café, shop, or doctor s office. A common feature in these situations is that the patient needs to use a smartphone as a device that collects sensor data, using the mobile networks to transmit the data. When walking, jogging, or taking a bicycle ride in the park many other people and their devices may interfere with the communication of the smartphone. 99 When walking in the woods, there may be several spots which are not covered by a mobile network. In this case, the signal is so weak that only emergency calls from another provider will work. While data traffic is not possible, SMS messages can be used to send data with very low bandwidth, possibly after several retries. For an average walking trip, this outage may last for some minutes. However, SMS is asynchronous and messages may take minutes to days to arrive. Thus, it may be quicker to wait until the user, if still mobile, moves to a region where there is network coverage. 3) Transport Scenarios: We consider two transport scenarios, one with public transport, and one with commuting by car. Core Scenario A-iii presents a situation where a patient commutes to a doctor s office or to a café using public transport. Here, the patient needs to use a smartphone as a device that collects sensor data, using the mobile networks to transmit the data. Blind spots without connectivity to a mobile network, roaming, varying data transmission quality, etc., are parts of this scenario. This scenario can be applied to longdistance trains, planes, etc. In Core Scenario A-iv the patient uses her own or another s (private) car to commute to a shop, a café, or the doctor s office. Here, the patient needs to use a smartphone as a device that collects sensor data, using the mobile networks or networks installed or used in the car to transmit the data. Blind spots without connectivity to a mobile network, roaming, varying data transmission quality, etc., are parts of this scenario. 4) Shop Scenario (A-v): Another situation defined by Core Scenario A-v is when the patient is in a shop. In addition to the conditions of A-vi, the patient is given the opportunity to check groceries to be compliant with the patient s diet and allergy-prohibition plans, access information from the shop, and use a shopping list. 5) Café Scenario (A-vi): The patient visits a café. Here, the patient needs to use a smartphone as a device that collects sensor data, using mobile networks or café s WLAN zone for data transfer. Switching between the WLAN and mobile networks may occur, the WLAN may be of varying quality, many other café visitors may interfere, or the WLAN may not actually be connected to the Internet. 6) Doctor s Office Scenario (A-vii): The patient is in the doctor s office, usually after some time in a waiting room (A-viii). Here, the patient can have extra sensors attached. These extra sensors, as well as the existing sensors, can communicate with the doctor s infrastructure either through the smartphone of the patient, or directly, depending on the needs. A doctor can change a sensor s characteristics, which requires the possibility to re-program the sensor devices. 7) Waiting Room Scenario (A-viii): The patient is in a waiting room at a doctor s office or a hospital. Patients that are known to the healthcare system can be connected from their smartphone to the healthcare network; here, specific actions for collecting data from the device or other preparations can

72 be performed. Once the patient is in the range of the waiting room, the smartphone can transfer large amounts of stored patient data directly to the infrastructure of the medical centre via short-range communication, instead of using long-range mobile communication. 8) Other scenarios: In the scenario structure we foresee that the patient can undergo a transition to other core scenarios in a different overall scenario in order to cover situations that else would be outside the scenario structure. For instance, a patient could get ill and be brought to a hospital in an ambulance (B-xiv) or an emergency situation happens (Scenario C). Note that the use of devices in the IoT could be different in Scenarios A, B, and C: as an example, in an emergency situation the use of one of the patient s own sensors would not be possible in all cases. C. The Structure of the Hospital Scenario In Scenario B, the biomedical sensors are used in a hospital environment. Here, the patient is located in an operating room (OR) or intensive care unit (ICU) while undergoing intensive monitoring of vital physiological parameters. Additional sensors may be required during this procedure to monitor other physiological parameters. The patient may be moved between different rooms during the treatment, e.g., from the OR to the ICU, but monitoring must continue. The sensor data may need to be transferred over different wireless networks. The system should be able to cope with a breakdown in sensor nodes, new software updates, wireless network traffic congestion, and interferences from other wireless networks and biomedical devices. In Scenario B, a fixed network infrastructure is available between Levels (II) and (III) which can be accessed by the sink nodes of the biomedical sensor network. The scenario includes a complex communication environment. Interference from co-existing wireless networks, mobile networks, and various medical facilities is possible; this may reduce the performance of the transmission. The network topology in this scenario is fixed, but changes to the network topology may happen while patients are moving or being moved from one place to another, possibly causing handoff to other gateways. However, roaming to other networks is not part of this scenario in order to stay within the hospital domain. Note that scenarios that seem to be similar in Scenario B and in Scenario A may have differences that are not obvious. Thus, one cannot use reasoning performed in one scenario in another without checking the context and other conditions. For instance, A-vii (doctor s office) could be different from a similar situation in a hospital (B-ix) since the hospital is connected to a different kind of network infrastructure. Usually, the primary healthcare points (doctor s office) and hospitals have different security requirements and policies. 1) Hospital Diagnosis Scenario (B-ix): The patient is examined; extra sensors are attached, and existing sensors on the patient may be accessed both directly and via the patient s smartphone. In addition, NFC tags are used to identify objects. A B-ii moving B-xiv ambulance A B-viii waiting room B-ix diagnosis Ov.Sc. A B-x operation B-xi intensive B-xii observation Fig. 6: The Hospital Scenario with the underlying core scenarios and their transitions. The medical personnel can re-configure and re-program the sensors during diagnosis. 2) Hospital Operation Scenario (B-x): The patient is undergoing surgery; extra sensors are attached, and existing sensors on the patient are accessed directly by the hospital system rather than through the smartphone of the patient. In this scenario, the QoS is set very high, while security-wise the sensors are in a protected zone. The medical personnel can re-program the sensors during the operation. 3) Hospital Intensive Care Scenario (B-xi): The patient is in intensive care after an operation. Extra sensors are attached, and existing sensors on the patient may be accessed both through the patient s smartphone, and directly through the hospital infrastructure. In addition, NFC tags are used to identify objects. In most cases, the smartphone will be used as PCH. The medical personnel can re-program the sensors during intensive care. 4) Hospital Observation Scenario (B-xii): The patient is in a room under normal observation; in contrast to the home situation, the patient s smartphone has direct access to the hospital systems and will deliver data directly with higher QoS through the secured hospital systems. D. The Structure of the Emergency Scenario The Emergency Scenario (C) presents an emergency situation where victims are provided with sensors, patients are transported with an ambulance (car, helicopter, plane) and delivered to the emergency reception at a hospital. In Scenario C the use of sensors is not planned beforehand, health personnel must improvise, the identity of the patient may be unknown, and the infrastructure may be partially unavailable. Despite this, the expectation is that severely injured patients are stabilised, and they survive the transport to the emergency reception in the best condition possible. We include the first scenario of the Hospital Scenario, the diagnosis phase when the patient arrives in Core Scenario C- ix. Here, the rather unplanned interventions at the emergency site are adapted to the routines at the hospital. 1) Accident Site Scenario (C-xiii): This scenario is a disaster and accident response scenario where biomedical sensors are deployed to measure values such as blood pressure, temperature, pulse and ECG in an ad-hoc network at the site of an accident. Wired or wireless communications infrastructures 100 A

73 may be damaged or unavailable, and a large number of severely injured people may overwhelm the emergency field personnel. This could prevent them from providing efficient and effective emergency rescue. Biomedical sensor networks can be quickly deployed to monitor vital signs. A large number of injured can be monitored simultaneously. In this scenario, the following characteristics are given: 1) The sensor network must operate autonomously, and needs a high degree of self-organisation. The network topology is highly dynamic. Therefore, the sensor nodes should be able to discover each other and setup a sensor network autonomously. 2) A fixed network infrastructure is not available; data transferred from Level (II) to Level (III) must use a mobile network or other specific wireless network, such as microwave, or digital trunk communication. 3) The radio link may be unstable and the radio link quality may vary. Additionally, the communication environment is rather complex, since many sensor nodes may be deployed in a small area, possibly causing severe channel competition. 4) High degree of mobility. Handoffs are possible and may be frequent. 5) Blue-light functionality. That is, being able to re-use sensors on short notice with high flexibility (short-cutting some of the usual procedures). 2) Ambulance Scenario (C-xiv): The patient is in an ambulance. The sensors on the patient are connected to the ambulance s information system, which is connected to a hospital infrastructure via a mobile network connection. The communication between the patient s sensors is either directly to the ambulance infrastructure, or via the mobile phone. The ambulance and the patient s mobile phone may use different carriers. Some properties in this scenario are common with Scenario iv (vehicle transport). Note that once the patient is inside the ambulance, sensors should communicate with devices in the ambulance without involving the mobile carrier. IV. STORYLINES FOR THE SCENARIOS The set of overall scenarios, core scenarios, and transitions can be used to create storylines that can be used as case studies in ASSET. We present the storylines developed for the Scenarios A and B. Parts of these storylines will be used in the following analysis to evaluate the diverse functions in the IoT. We have not yet developed a storyline for Scenario C. A. Storyline for the Home Scenario We developed the storyline for the home scenario as follows: Petra has both a heart condition and diabetes. In a hospital, she had two sensors placed in or on her body: one heart sensor and one blood sugar sensor. In addition, she uses external sensors to measure blood pressure, heart beat, inertial sensors, etc., as well as a camera. Inertial sensors can be used to detect if Petra falls in order to automatically call for help while cameras could be used to assess her mood [20]. Petra is living in her home that has been prepared for the monitoring system and 101 is commissioned with the necessary data connections so that her vital signs can be periodically reported to the healthcare personnel in levels (II) (nurse or doctor) or (III) (patient records) as introduced in Fig. 3; several technologies can be applied to achieve this. The patient monitoring system is set up so that the sensor data are transmitted wirelessly (several transmission technologies are possible) to a smartphone that acts as PCH. The PCH communicates with the hospital infrastructure (Level (III)). 1. Petra is now being monitored at home but data are acquired remotely (A-i); the following requirements are important: a. Petra wants the data related to her medical condition to remain confidential from neighbours, i.e., people close-by, but outside her home. The confidentiality requirement includes physiological data, location data, data retrieved from a smart-home environment, such as temperature and humidity, as well as other metadata and health records. b. Petra wants her data to remain confidential from visitors, i.e., people inside her home. 2. Petra takes a bath in her home (planned sensor acquisition disruption; A-i); a. the sensors are water-proof; the PCH is close enough to receive signals; b. the sensors need to be removed; i. a change in the values implicitly indicates the sensor removal; or ii. patient must notify the PCH about the sensors going off-line; 3. Petra is sleeping and sensors fall off (unplanned sensor acquisition disruption; A-i). 4. Petra leaves her home for training outdoors or a stroll in the park nearby (A-ii); a. she is walking alone with her sensors communicating to the PCH; b. she meets an acquaintance, Linda who has similar sensor equipment; note that Petra s sensors could communicate through Linda s sensor network; they continue walking together; c. when they walk further, Petra looses the communication channel to the health care institution because of the terrain. She could either connect through the open, mobile WLAN-zones that are offered or use Linda s PCH as communication channel. 5. Petra leaves her home to visit her friends in a café (A-vi, A-ii, A-iii, A-iv). 6. Petra visits her regular doctor for a check-up; the doctor s office is within walking distance from her home (A-ii, A- vii, A-viii). 7. Petra becomes ill and is transported by an emergency ambulance to the hospital (B-xiv); transition to the Overall Hospital Scenario B.

74 B. Storyline for the Hospital Scenario We developed the storyline for the hospital scenario as follows: Petra has both a heart condition and diabetes. One year ago, she had two sensors placed in or on her body: one heart sensor and one blood sugar sensor that both communicate wirelessly. In addition, she uses external sensors as described for the storyline of Scenario A. Petra suddenly gets ill while being at home. This is detected by the patient monitoring system installed at her home. 1. Petra is taken in an ambulance to the hospital (B-xiv). In addition to the sensors she is using, the paramedics use EEG and ECG sensors. The information from all sensors is available in the ambulance from three possible sources: a. information received directly from the sensors, available on the displays in the ambulance; b. information received from the PCH that Petra is using; c. information received from the healthcare records. 2. After the ambulance arrives at the hospital, Petra is moved to a room where diagnosis of her condition is performed (B-ix). Different sensors are used to find out her condition. These sensors are removed after diagnosis. 3. It becomes clear that Petra needs to undergo surgery (Bx). During surgery sensors are used to measure certain biomedical values. However, the medical procedure also creates electromagnetic noise in the same band as the data transmission between sensors uses. 4. After the surgery, Petra is moved to intensive care (Bxi) where a variety of sensors are used to observe her biomedical values. 5. After two days, Petra is moved to a recovery room with three other patients to allow time for her surgery wound to heal and for observation (B-xii). In addition to the heart and blood sugar sensors, two additional sensors are now used, but these will be removed after the observation phase is over. The two other patients in the same room are using the same kind of sensors. a. The sensors Petra is using transmit their readings to her PCH. b. Petra s additional sensors transmit their readings to a base station in the patients room, while her ordinary sensors are still report to her PCH. 6. Petra is discharged from hospital; transition to Overall Scenario A. C. Applying the Storylines As described by Savola and Abie [21] and Savola et al. [19] the data integrity, privacy, data confidentiality, availability, and non-repudiation requirements should be met for all core scenarios and communication levels presented in Section III, specifically end-user authentication and authorisation for scenarios in Levels (0)-(II), sensor and WBSN authentication for scenario in Level (I), service provider user authentication for scenarios in Levels (III) and (IV), and service provider user authorisation in Levels (III) and (IV). This is also true for both storylines described above since these scenarios apply to both storylines but in varying situations and contexts. The 102 adaptive security requirements for both storylines therefore can be summarised as follows: 1) End-user authentication and authorisation: The adaptive authentication mechanisms must cope with changing context of use, security threats and the user behaviour in order to enforce context-aware authentication mechanisms in an efficient and usable manner. 2) Sensor and WBSN authentication: Adaptive authentication mechanisms must cope with critical decisions to be made by the end-user and the service provider user based on the sensor input in order to minimise the possibility of fake sensors in possibly varying situations. 3) Service provider user authentication: Adaptive authentication mechanisms must cope with changing demands depending on the privacy level and the official authorisation level for making treatment decisions. 4) Service provider user authorisation: Adaptive authorisation techniques must cope with setting the adequate requirements and enforcing the sufficient authorisation mechanisms based on the strength of the authentication, context, and user role. 5) Data integrity (all levels): Adaptive data integrity techniques must maintain adequate data integrity especially during alarm situations allowing patients health security and longertime treatment decisions 6) Privacy and data confidentiality (all levels): Adaptive security decision-making must adapt to privacy and data confidentiality requirements based on the data processing needs, roles of stakeholders, regulations and legislation, and the privacy level of data indicated by privacy metrics. Since context can affect privacy, adaptive security must be able to adapt to different types of context such as time, space, physiological parameter sensing, environmental sensing, and noisy data. The context must also be collected and evaluated in real time in a secure and accurate manner. 7) Availability (all levels): Adaptive techniques must balance the load in the system and use resilience solutions to maintain adequate availability, which is critical for health and life. 8) Non-repudiation (all levels): Adaptive authentication mechanisms must ensure the adequate non-repudiation level despite of changing conditions and selection of security controls. Walking through these story lines or threat analysing them will show that the above adaptive security requirements must be met for their success and proper functioning. For example, the security requirement pointed out in Step 1.a of the storyline is related to confidentiality and privacy, which are often emphasised in healthcare. Strong confidentiality algorithms, key distribution, associated processes, and compliance to appropriate privacy legislation and regulations are crucial. V. EVALUATING THE HOME SCENARIO We use selected parts of Scenario A to illustrate how to use the ASSET framework. We go through the scenario

75 TABLE II: Numeric results for Example 1: applying the ASSET framework using the logarithmic approach from eq. (3) S 1 S 2 S 3 S 4 Q 1 Q 2 C q total w i = q ˆq q ˆq q ˆq q ˆq q ˆq q ˆq q ˆq q total Case I Case II Case III Case IV Case V Case III+IV I II III IV V III+IV Fig. 7: Visualising the results for q total from Example 1. The dark blue bars represent the results using the logarithmic function e β, as shown in eq. (3), while the light blue bars represent the results using the linear function e α, as shown in eq. (2). TABLE III: The 11-value scale for q S2 of Example 1 q S2 Description 10 not observable outside apartment 9 barely observable in adjacent apartments; cannot be interpreted 8 barely observable in adjacent apartments; need advanced equipment to interpret 7 observable in parts of adjacent apartments, but not beyond 6 well observable in adjacent apartments, but not beyond 5 observable in range > 30m; on street 4 observable in range > 50m on street 3 observable in range > 100m on street 2 observable on street from running car 1 observable through wide-range network 0 n/a description, and comment on the use of the framework. Note, however, that the numerical values are for illustration purposes. These values are based on rough estimates instead of a careful assessment. Different methods for assessment were proposed above in Section II-C, but applying and evaluating the different methods remain future work. A. Confidentiality and Observability In the storyline of the Home Scenario, Petra is monitored at home with the requirement that she wants her data to be confidential for people inside and outside her home. Let us assume that the properties of data observability and data confidentiality are essential in this first case, i.e., are in S. Here, data observability means that a third party can observe the signal sent from a device and, thus, deduce the existence of this device and some meta-data. For instance, neighbours of Petra may observe the signals from her sensors and make assumptions about her health conditions from this. As countermeasures the apartment could be shielded or the signal strength of the sensors could be reduced. While shielding the apartment is too expensive, reducing the signal strength, however, could have an impact on the data availability since some corners in Petra s apartment would not be covered. Data confidentiality means that a third party cannot interpret the received signals. Cryptographic methods and authentication are often used to assure data confidentiality. Countermeasures when threats occur could use a different cryptographic method or authentication protocol. However, using a different cryptographic method could have a negative impact on the performance or battery consumption. For a numeric example, here denoted as Example 1, we use the following variables: q S1 is the value for observability inside the apartment; q S2 is the value for observability outside the apartment; q S3 is the value for confidentiality; q S4 is the value for availability; q Q1 is the value for bandwidth; q Q2 is the value for battery consumption; and q C are other mitigation costs. Recall that the value of q i indicates how far a given requirement is from the ideal fulfilment, where 1 is complete fulfilment of the requirement. We use the following cases: I) the base case, i.e., the apartment is not shielded, rather simple encryption algorithms and authentication protocols are used, and sensors transmit at normal power; II) shielding the apartment; III) reducing transmission power; IV) using different encryption algorithm; and V) using different authentication protocol. As outlined in in Section II-B, for objective assessment we need to establish a scale using n steps similarly to the Likert scale. For an example, we present a possible scale for the requirement q S2 (observability outside apartment) on a scale with 11 values in TABLE III. The value of q S2 = 0 is marked as not applicable to indicate that for observability outside the apartment no situation is considered totally unacceptable. Note that marking q S2 = 0 implies q = 0 for this alternative, i.e., it would be marked as totally unacceptable. In an experiment, we assessed the values for q Si=1...4, q Qi=1...2, and q QC by using a rough estimate. We also assigned values for the weights w i using intuition; we are aware that these values need to be assessed more thoroughly at a later stage. The assessment values, weights, and results for ˆq i and q total are shown in TABLE II for the logarithmic approach from eq. (3). We also applied the linear approach from eq. (2) to the same data. Both results for q total are visualised in Fig. 7. In our example we see that the logarithmic approach and the linear approach show similar behaviour with respect to ranking

76 I II III IV V Fig. 8: Visualising the results from Example 2. The dark blue bars represent the results using the logarithmic function e β, as shown in eq. (3) while the light blue bars represent the results using the linear function e α, as shown in eq. (2). TABLE IV: Example 2 for applying the ASSET framework using the logarithmic approach from eq. (3) q i S 1 S 2 S 3 S 4 Q 1 Q 2 C q total w i = 8.5 Case I Case II Case III Case IV Case V the alternatives. However, the logarithmic approach results in higher values and less differences for the values in-between. In this particular example, a combination of cases III and IV, gives the best result while case II delivers the lowest result, which is reasonable. B. Assessment of Changes in Time As Example 2 we use the part of the storyline where Petra is taking a stroll in the park. We assume that her sensors are connected wirelessly to her smartphone in its function as a PCH, and the PCH is communicating through a wireless network with the health care infrastructure through a public wireless network offered by a telephony provider. Further, we assume that her smartphone can connect using a WLAN. In this example, we use different definitions for q S1 and q S1 by using the observability of the sensors and the PCH, respectively. We take into account effects for wide area networks that indicate that battery consumption is higher when the signal strength from the base station is weak or the connection is lost. For a numeric example we use the following variables: q S1 is the value for observability of the sensors; q S2 is the value for observability of the PCH; q S3 is the value for confidentiality; q S4 is the value for availability; q Q1 is the value for bandwidth; q Q2 is the value for battery consumption; and q C are other mitigation costs. We use the following cases from the storyline of Scenario A: I) walking alone in the park; II) meeting Linda; III) loosing connection; IV) connect to open, mobile WLAN; and V) using Linda s PCH as communication channel. In an experiment, as above, we assessed the values for q Si=1...4, q Qi=1...2, and q C by using a rough estimate and assigned values for the weights w i using intuition. The assessment values q i, weights, and q total are shown in TABLE IV for the logarithmic approach from eq. (3). We also applied the Fig. 9: The Adaptive Security concept, adapted for the IoT by Abie [24]. linear approach from eq. (2) to the same data. Both results for q total are visualised in Fig. 8. In this example we see how the security situation changes due to changes of the context (I II III), i.e., when Petra meets Linda or Petra looses connection. This example also shows that the assessment can give a hint which one of two possible actions (IV or V) would promise a better security situation. VI. APPLYING THE FRAMEWORK TO ADAPTIVE SECURITY Abie and Balasingham [2] define the term adaptive security as a security solution that learns, and adapts to changing environment dynamically, and anticipates unknown threats without sacrificing too much of the efficiency, flexibility, reliability, and security of the IoT system. Abie and Balasingham present the Adaptive Risk Management (ARM) framework that is based on a feedback loop known from cybernetics [22] with the five measures (i) identify, (ii) analyse, (iii) plan, (iv) track, and (v) control. This results in four steps in the adaptation loop, aligned to ISO/IEC 27005:2008 [10] and the Plan Do Check Act (PDCA) model of ISO/IEC 27001:2005 [23]. Abie [24] presented a functional description on the concept of adaptive security for a message-oriented infrastructure; he adapted this concept to the IoT, as shown in Fig. 9. He identified the following functionality to be essential for adaptive security to be implemented: a) being self-aware using a feedback loop and a history database; b) being contextaware using sensors and feedback from other nodes in the IoT; c) using security metrics to process the data from the sensors and the other nodes; d) using risk and threat estimation and prediction; e) using security metrics as defined by Savola et al. [19]; f ) using methods such as Bayesian networks [25], game theory, Markov chains, etc. to support the threat estimation and prediction; g) using a decision making module to enforce appropriate security and privacy level; and h) communicating data to other nodes in the IoT. A. Integrating the Estimation Function to Adaptive Security In the adaptive security concept, the Monitor receives data from sensors, detectors, and other sources that are further used in the Analyser/Learner to make adaptive decisions. In this context, the ASSET evaluation framework can be used to 104

77 the adaptation loop is shown with the components Monitor, 105 Analyzer, and Adaptor. The Monitor component retrieves data from sensors and adaptors to a set of measured parameters. Using the estimation model, the Analyzer performs its tasks, and forwards the calculated values to the Adaptor component. Fig. 10: Integration of the estimation framework into the adaptive security model. provide the ground truth data a) to train the learning algorithms employed in the evaluation loop, and b) to evaluate whether the behaviour of the adaptive algorithms is reasonable. For this we follow the following recipe: We use the storylines similarly as done in Section V where we assess the values q i for all useful cases that can appear and calculate q with the suitable weights. On the other hand, the Monitor receives k measured values from diverse sensors and detectors. These measurements are denoted as s k. We postulate a function u(s k ) that ideally is designed such that u(s k ) = q for q as defined in eq. (1) for all relevant situations from the scenarios. In that way, using the function u, the input from the sensors and detectors will generate the same value as the assessment suggests. Alternatively, we can postulate functions u i (s k ) where u i (s k ) = q i for all relevant situations. It is intended that the function u i will generate the same value for each partial product as the assessment suggests. The functions u(s k ) and u i (s k ) could be instantiated in a learning phase. However, the adaptive security model can also handle this dynamically, so that the definition of these functions can vary over time. The evaluation of the functions u(s k ), respectively u i (s k ), will be handled in the Analyser and Learner components using regression, Bayesian networks, game theory, or similar. On the basis of the evaluated values from these functions, the Adaptor takes the necessary decisions. Fig. 10 illustrates how the estimation function can be integrated into the adaptive security model. Based on the sets of system states the assessment of the values q, q Si, q Qi, and q C is performed using a panel of users and specialists. This is shown on the right side of Fig. 10. These values are used to calibrate the estimation model. When the estimation function is established, these values can also be used to be compared with the estimation function for validation purposes. On the left side of Fig. 10, a part of B. Evaluation Methods For the purpose of evaluating the behaviour of the adaptive security methods we intend to employ the scenarios and storylines presented in Section V above together with implementations in a lab [26], simulation, and formal reasoning [27]. In an evaluation, we will go through each situation of the storylines, and assess or calculate the values q, q Si, q Qi, q C, u(s k ), and u i (s k ) as necessary. Using a lab one could build all necessary equipment that contains all necessary functionality. According to the evaluation method, one will go through all states and situations defined by the storyline and assess or calculate all relevant values according to our framework. Thereafter, the adaptation algorithm will be applied, resulting in new states that are evaluated by assessing and calculating the relevant values. Comparing the calculated values u(s k ), and u i (s k ) after each adaptation step with the desired and assessed values for q, q Si, q Qi, and q C will give evidence on the behaviour of the adaptation algorithm. The goal is to evaluate whether the behaviour of the adaptation loop is close to the right decisions deduced from the assessment. Note that in the absence of a lab, simulations or the use of formal methods can be considered. Here, instead of implementing the devices in real hardware the essential functionality is implemented in a model, and simulation and model checking techniques are used [27], [28]. VII. DISCUSSION AND RELATED WORK Our framework supports the evaluation of security solutions and provides a means to assess data for development and calibration of the estimation model. In this section, we discuss several issues regarding our framework. We also relate our work to frameworks that are described in the literature. A. Issues and Concerns The estimation model s design and the function u(s k ) are not the focus here, but our framework can calibrate an estimation model. In principle, methods from machine learning, such as regression analysis, Bayesian networks, fuzzy logic, or game theory can be used to develop the function u(s k ) introduced in Section VI. The assessed data will be used as training data while only the measurable data from sensors will be used in the adaptive security concept. Using this concept, the estimation model and the function u(s k ) will respond with a sufficiently correct estimate as long as the particular case has been part of the scenarios and storylines used in the assessment. Cases that are not covered in the assessment can still be estimated, but we cannot predict the appropriateness of the estimate. Thus, the framework needs to monitor continuously whether all relevant cases are covered,

WHITE PAPER NetFlow Feature Acceleration Feature Description Rapid growth in Internet and intranet deployment and usage has created a major shift in both corporate and consumer computing paradigms. This

PRODUCT BULLETIN NO. 2438 CISCO CONTENT SWITCHING MODULE SOFTWARE VERSION 4.1(1) FOR THE CISCO CATALYST 6500 SERIES SWITCH AND CISCO 7600 SERIES ROUTER NEW FEATURES New features of the Cisco Content Switching

Report on Government Information July 1 - December 31, 2014 apple Apple takes our commitment to protecting your data very seriously and we work incredibly hard to deliver the most secure hardware, software

CHAPTER 1 Note The information in this chapter applies to both the ACE module and the ACE appliance unless otherwise noted. The features in this chapter apply to IPv4 and IPv6 unless otherwise noted. Secure

Configuring Security Features of Session Recording Summary This article provides information about the security features of Citrix Session Recording and outlines the process of configuring Session Recording

Report on Government Information Requests January - June, Apple takes our commitment to protecting your data very seriously and we work incredibly hard to deliver the most secure hardware, software and

SCADA SYSTEMS AND SECURITY WHITEPAPER Abstract: This paper discusses some of the options available to companies concerned with the threat of cyber attack on their critical infrastructure, who as part of

Chapter 17 Transport-Level Security Web Security Considerations The World Wide Web is fundamentally a client/server application running over the Internet and TCP/IP intranets The following characteristics

WHITE PAPER MANAGED FILE TRANSFER: 10 STEPS TO SOX COMPLIANCE 1. OVERVIEW Do you want to design a file transfer process that is secure? Or one that is compliant? Of course, the answer is both. But it s

AT-S39 Version 1.3 Management Software for the AT-8024 and AT-8024GB Fast Ethernet Switches Supported Platforms Software Release Notes Please read this document before you begin to use the AT-S39 management

opinion piece IT Security and Compliance: They can Live Happily Ever After Contents Pitfalls, misconceptions and mistakes 01 It s not all doom and gloom 01 Take the right steps towards compliance and IT

EXAM questions for the course TTM4135 - Information Security May 2013 Part 1 This part consists of 5 questions all from one common topic. The number of maximal points for every correctly answered question

The Misuse of RC4 in Microsoft Word and Excel Hongjun Wu Institute for Infocomm Research, Singapore hongjun@i2r.a-star.edu.sg Abstract. In this report, we point out a serious security flaw in Microsoft

Enabling SSL and Client Certificates on the SAP J2EE Engine Angel Dichev RIG, SAP Labs SAP AG 1 Learning Objectives As a result of this session, you will be able to: Understand the different SAP J2EE Engine

The Impact of 21 CFR Part 11 on Product Development Product development has become an increasingly critical factor in highly-regulated life sciences industries. Biotechnology, medical device, and pharmaceutical

Guide Axis Webinar User Guide Introduction Joining an Axis Webinar is a quick and easy way to gain additional knowledge about more than just new products, and technology. These webinars allow attendees

Report on Government Information July 1 - December 31, 2015 Apple takes our commitment to protecting your data very seriously and we work incredibly hard to deliver the most secure hardware, software and

Report on Government Information January 1 - June 30, 2015 Apple takes our commitment to protecting your data very seriously and we work incredibly hard to deliver the most secure hardware, software and

Introduction Security and privacy are two of the leading issues for users when transferring important files. Keeping data on-premises makes business and IT leaders feel more secure, but comes with technical

FAQs for Two-factor Authentication Two-factor Authentication 1. What is two-factor authentication? Two-factor authentication is an authentication scheme that increases online security by relying on a combination

Brochure More information from http://www.researchandmarkets.com/reports/1339929/ The 2011 World Forecasts of Machine Tools That Remove Material by Laser or Light, Photon, Ultrasonic, Electro-Discharge,

Fact Sheet PATHWAY STUDIO WEB SECURITY OVERVIEW Pathway Studio Web is a comprehensive collection of information with powerful security features to ensure that your research is safe and secure. FOR PHARMA

White Paper EMC DATA DOMAIN ENCRYPTION A Detailed Review Abstract The proliferation of publicized data loss, coupled with new governance and compliance regulations, is driving the need for customers to

TECHNICAL SPECIFICATIONS CrashPlan Security CrashPlan is a continuous, multi-destination solution engineered to back up mission-critical data whenever and wherever it is created. Because mobile laptops

CHAPTER 1 Secure Sockets Layer (SSL) is an application-layer protocol that provides encryption technology for the Internet. SSL ensures the secure transmission of data between a client and a server through

MySQL Security: Best Practices Sastry Vedantam sastry.vedantam@oracle.com Safe Harbor Statement The following is intended to outline our general product direction. It is intended for information purposes

Security vulnerabilities in the Internet and possible solutions 1. Introduction The foundation of today's Internet is the TCP/IP protocol suite. Since the time when these specifications were finished in

SERVICES OVERIVEW CISCO METRO ETHERNET SERVICES AND SUPPORT In the ever-changing communications market, incumbent service providers are looking for ways to grow revenue. One method is to deploy service

There are actually two distinct aspects to the use of public-key encryption in this regard: The distribution of public keys. The use of public-key encryption to distribute secret keys. 9.1 Distribution

Three attacks in SSL protocol and their solutions Hong lei Zhang Department of Computer Science The University of Auckland zhon003@ec.auckland.ac.nz Abstract Secure Socket Layer (SSL) and Transport Layer

2010 United Parcel Service of America, Inc. UPS, the UPS brandmark and the color brown are trademarks of United Parcel Service of America, Inc. All rights reserved. Quantum View Manage Administration Guide

Secure cloud access system using JAR ABSTRACT: Cloud computing enables highly scalable services to be easily consumed over the Internet on an as-needed basis. A major feature of the cloud services is that

CyberSource Payment Security Compliance The PCI Security Standards Council has published guidelines on tokenization, providing all merchants who store, process, or transmit cardholder data with guidance

AA BOND CO LIMITED (a public limited company incorporated in Jersey with registered no. 112992) (the "Issuer") NOTE: THIS NOTICE CONTAINS IMPORTANT INFORMATION THAT IS OF INTEREST TO THE LEGAL AND BENEFICIAL

Chapter 11 Security+ Guide to Network Security Fundamentals, Third Edition Basic Cryptography What Is Steganography? Steganography Process of hiding the existence of the data within another file Example: