Wednesday, December 21, 2011

2011 was a very busy year for WSO2 and specially for the Security team. We have 6 members in the team and it was hardly found a day all six were in Colombo WSO2 office...

Thilina was in Denmark in March for a customer engagement. In May he was in Switzerland and in July Thilina was in Denver for Cloud Identity conference. In November he was in Sweeden for another customer engagement.

Amila was in California from February to April. And back in US, in Raleigh, NC in September for a customer engagement. In October he was in New York for another customer engagement.

Asela was out during March/April for a customer engagement in North Carolina. In July Asela was in Denver for Cloud Identity conference and in New York once again for a customer engagement. In August he was at WSO2 Palo Alto office and did the Could Security workshop there with Asanka and me. Again in November he left to San Diego for a customer engagement. Just after one week he returned back to SL - he left to Germany for another customer engagement.

Hasini was out of the country in June - for a customer engagement in Indianapolis.

Manjula was in Thailiand in November to present WSO2 Identity Server at APICTA. And he left to Germany soon after that for a customer engagement - which finished in December.

I was in New York, Washington & Dallas in March for the WSO2 SOA Security & Identity Workshops and in July was in Denver for Cloud Identity conference and in New York once again for a customer engagement. In August I was at WSO2 Palo Alto office and did the Could Security workshop there with Asanka and Asela. Once again I was back in New York in October for another customer engagement and also for a workshop on Cloud Security. In November I was in Vancouver - Canada for the ApacheCon.

It's great to see the entire team is back in Colombo, by the end of the year...

Tuesday, December 20, 2011

OAuth 1.0 emerged from the large social providers like Facebook, Yahoo!, AOL, and Google. Each had developed its own alternative to the password anti-pattern. OAuth 1.0 reflected their agreement on a single community standard.

In 2009, an attack on OAuth 1.0 was identified which relied on an attacker initiating the OAuth authorization sequence, and then convincing a victim to finish the sequence – a result of which would be the attacker’s account at an (honest) client being assigned permissions to the victim’s resources at an (honest) RS.

OAuth 1.0a was the revised specification version that mitigated the attack.

In 2009, recognizing the value of more formalized standardization, that community contributed OAuth 1.0 to the IETF. It was within the IETF Working Group that the original OAuth 1.0 was reworked and clarified to become the Informative RFC 5849.

In 2010, Microsoft, Yahoo!, and Google created the Web Resource Authentication Protocol (WRAP), which was soon submitted into the IETF WG as input for OAuth 2.0. WRAP proposed significant reworking of the OAuth 1.0a model.

Among the changes were the deprecation of message signatures in favor of SSL, and a formal separation between the roles of ‘token issuance’ and ‘token reliance.’

Development of OAuth 2.0 in the IETF consequently reflects the input of both OAuth 1.0, OAuth 1.0a, and the WRAP proposal. It is fair to say that the very different assumptions about what are appropriate security protections between OAuth 1.0a and WRAP have created tensions within the IETG OAuth WG.

While OAuth 2.0 initially reflected more of the WRAP input, lately (i.e. fall 2010) there has been a swing in group consensus that the signatures of OAuth 1.0a that were deprecated by WRAP are appropriate and desirable in some situations. Consequently, signatures are to be added back as an optional security mechanism.

While many deployments of OAuth 1.0a survive, more and more OAuth 2.0 deployments are appearing – necessarily against a non-final version of the spec. For instance, Facebook, Salesforce, and Microsoft Azure ACS all use draft 10 of OAuth 2.0.

[The above paragraphs are direct extracts from the white-paper published by Ping Identity on OAuth]

OAuth provides a method for clients to access server resources on behalf of a resource owner (such as a different client or an end-user). It also provides a process for end-users to authorize third-party access to their server resources without sharing their credentials (typically, a username and password pair), using user-agent redirections.

In the traditional client-server authentication model, the client requests an access restricted resource (protected resource) on the server by authenticating with the server using the resource owner's credentials. In order to provide third-party applications access to restricted resources, the resource owner shares its credentials with the third-party. This creates several problems and limitations.

1. Third-party applications are required to store the resource owner's credentials for future use, typically a password in clear-text.

2. Servers are required to support password authentication, despite the security weaknesses created by passwords.

3. Third-party applications gain overly broad access to the resource owner's protected resources, leaving resource owners without any ability to restrict duration or access to a limited subset of resources.

4. Resource owners cannot revoke access to an individual third-party without revoking access to all third-parties, and must do so by changing their password.

5. Compromise of any third-party application results in compromise of the end-user's password and all of the data protected by that password.

OAuth addresses these issues by introducing an authorization layer and separating the role of the client from that of the resource owner.

The protocol centers on a three-legged scenario, delegating User access to a Consumer for resources held by a Service Provider. In many cases, a two-legged scenario is needed, in which the Consumer is acting on behalf of itself, without a direct or any User involvement.

OAuth was created to solve the problem of sharing two-legged credentials in three-legged situations. However, within the OAuth context, Consumers might still need to communicate with the Service Provider using requests that are Consumer-specific. Since the Consumers already established a Consumer Key and Consumer Secret, there is value in being able to use them for requests where the Consumer identity is being verified.

This specification defines how 2-legged OAuth works with OAuth 1.0. But it never became an IETF RFC.

With OAuth 1.0 - 2-legged OAuth includes two parties. The consumer and the service provider. Basically in this case consumer also becomes the resource owner. Consumer first needs to register a consumer_key and consumer_secret with the service provider. To access a Protected Resource, the Consumer sends an HTTP(S) request to the Service Provider's resource endpoint URI. The request MUST be signed as defined in OAuth Core 1.0 section 9 with an empty Token Secret.

All the requests to the Protected Resources MUST be signed by the Consumer and verified by the Service Provider. The purpose of signing requests is to prevent unauthorized parties from using the Consumer Key when making Protected Resources requests. The signature process encodes the Consumer Secret into a verifiable value which is included with the request.

OAuth does not mandate a particular signature method, as each implementation can have its own unique requirements. The protocol defines three signature methods: HMAC-SHA1, RSA-SHA1, and PLAINTEXT, but Service Providers are free to implement and document their own methods.

The Consumer declares a signature method in the oauth_signature_method parameter, generates a signature, and stores it in the oauth_signature parameter. The Service Provider verifies the signature as specified in each method. When verifying a Consumer signature, the Service Provider SHOULD check the request nonce to ensure it has not been used in a previous Consumer request.

The signature process MUST NOT change the request parameter names or values, with the exception of the oauth_signature parameter.

2-legged OAuth with OAuth 1.0 - the request to the protected resource will look like following.

1. Get a token from the authorization server
2. Use the token to access the resource server

Let's see how the above two steps work under 2-legged OAuth.

OAuth 2.0 defines a concept called - "authorization grant" - which is a credential representing the resource owner's authorization (to access its protected resources) used by the client to obtain an access token. This specification defines four grant types.

"Client Credentials" is the grant type which goes closely with 2-legged OAuth.

With "Client Credentials" grant type, the client can request an access token using only its client credentials (or other supported means of authentication) when the client is requesting access to the protected resources under its control.

Once the client makes this request to the authorization server - it will return back an access token to access the protected resource.

The access token returned back to the client could be either of type bearer of MAC.

The "mac" token type defined in ietf-oauth-v2-http-mac is utilized by issuing a MAC key together with the access token which is used to sign certain components of the HTTP requests by the client when accessing the protected resource.

The MAC scheme requires the establishment of a shared symmetric key between the client and the server. This is often accomplished through a manual process such as client registration.

The OAuth 2.0 specification offers two methods for issuing a set of MAC credentials to the client using..

1. OAuth 2.0 in the form of a MAC-type access token, using any supported OAuth grant type. [This is what we discussed above - an access token with 'MAC' type]

Bearer type is defined here. It's a security token with the property that any party in possession of the token (a "bearer") can use the token in any way that any other party in possession of it can. Using a bearer token does not require a bearer to prove possession of cryptographic key material (proof-of-possession).

When using Bearer type access tokens with 2-legged OAuth - the request to the protected resource will look like following.

Also - the issued access token from the Authorization Server to the client, has an 'scope' attribute. [2-legged OAuth with OAuth 1.O doesn't have this scope attribute as well as access token concept - so resource server has to perform authorization separately based on the resource client going to access]

The client should request access tokens with the minimal scope and lifetime necessary. The authorization server will take the client identity into account when choosing how to honor the requested scope and lifetime, and may issue an access token with a less rights than requested.

When securing APIs with OAuth - this 'scope' attribute can be bound to different APIs. So, the authorization server can decide whether to let the client access this API or not.

Thursday, December 15, 2011

Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target currently ranging from a few dozen terabytes to many petabytes of data in a single data set.

IDC estimated the digital universe to be around 1.8 zettabytes by 2011.

How big is a zettabyte? It's one billion terabytes. The current world population is 7 billion - that is, if you give a hard disk of 250 billion GB for each person on the earth - still that storage won't be sufficient.

Many sources contribute to this flood of data...

1. The New York Stock Exchange generates about one terabyte of new trade data per day.
2. Facebook hosts approximately 10 billion photos taking up one petabytes of storage.
3. Ancestry.com, the genealogy site, store around 2.5 petabytes of data.
4. The Internet Archive stores around 2 petabytes of data, and is growing at a rate of 20 terabytes per month.
5. The Large Harden Collider near Geneva will produce about 15 petabytes of data per year.
6. Everyday people create the equivalent of 2.5 trillion bytes of data from sensors, mobile devices, online transactions & social networks.

Facebook, Yahoo! and Google found themselves collecting data on an unprecedented scale. They were the first massive companies collecting tons of data from millions of users.

They quickly overwhelmed traditional data systems and techniques like Oracle and MySql. Even the best, most expensive vendors using the biggest hardware could barely keep up and certainly couldn’t give them tools to powerfully analyze their influx of data.

In the early 2000’s they developed new techniques like MapReduce, BigTable and Google File System to handle their big data. Initially these techniques were held proprietary. But they realized making the concepts public, while keeping the implementations hidden, will benefit them - since more people will contribute to those and the graduates they hire will have a good understanding prior to joining.

Around 2004/2005 Facebook, Yahoo! and Google started sharing research papers describing their big data technologies.

In 2004 Google published the research paper "MapReduce: Simplified Data Processing on Large Clusters".

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in this paper.

Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. Google's implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable.

A typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers and the system easy to use. Hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google's clusters every day.

Doug Cutting who worked for Nutch, an open-source search technology project which are now managed through the Apache Software Foundation, read this paper published by Google and also another paper published by Google on Google's distributed file system [GFS]. He figured out GFS will solve their storage needs and MapReduce will solve the scaling issues they encountered with Nutch and implemented MapReduce and GFS. They named the GFS implementation for Nutch as the Nutch Distributed Filesystem [NDFS].

NDFS and the MapReduce implementation in Nutch were applicable beyond the realm of search, and in February 2006 they moved out of Nutch to form an independent sub project of Lucene called Hadoop and NDFS, became HDFS [Hadoop Distributed File System] - which is an implementation of GFS. During the same time Yahoo! extended their support for Hadoop and hired Doug Cutting.

At a very high-level, this is how HDFS works. Say we have a 300 MB file. [Hadoop also does really well with files of petabytes and terabytes.] The first thing HDFS is going to do is to split this up in to blocks. The default block size on HDFS right now is 128 MB. Once split-ed in to blocks we will have two blocks of 128 MB and another of 44 MB. Now HDFS will make 'n' number of ['n' is configurable - say 'n' is three] copies/replicas of each of these blocks. HDFS will now store these replicas in different DataNodes of the HDFS cluster. We also have a single NameNode, which keeps track of replicas and the DataNodes. NameNode knows where a given replica resides - whenever it detects a given replica is corrupted [DataNode keeps on running checksums on replicas] or the corresponding HDFS node is dowm, it will find out where else that replica is in the cluster and tells other nodes do 'n'X replication of that replica. The NameNode is a single point of failure - and two avoid that we can have secondary NameNode which in sync with the primary -and when primary is down - the secondary can take control. Hadoop project is currently working on implementing distributed NameNodes.

Again in 2006 Google published another paper on "Bigtable: A Distributed Storage System for Structured Data"

Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size, petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products. This paper describes the simple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and describes the design and implementation of Bigtable.

BigTable maps two arbitrary string values (row key and column key) and timestamp (hence three dimensional mapping) into an associated arbitrary byte array. It is not a relational database and can be better defined as a sparse, distributed multi-dimensional sorted map.

Basically BigTable discussed how to build a distributed data store on top of GFS.

HBase by Hadoop is an implementation of BigTable. HBase is a distributed, column oriented database which is using HDFS for it's underlying storage and supports both batch-style computation using MapReduce and point queries.

Amazon, published a research paper in 2007 on "Dynamo: Amazon’s Highly Available Key-value Store".

Dynamo, is a highly available key-value storage system that some of Amazon’s core services use to provide an “always-on” experience. Apache Cassandra — brings together Dynamo's fully distributed design and BigTable's data model and written in Java - open sourced by Facebook in 2008. It is a NoSQL solution that was initially developed by Facebook and powered their Inbox Search feature until late 2010. In fact much of the initial development work on Cassandra was performed by two Dynamo engineers recruited to Facebook from Amazon. However, Facebook abandoned Cassandra in late 2010 when they built Facebook Messaging platform on HBase.

Also, besides using the way of modeling of BigTable, it has properties like eventual consistency, the Gossip protocol, a master-master way of serving the read and write requests that are inspired by Amazon's Dynamo. One of the important properties, the Eventual consistency - means that given a sufficiently long period of time over which no changes are sent, all updates can be expected to propagate eventually through the system and all the replicas will be consistent.

I used the term 'NoSQL' when talking about Cassandra. NoSQL (sometimes expanded to "not only SQL") is a broad class of database management systems that differ from the classic model of the relational database management system (RDBMS) in some significant ways. These data stores may not require fixed table schemas, usually avoid join operations, and typically scale horizontally.

The name "NoSQL" was in fact first used by Carlo Strozzi in 1998 as the name of file-based database he was developing. Ironically it's relational database just one without a SQL interface. The term re-surfaced in 2009 when Eric Evans used it to name the current surge in non-relational databases.

As per Marin Dimitrov, following are the use cases for NoSQL databases - in other words following are the cases where relational databases do not perform well.

1. Massive Data Volumes
2. Extreme Query Volume
3. Schema Evolution

With NoSQL, we get the advantages like, Massive Scalability, High Availability, Lower Cost (than competitive solutions at that scale), Predictable elasticity and Schema flexibility.

For application programmers the major difference between relational databases and the Cassandra is it's data model - which is based on BigTable. The Cassandra data model is designed for distributed data on a very large scale. It trades ACID-compliant data practices for important advantages in performance, availability, and operational manageability.

If you want to compare Cassandra with HBase, then this is a good one. Another HBase vs Cassandra debate is here.

Tuesday, December 6, 2011

What is meant by Symmetric ? Both the parties, client and server use the same key to encrypt and sign.

Now the question is how to establish this key..? Anyone of the two parties can generate the key, but.. how do we pass the generated key to the other end...

This is how it works in Web Services security...

1. Initiator generates a key
2. Signs/Encrypts the message with the generated key
3. Encrypts the key with the public key of the recipient
4. Builds an encrypted key element with the output from [3] and associates an ID with that element and stores it in-memory using ID as the key. [This is how Rampart stores it]

5. EncryptedKey element will be included in the Security header of the SOAP message going from the sender to the recipient.

Here what you see under CipherValue element is the encrypted generated key.

The value of KeyIdentifier, which is r3iHLvhEdbQLQGh0iuDzzJMBz40=, is the base64 encoded SHA1 value of the fingerprint of the recipient's public key. Looking at this fingerprint value, recipient can pick the corresponding private key to decrypt the message and get the generated key out.

Also, let's have a look at the Algorithm attribute of EncryptionMethod element. This is the algorithm used to encrypt the generated key - and which to use is based on the Algorithm Suite defined in your security policy. In this case I have used Basic256 as the Algorithm Suite - so it uses rsa-oaep-mgf1p as the asymmetric key wrapping algorithm.

Now let's see what happens at the recipient end...

1. Recipient gets the message
2. Finds the private key corresponding to the fingerprint value in EncryptedKey element
3. Decrypts the encrypted generated key
4. Decrypts the message and verifies the signature using the key from [3]
5. Generates the response
6. Signs/Encrypts the response from the same key from [3]
7. Now it generates the SHA1 of the EncryptedKey element it receives from the client and adds the base64 encoded value to the response
8. Sends the response to the client

Once the client gets the message, it will perform following validations..

1. Client gets the message
2. Goes through all the stored EncryptedKey elements in-memory to check whether the SHA1 hash of any of them matches with the hash value in the response. If any match found, that's the Encrypted key.
3. EncryptedKey element in the memory also maintains the generated key in clear text, so client can find it
4. From the key found in [3]. client validates the message

With Symmetric binding, only the recipient needed to have a public/private key pair.

But in Asymmetric binding both the parties should have their own key pairs.

Even it's Asymmetric, the encryption happens with a generated symmetric key - the reason is Asymmetric encryption is resource consuming and also cannot operate on a large amount of data... so, WS-Security specification recommends to use symmetric key encryption with a generated key, even with the Asymmetric binding.

The major difference between the Asymmetric and the Symmetric is the way Signature been handled.

With Symmetric binding, both the request and the response are signed using the same generated key.. But in Asymmetric binding, the request is signed using the sender's private key and the response is signed using recipient's private key. In other words, Asymmetric provides a guarantee on non-repudiation while Symmetric binding does not.

Let's see how Asymmetric binding works in Web Services security..

1. Initiator/client generates a key
2. Encrypts the message with the generated key
3. Signs message with it's own private key
4. Encrypts the generated key with the public key of the recipient
5. Builds an encrypted key element with the output from [4] and associates an ID with that element. Do NOT store it in-memory as in the case of Symmetric.

1. Recipient gets the message
2. Finds the private key corresponding to the fingerprint value in EncryptedKey element
3. Decrypts the encrypted generated key
4. Decrypts the message using the key from [3]
5. Verifies the signature of the message using the public key of the sender
6. Generates the response
7. Generates a new key
8. Encrypts the message with the generated new key
9. Signs message with it's own private key
10. Encrypts the generated key with the public key of the initiator [the client]
11. EncryptedKey element will be built with the encrypted generated key and included in the Security header of the SOAP message going from the service to the client.
12. Sends the response to the client

Once the client gets the message, it will perform following validations...

1. Client gets the message
2. Finds the private key corresponding to the fingerprint value in EncryptedKey element
3. Decrypts the encrypted generated key with the key from [2]
4. Decrypts the message using the key from [3]
5. Verifies the signature of the message using the public key of the service

One of my colleagues recently asked me how to find whether someone is using Asymmetric or Symmetric binding just by looking at the SOAP messages...

This is not possible by looking at the SOAP request - but looking at the SOAP response we can figure it out. When using Asymmetric binding SOAP response will have the EncryptedKey element inside the Security header - but not in the case of Symmetric binding.

Let's summarize the differences in the behavior between Asymmetric and Symmetric bindings.

Saturday, December 3, 2011

APIs have become an essential and key success factor for any business. Businesses do not operate as silos anymore, therefore each business depends on B2B communications. In technical terms, different systems/applications need to communicate with each other to fulfill various business requirements. Publishing rich business APIs is the answer to the above requirements.

Architects and developers who implement APIs prefer to use REST as the standard by looking at the simplicity and flexibility it provides for the end-users of the API and use lightweight message formats like JSON and POX. Most enterprises struggle to expose RESTful APIs due to various technical limitations and spend more time to architect and implement the same.
This half-day workshop focuses on how to expose your heterogeneous back-end services as a RESTful API in a quick and easy but architecturally accurate way, using the WSO2 Platform - and presented by Asanka Abeysinghe, Director, Solutions Architecture, WSO2.

Thursday, December 1, 2011

The first ever DZone meetup in Sri Lanka happens on 15th December at WSO2 #58 office..

Sri Lanka has the highest number of Apace Committers out side USA and further in Google Summer of Code, University of Moratuwa - Sri Lanka was ranked as the top university Worldwide in terms of the number of awards received by students for the five year period from its inception in 2005. So, the interest among Sri Lankan tech community towards DZone Community Meetup undoubtedly expect to be very high...

We have picked "Big Data" as the theme for this meetup and have invited the experts in this area to share their thoughts.. All topics related to Big Data has lots of traction these days and DZone NoSQL Zone has a good collection of resources..

We would like to invite any of the DZone members/users around Colombo to join this meetup to share your thoughts..

WSO2 is happily sponsoring the event and we would expect DZone to send us some RefCards to share between the attendees..