Thursday, December 20, 2012

I've just read a great book, "Automate this." Its topics range from automated trading on Wall Street to a dark fiber line from Chicago to New York cutting through mountains, to Facebook tracking what its billion of users do, to Cloudera and the story of Jeff Hammerbacher, to automated legal review, and then to US educational system.

On the last subject, the author suggests to give a course in programming to every high school student, in order to find those 5% who are born developers (that is because in the mind of the author, Christopher Steiner, everything can be automated: doctors, lawyers, and traders, except the automators themselves).

I say more: we should teach Hadoop in high school. Going through basic Java or C++ is booooring. Rather, tell the kids that you will teach them to program just like Facebook and Google people do, then by the way teach them Java. Don't start from int and float, but teach them to read the code, then to understand it, then to write it. Just like learning a human language the right way.

Do it in the IDE (NetBeans, of course:) with all the modern conveniences. And watch for the amazing results.

Sunday, December 9, 2012

IntroductionWith
the proliferation and easy availability of cloud computing resources it
is inevitable that lawyers and legal IT departments will eventually
start using clouds. It is also inevitable that they will ask
many questions about cloud security. This paper deals with general cloud
security, as well as with specific security requirements that lawyers
and legal/IT departments of the corporation usually deal with.We specifically discuss Amazon Web Services (AWS) and the SHMcloud (open source solution for eDiscovery) from SHMsoft. At the end we provide the Q&A section.What is a “cloud?”We
will define cloud computing as elastic computing resources. A perfect
example of this definition is the Elastic Compute Cloud (EC2) provided
by the Amazon Web Services (AWS). Its characteristics are:

Easy availability of computing resources: You can provision any amount of storage and any number of computing instances within the matter of minutes;

Self-service:
you have the ability to manipulate your computing resources through the
browser in your AWS account, or programmatically, without the need of
the formal provisioning process, phone calls, and human involvement;

Pay-as-you go: you pay only for the computing resources you use, while you use them.

Since
AWS is one of the most prevalent providers, and since SHMcloud is
primarily implemented on EC2, we will limit our discussion to these two
services. However, the general approach, conclusions, and recommendation
are applicable to other cloud services.Cloud security todayThe
IT departments of the corporations and law firms used to be in complete
control of their web-accessible applications, or at least the
responsibility was lying with them. Now the situation is changing. With
the availability of cloud resources, and with multiple reasons driving
cloud adoption, these same people often have to rely on the security furnished by the cloud
provider. A recent Forbes article, Data Privacy And The Cloud: Fact Versus Fiction, highlights the growing understanding and adoption of the cloud in general.Amazon’s
EC2 - the one we concentrate upon here - sets very high standards for security protection. It also
recommends, and in some cases mandates the best security practices.
However, it is in large measure up to the users to make sure that they
use the EC2-provided security correctly. In case of applications built
on top of EC2, the users also need to verify that the applications
designers have implemented the Amazon’s guidelines. If any additional
security measures have been used, they need to be documented, and in
some cases subject to an audit.Components of an EC2 applicationThere
are three components in an EC2 application: the Amazon Machine Image
(AMI) that you run, in one or multiple copies; security key pairs; and
security groups.AMIAMI
is what you run, and here what is the important is the source where you
get your AMI. If you use community-provided AMI, it falls on you to
check for the backdoors that may have been left by the creators. You
would need to delete credentials, remove certificate and key material.
About 30-40% of all the community AMIs have some form of backdoors left
behind, most likely by oversight, but it does not matter how this
happened, the AMI that you run must be clean.If
you are using a Marketplace AMI, then you can be sure that the Amazon
Marketplace team has taken these steps before placing the AMI into
general availalbity through the Marketplace. Thus, for example, all of
the AMIs used by SHMcloud are guaranteed by Amazon to be secure in this
sense, and free of backdoors.Key PairKey
pair is the pair of private/public keys. The public part is stored on
Amazon, and it allows you to access your AMI’s, provided that it agrees
with the private key that is stored on your machine. The recommended
practice here is for every person to generate and use his/her own key
pair. This generation is easy with Amazon, but doing this right has
multiple advantage: better security, better accountability, and easier
reassignment of rights, when a person leaves the company.Security groupSecurity
group can be viewed as a firewall to the application. It lists the
user(you)-defined access rules for ingress and egress. The security
group includes its name, protocol (CTP, UDP)to and from ports, and the source (where the traffic is coming from).There
can be many security group (up to 500), and in addition there are some
known problems with them, such as the use of memcache server. The best
way to deal with this is to have a person responsible for security group
setup, and auditor, and to use automation to verify the common problems
with the security group, such as the Scout tool, https://github.com/iSECPartners/scout.
Whatever the method, you need to highlight potentially dangerous
security groups, and compare what it should be to what it is really
there.Simple Storage Service - S3S3 is the storage part of the AWS in general, and of SHMcloud in particular. It has the following three security mechanisms:

ACL, or Access Control List;

Bucket policy; and

IAM (Identity Access Management) policy.

Let’s look at each one separately.ACLACL’s work together with bucket and IAM security.Bucket securityWith
bucket-level permissions, one can have fine-grained permissions, on the
level of specific objects. One can also use more granular bucket
policies: such as specific actions and conditions. For example, one can
enforce permissions based on object size.IAMIAM,
or Identity Access Management, is the way to have multiple users for
the same Amazon AWS account. The best practice of using it is outlined
below:

Create IAM principal, attach IAM policies to it;

Create departments and buckets, set policies;

Attach users to departments.

Since IAM includes Identity federation, it is a convenient and powerful for users and groups permissions and controls.Access Logs allow you to verify how the data is being access. They can be used for security auditing.IAM allows for coarse-grained permission: read, write, etc. The grantee can be a human user, or a system user (software agent).EncryptionEncryption
is another layer for the security protection of your data. The Amazon
AMI images are already encrypted, to guarantee that Amazon’s employees
do not have access to your data. However, a targeted hacking attempt or a
human glitch to lead to data exposure. To mitigate this risk, sensitive
data should additionally encrypted.There are two ways of data encryption, client side and server side.With
server-side encryption, Amazon manages keys with AES-256. This is more
convenient to implement. In this scheme, objects are encrypted, not
buckets. Furthermore, there is no need to manage keys, and risk is
transferred to AWS services. With
client-side encryption, you manage the keys, using AWS SDK. There is
additional implementation load, but one has even greater flexibility.
Also, the chain of custody includes only the known elements and excludes
the third party of AWS.With
encryption, as with all other levels of security, it is a recommended
practice to use automated tools (such as Scout) should verify and
enforce encryption.Questions often asked by lawyersAs
we have seen, AWS platform has all the necessary elements to
implement the best practices of security in web-base applications.
SHMcloud, an eDiscovery system based on AWS, implements all of these
best practices.Of
course, SHMcloud can be deployed internally, in a hosted or internal
computing center, and then it will carry over all of its security
practices to this implementation.In
addition, below is a list of questions that the law practitioners usually ask
about cloud-based eDiscovery, as it relates to specific areas of legal
responsibility and various geographical jurisdictions. We have also
include some questions related to price/benefits analysis.Q.How well is my data protected against accidental loss?A.S3
stores all its data with the replication factor of 3. Currently S3
stored one billion of new objects daily. Yet, since the beginning of its
operation in 2002, AWS does not have a documented case of customer data
loss. Given the public nature of all outages, this is a remarkable
record.Q. Some lawsuit case require storing data for years, can S3 accommodate this?A.S3 has no time limit on data storage. In addition, you can implement selective backup for the important information.Q. The price of 20 cents / month/ GB of data can be quite high. Is there a way to mitigate this cost?A. Amazon Glacier store the data at 1/20 of this cost, at 1 cent/month/GB. You can think of it as inexpensive long-term backup.Q. What about human errors, such as deletions?A. There
is no complete protection against human errors, but best practices help
mitigate this risk. These include storing multiple copies of the
important information, some of with read-only permissions for all but
the project administrator.Q.

How can I make sure that my data is indeed deleted after the necessary retention period is finished?

A.

There are multiple measures that you can take

Delete the data from S3, shut down your EC2 instances. This has the effect of erasing the data from your hard drive, only more so. In the case of the local hard drive, there is undelete and forensics restore. By contrast, in the case of S3 and EC2, there data is encrypted by Amazon in the first place, so now it is essentially gone;

If on top of that you used your own encryption, the data is unrecoverable;

You can overwrite your data with another, bogus data. This is not needed, but some people like to run PC Eraser type program a number of times, for their comfort, and this has about the same effect here.

Q. Some
jurisdictions, such as the European Union, impose the data locality
requirement, such as that the data should never leave the European
region, for example. Can this be accommodated by AWS, and by extension,
by SHMcloud?A. Amazon
AWS provides “Regions” for this exact purpose. Data deployed in one
region (such as Ireland for Europe) is guaranteed to never leave the
particular computing center. In addition to satisfying the legal
requirements, regions provide for better application latency and
responsiveness.SHMcloud
takes full advantage of this Amazon Regions, and it has AMI instances
that can be deployed in any Amazon region. Below is a screenshot showing
SHMcloud instances and their regions.

SummarySHMcloud
(TM) is an open source solution for eDiscovery. It is based on the
Hadoop framework and other modern Big Data tools. It has been
extensively tested on large volumes of data. Its current capabilities
include metadata and text extraction, culling, exception handling and
deduplication. It also allows searching from within the processed
results. Its output consists of archives with native files, text files,
and exceptions, as well as a separate load file which can be loaded
into a review platform such as Concordance or Summation. The
SHMcloud project initially started as FreeEed. After being under
constant development for a year, with new functionality going into the
closed-source, enterprise version of the software, we decided to open
source all of it, to provide more benefits to the users, and to simplify
the development. The most recently added capabilities include OCR,
imaging (PDF), and search with Lucene and Solr, as well as speed
enhancements and load-balancing. SHMsoft provides rigorous testing and
quality assurance, and offers responsive commercial support.HistoryThe
FreeEed software was created by developer Mark Kerzner, and published
on GitHub in March of 2011. This was Mark’s third eDiscovery project,
with the first two being early attempts at distributed computing. Thus,
FreeEed was the result of years of experience and the deep knowledge of
eDiscovery software. It was built for Big Data from the start, using
such technologies as Hadoop, Lucene, Tika, and Hive.The project received its initial publicity through an article by LTN reporter Evan Koblentz, “Open Source Could Change the Future of E-Discovery”. Since that time Mark has presented the project at meetings such as Women in eDiscovery
in Houston, and a meeting of the Houston Association Of Litigation
Support Managers (HALSM), which took place in Houston. As Mark
continued developing the project, he brought it to the Amazon AWS Cloud
as the quickest route to adoption. He lined up his software consulting
company, SHMsoft, to offer support.Having
started as a software consulting company, SHMsoft soon evolved into a
developer and promoter of FreeEed, offering commercial support and
adding open source and closed source enhancements. It became necessary
to have a separate enterprise version of FreeEed, which is now offered
as SHMcloud. The company was accepted as a client of the Houston Technology Center,
a technology incubator. SHMsoft received initial angel funding, and
became noticed as one of the first Big Data companies in Houston, TX.
It was selected as a finalist in the prestigious Goradia Startup
Competition. SHMsoft moved forward by hiring technical and marketing
personnel, formed an advisory board, and has a current headcount of
around twenty people.Then
we open sourced the additional capabilties under the name of SHMcloud,
which is also found on GitHib. SHMsoft plans to stand behind the
SHMcloud project. In fact, it is in the process of forming a separate
non-profit foundation to promote FreeEed and other open source software
for eDiscovery. The name of the foundation is “EddFoundation”. SHMsoft
is currently working with eDiscovery processing bureaus as well as with
enterprises, not only in Houston but also nationwide, to offer support
and facilitate the use and acceptance of the SHMcloud software. Architecture and software development processingProcessing
is organized on the Hadoop framework. The input data is combined
(“staged”) into zip archives for processing and chain-of-custody
purposes. During processing, each file is read from the archive and
assigned a unique ID. The data is then processed with Tika, which
extracts text and metadata. Metadata, text, and the file itself are
delivered as processed results.The current and future building blocks of the system are HDFS, Hadoop, HBase, Tika, Lucene, Solr, Mahout, Hive, and Pig. A proprietary enhancement used for quick searches and review will include DataStax technologies.Indexing and searchingCulling
is accomplished through the use of an open source search engine called
Lucene. An efficient in-memory index is created for each document, and
all of the project’s keywords are run against this index. If the index
contains any of the keyword combinations, the document is considered
responsive and is sent for further processing.A
feature that is currently being tested is the capability to store each
search index for each document in a complete Lucene index. This allows
for additional searching and culling to be performed once the project
processing is completed. This
process is made even more efficient and flexible because each node on a
Hadoop cluster is creating its own Lucene index. The indexes can then
be used for searches, where the software queries all of them in a
combined query. For the sake of efficiency, the indexes get merged into
the project’s search index during the next step of processing.OutputMetadata
results are output as a CSV file, while the native files and the
extracted text are stored in a zip file(s). The end results can be used
for culling and producing native files for legal review.Supported file formatsMS Office formatsPST processingPDFImagesSpeed of processingOn
regular commodity servers, SHMcloud processes about 2 GB of data per
hour. The speed linearly increases with the number of servers in the
Hadoop cluster. Thus, at a recent demo for HALSM using 50 computers on
the Amazon EC2 cloud, SHMcloud processed 100 GB of Enron data in 1 hour.TestingSHMsoft
has a full-time tester dedicated to testing the stand-alone and
cloud-based versions of FreeEed/SHMcloud. The testing is done using
standard data sets, in particular the Enron set. The results of the
complete Enron data processing can be found at FreeEed.org, or by
navigating to http://freeeed.org/index.php/documentation/testing-with-enron-data.Controlling the softwareThe
SHMcloud software is controlled through a desktop application called a
“Player”. The Player allows the user to set and organize projects, add
data to the project, set and update processing parameters, stage the
data (copy it to archive files for deployment on the Hadoop cluster or
on the Amazon AWS cloud), and then to start and control processing.The web browser-based GUI is under development, first for the search and culling, and later to replace the Player.The
back-end processing, residing on an internal Hadoop cluster or in the
private AWS Amazon Cloud, is referred to as SHMcloud. It consists of
the same SHMcloud software deployed to every cluster node. The Player
organizes the cluster processing. This is illustrated in the diagram
below.

In the near future, SHMcloud processing will have the following enhancements:

Browser interface, instead of a desktop application

Optimized data harvesting

Added proprietary data sources and databases

Allow searches and first-pass review directly on the cluster

Allow additional culling, based on previous results

The enhanced near-term processing is illustrated below:

The next enhancements for SHMcloud will include:

Advanced analytics

Review built on Big Data

Comparison of FreeEed / SHMcloud editions

Edition/Feature

FreeEed

SHMcloud for Amazon AWS

SHMcloud for Hadoop cluster - support

License

Apache

Proprietary

Proprietary

Player (desktop application) for local one-workstation processing

Free

Free

Free

Player app to control cluster processing

It works, but you do it yourself

Enterprise support

Enterprise support

Levels of support

Email, community

Training, implementation, and support, 8 through 5, or 24x7

Training, implementation, and support, 8 through 5, or 24x7

Pricing

Free

$0/month+$1 / server instance hour + AWS charges

Yearly: $2,500 per node on Hadoop cluster

Text and metadata extraction, culling, load file

Yes

Yes

Yes

OCR

No

Yes

Yes

Imaging

No

Yes

Yes

Deduplication

No on Windows, yes on Linux

Yes

Yes

Speed

2 GB / hour

2 GB / hour * number of machines in the cluster (which is limited only by your AWS account)

2 GB / hour * number of machines in the cluster

Formats: MS Office, PST, PDF, images

Yes

Yes

Yes

Custom formats

No

Optional

Optional

Databases

No

Optional

Optional

Integration support

No

Available

Available

Training

No

Available

Available

Scalability

Limited by one workstation, or by your own support on the cluster

Basically, unlimited - you only need to increase your maximum number of instances assigned by Amazon

Monday, December 3, 2012

We
started receiving feedback from our beta testers and advisors on the
user interface for eDiscovery processing and culling, and we already
started implementing the changes. It will be in the browser, simple, and
powerful, and it will run on a single workstation, internal Hadoop
cluster, or EC2 - take your pick.

We
are deepening our Hadoop training expertise with administrators’ and
operations needs, and with advanced MapReduce and HBase programming
techniques.

We are expecting a great week to come working with law firms, hosting providers etc. - stay tuned.