Jun 26, 2016

Apache Zeppelin provides a Web-UI where you can iteratively build spark scripts in Scala, Python, etc. (It also provides autocomplete support), run Sparkql queries against Hive or other store and visualize the results from the query or spark dataframes. This is somewhat akin to what Ipython notebooks do for python. Spark developers know that building, testing and fixing errors in spark scripts can be a lengthy process (It is also dull because it is not interactive), but if you use Apache Zeppelin, you can iteratively buld and test portions of your script and this will enhance your productivity significantly.

Installing and Configuring Apache Zeppelin

Ensure following prerequisites are installed

Java 8: su -c yum install java-1.8.0-openjdk-devel

Maven 3.1.x+: sudo yum install apache-maven and then link it sudo ln -s /usr/share/apache-maven/bin/mvn /usr/bin/mvn. If this does not work for you, you can install it the following way.

This command works for version 5.5 of cloudera distribution, make sure your versions of hadoop and spark are correct. In addtion to installing support for spark, this command will configure zeppelin with support for pyspark as well.

To configure access for hive metastore copy the hive-site.xml to conf directory under zeppelin.

In the conf folder create copies of files zeppelin-env.sh.template and zeppelin-site.xml.template as zeppelin-env.sh and zeppelin-site.xml respectively.

If you would like to change the port for zeppelin, change the following property in zeppelin-site.xml.

To start zeppelin use the command ./zeppelin-daemon.sh start. Then you can access zeppelin ui at http://localhost:8999[1]

To stop zeppelin use the command ./zeppelin-daemon.sh stop

Running SparkQL queries against Hive and Visualizing Results

In a cell in zeppelin type %hive to activate interpreter with hive ql support. After you do this, you can then run the query and the visualization support is automatically activated in the output. To execute the cell use Shift+Enter key.

Bulding scala scripts and plotting model outputs

You can also code in scala or python by activating the interpreter. Scala and Spark interpreter is activated by default for a cell.

To visualize the spark dataframe just use z.show(df) command.

Writing documentation

Activate the markdown support in a cell by using %md. You can then add documentation along with your code. Unfortunately, the support for latex is still not there, but it should be there in future releases.

What's missing ?

Unlike ipython notebooks, there is no option to export to html or pdf(using latex). Also, the support for embedding latex expressions is missing, but these features should be added in future releases.

Conclusion

Although certain features are missing, Apache Zeppelin surely helps you in increasing your productivity by reducing the time required for build, test and fix cycle. Also, it provides nice visualization capabilities for your queries and dataframes.

Apache spark has an advanced DAG execution engine and supports in memory computation. In memory computation combined with DAG execution leads to a far better performance than running map reduce jobs. In this post, I will show an example of using Linear regression with Apache Spark. The dataset is NYC-Yellow taxi dataset for a particular month in 2015. The data was filtered to extract records for a day.

This example uses HiveContext[1] which is an instance of Spark SQL execution engine that integrates with Hive data store. The dataset has the following features.

Feature Name

Feature Data Type

trip_distance

Double

duration (journey_end_time-journey_start_time)

Double

store_and_forward_flag(categorical, requires convertion)

String "Y/N"

ratecodeid( categorical, requires convertion)

Int

start_hour

Int

start_minute

Int

start_second

Int

fare_amount(target variable)

Double

We want to predict the fare_amount given the set of features. As fare is a continuous variable, so the task of predicting fare requires a regression model.

Things to consider:

To obtain the data into the dataframe, we must first query the hive store using hiveCtxt.sql() method. We can drop invalid records using na.drop()[2] on the obtained dataframe and then cache it usingcache() method for later use.

The two categorical variables need to be converted to vector representation. This is done by usingStringIndexer and OneHotEncoder. Look at the method preprocessFeatures() in the code below.

Models can be saved by serializing them as sc.parallelize(Seq(model),1).saveAsObjectFile("nycyellow.model") and can be used by deserializing themsc.objectFile[CrossValidatorModel]("nycyellow.model").first(). Newer spark api supports OOTB methods for doing this and using those methods is recommended.

Data can be split into training and testing data by using randomSplit() method on the DataFrame. Although if you are using cross validation, it is recommended to train the model on the entire sample dataset.

The features in the dataframe must be transformed using VectorAssembler into the vector representation and the column should be named as features. The target variable should be renamed as label, you can use withColumnRenamed() function to do so.

Cross validation can be performed using CrossValidatorModel and estimator can be set bysetEstimator().

The evaluator chosen depends on whether you are doing classification or regression. In this case, we would use RegressionEvaluator

You can specify different values for parameters such as regularization parameter, number of iterations and those would be used by CrossValidatorModel to come up with the best set of parameters for your model.

After this you can fit the model with the dataset and evaluate its performance. In this case, as we are testing regression model accuracy. We can use RegressionMetrics to compare the predicted_fare vs actual_fare. The measures that can be used are R-Squared (r2), Mean Absolute Error.

For new predictions the saved model can be reused. The new data needs to be transformed into the same format as was used to train the model. To do so we must first create a dataframe usingStructType to specify its structure, then preprocess features the same way by invokingpreprocessFeatures() method.

Google recently has deprecated the Google+ Sign in and process of obtaining oauth access tokens viaGoogleAuthUtil.getToken API. Now, they reccomend a single entry point via new Google Sign-In API. The major reasons for doing so are 1. It enhances user experience and 2. It improves security, more here. Also starting with android 6.0, the GET_ACCOUNTS permission has to be requested at runtime and if you implement this API, it eliminates the need for requiring this permission.

The feature that is really exciting is that it introduces new silentSignIn API, which allows for cross device silent sign in (essentially if a user has signed into your application on another platform, he won't be shown the sign in prompt) provided that the requested scopes are same, so this improves the user experience. In addition, you don't have to use the GoogleAuthUtil.getToken API to obtain the tokens as they are granted on the initial sign-in.

So if you have an android application in which you had previously implemented the Google+ sign in and used other Google plus features and want to migrate your android applications to new Google Sign in implementation, this post explains how to do so. Depending upon whether you choose to automate the lifecycle for GoogleAPIClient (Use enableAutoManage, this approach is recommended as it avoids boilerplate code) or manage the lifecycle for GoogleAPIClient by implementing the ConnectionCallbacks interface, the code might slightly differ. But, as the latter approach requires a bit more code, I will explain the process using it.

What needs to be changed

Replace mGoogleApiClient.connect() with mGoogleApiClient.connect(GoogleApiClient.SIGN_IN_MODE_OPTIONAL), this is basically required to allow the client to transition between authenticated and unauthenticated states and for use with GoogleSignInApi.

Build a GoogleSignInOptions instance. While building the instance, request the additional scopes via requestScopes method,( this is where you can request scopes such as SCOPE_PLUS_LOGIN andSCOPE_PLUS_PROFILE). Also, if you need to authenticate the user with the backend and want to obtain the authorization token to access the API's using your backend use requestIdToken(serverToken) andrequestServerAuthCode(serverToken) methods. Here unlike Google plus sign in the serverToken is just the clientId of the web application.

Build the GoogleApiClient instance, use the addApi method to add the Auth.GOOGLE_SIGN_IN_API andPlus.API.

In the onStart method connect the client usingmGoogleApiClient.connect(GoogleApiClient.SIGN_IN_MODE_OPTIONAL) and in onStop method disconnect the client. (You may do this in onResume and onPause methods also).

After the client is connected, first attempt the silentSignIn and if it fails with code SIGN_IN_REQUIRED, attempt to do a fresh sign in for the user.

After the sign in is completed, then you can invoke the Plus.PeopleApi with user accountId to obtain users google profile information.

To Sign out the user use Auth.GoogleSignInApi.signOut method and to revoke access useAuth.GoogleSignInApi.revokeAccess method.

Hive or Impala ?

Hive and Impala both support SQL operation, but the performance of Impala is far superior than that ofHive. Although now with Spark SQL engine and use of HiveContext the performance of hive queries is also significantly fast, impala still has a better performance. The reason that impala has better performance is that it already has daemons running on the worker nodes and thus it avoids the overhead that is incurred during the creation of map and reduce jobs.

The query that I will mention later ran almost 10X faster on impala than on Hive (61 seconds vs around 600 seconds): Impala is known to give even better performance.

Schema on read vs Schema on write

Schema on read differs from schema on write as data is not validated till it is read. Although schema on read offers flexibility of defining multiple schemas for the same data, it can cause nasty runtime errors. As an example Hive and Impala are very particular about the timestamp format that they recognize and support, one workaround to avoid such bad records is to use a trick where rather than specifying the data type as timestamp, you specify the datatype as String and then use the cast operator to transform the records to timestamp format, this way bad records are skipped and the query does not error out.

cast(field_name as timestamp)

Window Functions, Top-N Queries, PL/SQL

Hive and Impala do not support update queries, but they do support select*frominsertintooperation. Hive and impala also support window functions. The latter makes life easier because both Impala and Hive do not support PL/SQL procedures.

In the example below, I am using the dataset of NYC Yellow Taxi from the month of January 2015. The query below filters out invalid timestamp records and selects first 500 records per hour for 1st january 2015.

/**Top-N Subquery selects first 500 records per hour for a day*/
insert into nyc_taxi_data_limited select VendorID, tpep_pickup_datetime , tpep_dropoff_datetime , passenger_count ,trip_distance ,pickup_longitude ,pickup_latitude,RateCodeID ,store_and_fwd_flag ,dropoff_longitude ,dropoff_latitude ,payment_type ,fare_amount ,extra,mta_tax ,tip_amount,tolls_amount,improvement_surcharge,total_amount from ( select *,
row_number() over (partition by trunc(cast(tpep_pickup_datetime as timestamp), 'HH') order by trunc(cast(tpep_pickup_datetime as timestamp), 'HH') desc)
as rownumb from nyc_taxi_data where cast(tpep_pickup_datetime as timestamp) between cast('2015-01-01 00:00:00' as timestamp) and cast('2015-01-01 23:59:59' as timestamp)
) as q where rownumb<=500;

Note the use of window function row_number and ordering by truncated timestamp, and cast operator to avoid invalid records.

What's the catch ?

Given the benefits of Impala why would one ever use Hive ? The answer lies in the fact that impala queries are not fault tolerant.

Conclusion

Although, Impala and Hive do not offer entire repertoire of functionality supported by traditional RDBMS's, they are closest wrt to functionality offered by traditional RDBMS's in the world of distributed systems and offer scalable and large scale data analysis capability.

The Idea

Java 8 introduced functional programming support, this is a powerful feature which was missing from earlier versions. One of the benefits of functional programming is that it can be used to implement decorator pattern easily. One common requirement is to implement some kind of rate limiting for web services. Now, ideally you would want separation of concerns between the actual business logic and rate limitation logic. With Java 8, we can use function references to implement this separation of concerns and implement the decorator pattern.

The code

The code fragment below shows the implementation of the pattern. It is an example of integration with the Lyft API. The full source code is available here.

The highlighted code above shows how to pass method reference to methods invokeWithRateLimit() andinvokeWithoutRateLimit(), each of these methods then adds some custom preprocessing logic (like rate limitation using RxJava) after which it invokes the supplied method by using the apply() method. This implementation of the decorator pattern is much easier to grasp, than going via the inheritance route.

You can use the following link to view the entire code on Github repository.

You might run into a scenario where you might require conditional authentication with Retrofit 2.0.

This post provides an example of integration with the Lyft API. In case of the Lyft API, first we need to authenticate with and query the oauth/token endpoint to obtain the OAUTH token, and then use thisaccessToken in other service calls. Also, such access tokens have an expiry time(1 hour), so ideally there should be a mechanism to handle this scenario.

One lazy (tends out to be perfect) solution is to use interceptors and compare the HTTP Response codefrom the service to see whether the code is 401. If the code is 401, you can assume that the token has either expired or was never obtained initially, either way you would need to re-authenticate and query the endpoint to obtain the accessToken.

The code block below shows how this is done. To access the entire source code you can visit Lyft-Client on Github.

As can be seen in the above code example, we build two OkHttpClient objects, the clientNormal object is configured to use HTTP basic authentication and is used by retrofit object to query the getAccessTokenendpoint to obtain the access token, this accessToken is required by other Lyft service endpoints. TheclientAuthenticated object uses a interceptor to set the Authorization:Bearer header with the value of theaccessToken, which is required for all other service endpoints.

In the method initializeRetrofitClients it can be seen that initially, we just invoke the service endpoint by using the value of accessToken (by calling chain.proceed) and if we see that the response code is 401, we invoke getAuthenticationToken followed by another call to chain.proceed with the new value of accessToken. For subsequent calls, the interceptor will use the stored value of accessToken. This lazy invocation to obtain the access token is better because this way the logic for deciding when to obtain accessToken is not hardcoded. In addition, this keeps the code simple by avoiding unncessary checks.

Hope this post was helpful in clearing the use of interceptors for conditional authentication.

You can use the following link to view the entire code on Github repository.

Mar 11, 2016

Well blogger does not have support for latex and the windows live writer is being redeveloped. So, in the meanwhile, I have written a few posts on the pelican blog and thought, I might as well link to them here.

Aug 5, 2015

Earlier, I had covered an example here, which showed how to dynamically create users and map the application roles to enterprise groups. In this post, the sample application is extended to show how you can query the application roles from the application stripe(application specific policies).
To query the application specific roles, you need to access the application's policy from policystore and then you can either directly invoke searchRoles(String roleName) or searchRoles(String attributeToSearchRolesBy,String attributeValue,String equalityOrInequalityFlag). The response from the method is a List<AppRoleEntry>. The snippet below shows the code for doing so. Note that although there is another much more flexible method to search across application stripes by using policyStore.getAppRoles(StoreAppRoleSearchQuery obj), It is not implemented for the embedded policy store and throws UnsupportedOperationException.

You can also do various other operations on the policy store and alter the application specific policies, once you have access to those operations. A key thing to note is these operations require specific PolicyStoreAccessPermissions to be granted in the jazn-data.xml. The steps to do so are mentioned below.

Define a resource type of permission class PolicyStoreAccessPermission and the neccessary actions that you want to grant access to (In this example,I am granting access to all operations, signified by *). The snippet is shown below:-

Next, create resources that are to be granted permissions. In this case, I have created two of them, the first one is the superset that allows access to all the application stripes and the next one grants access to only this applications's stripe. The snippet is shown below:-

In the last step you have to assign these resources to the application roles or groups.

The enterprise identity store provider being used here is the embedded weblogic ldap, to run the application properly you will need to configure a password for it in weblogic and set the password in jps-config.xml as shown in the screenshot below.

To run the application, the username/password combination is john/oracle123. To view the search roles screen, either run the SearchRoles.jspx or click on the Search Roles link in the left navigation bar. The link to download the application is mentioned below:-

Subscribe to Feeds

About Me

I am Ramandeep Singh Nanda; I have a bachelor's degree in computer science engineering. have worked for Oracle in Fusion Middleware and IDM domain in the past. I am a self starter and like to explore new technologies and to blog about them. I am currently pursuing a Masters degree in Information Systems from New York University.