August 5, 2019

There is a fantastic blog entry on how to automate DR on Amazon RDS for Oracle Databases. One of my customer had on-premise database of 1 TB database with transactional activity peak of 50 GB and had RPO and RTO requirements of 2 hours and 4 hours respectively. Before they migrate to RDS they wanted to know if solution can fulfill their RPO and RTO requirement so I carried out few tests.

Demo environment:

I built Database orcl in us-east-1 using swingbench. For first/baseline snapshot, it took 1 hour 25 minutes to create it, 50 minutes to copy from us-east-1 to us-west-2 and 19 minutes to restore it. First/Baseline snapshot numbers could be large depending upon the size of your database. I tweaked swingbench such that it generates 50 GB of archive logs in an hour. Incremental snapshot at T2, took 13 minutes to create it, 11 minutes to copy from us-east-1 and us-west-2 and 18 minutes to restore it. I performed number of incremental updates and all the snapshot creation, copy and restore numbers were in same ball park number.

Assuming snapshot is scheduled every hour, if there is an event at 1:14 pm (worst case scenario), PIT snapshot available will be from 12:00 pm. In that case data will be lost of 1 hour 14 minutes and snapshot will be restored by 1:45 pm. So RPO was 1 hour 14 minutes and RTO was 31 minutes. Another box demonstrates failure after snapshot completion but before snapshot copy was complete.

As you can see from this test that RPO and RTO requirements for this customer are achievable and Oracle RDS will be great choice as a first step to migrate to AWS cloud and hopefully migrate to Aurora later in future.

Dont’s

Copy/Restore times do not extrapolate so carry out testing based on your requirements. Do not assume that a 20 TB database is going to take 20x to copy/restore.

May 8, 2019

Overview

In previous blog, I demonstrated how to create web based Data Science environment using JupyterHub on Elastic container services. There has to minimum of one task running so if environment is not continuously used customer still has to pay for one task per level of authorization that customer needs. There are different ML stages as shown below in diagram and customer wants to use 1 and 3 stage/tool pattern for their user cases. It will not be economical to use GPU based instances on a persistent basis with web based architecture so customer wants hybrid of web based and on demand environment. In this blog, I will demonstrate how to data scientists can request such on-demand environment using Service Catalog.

2a to 2i – Follow similar steps as above with mapping to nonPII IAM role.

1h – Users can launch product using products assigned to them.

1j – Once the product is launched, users can access JupyterHub environment.

Implementation

0a. Create AMI

Launch any t2 EC2 instance using ubuntu 18.04, login to ec2 instance and run following commands. Once the packages are installed, reboot ec2 instance and create AMI image from it. After image is created terminate EC2 instance.

1h ,2h Login as Data scientist IAM user and launch product.

Once the product is launched it will show JupyterHub url as key value pair.

Launch JupyterHub from browser.

\

Login using username and password.

1k, 2k Create notebook and test access.

Create notebook and notebook will be stored on EFS mount point.

As expected, User can access data from PII folder.

User does not have access to nonPII data.

Once the user completes data science or machine learning tasks, product can be terminated by clicking on action and terminate. In future user can launch product and notebooks will be preserved as they are stored on persistent EFS mount.

Additional Considerations

Spot Instances

I have created products using cloud formation that uses on-demand instances. If there is no urgency to complete data exploration, machine learning training, consider creating products that use spot instances which can significantly save cost.

Security

Use certificates and consider launching EC2 products in private subnet for security reasons and access it through bastion.

Active Directory Integration

SageMaker

Product offering can be extended using sage maker and data scientists can have flexibility to use JupyterHub or SageMaker depending upon their requirements.

Cost and Reporting

If users don’t terminate product EC2 will keep incurring additional cost. Lambda can be scheduled to terminate idle tasks or cloudwatch alarm can be created such that if EC2 instances are idle for more than certain period of time then terminate those instances.

As users have control over what type of EC2 instances they can launch, additional reporting can be created using service catalog, EC2 and cost metadata.

April 24, 2019

Overview

Customer has built multi tenant data lake on S3 and have started ingesting different types of data. Now they want to build data science environment for data exploration using JupyterHub. Below were the requirements.

Environment must be low cost.

Environment must scale with number of data scientists.

Environment should support authentication and authorization (S3 data lake).

Notebooks must be stored in a centralized location and should be sharable.

Environment must support installing custom packages and libraries such as R, pandas etc..

Customer’s preference is to use JupyterHub and does not want to use EMR due to additional cost.

Architecture

1a. Create IAM policies and roles with access to specific S3 folder. For simplicity lets assume S3 Bucket has two keys/folder called PII and Non-PII. Create policy and role with access to PII and Non-PII.

2a. Create two Dockerfile for authorization purpose. Each Dockerfile will have separate users for authentication and later while creating ECS task definition each image will be attached to different role for authorization. Store Dockerfile in CodeCommit.

2b. CodeBuild will trigger on commit

2c. Code Build will build images using Dockerfile.

2c. CodeBuild will push images in Elastic Container Repository.

Web Based Environment

Single Task can shared by multiple users, Task can scale based on scaling policy, Minimum of one task must be running so customer has to pay for at least one task per task group, CPU and memory limits per task are 4 vCPU and 30 GB of memory.

3a. Create ECS cluster

3b. Create one task definition using the role that has access to PII folder in S3 and image that consists of users who needs access to PII data and other task definition for Non-PII.

EC2 instance can be provisioned using service catalog, One EC2 instance per User, Users could also share EC2, Customer only pay what they use, wide variety of EC2 options available with much higher CPU, memory compared to ECS. Recommended for ad-hoc and very large data processing use cases.

In this blog, I will cover implementation of Web Based Environment and will cover On Demand Environment in part 2.

Walkthrough Dockerfile

Get the base image, update ubuntu and install jupyter, s3contents, awscli. s3contents is required to store Notebooks on S3.

Install S3ContentsManager to store notebooks in centralized S3 location. Although github says it should work with IAM role but I got some errors so as of now I’m using access_key_id and secret_access_key that has read/write access to S3 bucket.

2c, 2d. Build Docker images and push it to ECR

For sake of brevity, I will skip code commit and code build and show what commands codebuild has to run. There will be two images, one will have users that needs access to PII and another one for non-PII. Instead of two repositories you can also create single repository and create two images with different tags.

After both the services are created, wait for few minutes until there is one task running for each service.

3f. Create A-records.

Create A-records in Route53.

Test it

Launch jhpocpii.vishalcloud.club:8000. Login as user1 and notice user1 can only access pii data. Trying to login using user3 or user4 will result into authentication error.

Launch jhpocnonpii.vishalcloud.club:8000. Login as user4 and notice user4 can only access non-PII data. Trying to login using user1 or user2 will result into authentication error.

Test R Program.

Important Additional Considerations

Cost

ECS Tasks can be launched using Fargate or EC2. Below matrix shows cost comparison of similar CPU/memory configuration between Fargate and EC2. To save cost, depending upon the type of usage pattern of environment use EC2 for persistent usage or use Fargate for adhoc usage.

Security

Use certificates and consider launching ECS tasks in private subnet for security reasons.

Active Directory Integration

You can use ldap authenticator to authenticate users through AD. Create separate images with different AD group to control authorization.

January 30, 2019

Customer Requirement:

Customer XYZ is planning to host multi-tenant ElasticSearch domain to provide log analytics service to multiple clients. Customer XYZ will receive logs from different clients into their data lake and selectively push relevant logs to client specific Indices. Clients should be able to login and authenticate to Kibana portal and should only have authorization to client specific indices. There will be separate set of indices created for each client starting with standard prefix such as clientid or clientname.

Solution:

In current solution, I will demonstrate how to integrate ElasticSearch domain with Cognito. Using Cognito, customer can create users for different clients in user pool. Customer can also federate using SAML provided users are available or can be created in hosted/cloud AD. Each user either from user pool or federated will map to one of the groups in Cognito and each group will be associated with IAM role which in turn will provide authorization access to set of client specific indices.

Note: Solution will work only for dev tools in kibana and not for discover, dashboards etc

Domain status will change to processing. Once the processing is complete status will change to Active and Cognito configuration will be updated for Elasticsearch domain.

Step 6: Users, Roles and policies

Create IAM policies, roles, Cognito user pool users, groups and map groups to IAM roles. Policy documents can be found in Appendix A. IAM policy documents are key on how to control authorization at indice level.

SAML integration:

Further, if customer wants to provide federation for AD users, identity provider can be configured using SAML using the xml file obtained from ADFS.

Enable the identity provider in the App client Settings.

Kibana Login page will now look like as follows:

Federated identities may not map to any groups initially so federated identities can still get access denied message. Cognito pre/post authentication triggers can be configured using lambda to add external identities to Cognito groups depending upon SAML attribute values or can be added after some verification/approval process.

To determine maximum CPU threads utilized we can look at top activity in Grid control, but the top activity view is limited to 24 hours. What if we want to look at weekly or monthly trends (example figure 1 below)? It will be tedious and boring task to view top activity one day at a time and then determine the maximum CPU threads utilized.

Figure 1

“CPU usage per sec” metric from dba_hist_sysmetric_summary can be converted to seconds and joined with other views to get Grid Control type visual. Figure 2 is from Grid control and Figure 3 is created using maxthreadused alias from attached script.

Figure 2 – Grid control view

Figure 3 – dba_hist_sysmetric_summary view

There are additional columns in script such as average and maximum cpu threads utilized, standard deviation (lower the value meaning higher confidence in average compared to maximum). “Sum of the squared deviations from the mean” is generally used for comparison. I have not added that metric in script but can be easily modified to add that column and used for comparison of two periods if necessary. If you want to look at trends of multiple instance you can quickly add pivot/decode and plot in excel to look at trends of multiple instances.

Application team wanted to implement VPD using roles such that if a certain role is granted, user can see column data and if role is not granted user cannot see column data. Below is a quick demonstration.

PS I wrote this back in 2013 and little outdated compared to some 12c features.

August 26, 2016

We had batch system that starts concurrent sessions all with parallel(8). With every release more concurrent sessions are added to that job and at times few sessions are getting serialized which in turn impacts runtime of batch. Downgraded SQL that used to take few minutes now run for couple of hours and does not complete. Lot of concurrent sessions completes within range of 1-5 minutes so it would be ideal to queue the sessions until it gets all parallel slave processes instead of downgrade and missing SLA. We did not want all the features of parallel automatic tuning so I thought of quickly testing only subset of feature called parallel statement queuing.

Operational warehouse system is on 5 node RAC (non Exadata) and batch is tied to service running on single node and parallel_force_local was set to TRUE for known reasons.

Test 1: There are two instances with total 16 slave processes and parallel_force_local set to TRUE. When we submit 4 concurrent sessions with parallel(4) there was downgrade and no queuing. When we submit 4 concurrent sessions with parallel(8), one session is queued, two are downgraded and one gets all 8 slaves.

Test 2: Now I turn off parallel_force_local and we don’t get queuing with 4 concurrent sessions and parallel(4) as there are 16 slaves available across node. When we submit 6 concurrent sessions with parallel(4) we do get queuing and eventually all processes can get requested number of slaves and there were no downgrades.

Test 3: If I turn on parallel_force_local (same as Test 1) we get downgrades.

Test 4: Now I left parallel_force_local turned on but set parallel_min_server, parallel_max_servers and parallel_server_Targets to 0 on second instance. So there are only 8 slaves to serve requests across the cluster. Now I get queuing for 4 concurrent threads with parallel(4) and parallel(6).

Test 5: Now I tried parameter instead of hint and got same behavior.

Based on above observations, I think statement queuing logic does not take into consideration parallel_force_local set to TRUE or always assumes its FALSE. When we submit concurrent requests, Oracle thinks that there are slaves available (across the cluster) but when it actually tries to execute SQL due to parallel_force_local set to TRUE it has to get all slaves from same node and there are not enough slaves so its getting downgraded.

I will check with Oracle support to find out if its bug or design limitation and update this blog.

Update from Oracle support:

Bug 21438369 is fixed in 12.2

If the system is under a high load of PFL queries, then you may see that parallel_servers_target is not conformed by Resource Manager, or may result in downgrades even.

Parallel statement queuing did not conform to parallel_servers_target when a lot of parallel_force_local (PFL) queries are run at the same time. This was happening due to the stale published load value: kxfpinaslv.

It looks like there are some ongoing problems with these settings.

Instead of parallel_force_local = true setting, dev suggests:
If they want to restrict certain jobs to certain nodes, they should do the following:
– Create a service for each job type.
– Enable the service on the nodes where it should run.
– Create a consumer group for each service and map the service to the consumer group.

June 1, 2016

In shared environment, with instance caging turned on, I had to perform maintenance during downtime window to reorganize few tables that are populated by AQ. I wanted to load balance sub tasks, use all available resources and complete task quickly. Here is little snippet of code I wrote to load balance tasks across multiple nodes in RAC cluster.

Create job table and populate job table with sub tasks as shown below. I didn’t had huge variation in table sizes so order was not that important. But if there is huge outlier table in list, put them first on the list otherwise you may end up with one job at the end that starts towards the end and run for long time.

I used similar approach for other long running tasks such as index rebuild, gathering statistics on tables and indexes. Other post tasks such as convert table , indexes degree back to 1 etc that does not take long time, were ran from regular sqlplus script.