Tuesday, September 26, 2017

The article, Achieve enterprise integration with AWS, depicts the orchestration of Lambdas using Amazon Simple Workflow (SWF) with outstanding results. As stated, SWF requires a standalone application running in order to process the flows and this time we wanted to migrate the application to a 100% serverless solution. The article also mentions that a new service is available and looks very promising in the serverless scenario, Step Functions. Here, we want to show you how we took the previous approach and transform it into a Step Functions-led approach.AWS Step Functions is a service that helps you to create a flow based on several units of work, often implemented in AWS Lambdas. This service is basically a state machine: given an input, an initial state will compute what's required by the underlying implementation and will generate an output. This output serves as the input for the next state whose output might be used as an input for another step and so on until the flow is completed and the last state gets executed. Each state, or node in the visual editor in AWS Step Functions console, is implemented with a Lambda and the flow of the state machine is orchestrated by the logic specified in the transitions' definitions.AWS Step Functions provides the following functionality:

Create a message channel between your Lambdas.

Monitoring the functionality of the Lambdas by reporting the status of each.

Automatically trigger each step.

The Scenario

In IO Connect services we want to test this new feature of AWS with an enterprise integration use case based on the scenario described in the SWF implementation. We modified the file size according to the AWS Step Functions free tier for testing purposes:

Reduced the CSV (comma separated values) file stored in AWS S3, from 800k+ to 100K+ records and 40 columns for each record. This because we want to be sure the number of state transactions will not overpass the 4000 established in the free tier, reduce the records to 100+ give me approximately 400+ pages to be created, in the case of the "Parallel Branches" approach (explained below) it consumes 1200+ transactions, give me at least 3 runs before pass the free tier limit vs the 3200+ pages created by the original file that consumes approximately 9200+ transactions, generating a cost of $0.14 USD the first execution and $0.23 USD per execution after.

Create pages of the file according to the batch size specified. SQS has a limit of 256KB per message, and using the UTF-8 charset with 250 records peer page/message gives me 230KB approximately.

Store the pages in individual files in a Storage Service like AWS S3.

For this approach, the main idea is to use AWS Step Functions as an orchestrator in order to see all its features - logs, visual tools, easy tracking, etc. - that provides to support enterprise integration. The actual work parts are implemented using AWS Lambda. Because of the AWS Lambda limits, the units of work are very small to avoid reaching these limits, hence a good flow written in AWS Step Functions requires a series of steps perfectly orchestrated.

What we can do with AWS Step Functions.

This was a completely new tool for us so we did some diligence to investigate about what can, can't-do and other useful information about this tool:Can.

Use a simple JSON format text to create the State Machine.

Use states to call AWS Lambda Functions.

Run a defined number of branches that execute steps in parallel.

Have a different language for your Lambdas in the same State Machine.

Send serializable objects like POJOS in the message channel.

Create, Run and Delete State Machines using the API.

Use the logs and other visual tools in order to see execution messages.

Can not.

Edit an already created State Machine. You'll have to remove it and then re-create a new state machine.

Launch the state machine using a trigger event like create a file in S3. Instead, you'll need to write a Lambda to trigger it.

Create a dynamic number of branches of states to be run in parallel. It's always a pre-defined set of parallel tasks.

Use visual tools (like drag and drop) to create the state machine. All the implementation must be done writing a JSON. The tool only shows you a graph with the representation of the state changes, but you can not use it to create the state machine, it only visualizes it.

Send non-serializable objects in the message channel. This is a big point as you must be sure the objects you return in your Lambda are serializable.

Resume the execution if some of the steps fail. Either it runs completely or fails completely.

Consider this.

The free tier allows you 4,000 step transitions free per month.

All the communication between steps is made using JSON objects.

A state machine name must be 1-80 characters long.

The maximum length of the JSON used as input or result for a state is 32768 characters.

Approach 1: Lambda Orchestration.

For the first approach, we wanted to test how Step Functions work. For this purpose, we only set two steps in order to see what we can examine using the Step Functions logs and graph.The StateMachine JSON

As mentioned before, a state machine cannot be triggered by an S3 event or similar, so we used a Lambda to trigger the state machine when an S3 object is created in a certain bucket and ends with ".csv" extension, then this Lambda passes the execution to the state machine by starting it and also passes the S3 object details as an input parameter.

The FileIngestion step calls a Lambda that reads the information provided by the event trigger to locate and read the file created in S3, calculate the number of pages to create and returns this number and the file locations as output.

The Paginator calls a Lambda that this reads the lines of one single page, stores it in a variable, then call another Lambda in an async mode to write a file with the page content. This process is repeated until the original file is completely read.

In this approach, the Lambdas have more flow control than the state machine, because one Lambda calls another and orchestrates the asynchronous executions. Also in this case, if the Lambda that writes the pages fails you can not notice it in the graph, you need to check the Lambda executions and manually identify which Lambda and why it failed.

The Metrics.

The total time of execution is 4 minutes average to process 100K+ records.

Approach 2: Lineal Processing.

Taking in count the previous implementation, We wanted to create a state machine that has more authority of the control of the flow execution. As a first step, we decided to implement a linear execution with no parallelization defined.

The FileAnalizer step calls a Lambda function that consumes the .csv file and creates a POJO with the start and end byte of each page to be created, these bytes are calculated depending on the page size parameter specified in the Lambda. You can see it this way, FileAnalizer creates an index of the start and end byte for each page.

FileChecker is a Choice Step that verifies a boolean variable in which it determines if all the pages were completed. This information is stored in a SQS queue.

PageCreator calls a Lambda that reads the start and end bytes of each page in the received POJO, reads the S3 file only in that portion - start and end bytes- and creates a SQS message with the page content.

QueueChecker is similar to FileChecker but in this case waits until no messages are left in the SQS queue.

ReadSQStoS3 is a Resource step that calls a Lambda function, it reads the messages in the SQS queue that represents a page of the .csv file and stores it in an S3 folder.

SuccessState ends of the state machine execution.

For this approach, the message channel always contains the POJO with the start and end bytes of each page.

The Metrics.

The total time of execution is 15 minutes average to process 100K+ records.

Approach 3: Batch Writing.

We took the same state machine for the linear processing but, the Lambda resource in step ReadSQStoS3 was modified with the intention to reduce the execution time of the previous approach. I've added a long polling behavior in the Lambda with a maximum of 10 messages, with this, the Lambda waits for a maximum of 10 messages in SQS if available (if 20 seconds pass and the 10 messages are not visible, it gets the maximum available at that moment) in the queue, get them and calls another Lambda asynchronously to write these 10 messages.The Metrics.

The total time of execution is 10 minutes average to process 100K+ records.

Approach 4: Parallel Branches.

For this implementation, we added a series of 5 branches in order to read the first and end byte of each page and send a message to SQS with the page content in parallel.

Here we faced two problems:

We confirm the Step Functions limitation that you cannot create a dynamic number of parallel branches. This means you have to define a fixed set of parallel jobs since the beginning.

At the end of the parallel execution, meaning the 5 tasks depicted below, Step Functions aggregates the results of all tasks and passes a single message with all results in it. This results in a problem with big JSON structures that can convert into a bigger JSON at the end of the parallel execution. If this JSON is bigger than 32768 characters then an error will be thrown and the execution will fail.

FileAnalizer is a resource step in which a Lambda is called to read the csv file stored in AWS S3, create the index of the start and end byte of each page and storage this indexes in an external service like a database, file, cache service, etc.

FileChecker is a choice step. It reads a boolean variable that indicates if all the pages were already stored as SQS Messages.

SetBatchIndexX is a transition step. Sets a variable used by "PageCreator" to know which index is next to be read.

PageCreatorX depending on the integer value passed by "SetBatchIndex", it extracts the page's start and end bytes, use them to read the CSV File only by the portion defined by these indexes and sends a SQS Message with the page content. It returns the information it reads in order to know in the next step which pages are left.

DeleteRead receives the output payload of the parallel process and determines what pages were already written in SQS and deletes the information related to these pages from the external service (step 1).

QueueChecker, ReadSQStoS3 and SuccessState, works as the same of the "Batch Writing" approach.

The Metrics.

The total time of execution is 6 minutes average to process 100K+ records.

Conclusion.

AWS Step Functions is a tool that allows you to create and manage orchestrate flows based on small units of work. The simplicity of the language makes it perfect for quick implementations, as long as you already have the units of work identified.

Unfortunately, as this is fairly new service in the AWS ecosystem, functionality is severely limited. A proof of this is the fact that you need to maintain a fixed number of parallel steps and if you end up having less work than parallel steps you must add control logic to avoid unexpected errors.

Moreover, given the limits found in AWS Lambda and Step Functions, computing of high workloads of information can be very difficult if you don't give a good thought to your design to decompose the processing. We highly recommend you give a read to our blog Microflows to have an understanding of what this means.

On the plus side. if you want to transport small portions of data or compute small processes in a serverless fashion, Step Functions is a good tool for it.In the future, we will evaluate a combination of other new AWS services like AWS Glue and AWS Batch together with Step Functions to achieve outstanding big data processing and enterprise integration.

Thanks for taking time to read this post. I hope this is helpful to you at the moment you decide to use Step Functions and do not hesitate to drop a comment if you have any question.

Tuesday, August 22, 2017

In recent days, We were asked to do an ETL flow using Amazon Web Services. Because we excel in Enterprise Integration we had a particular design in mind to make it happen. The job was pretty simple:

The trigger was a file placed in a particular S3 bucket.

Take the S3 object metadata of the file as the input of the job.

Read the file and package the records in pages, each page is sent asynchronously as a message. This technique is to increase parallelism in the job processing since the files contain one million records in average.

Consume all pages asynchronously and upload them as micro-batches of records into a third-party system via a Restful API.

Other tasks to complete the use case like recording the completion of the job in a database.

On top of these basic requirements we had to make sure the system was robust, resilient and as fast as possible while keeping low the costs of the different systems.

We chose to use different services from Amazon Web Services for this: S3, Simple Workflow (SWF), Simple Queue Service (SQS) and Lambda.

Here a diagram of the solution (click on the image to see it bigger).

Solution diagram

Why Simple Workflow (SWF)?

As you can see in the diagram, every task is executed by a Lambda function, so why involve Simple Workflow? The answer is simple: We wanted to create an environment where the sequence of task executions was orchestrated by a single entity, and also be able to share with the different tasks the context of the execution.

If you think of this, we wanted to have something similar to a Flow in a Mule app (MuleSoft Anypoint platform).

It is important to highlight that AWS has some specific limits to execute Lambdas like one Lambda function can only run for a maximum of 5 minutes. Due to these limits, we had to break the tasks into small but cohesive units of work while having a master orchestrator that could run longer than that. Here's where the shared context comes useful.

Note: There's another service that plays very well on the serverless paradigm as opposed to SWF, Step Functions, but at the time We were working on this task it was still in Beta, hence not suitable for production. There is a follow-up post about full Serverless integration that will include Step Functions.

Challenges and recommendations

While working with SWF and Lambdas, We learned some things that helped us a lot to complete this assignment. Here I'll show you the situation and solution that worked for me.

Invoke Lambdas from activities, not workflow workers

One thing you should know about working with SWF is that every output of an activity returns as a Promise to the workflow worker - very similar to a Promise in JavaScript. This Promise returns the output as a serialized object that you need to deserialize if you want to use it as an input for a Lambda function execution directly from the workflow worker. This overhead can be very cumbersome if you use it frequently. In your lambdas you're supposed to work with objects directly, not serialized forms.

Here my first advice, even though you can invoke a Lambda function from within a workflow worker don't do it, instead use an Activity worker. This way each workflow worker implements a unit of work that calls an Activity worker which in turn calls a Lambda function internally. Why? Because in the Activity worker you will be able to use a proper object to pass to the Lambda as an input parameter. This technique requires you to deal with some extra plumbing in your SWF code since you'll need one Activity per Lambda, but in the end, this provides you a very flexible and robust mechanism to exchange information between SWF and Lambdas.

See this sequence diagram to understand it.

Workflow, activity and lambda sequence diagram.

Wrap your payload in a Message object

All in all, we are talking about Enterprise Integration and one of the central pieces is the message. In order to uniformly share information between the workflow and the different Lambdas, it's better to standardize this practice by using a custom Message object. This Message must contain the workflow context you want to share and the payload. When the Lambda functions are called, they receive this Message object that they use to extract the information required to perform the task fully with no external dependency.

Decompose large loads of data into small pieces

As mentioned before, Lambdas are supposed to run small tasks quickly and independently, therefore they have limits that you should be aware of, such as execution time, memory allocation, ephemeral disk capacity, and the number of threads among others. These are serious constraints when working with big amounts of data and long running processes.

In order to overcome these problems, we recommend decomposing the entire file content into small pieces to increase task parallelism and improve performance in a safe manner - actually, this was one of the main reason to use Lambdas since they auto-scale nicely as the parallel processing increases. For this, we divided the file content into packages of records as pages, where each page can contain hundreds or thousands of records. Each page was placed as a message in an SQS queue. The size of the page must consider the limit of 256 KB per message in SQS.

Keep long running processes in Activities, not Lambdas

As you see in the diagram above, there's a poller that is constantly looking for new messages in the SQS queue. This can be a long running process if you expect dozens of thousands of pages. For cases like this, having activities in your flow is very convenient as you can have an activity running for up to one year, this contrasts highly with the 5-minute execution limit of a Lambda function.

Beware of concurrency limits

Consider the scenario where you have an Activity whose purpose is to read the queue and delegate the upload of the micro-batches to an external system. Commonly, to speed up the execution you make use of threads - note I'm talking about Java but other languages have similar concepts. In this Activity, you may use a loop to create a thread per micro-batch to upload.

Lambda has the limit of 1024 concurrent threads, so if you plan to create a lot of threads to speed up your execution, like uploading micro-batches to the external system mentioned above, first and most importantly, use a thread pool to control the number of threads. We recommend do not create instances of Thread or Runnable, instead, create Java lambda functions for each asynchronous task you want to execute. Make sure you use the AWSLambdaAsyncClientBuilder interface to invoke Lambdas, the ones in AWS, asynchronously.

Conclusion

This approach was particularly successful for a situation where we were not allowed to use an integration platform like Mule. It is also a very nice solution if you just need to integrate AWS services and move lots of data among them.

AWS Simple Workflow and Lambda work pretty well together although they have different goals. Keep in mind that an SWF application needs to be deployed on a machine, like a standalone program, either in your own data center or maybe an EC2 instance, or another IaaS.

This combo will help you to orchestrate and share different contexts, either automated through Activities or manual by using signals, but if what you need is isolated execution and chaining is not relevant to you, then you could use Lambdas only, but the chained execution will no truly isolate them from each other and the starting Lambda may timeout before the Lambdas functions triggered later in the chain finish their execution.

Moreover, every time you work with resources with similar limitations like AWS Lambdas, always bear in mind the restrictions they come with and design your solution based on these constraints, hopefully, in Microflows. Have a read on the Microflows post by Javier Navarro-Machuca, Chief Architect at IO Connect Services.

To increase parallelism we highly recommend using information exchange systems such as queues, transient databases or files. In AWS you can make use of S3, SQS, RDS or DynamoDB (although our preference is SQS for this task)

Stay tuned as we're a working on a solution that uses Step Functions with Lambdas rather than Simple Workflow for a full Serverless solution integration.

Thursday, August 3, 2017

In the previous post Benchmarking Mule Batch Approaches, written by my friend and colleague Victor Sosa, we demonstrated different approaches for processing big files (1-10 GBs) in a batch fashion. The manual pagination strategy proved to be the fastest algorithm, but with one important drawback: it is not resilient. This means that after restarting the server or the application, all the processing progress is lost. Some of the post commenters highlighted that this lack was needed to evaluate this approach against the Mule Batch components equally since the Mule Batch components provide resiliency by default.

In this post, I show how to enhance the manual pagination approach by making it resilient. For the testing of this approach I use a Linux virtual machine with the following hardware configuration:

Intel Core i5 7200U @ 2.5 GHz (2 cores)

8 GB RAM

100gb SSD

Using the following software:

MySQL Community Server 5.7.19

AnyPoint Studio Enterprise Edition 6.2.5

Mule Runtime Enterprise Edition 3.8.4

To process a comma-separated value (.csv) file that contains 821000+ records with 40 columns each, the steps are as follows:

The Approach.

We based on the manual pagination approached from the aforementioned article and created a Mule app that processes .cvs files. This time we added a VM connector to decouple the file read and page creation from the bulk page-based database upsert. We configured the VM connector to use a persistent queue so that messages are stored in the hard disk.

Description of the processing flow:

1. The file (.csv) is read and the number of pages is calculated according to with the number of records configured per batch size, in this case, 800 records are set per page.

2. The file put in the payload as a stream in order to be accessible and forward read it to create pages in each ForEach loop.

3. Each page is sent to a persistent VM connector to store the pages in the DB in a different flow. Make it the VM connector persistent means that pages are written into files in the disk, hence the inbound VM connector can resume the consumption of the messages as files after an application reboot so that the records in those messages can get upserted in the database.

Metrics.

I took the metrics used in the previous Mule Batch article as a baseline to compare the efficiency of this new approach. I recreated very similar flows to test in my environment and I obtained the following results:

Out-of-the-box Mule batch jobs and batch commit components.

The total time of the execution is 7 minutes average.

The total of memory usage is 1.34Gb average.

Custom pagination.

The total time of the execution is 6 minutes average.

The total of memory usage is 1.2Gb average.

Custom Pagination with VM connector (This Approach).

At first, I obtain good results with this approach, but they were 30 seconds slower than the "Custom Pagination" one:

The total time of the execution (without stopping the server) is 6 minutes and 30 seconds average.

The total of memory usage is 1.2Gb average.

After increasing the number of Threads from 25 to 30 in the Async connector configuration these are the results:

The total time of the execution (without stopping the server) is 6 minutes average.

The total of memory usage is 1.2Gb average.

Conclusions.

When designing an enterprise system many factors come into the play and we have to make sure it will work even in disastrous events. Adding resiliency to a solution is a must-have in every system. For us, the VM connector brings this resiliency while keeping the execution costs within the desired parameters. Also, you need to know that some performance tuning should be implemented in order to obtain the best results in resiliency without compromising performance.

Friday, July 7, 2017

In this post, I want to share my experience on how I passed successfully the MCD - Integration and API Associate certification exam. This is one of the entry level certifications for the Mule platform.

Working for IO Connect Services as an Integration Engineer, I need to focus on Enterprise Integration for my daily tasks, and for this purpose, we use the Mule integration platform by MuleSoft.

To prepare for the MCD certification, you can find plenty of documentation and tips in many sites and developer forums like in StackOverflow and MuleSoft.U - where courses and tutorials are for free.

Introduction to the exam.

This is not a complicated exam, but you must have a good software developer background in order to understand the topics.

You have the chance to use AnyPoint Studio Enterprise Edition for 30 days free, this is a very useful tool and this is where you are going to do your practices. AnyPoint Studio is an Eclipse Based IDE that contains visual tools that are very intuitive. After all, Mule uses Java, then if you are familiar with this language and you have some experience using the Spring framework, you are more than ready to start.

I highly recommend that in all of the practices, you use the debugger tool to see how the variables change between the application components. This will help you to identify why and how these changes happen, and this type of knowledge is fundamental to pass the exam.

The exam contains the following topics:

Introducing API-Led Connectivity

Designing APIs

Building APIs

Deploying and Managing APIs

Accessing and Modifying Mule Messages

Structuring Mule Applications

Consuming Web Services

Handling Errors

Controlling Message Flow

Writing DataWeave Transformations

Connecting to Additional Resources

Processing Records

Exam preparation.

MuleSoft offers two courses for training: the instructor-led Anypoint Platform Development Fundamentals course - onsite and online delivery - and the self-paced MuleSoft.U Development Fundamentals course. Both options cover all the topics for the exam, the difference is that in the first one you have to attend regular classes with an instructor for five days, eight hours a day; and in the second option you are given the training material and you study and practice it on your own. The official MuleSoft.U site says that the self-paced training may take you up to eight weeks if you study the material 3 hours per week, but if you are very dedicated, you can prepare it in a shorter time. Both training options are great, you can decide which one to take based on your available time, the way that you feel more comfortable, and last but not least... the price. By the time of writing this post, the instructor-led training has a price tag around the $2,500 USD.

In my case, and based on the support of my employer and manager, I decided to take the self-learning option since I could study around six to eight hours every day during weekdays. I found myself confident to take the exam in less than 2 weeks to do the exam, and you know the result… I passed! At the end, it all depends on your available time and dedication.

Extra documentation.

Unfortunately for this certification there aren’t practice exams out there as in other certification programs, but the course material provided by MuleSoft is very complete and easy to follow. If you feel that you need more information you can use this other training resource:

MuleSoft User Guide.

The MuleSoft official site provides you a complete guide to its products, there you can find a more detailed information about the tech specifications and code examples of all Mule modules and components.

Ask the Experts.

If you know someone that was obtained the certification before, try to ask him/her, if not, you always can go to the MuleSoft forums looking for answers, try to search it first before post, it is highly possible that someone asks the same question before.

Tips for the exam.

Here are some tips that I recommend that you take into consideration before taking the exam:

You Have Opportunities.

If you don’t pass the exam, don’t worry, you have another two chances to take it, at the same modality, so if this happen to you, check the results page and see in which modules you need to improve, go back to the material, study and take notes of these parts and bring it to you in the next chance.

Bring Notes.

This is an online unproctored open book exam, that means that you have the possibility of bring books and notes, even to search on internet, but I highly recommend that you bring notes of the modules that you really think that not completely understand, you didn’t want to waste time searching the answers that you already know only for “be sure” if you have time at the end go back to the question that you are no sure and compare it with your notes.

Use the Debugger.

I mentioned it before, but I want to say it again. It is very important that you take time to use the AnyPoint Studio Debugger in the most of the practices, this way you get to see how the variables change their values and the way the flow components interact with each other.

A big thing to highlight her, the debugger is a tool only available on AnyPoint Studio Enterprise Edition - it is not in available in the Community Edition. Thankfully, you have a 30 days for free to use the EE version, so take advantage of it. If for some reason you can’t launch the application in debug mode, check your target Mule server edition.

Read all in the exam.

Be careful at the moment of answering the questions, read all the description of the question and all the possible solutions; some answers can be tricky.

Learn about Java.

You use visual components and xml documents to develop Mule applications, but all is based in Java. Also, in the final training modules, you learn how to create components for your applications based on Java. It is important that you know this programing language in advanced to understand how the Mule applications run. You have to use a JVM based language like Java if you want to build custom Mule extensions.

Learn about Enterprise Integration Patterns.

Many Mule components are concrete implementations of the Enterprise Application Patterns. This is not a must for the MCD exam, but they are very handy if you want to build robust, reliable, extensible, and fault tolerant Mule applications. I recommend that you take a look to http://www.enterpriseintegrationpatterns.com

Conclusion.

I hope this post can be helpful for your preparation for the MCD - Integration and API Associate certification exam. No matter what option you choose, instructor-led or self-pacing study, be constant, study and prepare your notes. This is not a complicated exam. Don’t worry if you fail in the first time, you have other opportunities (the certificate doesn’t show how many times you take to pass the exam).

Thank you for taking your time to read this article, I really hope my shared experience could be helpful to you. If you have some comment to complement this post, please share it with us! We would love to hear from you.

Tuesday, June 27, 2017

I may not include a lot of information, but I wanted to share my experience of how I passed the AWS Certified Developer Associate certification with a score of 94/100.I posted initially this on my personal blog, but I thought this is a good read for a more serious blog site such as IO Connect Services blog so I decided to re-post it. You can find the original here https://victorsosasw.blogspot.mx/2017/06/aws-certified-developer-associate-tips.html

A week ago, I presented and passed the AWS Certified Developer - Associate exam. I found it difficult even though it is an associate level, with due dedication, one can pass it.

I'd like to share my experience with you.

I got into AWS as part of my work at IO Connect Services with one of our customers. It's exciting as it's my first time doing Serverless and Cloud computing with AWS Lambda and other AWS technologies. Because of this, and other plans on my list, I decided to prepare for the AWS CDA exam. One thing I wasn't aware is that there's a lot you have to learn for this. In my opinion, this certification is not an associate level and I'll tell why.

First of all, the topics you have to learn, and mostly memorize are:

AWS Cloud computing fundamentals.

Identity and Access Management (IAM).

Elastic Cloud Computing (EC2).

Elastic Block Store (EBS).

Simple Storage Service (S3).

Virtual Private Cloud (VPC).

Elastic Load Balancer (ELB).

DynamoDB.

Simple Workflow Service (SWF).

CloudFormation.

Simple Queue Service (SQS).

Simple Notification Service (SNS).

Elasticbeanstalk.

All these services are spread in the following 4 categories in the exam:

AWS Fundamentals.

Designing and Developing.

Deployment and Security.

Debugging.

The format of the exam is questions with multiple choices, where in multiple times you have to select all that apply, increasing the difficulty of selecting the right combination of answers.

In my experience, from these topics, you have to go in detail with IAM, VPC, EC2, DynamoDB, SQS, and S3. The exam is bloated with questions about these services in deep, so you better get familiar with them and make sure that you can master how to build and deploy an application without many supporting references.

Also, I haven't mentioned SDKs yet. They are not covered in deep. As long as you can identify the supported SDKs you'll be mostly fine. This means that you will not be questioned about a particular API or routine from the SDKs. Bear in mind that you will be questioned about how to interact with the REST APIs of the services though, mostly the common aspects like authentication, token management, HTTP response codes, among others.

Like I said, it's difficult but not impossible. I used the resources listed below to prepare myself.

Udemy

I took AWS Certified Developer Associate courses from Udemy. They are video tutorials and the instructor really does a good job at explaining each of the topics and scenarios that are covered in the exam. Also, the practice exams gave a lot of chances to improve as you can see explanations of the answers. You can retake the exams to train your knowledge.

Test King

Here you will find questions that are a lot like the ones in the exam. Really useful resource. I encourage you to go through all questions and read the comments, as some of the answers are wrong but the people in comments give you a really good hint of what answers are the good ones.

AWS FAQ

Last but not least, make sure you read the FAQ, particularly for S3, DynamoDB, EC2, and VPC. A lot of the questions are about things that do not come up often in normal scenarios of development like limits, region support, and corner cases.

Is it an online exam?

Yes but it's proctored. Have in mind that this exam can only be applied by authorized proctors and probably you or your employer will have to allocate budget for your travel like in my case. When you schedule your exam, you will see the authorized centers close to you.

Thursday, April 27, 2017

Note: This blog post demonstrates that when fine-tuning a Mule application, the processing of really big volumes of data can be achieved in a matter of minutes. A next blog post, written by Irving Casillas, shows you exactly how to do this but adding resiliency. See http://blog.ioconnectservices.com/2017/08/mule-batch-adding-resiliency-to-manual.html.Bulk processing is one of the most overlooked use cases in enterprise systems, even though they’re very useful when handling big loads of data. In this document, I will showcase different scenarios of bulk upsert to a database using Mule batch components to evaluate some aspects of performance. The objective is to create a best practice of what’s the best configuration of Mule batch components and jobs that process big loads of data for performance purposes. Our goal is to optimize timing executions without compromising computer resources like memory and network connections.

The computer used for this proof has a very commodity configuration as follows:

Lenovo ThinkPad T460

Intel Core i7 2.8 GHz

8 GB RAM Memory

HDD 100GB SSD

The software installed is:

MySQL Server v8.0.

Mule v3.8.2.

Anypoint Studio v6.2.2.

JProfiler v9.2.1

The evaluation consists of the processing comma-separated values (.CSV) file which contains 821000+ records and 40 columns for each record. First, we consume the file content, later we transform the data, and then we store them into a database table. The file sizes 741 MB when uncompressed. To ensure that each record has the latest information, the database queries must implement the upsert statement pattern.

Three approaches are shown here:

Out-of-the-box Mule batch jobs and batch commit components.

Using a custom pagination algorithm.

A hybrid solution.

The first two approaches use the bulk mode of the Insert operation in the Database connector. The third approach uses the Bulk operation in the Database connector.

Some background about the Mule Batch connector. You can configure the block size and the maximum amount of threads of the connect. Moreover, in the Batch Commit component, you can also configure the commit size of the batch. This gives a lot of flexibility in terms of performance and memory tuning.

This flexibility comes with a price: You must calculate the amount of memory the computer will be able to manage for this process only on each block size. This can be easily calculated with the following formula:

Maximum memory = Size of the record * block size * number of maximum threads

For instance, in our test, the size of the record -which is a SQL upsert statement- is 3.1 KB. The settings for the Batch component are 200 records of block size and 25 running threads. This will require a total of 15.13 MB per block size. In this case, this will be executed a minimum of 4105 times approximately (remember the 821000 records?). Also, you must verify that your computer host has enough CPU and memory available for the garbage collection too.

The flow

Figure 1. The batch flow.

The batch job is configured to use a maximum of 25 threads and a block size of 200 records.

The file content is transformed into an iterator which is then passed to the batch step.

The commit size of the Batch Commit is set to 200. This matches the block size of the batch job, meaning the full block will be committed.

The Database connector is using an Insert operation in bulk mode and it’s parameterized.

A simple flow as all the pagination and bulk construction is done by Mule, we just need to worry about the SQL statement and performance.

The metrics

As explained before, batch jobs are designed to run as fastest as possible by running multiple processes in threads.

The total time of execution is ~7 minutes.

4105 round trips are made to insert the records into the database.

The maximum memory used during the execution of the batch is 1.29 GB.

Approach 2: Custom pagination

The overall idea here is to read the file, then to transform the content into a list and iterate through the list to create a page of records that allows us to construct a series of SQL queries. Later, the queries are sent in bulk fashion to a Database connector with the bulk mode flag enabled.

The flow

Figure 2. Custom pagination flow.

The number of pages is determined based on the page size. In this case, the page size is set to 200 like in the first approach.

The For Each scope component takes the number of pages as the collection input.

The CSV File Reader consumes the file and builds the map that will be used as the payload, and then it maps the CSV fields to columns in a database record.

The created queries are passed to the Database asynchronous scope which executes the bulk statements with a maximum of 25 threads like in approach 1.

The metrics

The total time of execution was 5 minutes average. The processing of the SQL bulk is done in a single thread, but the upsert execution is done asynchronously.

The total of round trips to the database was 4109.

The memory consumption was at a maximum of 1.42 GB.

The extra approach 3: Hybrid

This is an approach that was also tested but the results were not as satisfactory as the two above in regards to execution timing but it showed the least memory consumption. The results of the testing are presented next.

The SQL bulk query is constructed manually but the pagination is now handled by the batch job.

The flow

Figure 3. Batch manually building the SQL bulk statement.

The CSV content is transformed into an iterator to be passed to the batch process step.

The batch process handles the block size and pagination.

In the batch process, each record is used to construct the SQL for one record in particular. The SQL query it is added to a collection that will be used to create one single SQL statement with all the queries appended to be processed in bulk.

The Batch Commit connector handles the same size as the block size of the batch job.

The Database connector uses the Bulk Execute operation to insert the records into the database.

The metrics

The total time of execution when completed successfully is 18 minutes average.

The number of connections matches the maximum 25 threads running in the batch job. This gives a total close to 4105 roundtrips to the database.

The maximum memory used during the execution of the batch is 922 MB.

Conclusion

Many times, a thoughtful design is more helpful than the out-of-the-box features that any platform may offer. In this scenario, the custom pagination approach is the fastest to upsert the records into the database than the batch approach. However, a couple of things to consider as the outcome of this proof of concept:

The custom pagination approach is more flexible at treating data that can’t be split in records so easily.

For scenarios where you have a source with millions of records coming from separate systems, it’s generally a good practice to consume the content in a stream fashion to not blow the memory or the network.

It’s easier to maintain the batch job flow than the custom pagination flow.

Using Mule’s batch jobs gives you more facilities for batch result reporting as it gives you the total, succeeded and failed records.

If memory management is the most important factor to honor in your solution, then a hybrid algorithm approach is better as it shows the best numbers in memory.

As side experiments, I also observed that using the Bulk Execute operation in the Database connector is slower in performance than the Insert operation in bulk mode. Moreover, the parameterized mode allows you to take the data from any source -trusted or untrusted- and still have the queries sanitized.