Tuesday, 21 June 2011

I’m terribly late with this article, initially scheduled for January 2011 … sorry. Maybe it is a bit outdated now, anyway, I publish it …
Let’s talk about EC2 cloud computing, Talend, Postgresql and JasperServer. Basic setup.
You already know all the pros and cons with cloud computing, I won’t talk about that. As to me, I love cloud computing and use it everyday, because of these particular advantages :

Cloud computing is still something new, and it is not surprising to discover softwares that are not ready for it or not fully “cloud compliant”. I recently faced such an issue when implementing Postgresql, Talend and Jaspersoft, which remain my preferred open source BI tools.

First issue

Let’s imagine we have a single server, hosting Postgresql. No big deal with that as long as we use this instance in a simple way : I can start my instance, host data on a persistent EBS, connect to it and stop it whenever I want. By using elastic IPs, I can assign a “fixed” IP address to this server and can easily set up a connection string. Note on 16/12/2010 : Amazon is now offering a DNS service.
Now let’s imagine we need a typical BI architecture (tiers) : one ETL (Talend or Pentaho of course !), a Postgresql database in the middle and Jaspersoft for reporting.
That’s a bit more complex because we need our Postgresql server to allow connections from the ETL and from the reporting tool. On top of that, we want to fully leverage all cloud computing features : stop the servers when they are not used, boot them when the service is needed, maybe change their network properties ... eventually we want this to be fully automated and working without any human actions like changing the connection strings, starting/stopping the servers …
Let’s have a look to a little schema now. As you can see, we have now our architecture up and running. We are also using elastic IPs for each server, which is mandatory for the following demonstration. IPs are fake.

How to read Public DNS, Private DNS and Elastic IPs on AWS EC2 ?
Imagine we have an instance running. This instance has an Elastic IP which is 46.52.186.25 and the private IP address is 11.235.33.6.
The Private DNS name is : ip-11-235-33-6.eu-west-1.compute.internal
The Public DNS name is : ec2-46-52-186-25.eu-west-1.compute.amazonaws.com
You see the relationship ?

Ok, now, how do you think we will configure Postgresql server to allow connexions from the ETL server and from the Reporting server ? Easy, here is one answer :

By making the ETL Server and the reporting server point to Postgresql. For that, we will use this nice little Elastic IP we previously set up for Postgresql server because it’s soooo easy to do that way …

By writing the ETL server Elastic IP and reporting server Elastic IP into Postgresql pg_hba.conf of course … because here again it is soooo easy natural to do so.

And then we write down the Elastic IPs into the pg_hba file like this, in order to allow Talend server and JasperServer to connect to the postgresql database. This is a basic pg_hba.conf, I encourage you to add stronger authentication.
We are done. Don’t forget to adjust the security groups like this :

Talend Server : allow 8080, allow 22

Postgresql Server : allow 5432, allow 22

Jasperserver : allow 80 (or 443 if https), allow 22

Okay, this stuff is fully working, you can test it.
But wait … that’s not the good way to do ! By using the elastic IPs to set up communication between each server/node, we just created a weird monster that makes the traffic goingOUT of the cloud and goingBACK INTO the cloud. Don’t forget you are paying for that. Look at this schema.

First solution

The best practice is to avoid using elastic IPs in order to set up network traffic between servers that are hosted inside the EC2 cloud. Instead, use EC2 internal adresses.
Ok, but … wait a minute.

How do I do to retrieve the internal address from inside EC2 ?

The solution rely on a poorly documented EC2 feature : when you query an ec2 public DNS server from inside EC2, you will be given back the corresponding internal IP address. Just what we need !!!!
For instance, if you query your ETL Server from your your Postgresql server, by using the famous host command, you will have :

You see what you have to do ? Replace all elastic IPs, except for your Talend client, by internal IPs. Like that, your internal data won’t leave the cloud, like below.

After using the internal addressing, the connexion screens will look like this : Jasper server connexion screen : Postgresql database <===> Jasperserver

Second issue

Well, ok, we solved our first issue : using internal addresses between the ETL server and the Postgresql server. But, I can see two other issues :

Postgresql still does not accept DNS names in the pg_hba.conf ! Only IP addresses allowed. So We can’t ask Postgresql and pg_hba.conf to resolve the dns for us.

What if I decide to reboot the ETL server, or the Reporting server ? These internal adresses are nice but they are changing each time I reboot / restart server in EC2. Then, how to keep my Postgreqsl pg_hba.conf updated with frequently changing adresses ?

…not allowed …

Second solution

No, there is still no support for DNS entries in the pg_hba.conf. I know this is a long awaited feature, at least by me. But, unless I’m wrong (tell me), writing down a DNS name in pg_hba.conf won’t work and the server won’t start.
We need to find a way to update the pg_hba.conf with the last / current ec2 internal addresses corresponding to the ETL server and the Reporting server. Easy, we will use a bit of shell code here. This script will retrieve the internal IP Address for each server (ETL and JasperServer) by using the command host and will update this address in the pg_hba.conf by using some sed or awk. Then, by using a sighup, Postgresql server will apply the new address configuration.
Nothing complex, but the success rely on a good timing.
Note here : I created an ORCHESTRATOR, a specialized instance in EC2, to monitor all my servers. This orchestrator will run this kind of script as soon as it detects any change in the internal addressing schema. This ORCHESTRATOR will be detailed in a future article (I made several public presentations, and a lot of people seem interested …).
And the shell script. This shell asks for the internal address, then updates the corresponding line. For that, you must maintain your file in a tidy way : labels are needed.

The end

Having a small (or even big) BI architecture up and running into EC2 is not a big deal. Having it properly set – in order not to pay extra fees – is something different and need some basic thinking before doing. The addressing issue which is technically simple, can have negative impact on your project if you don’t manage it from the start.

I will recommand any AWS / EC2 user (BI or not) to create their own admin tools and scripts, based on the various available APIs, in order to :

reduce reaction time,

be fully independent,

spare time (graphical tools are nice but need clicks, clicks and clicks …)

Hello friends,Amazon Route53 is a great way to manage the DNS entries of cloud services. DNS30 Professional Edition provides desktop tool for route53 services.It can be used to manage hosted zone.http://www.dns30.com/

Unquestionably believe that which you said. Your favorite justification seemed to be on the internet the easiest thing to be aware of.I say to you, I certainly get irked while people think about worries that they plainly do not know about. You managed to hit the nail upon the top and defined out the whole thing without having side-effects , people can take a signal. Will probably be back to get more. Thanks

Hello this is kind of of off topic but I was wanting to know if blogs use WYSIWYG editors or if you have to manually code with HTML. I'm starting a blog soon but have no coding skills so I wanted to get advice from someone with experience. Any help would be enormously appreciated!

Howdy! I could have sworn I've visited this site before but after looking at many of the posts I realized it's new to me.Nonetheless, I'm certainly delighted I discovered it and I'll be bookmarking it and checking back often!

Howdy! I know this is kind of off topic but I was wondering if you knew where I could find a captcha plugin for my comment form?I'm using the same blog platform as yours and I'm having trouble finding one?

Who am I ?

Datawarehousing & BI / Cloud Computing and Data processing.
I work for several clients from banking/insurance to call centers, entertainment and tourism.
Regularly CTO or CDO for startups or marketing companies dealing with data. Technical consulting for startups around Big Data, Analytics and Cloud Computing. Currently working for public / gov organization.