Trouble Automating Snapshots & Restoration (EC2+RightScale)

06-13-2012, 06:46 PM

Now that I have Percona XtraDB up and working with CentOS 5.6, I've moved on the stage of automating the cluster configuration within our RightScale environment on top of EC2. I need to be able to restore a node from a snapshot and have it rejoin the cluster.

PROBLEM: If I restore from an EBS snapshot, having set wsrep_cluster_address to the address of the other node (in a 2 node cluster) before starting mysql, mysql always fails to start and gives the following error:

To restore the "sentry1" node after I launch a new instance to replace it, I replace the wsrep_cluster_address with "gcomm://sentry2.ourdomain.com" in the mysql configuration before starting the server. This is where I run into problems probably as a result of a lack of deep understanding of the Galera clustering.

I should also note, that through the process of relaunching, "sentry1" will obtain a new IP address. When shutting down "sentry1" to relaunch, nothing more than a "service mysqld stop" is executed. Nothing is done to leave the cluster in an orderly fashion.

-Erik

Comment

I think you have configured everything right (except wsrep_node_address, which is the same on both nodes) and the error you're getting is clearly unrelated to configuration. Do you have a firewall configured on either of the nodes? Does centry2 correctly resolve centry1 IP after centry1 is restarted?

BTW, "service mysqld stop" IS leaving the cluster in an orderly fashion.

I have 3 node cluster and configuration is done correctly. Currently the primary node is only working properly. In the other two when I try to start mysqld , it gave the error displayed above.Initially I got lot of "conflicts error" while installing Percona-Xtradb cluster on these two nodes. I also had to change some file permissions, then all 3 clusters were started and replicating for once. But Once I rebooted one of the machines ( to check if db created on other nodes will replicate) , mysqld dameon is giving the error displayed above.

Comment

1) there are no primary/secondary nodes in the cluster. It is single-rank.
2) once you restart your "primary" node, instead of connecting to other nodes wsrep_cluster_address=gcomm:// makes it to create a new cluster once again. Most likely this is the cause of trouble. Never leave wsrep_cluster_address=gcomm:// in my.cnf after you have started the node. Set it to point to another node.
3) apparently nobody is listening at 10.10.20.62:4567, or rather you have a firewall dropping the packets (hence connection timeout instead of connection refused). Check that you can connect to this address from "other nodes" by telnet.

Comment

1. There will be one node in the cluster to which others will connect to , so that becomes the primary node right ? Unless this node is started , other nodes mysqld won't start and wont join the cluster.

2. I tested with restarting all nodes (primary/secondary)and the issue seems to have resolved.I dint change wsrep_cluster_address=gcomm:// in primary node 10.10.20.62. But it gave a log like after restart:

Can you provide me the ideal my.cnf configuration for three cluster nodes ?

3. This was resolved. I checked with telnet and port 3306 and 4567 is open all three nodes. Can you tell me if both 3306 and 4567 is mandatory on all nodes or only one will do ? I think 4567 is used for galera API commnuication for cluster and 3306 is to write to the db . So both are required.

Comment

1. There will be one node in the cluster to which others will connect to , so that becomes the primary node right ?

In the cluster each node is connected to every other node.

Quote:

Unless this node is started , other nodes mysqld won't start and wont join the cluster.

Of course some node has to be the first to start, when there are no other nodes running. That's why for this node you have to set

wsrep_cluster_address=gcomm://
since you have no other nodes to connect to yet. It makes it the first node in a cluster. But it does not make it any different from the other cluster members.
And that's why when you restart it with the same setting, it won't connect to its old mates - cause you explicitly telling it not to connect to anybody.
If you have 10 nodes in the cluster, you can connect to ANY of them.

Quote:

2. I tested with restarting all nodes (primary/secondary)and the issue seems to have resolved.I dint change wsrep_cluster_address=gcomm:// in primary node 10.10.20.62. But it gave a log like after restart:

But when I set this and restart primary node , the mysqld gets terminated.

Yes, you have misconfiguration somewhere and donor can't send state snapshot to that node. You have to look in the logs on the donor node for the clues. Who was the donor node is said in the log above the snippet you posted below.

Can you provide me the ideal my.cnf configuration for three cluster nodes ?

If there existed an "ideal configuration", then it would have been hardcoded into the code and no external configuration would be necessary.

Quote:

3. This was resolved. I checked with telnet and port 3306 and 4567 is open all three nodes. Can you tell me if both 3306 and 4567 is mandatory on all nodes or only one will do ? I think 4567 is used for galera API commnuication for cluster and 3306 is to write to the db . So both are required.

yes, both are required.

If you're still confused I'd advise you to attend Percona Live NY or Percona Live London events where there will be hands on training for Percona XtraDB Cluster.