Troubleshooting in OpenStack Cloud Computing

Introduction

OpenStack is a complex suite of software that can make tracking down issues and faults quite daunting to beginners and experienced system administrators alike. While there is no single approach to troubleshooting systems, understanding where OpenStack logs vital information or what tools are available to help track down bugs will help resolve issues we may encounter.

Checking OpenStack Compute Services

OpenStack provides tools to check various parts of Compute Services, and we'll use common system commands to check whether our environment is running as expected.

Getting ready

To check our OpenStack Compute host we must log in to that server, so do this now before following the given steps.

How to do it...

To check that Nova is running the required services, we invoke the nova-manage tool and ask it various questions of the environment as follows:

To check the OpenStack Compute hosts are running OK:

sudo nova-manage service list

You will see the following output. The :-) icons are indicative that everything is fine.

If Nova has a problem: If you see XXX where the :-) icon should be, then you have a problem.

Troubleshooting is covered at the end of the book, but if you do see XXX then the answer will be in the logs at /var/log/nova/.

If you get intermittent XXX and :-) icons for a service, first check if the clocks are in sync.

Checking Glance: Glance doesn't have a tool to check, so we can use some system commands instead.

ps -ef | grep glancenetstat -ant | grep 9292.*LISTEN

These should return process information for Glance to show it is running and 9292 is the default port that should be open in the LISTEN mode on your server ready for use.

Other services that you should check:

rabbitmq:

sudo rabbitmqctl status

The following is an example output from rabbitmqctl when everything is running OK:

ntp ( N etwork Time Protocol, for keeping nodes in sync):

ntpq -p

It should return output regarding contacting NTP servers, for example:

MySQL Database Server:

MYSQL_PASS=openstackmysqladmin -uroot –p$MYSQL_PASS status

This will return some statistics about MySQL, if it is running:

How it works...

We have used some basic commands that communicate with OpenStack Compute and other services to show they are running. This elementary level of troubleshooting ensures you have the system running as expected.

Understanding logging

Logging is important in all computer systems, but the more complex the system, the more you rely on being able to spot problems to cut down on troubleshooting time. Understanding logging in OpenStack is important to ensure your environment is healthy and is able to submit relevant log entries back to the community to help fix bugs.

Getting ready

Log in as the root user onto the appropriate servers where the OpenStack services are installed.

How to do it...

OpenStack produces a large number of logs that help troubleshoot our OpenStack installations. The following details outline where these services write their logs.

OpenStack Compute Services Logs

Logs for the OpenStack Compute services are written to /var/log/nova/, which is owned by the nova user, by default. To read these, log in as the root user. The following is a list of services and their corresponding logs:

nova-compute: /var/log/nova/nova-compute.log Log entries regarding the spinning up and running of the instances

OpenStack Dashboard logs

OpenStack Dashboard (Horizon) is a web application that runs through Apache by default, so any errors and access details will be in the Apache logs. These can be found in /var/log/ apache2/*.log, which will help you understand who is accessing the service as well as the report on any errors seen with the service.

OpenStack Storage logs

OpenStack Storage (Swift) writes logs to syslog by default. On an Ubuntu system, these can be viewed in /var/log/syslog. On other systems, these might be available at /var/log/messages.

Logging can be adjusted to allow for these messages to be filtered in syslog using the log_level, log_facility, and log_message options. Each service allows you to set the following:

If you change any of these options, you will need to restart that service to pick up the change.

Log-level settings in OpenStack Compute services

Many OpenStack services allow you to control the chatter in the logs by setting different log output settings. Some services, though, tend to produce a lot of DEBUG noise by default.

This is controlled within the configuration files for that service. For example, the Glance Registry service has the following settings in its configuration files:

Moreover, many services are adopting this facility. In production, you would set debug to False and optionally keep a fairly high level of INFO requests being produced, which may help with the general health reports of your OpenStack environment.

How it works...

Logging is an important activity in any software, and OpenStack is no different. It allows an administrator to track down problematic activity that can be used in conjunction with the community to help provide a solution. Understanding where the services log, and managing those logs to allow someone to identify problems quickly and easily, are important.

Troubleshooting OpenStack Compute Services

OpenStack Compute services are complex, and being able to diagnose faults is an essential part of ensuring the smooth running of the services. Fortunately, OpenStack Compute provides some tools to help with this process, along with tools provided by Ubuntu to help identify issues.

How to do it...

Troubleshooting OpenStack Compute services can be a complex issue, but working through problems methodically and logically will help you reach a satisfactory outcome. Carry out the following steps when encountering the different problems presented.

Cannot ping or SSH to an instance

When launching instances, we specify a security group. If none is specified, a security group named default is used. These mandatory security groups ensure security is enabled by default in our cloud environment, and as such, we must explicitly state that we require the ability to ping our instances and SSH to them. For such a basic activity, it is common to add these abilities to the default security group.

Network issues may prevent us from accessing our cloud instances. First, check that the compute instances are able to forward packets from the public interface to the bridged interface.

sysctl -A | grep ip_forward

net.ipv4.ip_forward should be set to 1. If it isn't, check that /etc/sysctl. conf has the following option uncommented:

net.ipv4.ip_forward=1

Then, run the following, to pick up the change:

sudo sysctl -p

Other network issues could be routing problems. Check that we can communicate with the OpenStack Compute nodes from our client and that any routing to get to these instances has the correct entries.

We may have a confl ict with IPv6, if IPv6 isn't required. If this is the case, try adding --use_ipv6=false to your /etc/nova/nova.conf file, and restart the nova-compute and nova-network services. We may also need to disable IPv6 in the operating system, which can be achieved using something like the following line in / etc/modprobe.d/ipv6.conf:

install ipv6 /bin/true

Reboot your host.

Viewing the Instance Console log

You can view the console information for an instance using a number of methods:

You will be taken to an Overview screen. Along the top of the Overview screen is a Log tab. This is the console log for the instance.

When viewing the logs directly on a nova-compute host, look for the following file: The console logs are owned by root, so only an administrator can do this. They are placed at /var/lib/nova/instances/<instance_id>/console.log.

Instance fails to download meta information

If an instance fails to communicate to download the extra information that can be supplied to the instance meta-data, we can end up in a situation where the instance is up but you're unable to log in, as the SSH key information is injected using this method.

Viewing the console log will show output like in the following screenshot:

Ensure the following:

nova-api is running on the host (in a multi_host environment, ensure there's a nova-api and a nova-network node running on the nova-compute host).

Perform the following iptables check on the nova-network node that is running nova-compute:

sudo iptables -L -n -t nat

We should see a line in the output like in the following screenshot:

If not, restart your nova-network services and check again.

Sometimes there are multiple copies of dnsmasq running, which can cause this issue. Ensure that there is only one instance of dnsmasq running:

ps -ef | grep dnsmasq

This will bring back two process entries, the parent dnsmasq process and a spawned child (verify by the PIDs). If there are any other instances of dnsmasq running, kill the dnsmasq processes. When killed, restart nova-network, which will spawn dnsmasq again without any confl icting processes.

Instance launches, stuck at "Booting" or "Pending"

Sometimes, a little patience is needed before assuming the instance has not booted, because the image is copied across the network to a node that has not seen the image before. At other times though, if the instance has been stuck in booting or a similar state for longer than normal, it indicates a problem. The first place to look will be for errors in the logs. A quick way of doing this is from the controller server and by issuing the following command:

sudo nova-manage logs error

A common error that is usually present is related to AMQP being unreachable. These can be ignored unless the errors are currently appearing.

This command brings back any log line with the ERROR as log level, but you will need to view the logs in more detail to get a clearer picture.

A key log file, when troubleshooting instances are not booting properly, will be available at /var/log/nova/nova-compute.log. Look here at the time you launch the instance and the ID.

Check /var/log/nova/nova-network.log for any reason why instances aren't being assigned IP addresses. It could be issues around DHCP preventing address allocation.

Error codes such as 401, 403, 500

The majority of the OpenStack services are web services, meaning the responses from the services are well defined.

40X refers to a service that is up but responding to an event that is produced by some user error. For example, a 401 is an authentication failure, so check the credentials used when accessing the service.

50X errors mean a connecting service is unavailable or has caused an error that has caused the service to interpret a response to cause a failure. Common problems here are services that have not started properly, so check for running services.

If all avenues have been exhausted when troubleshooting your environment, reach out to the community, using the mailing list or IRC, where there is a raft of people willing to offer their time and assistance.

Listing all instances across all hosts

From the OpenStack controller node, you can execute the following command to get a list of the running instances in the environment:

sudo nova-manage vm list

This is useful in identifying any failed instances and the host on which it is running. You can then investigate further.

How it works...

Troubleshooting OpenStack Compute problems can be quite complex, but looking in the right places can help solve some of the more common problems. Unfortunately, like troubleshooting any computer system, there isn't a single command that can help identify all the problems that you may encounter, but OpenStack provides some tools to help you identify some problems. Having an understanding of managing servers and networks will help troubleshoot a distributed cloud environment such as OpenStack.

There's more than one place where you can go to identify the issues, as they can stem from the environment to the instances themselves. Methodically working your way through the problems though will help lead you to a resolution.

Troubleshooting OpenStack Storage Service

OpenStack Storage Service (Swift) is built for highly available storage, but there will be times where something will go wrong, from authentication issues to failing hardware.

How to do it...

Carry out the following steps when encountering the problems presented.

Authentication issues

Authentication issues in Swift occur when a user or a system has been configured with the wrong credentials. A Swift system that has been supported by OpenStack Authentication Service (Keystone) will require you to perform authentication steps against Keystone manually as well as view logs during the transactions. Check the Keystone logs for evidence of user authentication issues for Swift.

The user will see the following message with authentication issues:

If Swift is working correctly but Keystone isn't, skip to the Troubleshooting OpenStack Authentication recipe.

Swift can add complexity to authentication issues when ACLs have been applied to containers. For example, a user might not have been placed in an appropriate group that is allowed to perform that function on that container. To view a container's ACL, issue the following command on a client that has the Swift tool installed:

Handling drive failure

When a drive fails in an OpenStack Storage environment, you must first ensure the drive is unmounted so Swift isn't attempting to write data to it. Replace the drive and rebalance the rings.

Handling server failure and reboots

The OpenStack Storage service is very resilient. If a server is out of action for a couple of hours, Swift can happily work around this server being missing from the ring. Any longer than a couple of hours though, and the server will need removing from the ring.

How it works...

The OpenStack Storage service, Swift, is a robust object storage environment, and as such, handles a relatively large number of failures within this environment. Troubleshooting Swift involves running client tests, viewing logs, and in the event of failure, identifying what the best course of action is.

Troubleshooting OpenStack Authentication

OpenStack Authentication Service (Keystone) is a complex service, as it has to deal with underpinning the authentication and authorization for the complete cloud environment. Common problems include misconfigured endpoints, incorrect parameters being stored, and general user authentication issues, which involve resetting passwords or providing further details to the end user.

Getting ready

Administrator access is required to troubleshoot Keystone, so we first configure our environment, so that we can simply execute the relevant Keystone commands.

How to do it...

Carry out the following steps when encountering the problems presented.

Misconfigured endpoints

Keystone is the central service that directs authenticated users to the correct service, so it's vital that the users be sent to the correct location. Symptoms include HTTP 500 error messages in various logs regarding the services that are being accessed, and clients timing out trying to connect to network services that don't exist. To verify your endpoints in each region, perform the following command:

keystone endpoint-list

We can drill down into specific service types with the following command. For example, to show adminURL for the compute service type in all regions.

keystone endpoint-get --service compute --endpoint_type adminURL

An alternative to listing the endpoints in this format is to list the catalog, which outputs the details in a more human-readable way:

keystone catalog

This provides a convenient way of seeing the endpoints configured.

Authentication issues

From time to time, users will have trouble authenticating against Keystone due to forgotten or expired details or unexpected failure within the authentication system. Being able to identify such issues will allow you to restore the service or allow the user to continue using the environment.

The first place to look will be the relevant logs. This includes the /var/log/nova logs, the / var/log/glance logs (if related to images), as well as the /var/log/keystone logs.

Troubleshooting accounts might include missing accounts, so view the users on the system using the following command:

keystone user-list

After displaying the user list to ensure an account exists for the user, we can get further information on a particular user by issuing, for example, the following command, after retrieving the user ID of a particular user:

keystone user-get 68ba544e500c40668435aa6201e557e4

This will display output similar to the following screenshot:

This allows us to verify that the user has a valid account in a particular tenant.

If a user's password needs resetting, we can execute the following command after getting the user ID, to set a user's password to (for example) openstack:

If it turns out a user has been set to disabled, we can simply re-enable the account with the following command:

keystone user-update --enabled true 68ba544e500c40668435aa6201e557e4

There could be times when the account is working but problems exist on the client side. Before looking at Keystone for the issue, ensure your environment is set up correctly, in other words, set the following environment variables:

How it works...

User authentication issues can be client- or server-side, and when some basic troubleshooting has been performed on the client, we can use Keystone commands to find out why someone's user journey has been interrupted. With this, we are able to view and update user details, set passwords, set them into the appropriate tenants, and disable or enable them, as required.

Alerts & Offers

Series & Level

We understand your time is important. Uniquely amongst the major publishers, we seek to develop and publish the broadest range of learning and information products on each technology. Every Packt product delivers a specific learning pathway, broadly defined by the Series type. This structured approach enables you to select the pathway which best suits your knowledge level, learning style and task objectives.

Learning

As a new user, these step-by-step tutorial guides will give you all the practical skills necessary to become competent and efficient.

Beginner's Guide

Friendly, informal tutorials that provide a practical introduction using examples, activities, and challenges.

Essentials

Fast paced, concentrated introductions showing the quickest way to put the tool to work in the real world.

Cookbook

A collection of practical self-contained recipes that all users of the technology will find useful for building more powerful and reliable systems.

Blueprints

Guides you through the most common types of project you'll encounter, giving you end-to-end guidance on how to build your specific solution quickly and reliably.

Mastering

Take your skills to the next level with advanced tutorials that will give you confidence to master the tool's most powerful features.

Starting

Accessible to readers adopting the topic, these titles get you into the tool or technology so that you can become an effective user.

Progressing

Building on core skills you already have, these titles share solutions and expertise so you become a highly productive power user.