How to use the CloudBees Jenkins Enterprise command line tools

Log Files

This section provides some guidance for CloudBees Jenkins Enterprise administrators to be able to locate the logs of each relevant component in the architecture.

Tenant Logs

Command line access to the tenant logs is greatly simplified in the latest versions of CloudBees Jenkins Enterprise.
Be sure that your cluster is up to date
since the latest releases include more tools
that simplify the supportability of the product.

Tenant Logs from the Mesos UI

Unfortunately this approach only works for clusters not running though HTTPS. The Mesos version bundled in the product does not have this capacity for HTTPS clusters. However, in a Proof of Concepts or initial installation without HTTPS this method might be still useful.

The Mesos UI could be discovered by running the run display-outputs and the username and password by executing cat .dna/secrets at CloudBees Jenkins Enterprise project level.

The <TASK_ID> can also be obtained from the MESOS UI.
In the image from the previous section, if the <TASK_ID> were masters_master-1.62ceb825-13f5-11e8-80c9-3a2e409cf189,
we could get the tenant logs from this specific master.

Access to the Tenant Access Logs for CloudBees Jenkins Enterprise Version Older than 1.10

Diagnostic Sequence

The following table shows the most important steps which happen when a Managed Master is provisioned, specifying if the step is usually a source of issues, the component where the event is allocated, the location and the log paths.

The image below, represents the diagnostic troubleshooting sequence from a graphical point of view.
The components that are marked with a star represent the most problematic ones in the Managed Master provisioning process.

Simple Diagnosis Analysis

The points below describe the general procedure to diagnose Managed Master provisioning failures.
Solutions are included for those cases that can be handled through CloudBees Jenkins Enterprise administrator.

It is very important to understand if you are under a CloudBees Jenkins Enterprise, or a Jenkins issue.
Notice that all Managed Masters have an Advanced section in the configuration,
where you can modify the default values for the Health checks that Marathon does
in order to decide if Jenkins is in a healthy state or, if not, re-provisioned.

Grace Period (seconds). Health check failures are ignored within this number of seconds of the task being started
or until the task becomes healthy for the first time.

Interval (seconds). Number of seconds to wait between health checks.

Tip

The Grace Period value is directly influenced by the length of time it takes for Castle to provision the storage. The default settings for the various components are generous enough for typical use cases. However, in some situations, such as when the volume is extremely large, e.g., 1TB, this value needs to be adjusted to reflect reality. The Health Check end point will not be accessible until after the volume for CJOC/MM becomes available and thus, it is important to set this value correctly.

For troubleshooting reasons, it is very important to increase the Grace Period
so, in case of a Jenkins performance issue, we can check if the Jenkins UI is accessible, even if it is not responsive.

Use these steps to troubleshoot Managed Master provisioning issues in a simple way:

Stop the Managed Master provisioning from the Manage section in the Operations Center UI.

In the Health check section, under the Advanced configuration section in the Operations Center UI,
ensure that the Grace Period (seconds) is at least 1200 seconds
to ptevent the Managed Master from being restarted every 2 minutes
when Jenkins does not respond to the Health checks from Marathon.

Start the Managed Master provisioning from the Manage section in the Operations Center UI.
At this point, if the UI is accessible, even if it is not responsive,
then it means Jenkins is suffering a performance issue.
However, if the UI is not even accessible, then it might be a CloudBees Jenkins Enterprise issue.

Check the worker where the Managed Master got provisioned and connect to it

Cluster Logs

Chronological List of all the Deployment Tenant Logs

When an MM has been re-provisioned several times,
we will see a ton of tenant logs and it is very difficult to track which is the latest one.
For this, we can use the find command below, which lists all the MM provisioned tenants by deployment order.

To solve this memory issue, increase the overall container memory in Jenkins Master Memory in MB
and/or increase the JVM Max heap ratio for Jenkins under the configuration section at Managed Master level.

The JVM Max heap ratio must be a decimal between 0 and 1.
Values over 0.7 are not recommended and can cause master restarts when running out of memory.

Regarding the Jenkins Master Memory in MB, it is the amount of RAM that will be given to the container, expressed in megabytes.
The heap given to the Master JVM is a ratio of this memory.
The minimum recommended value for non test/demo instances is 4096 MB.

Cluster Resources

Before generating more Managed Masters in a CloudBees Jenkins Enterprise cluster,
you should ensure that the cluster has enough CPU and Memory resources to allow new Docker containers to be created.

The memory and CPU assigned to a Managed Master are listed with the Advanced button
in the Provisioning section of Jenkins›Master Name›Operations center›Configure›Advanced
for each Managed Master.

In order to check the current resources availables in the cluster you can use cje run list-resources. It tells you how much RAM there is for each of the master worker. If there is not, at least, one worker with enough RAM to accommodate the MM’s container required RAM (referred to as "Jenkins Master Memory in MB" in above diagram), then the Managed Master will not be provisioned.

In the example below, we can see that for the worker-2, which is type master,
3.4 units of CPU (calculated as 4.0-0.6) and 8623 MB of memory (calculated as 15023-6400) are still available.

Disk Space

A common issue for a Managed Master not being provisioned is that one of the infrastructure elements involved
runs out of space. There are two main elements that might run out of space:

Worker machine

Masters volume mount

Worker machine devices in AWS have as a device /dev/xvda1, when masters volume mount start with /mnt/

In order to check if there is enough disk space in the worker machine and the master volume mount,
connect to the worker where the Managed Master is provisioned
and use the df command to check whether there is enough space on the disk.

Additionally, Operations Center also has built-in Health Checks for workers. They have a hard-coded 90% threshold for the disk on the worker. Or users can also use CJOC raw metrics which is available from /metrics/currentUser/metrics.

Volume Provisioning

The health of the Castle system can be checked from the Marathon UI in the jce folder of the Castle application.
We should see a Castle application for each worker of type master.

Once we know the worker in which the Managed Master is getting provisioned,
we can access the Castle logs and check if the volume is provisioned correctly.
The Castle logs are located in the mesos logs under
/var/lib/mesos/slaves/<SLAVE_ID>/frameworks/<FRAMEWORK_ID>/executors/jce_castle.<APP_ID>/runs/latest/stderr.

MM Docker Image Cannot be Downloaded

The tenant logs do not tell you what is wrong; you must look at the Mesos agent logs,
since the image is downloaded by this process - or check with docker images command that the image is actually on disk.

One of causes for failure is Docker registry credentials, in which case you will need to update the Docker Registry.

Performance Issues

Performance issues will prevent the MM from being reprovisioned over and over again.
The difference between a performance issue and the rest of MM provisioning issues
is that the UI is reachable for a few seconds/minutes,
but, after a few Marathon attempts to re-provision the instance,
it does not respond to the Marathon Health checks.

The first step to cope with this issue is to increase the grace period,
which prevents Marathon from attempting to re-provision the instance over and over again.
For this go to the MM configuration and, under the Health check section,
ensure that the Grace Period is at least 1200 seconds and the interval 60 seconds.
This gives us enough time to take threadDumps.
You can increase the value to something higher if you want to stop Marathon from reprovisioning this instance so frequently.

After, increasing the Grace period run

# Only available for CJE-CLI version >= 1.7.1
# For version below 1.7.1 you need to manually do this by using jstack or kill -3 command
$ cje run support-performance master-1 60 15
...
Downloading worker-2:/tmp/20170711092408.worker-2.master-1.performance.tgz to /Users/dvilladiego/workspaces/support/support-cluster-cje/.dna/support/performance/20170711092408.worker-2.master-1.performance.tgz
Warning: Permanently added 'ec2-54-242-218-225.compute-1.amazonaws.com,54.242.218.225' (ECDSA) to the list of known hosts.
20170711092408.worker-2.master-1.performance.tgz

Operations Center Provisioning

The Operations Center provisioning follows almost the same process than the Managed Master provisioning from both, architectural and troubleshooting point of view. This means that under a Operations Center provisioning issue, the procedure explained in the Managed Master provisioning section applies as well to this section.

The main difference is the way we can configure Operations Center. When in the Managed Masters we do it through the Operations Center UI, under the Advanced button of the Managed Master item, in Operations Center we perform this action through the CloudBees Jenkins Enterprise CLI.

The CJE CLI command cje prepare cjoc-update is the one used for:

Configuring the Application memory in MB

Adding JVM options

Specify the CPU reservation/allocation

Determinate the Disk size in GB

Define the Custom docker image tag

$ cje prepare cjoc-update
cjoc-update is staged - review cjoc-update.config and edit as needed - then run 'cje apply' to perform the operation.

The image below represents the diagnostic troubleshooting sequence. The components marked with a star represent the most problematic ones in the Managed Master provisioning process.

Simple Diagnosis Analysis

The first thing to do is to match the build which is not working with the corresponding Palace task. Search the Build Console Output for a line similar to Agent 3ef2350e is provisioned from template Operations Center Shared Templates » maven-jdk-8.

The CJE Agent Provisioning section in the bottom left on the main Managed Master dashboard shows the corresponding Palace Task.

Clicking the link to see the Palace tasks. Find the task which contains <MASTER_ID>.<AGENT>. It is master-1.3ef2350e in our case.

The Error output section provides hints about the failure.

Tip

If you prefer to gather this information from the command line, find the build worker for the build agent and read the tenant logs for <MASTER_ID>.<AGENT>. The path is usually /var/lib/mesos/slaves/400c68cb-827b-4c9d-a0f0-6bdad1e94053-S0/frameworks/400c68cb-827b-4c9d-a0f0-6bdad1e94053-0000/executors/master-1.7c7d2b1e:b5092e83-9537-243-9679-33a0513db9d4/runs/c19722fd-60c6-49c2-8f1b-9a54608d2aa5. See how to get the tenant logs for more information.

Warning

If the last line of the Error output section is I0222 06:58:55.485720 8159 fetcher.cpp:456] Fetched 'file:///root/docker.tar.gz' to '/var/lib/mesos/slaves/400c68cb-827b-4c9d-a0f0-6bdad1e94053-S0/docker/links/903884d6-7fc3-45c1-b087-977ef64f6c2e/docker.tar.gz' then the issue is clearly in the image not being able to be retrieved from the Internet, the Docker Private Registry or whatever is used.

Advanced Diagnosis Analysis

The advanced Diagnosis procedure checks each element in the diagnosis sequence to ensure that everything has worked as expected.

Confirm that the Mesos task exists for the Build Agent provisioning so we can skip most of the health checks we need to do at log level.

Most Common Issues

In this section we will review the most common issues that happen in a Build Agent provisioning.

Lack of Enough Memory for the Build to Work

This problem usually happens because the memory assigned to the Agent is insufficient for the build.
This might happen if you assign insufficient memory to the container or if the build requires an unexpected amount of memory.

If this happens, we should see a stacktrace like the one below in the syslogs of the corresponding build worker. We can see how the Kernel is killing the Docker container oom_kill_process.

Build Agent Docker Image

If the failed build is happening with a customized build agent that never has worked before, the issue is likely to be the image not being correctly created. As a first step, always check that the new customized images work on pretty simple FreeStyle or Pipeline jobs. To troubleshoot this problem, just follow the section of the simple diagnosis, which should provide you the problem exposed.

The traces are commonly seen when the Palace service can’t reach the expected mesos endpoint /metrics/snapshot. When the mesos endpoint can’t be reached, the service will not start. This issue is commonly related to security policy changes in your load balancer. This check is performed when the Palace service is starting, thus any changes performed in your security settings will affect Palace on its next restart.

In order to get additional details on the error specifics, enable the following startup parameter in the palace service: javax.net.debug=all.

Upgrades

The following section provides some guidance to troubleshoot CloudBees Jenkins Enterprise upgrades. The most common issues are:

Reinitialization of the CloudBees Jenkins Enterprise workspace with the new version of the CLI

Execution of upgrade scripts

Terraform Refresh

Upgrade of the Marathon applications (Operations Center and commonly Castle)

Most issues happens in the last phase, when marathon re-deploys the applications that need to be upgraded. Operations Center is almost always restarted since the release of the CLI is in line with the Operations Center release. Other applications that may be upgraded are Palace, Castle and Elasticsearch.

If the upgrade reaches the point where there is a Marathon deployment and a Mesos task for Operations Center and this Mesos task shows that the Jenkins process is started, then this is a Jenkins issue. Otherwise, this is a CloudBees Jenkins Enterprise issue.

Note

The CloudBees Jenkins Enterprise upgrade often fails due to problems that were already present in the cluster but that were either not detected or not dealt with. For example, Operations Center credentials that were not updated, a faulty zookeeper node or a castle container that is not running anymore on one of the master workers. While these problem do not directly impact operations in Jenkins or not much, they are a source of issues for few CLI operations like an upgrade.

Diagnostic Sequence

The following table lists the steps of a CloudBees Jenkins Enterprise Upgrade, specifying if the step is usually a source of issues, the component where the event occurs, the location and the log paths.

Frequent?

Event

Component

Location

Logs path

Access to the logs

Upgrade Scripts

Worker/Controller

Bastion

.dna/logs/*-upgrade/*

If the script upgrades were correctly executed, the troubleshooting can start from here

There could be different reasons why castle is not running on a specific worker (lack of resources, disk space, …​). Check the castle logs on the worker to see what caused castle to stop. The logs are located in the Mesos task logs under /var/lib/mesos/slaves/<SLAVE_ID>/frameworks/<FRAMEWORK_ID>/executors/jce_castle.<APP_ID>/runs/latest/stderr:

This issue often happens when upgrading for the first time. The cluster initialize with a default admin user whose credentials are stored in the CloudBees Jenkins Enterprise workspace ($CJE_PROJECT_HOME/.dna/secrets).

After setting a Security Realm in Operations Center, it is required to update the local configuration with the new credentials. If the Operations Center credentials stored in the CloudBees Jenkins Enterprise workspace are wrong, the upgrade fails because the CLI does not have administrative access to the CJOC.

The solution is to update the Operations Center credentials in the file $CJE_PROJECT_HOME/.dna/secrets with the credentials of a Jenkins administrator. The API Token can be used as password.

Cluster Resources

A lack of resources would prevent Marathon to deploy applications, commonly Operations Center and Caste. If a marathon task is never deployed, this could be a resource problem. In such scenario, Marathon tries to deploy an application over and over but the task is never accepted or processed by Mesos.

Following is an example of upgrade logs when Operations Center cannot be deployed:

Check that there is a Marathon deployment for Operations Center. If there is no Mesos task associated (we see "No Tasks Running" even after a while) we need to check on the cluster available resources.

Check the resources with cje run list-resources. We can see in the following output that the master workers are both at full capacity, leaving no space for Operations Center:

Free resources by decreasing the memory allocated to some components (commonly Managed Masters)

Add resources with the command cje prepare worker-add

Zookeeper Failures

If a Zookeeper node is faulty, Marathon and Mesos may not operate correctly or even be unresponsive. This leads to upgrade failures.

For example, the upgrade logs might show that Marathon is unreachable:

Waiting for Marathon at http://marathon.cje-aburdajewicz-01-controlle-327711499.ap-southeast-2.elb.amazonaws.com.elb.cloudbees.net
Timeout waiting for http://marathon.cje-aburdajewicz-01-controlle-327711499.ap-southeast-2.elb.amazonaws.com.elb.cloudbees.net
There were one or more errors

Check the health check alerts in Operations Center. If zookeeper is down on one controller, you should see an alert:

From one worker, check that zookeeper (port 2181) is reachable for each controller:

EBS Volume Provisioning

In a Multi Availability Zone cluster, EBS provisioning can fail due to a race condition. EBS snapshots / volumes are not visible across all Availability Zones. When deploying Operations Center to a different availability zone, EBS snapshots must be copied to provision the EBS volume, that can take time. If the provisioning is not done in time, the deployment of Operations Center would fail with a timeout.

What can be seen in such cases is a repeated restart of the Operations Center task in Mesos with a short duration.

A workaround is to increase the startup timeout (also known as grace period) for Operations Center.

Locate the following line in $CJE_PROJECT_HOME/.dna/project.config:

[cjoc]
marathon_timeout = 900

Change the marathon_timeout to a bigger number - for example 3600 (1 hour). Then launch the upgrade again:

cje upgrade

Note

Since CloudBees Jenkins Enterprise 1.11.3, Marathon has full control on how to quit a castle task. This makes EBS provisioning more robust in a Multi Availability Zone cluster.

Red Hat Enterprise Linux Subscription

Clusters that are configured with rhel images require that the worker and controller instances be registered using the Red Hat Subscription Manager. This is needed to be able to install packages from yum on a Red Hat Enterprise Linux instance.

If rhel instances are not registered, the upgrade could fail when executing upgrade scripts on workers and controllers.

Following is an example of an issue when upgraded from 1.6.2 to a more recent version:

Note: The CloudBees Jenkins Enterprise recover in repair mode is an attempt to recover a cluster when it is in a bad state. It is recommended to first destroy the cluster and the recover it as a destroyed cluster, as explained in Restore a CloudBees Jenkins Enterprise Cluster.

Cluster Recovery Version Support

If a cluster has been created with a version of CloudBees Jenkins Enterprise lower than 1.6.3, the cluster-recover operation fails with the following:

Remnants of Cluster Delete

If a cluster is not fully destroyed because there are resources created by the cluster that still exist or data (like S3) that has been preserved, then a full recovery would fail with a terraform message explaining that the resource already exists. Here is an example with an existing S3 bucket:

Test the Connection Between Elasticsearch and Operations Center

Notice that, if you add the Elasticsearch hostname into No Proxy Host in OC, a restart is needed to apply the change.

Elasticsearch Compatibility

Review the Elasticsearch logs for errors.
If the logs contain parse errors, the Elasticsearch cluster could be broken or you could be using an incorrect version of Elasticsearch.
Refer to the Analytics documentation for Elasticsearch version compatibility.
Check the Elasticsearch health report to confirm Elasticsearch is functioning as expected.

This exception is shown in Elasticsearch when you try to use an Elasticsearch version later than 1.7.X

java.lang.IllegalArgumentException: Limit of total fields [1000] in index [metrics-20170419] has been exceeded
at org.elasticsearch.index.mapper.MapperService.checkTotalFieldsLimit(MapperService.java:593) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.index.mapper.MapperService.internalMerge(MapperService.java:418) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.index.mapper.MapperService.internalMerge(MapperService.java:334) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.index.mapper.MapperService.merge(MapperService.java:266) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.cluster.metadata.MetaDataMappingService$PutMappingExecutor.applyRequest(MetaDataMappingService.java:311) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.cluster.metadata.MetaDataMappingService$PutMappingExecutor.execute(MetaDataMappingService.java:230) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.cluster.service.ClusterService.executeTasks(ClusterService.java:679) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.cluster.service.ClusterService.calculateTaskOutputs(ClusterService.java:658) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.cluster.service.ClusterService.runTasks(ClusterService.java:617) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.cluster.service.ClusterService$UpdateTask.run(ClusterService.java:1117) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:544) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:238) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:201) ~[elasticsearch-5.3.0.jar:5.3.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_65]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_65]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_65]

In the Operations Center embedded Kibana, you could see this exception if you try to connect to a Elasticsearch version later than 1.7.X

Error: Unknown error while connecting to Elasticsearch
Error: Authorization Exception
at respond (http://192.168.1.20:8080/plugin/operations-center-analytics-viewer/index.js?_b=7562:85289:15)
at checkRespForFailure (http://192.168.1.20:8080/plugin/operations-center-analytics-viewer/index.js?_b=7562:85257:7)
at http://192.168.1.20:8080/plugin/operations-center-analytics-viewer/index.js?_b=7562:83895:7
at wrappedErrback (http://192.168.1.20:8080/plugin/operations-center-analytics-viewer/index.js?_b=7562:20902:78)
at wrappedErrback (http://192.168.1.20:8080/plugin/operations-center-analytics-viewer/index.js?_b=7562:20902:78)
at wrappedErrback (http://192.168.1.20:8080/plugin/operations-center-analytics-viewer/index.js?_b=7562:20902:78)
at http://192.168.1.20:8080/plugin/operations-center-analytics-viewer/index.js?_b=7562:21035:76
at Scope.$eval (http://192.168.1.20:8080/plugin/operations-center-analytics-viewer/index.js?_b=7562:22022:28)
at Scope.$digest (http://192.168.1.20:8080/plugin/operations-center-analytics-viewer/index.js?_b=7562:21834:31)
at Scope.$apply (http://192.168.1.20:8080/plugin/operations-center-analytics-viewer/index.js?_b=7562:22126:24)

Operations Center Accessing the Internet Through a Proxy

This is by far the most common issue when you have a Proxy set up in Operations Center under Manage Jenkins›Manage Plugins›Advanced. In such cases, you must add the Elasticsearch hostname to the No Proxy Host section. i.e domain.example.com/elasticsearch. Notice that a restart is needed each time you modify the No Proxy Host section for Analytics to take the changes.

Instead of using the No Proxy Host, you could use the -Dhttp.nonProxyHosts`Java argument; i.e `-Dhttp.nonProxyHosts=domain.example.com/elasticsearch. Just as with No Proxy Host, a restart is needed for Analytics to take effect after the Java argument is added to the Operations Center.

To test the connectivity between Elasticsearch and the Operations Center you can use:

Restart Operations Center After the Initial Configuration

After first configuring Analytics, the Operations Center must be restarted to create the index and dashboards.

Elasticsearch Snapshots/Backups

Analytics should be be configured to take snapshots/backups of your Elasticsearch data at least one or two times per day — not more, since this is a heavy-load process that may require hours to complete.
Furthermore, keep seven to fourteen snapshots/backups that can be used to restore data after a week.
Backup Interval in Minutes should be set to 1440 for one snapshot per day or 720 for two snapshots per day.
Number of Backup Snapshots to Keep for Elasticsearch should be set to 7 if you make a snapshot per day
to keep a week of snapshots or 14 if you make two snapshots per day.

Kibana Dashboards are Not Created

If you see the following error, it means that the Analytics required dashboards have not been created.

To resolve this issue, you need to restart the Operations Center, which will recreate the default indices and dashboards.

Elasticsearch Default Index Does Not Exist

If the default index is not selected, you will see the following page.

You only have to select the time-field, click on create, and then you can go to another Analytics tab to check that the data is displayed.

Elasticsearch is Not Accessible

When it is not possible to connect to the Elasticsearch service, you will see the following type of error.

To identify the source of this problem, check the following:

the connectivity between Operations Center and the Elasticsearch service,

the health of the Elasticsearch cluster, and

the Jenkins proxy settings.

Recommended Cluster Size

The Elasticsearch cluster should have at least three nodes,
which provides you with fault tolerance for up to 2 nodes crashing.
Each node should have 16-32GB of RAM and 50-200GB of disk space, depending on your environment size.
If you have more than 10 Masters or more than 10000 jobs, you will need a large Elasticsearch environment to support your load.

Elasticsearch Cluster Health

Did you restart your Operations Center after first configuring Analytics?
This is necessary to create the index and dashboards.
If you did that, you can move on to checking the state of your cluster.
Assuming you retrieved the critical information as described above,
you can execute the following commands to obtain base information
about the health of the Elasticsearch cluster:

Mismatch between Elasticsearch workers and applications

In some situations such as a failure with an Elasticsearch worker, there can be a mismatch between the list-workers output and the list-applications output.
If the two numbers returned by the below commands do not match, there is likely a Elasticsearch worker that is broken:

Unassigned Shards

If you see unassigned shards in your Cluster Health information and you do not have a node that is restarting,
you must assign all shards in order to have your cluster on status "green". If you have a node that is restarting, you should wait until that node is up and running, and the pending tasks returned by the health check stabilizes.

This script is designed to assign shards on a Elasticsearch cluster with 3 nodes;
you must set the environment variables ES_USR (user to access to ES), ES_PASSWD (password) and, DOMAIN (url to access to ES).

Get Pending Tasks on the Elasticsearch Cluster

Sometimes if you execute the health commands and check the pending tasks, you may see that there are too many tasks, or some index on initializing status. To obtain the details about these tasks,
use the following commands to obtain the pending tasks of the cluster;
you may then be able to determine the cause of the problems.

Before deleting an index it might be interesting to create a backup of the current state.

Manage Elasticsearch Snapshots

An Elasticsearch Snapshot is a backup of the current status of indices in the Elasticsearch Cluster. A snapshot is stored in a snapshot repository that should exist on disk and should be configured in Elasticsearch.

Make a Snapshot of Indices

Sometimes before doing an operation over the cluster, you need to make a snapshot of the data in it. To do this, you can create a new snapshot repository in which to create the new snapshot.

In some cases, this process could take more than an hour; you can to check the status of your snapshot with these commands:

export ES_USR="YOUR_USERNAME"
export ES_PASSWD="YOUR_PASSWORD"
export DOMAIN="ES_URL"
export REPO="backup"
export SNAPSHOT="snapshot_1"
#status of the current snapshot
# when the snapshot it is finished it will return this
# {
# "snapshots": []
# }
#
curl -u $ES_USR:$ES_PASSWD -XGET "$DOMAIN/_snapshot/$REPO/_status?pretty"
#status of snapshot_1, you have to check the `status field` until it will be SUCCESS/FAILED
curl -u $ES_USR:$ES_PASSWD -XGET "$DOMAIN/_snapshot/$REPO/$SNAPSHOT?pretty"

Errors Trying to Create a Snapshot: snapshot is already running

If you execute your snapshot command and see the following error, it is due to another snapshot still running. Therefore, either wait until that snapshot finished, or cancel it.

Next, examine the cloudbees-analytics-snapshots.json file and check which snapshots and indices you want to restore. Once this is done, edit the following script by adding a new line restore "SNAPSHOT_NAME" "INDEX_NAME" for each index you want to restore. The following script creates a file for each snapshot with the results of the restore operation:

Delete Snapshots

If you want to keep only an exact number of snapshots on a repository, you can use the following script to do this,
assuming that you have installed the JSON parser jq). This will list all snapshots, keep only the last 30 and delete the remainder.

Mesos and Marathon Troubleshooting

Background

When focusing on the elements that control task scheduling and provisioning in the cluster, pay special attention to the following subsystems: Mesos, Marathon and Zookeeper. These three subsystems are critical to CloudBees Jenkins Enterprise and any problem inside or affecting them can cause the cluster to enter an unhealthy and erratic state.

Being critical to the cluster life, these subsystems work in HA.

Architecture

In terms of CloudBees Jenkins Enterprise architecture, these subsystems "live" inside the cluster’s controllers. For testing purposes, one might think in terms of a single controller architecture, whereas for production environments, it is critical to have more than one controller. This leads to the first important concept - the number of working controllers in the cluster must be odd. Following on from above, since these subsystems work in HA, the election mechanism must have an odd number of elements so that the leader (the active subsystem) can be elected.

It is key to understand the symptoms to know that the Load Balancer will be sending requests to different controllers depending on the load using a Round-Robin algorithm.

Typically, these subsystems will experience problems when they somehow get out of sync. What does this mean? We say that one of these subsystems in HA is out of sync when there is no agreement on which node is the leader.

Symptoms of an Existing Problem

There are different symptoms of subsystem failures:

Intermittent behavior of the cluster.

Execution of several cje run commands returning no information or inconsistent information:

cje run list-applications

cje run list-resources

cje status

Diagnose

The following approaches can help determine if your cluster is affected by these kinds of problems:

CloudBees Jenkins Enterprise Bundle Analysis

CloudBees Jenkins Enterprise Support Bundle includes the information for our Cluster controllers to review the contents of the folder: pse/logs/controller-x/router/config.d. If the contents of the folder are not the same, then that will mean that one or more subsystems are out of sync.

Manual Controller Data Review

Access the controller terminal. From your Bastion Host, run the following command: dna connect controller-x.

Once there, connect to the docker process running the router service, sudo docker ps, and look for cloudbees/pse-router Image.

Once you locate the image, access its container with sudo docker exec -it "container_id" /bin/sh

Then finally review the contents of the folder /etc/nginx/conf.d.
These options will help you determine whether or not Mesos and Marathon are running out of sync. Also, marathon.conf and mesos.conf contain information about which node is the controller that is the leader for any of these services.

As an alternative method, you can get the same information by using curl to determine the leader for Marathon and Mesos. Once logged in to each controller, you can curl localhost:5050/state | jq . |grep leader to get Mesos leader information and then curl -u [marathon_username]:[marathon_pwd] localhost:8080/v2/leader. All the controllers should agree on the elected master.

Additionally, for ZooKeeper, you can also perform the following operation to check whether the number of leaders is correct or not:

Connect to every controller using dna connect controller-x

Get the information from the Zookeeper endpoint echo "srvr" | nc localhost 2181. (if you have not nc already installed in your controller you might need to install it). Nevertheless, there is another way to get the information about the leader by running cje run support-mesos ms-state | jq . |grep leader

This information should show 1 leader and C-1 followers, where C is the number of Controllers in the cluster.

Mesos UI Review

You can get the information provided by Mesos in terms of tasks running. The presence of any duplicate task on this json will imply that there is more than one Mesos Master Service which considers that it is the leader. In order to do that, we can hit our Mesos url as shown when you run the cje run display-outputs command. You can get the data by invoking different API endpoints, in this case http://controller-x-ip:5050/tasks.

In addition to this, if you see a number of frameworks active different from 2 (more or less) in the $Mesos_URL//frameworks then that will mean that there is something wrong with one of these subsystems, and will require an action on our side. You can also run cje run support-mesos ms-state | jq '[.frameworks]|.[][]|[.name, .id]' to get the running frameworks.
Example: In the case where there is a Marathon leader-election problem, there maybe 2 or more Marathon leaders, all of which will register themselves to Mesos as frameworks, in which case, the Mesos UI will show more than one framework named "marathon".

Another possible problem that you can detect with an inspection of the Mesos UI, is the existence of "orphan tasks", being these tasks that only exist in the Mesos UI. You can verify that you have found an "orphan task" by running the cje run list-applications command in the Bastion Host console. This command’s output will show an application running in a specific worker, but when you connect the worker (dna connect worker-x), and list the container running in this worker (sudo docker ps -ef), you will not be able to find any container corresponding to that application.

Support Commands

CJE includes several diagnostics commands that can help us locate inconsistencies on the cluster:

Marathon Support Commands

cje run support-marathon mt-running-tasks : List all running tasks.

cje run support-marathon mt-search-duplicate-taks <tasks_json_file> : check for duplicate taks in the ouput of mt-running-tasks, required installed. This will help you determine whether or not there is a conflict in your Marathon subsystem.

cje run support-marathon mt-ping : make a ping to the marathon service

cje run support-marathon mt-info : Get info about the Marathon Instance

cje run support-marathon mt-metrics : Get metrics data from this Marathon instance

cje run support-marathon mt-get-jce-info : Get the application with id jce. The response includes some status information besides the current configuration of the app. You can specify optional embed arguments, to get more embedded information.

cje run support-marathon mt-get-masters-info : Get the application with id masters. The response includes some status information besides the current configuration of the app. You can specify optional embed arguments, to get more embedded information.

cje run support-marathon mt-running-apps : Get the list of running applications. Several filters can be applied via the following query parameters.

cje run support-marathon mt-get-app-info <palace|castle|cjoc|elasticsearch> : Get the application with id jce/<palace|castle|cjoc|elasticsearch>. The response includes some status information besides the current configuration of the app. You can specify optional embed arguments, to get more embedded information.

If you determine that the zookeeper subsystem is misbehaving, remove/rename the database /var/lib/zookeeper/version-2 (e.g. mv /var/lib/zookeeper/version-2 /var/lib/zookeeper/old) before restarting the service. By cleaning the database this way, the system might relaunch tasks that are actually running. Therefore, run cje run list-applications to verify that no duplicate tasks are running.

Mesos Orphan Tasks

In most cases, this problem is due to the mesos-slave service not working properly in the corresponding worker. The recommended way to restore the service is by restarting the worker itself (dna stop worker-x and then dna start worker-x). This way, all the services running in that worker will be restarted and the worker will be brought back to a healthy status.

Important

Before stopping and restarting any worker, please be sure that your cluster has enough resources to provision the applications running in that worker (for when it restarts).

Consult the Knowledge Base

The Knowledge Base can be very helpful in troubleshooting problems with CloudBees Jenkins Enterprise and can be accessed on the CloudBees Support
site.

Be aware that the name for CloudBees Jenkins Enterprise changed with version 1.6.0;
it was previously known as "Private SaaS Edition" (PSE).
When searching, search for both "CJE" and "PSE".

Examine the Logs

If a cje command installation fails or you encounter other problems, the first place
to look is the logs. Start with the cje log, which is the output produced on your
computer when you run cje.

The logs for each cje operation are stored in the operations subdirectory of your cje-project directory:

Expected failures

Not all errors or failures you see in the logs are significant. Installing and starting
services takes time and the cje command spends some of its time waiting and retrying
services to see if they have started. For example, these types of message are typical and
not a sign of a problem:

Unexpected Failures

The cje tool will only wait and retry an operation a limited number of times.
When the retry limit is reached, cje emits an error message and exits.
You will see something like this at the end of the log:

11:23:46 [cjoc] 11:23:46 Failed in the last attempt (curl -k -fsSL http://cjoc..../health/check)
11:23:46 An error occurred during cjoc initialization (22) - see .../.dna/logs/20160229T104328Z-cluster-init/cjoc
[cjoc] Failed in the last attempt

By examining the log files you might be able to determine which part of CJE
is failing. In the above example you can see that the problem is "cjoc" which
is the Operations Center. This typically means that there is some misconfiguration in the CloudBees Jenkins Enterprise
config file.

Apache Mesos Logs

Another place to look for troubleshooting information is the Apache Mesos console.
By looking at the Mesos and Marathon consoles, you can see which processes are
running, which are not and you can view logs for each process. That might give
you a clue about what caused some processes to fail to come up.

You can get the URLs of the Mesos console via the cje run display-outputs
command. If you are running on AWS and your cluster name is "cluster1", you will see
something like this:

To access Mesos, open your browser and go to the Mesos console at the URL above.
Login with the Mesos credentials. You can get these credentials via the echo-secrets
command. The two commands below will give you the username and password:

Once you login to Mesos, you will see what processes are running and which ones
have failed. In the console, look for the completed tasks section.

If Operations Center failed to start you should see something like this:

If you click on the sandbox link, you will be able to view the Operations Center stdout and stderr
logs which should provide some insight into why Operations Center failed to start.

Operations Center depends on the CloudBees Jenkins Enterprise Castle and Palace components, so you should also
examine the logs for those Mesos tasks.

If you see that Operations Center failed on a specific host in the previous screenshot),
then look at the logs for castle.cje running on the same host. As you did
with Operations Center, click on the Sandbox link and examine the stdout and stderr logs.

Next Steps

Installation problems are typically resolved by changing property settings in your
cluster-init.config and cluster-init.secrets files, and then rebuilding your cluster.
Before you can do that, you need to destroy your failed cluster. See
destroying the CloudBees Jenkins Enterprise cluster
for information on how to destroy and then start the cluster again from scratch.

Shell Access to Servers

In some cases, it is necessary to access servers and their filesystems on the
running Operations Center or Master servers.

First, run cje run list-applications to find out in which worker host the container
is running. In the following example it would be worker-2 for Operations Center:

Running your own scripts on workers

If you have some shell script you would like to run on various workers, you can create a new script under .dna/scripts, but it is critical that you use a unique script name that is not already there. You can then easily run that script on any worker using:

Reviewing workers

Removing multiple workers

If you are wanting to remove multiple workers, or wanting to script the removal, you can pass the worker to be removed as an argument to the operation, instead of having to add it to the staged worker-remove.config:

Accessing application data from each worker

If you want to inspect the data from an application, for example the JENKINS_HOME data for the CJOC, you can do so by connecting to the worker with dna connect worker-N, then the data is mounted at /mnt/$application/$containerID, for example to access the CJOC data:

Create a Support Request

You can call on CloudBees to help resolve your problems. You can do this by
submitting a support request at the CloudBees Zendesk site.
In your request state the problem, any steps-to-reproduce and a Support Bundle.
A Support Bundle is an archive of logs for all CloudBees Jenkins Enterprise applications and access logs for
your CloudBees Jenkins Enterprise cluster. You create a Support Bundle, send it to CloudBees support and they can
use the information within to help you diagnose the problem you are encountering.

Also, be aware that the command cje was known as bees-pse before CloudBees Jenkins Enterprise version 1.6.0
was released. You can still use the bees-pse command, but it is really just a link
to cje. In future releases bees-pse may be removed.

Accessing a support bundle when a Managed Master’s web interface is not working

If a managed master is running, but the web interface is inaccessible, you can get the support bundle directly from the worker that is running that master with the commands:

Support Bundle Anonymization

The Support Core Plugin collects diagnostic information about a running Jenkins instance.
These data can contain sensitive information, but this can be automatically filtered by enabling support bundle anonymization.
Anonymization is applied to agent names, agent computer names, agent labels, view names, job names, usernames, and IP addresses.
These strings are mapped to randomly generated anonymous counterparts which mask their real values.
If you need to determine the real value for an anonymized one, you can look that up in the support bundle anonymization web page.

Configuration

When anonymization is disabled, a warning message is shown on the Support web page.

Click the link to Manage Jenkins›Configure System and enable support bundle anonymization.

Viewing Anonymized Mappings

When submitting an anonymized support bundle to your support organization, they may need to ask further details about items with anonymized names.
To translate that, navigate to Manage Jenkins›Support Bundle Anonymization.

This page contains a table of mappings between original names and their corresponding anonymized versions.
This also contains a list of stop words that are ignored when anonymization generates anonymized counterparts.
These are common terms in Jenkins that by themselves convey no personal meaning.
For example, an agent named "Jenkins" will not be anonymized because "jenkins" is a stop word.

Limitations

Anonymization filters only apply to text files.
It cannot handle non-Jenkins URLs, custom proprietary Jenkins plugin names, and exceptions quoting invalid Groovy code in a Jenkins pipeline.
The active plugins, disabled plugins, failed plugins, and Dockerfile reports are not anonymized due to several Jenkins plugins and other Java libraries using version numbers that are indistinguishable from IP addresses.
These reports are in the files plugins/active.txt, plugins/disabled.txt, plugins/failed.txt, and docker/Dockerfile.
These files should all be manually reviewed if you do not wish to disclose the names of custom proprietary plugins.

The registered trademark Jenkins® is used pursuant to a sublicense from the Jenkins project and Software in the Public Interest, Inc. Read more at
www.cloudbees.com/jenkins/about.

Apache, Apache Ant, Apache Maven, Ant and Maven are trademarks of
The Apache Software Foundation. Used with permission. No endorsement by
The Apache Software Foundation is implied by the use of these marks.

Other names may be trademarks of their respective owners.
Many of the designations used by manufacturers and sellers
to distinguish their products are claimed as trademarks. Where those
designations appear in this book, and CloudBees was aware
of a trademark claim, the designations have been printed in caps or initial
caps.

While every precaution has been taken in the preparation of this book,
the publisher and authors assume no responsibility for errors or omissions,
or for damages resulting from the use of the information contained
herein.