Welcome to Splunk Answers, a Q&A forum for users to find answers to questions about deploying, managing, and using Splunk products. Contributors of all backgrounds and levels of expertise come here to find solutions to their issues, and to help other users in the Splunk community with their own questions.

This quick tutorial will help you get started with key features to help you find the answers you need. You will receive 10 karma points upon successful completion!

People who like this

if i use the wrong password when adding a searchhead to a cluster, it doesn't let me add it to begin with. full stop. I also mentioned it works fine for a while after I restart splunk. This would not happen if the secret key were different, is it not so?

In which files, and which stanzas am I supposed to check if the secret keys are the same?

As you can see, BOTH searchheads’s hostname is splsearch01 , so these problems arise because they kept overwriting each other’s pem files so only one searchhead could be authenticated at any given time. None of this was logged as an error or warning by splunk…so it was effectively invisible.

In other words, this was all caused by the COINCIDENCE that both searchheads we just installed shared the same name because splunk didn’t go and get an FQDN (fully qualified domain name) by default when it was installed. It just used whatever the output of the command “hostname” was and left it at that. That’s right. It didn’t use a unique identifier. It assumed the output of the command hostname would be unique across all splunk instances….why splunk, why?

I went and manually added the fqdn to server.conf on both searchheads so splunk is identifying them by their UNIQUE fqdn, and I’ve verified that the pem files are now be stored in DIFFERENT directories so no pem files will be overwritten.

Going forward, we’ll have to take this into account and rename each host with the fqdn (hostname bla.xxx.cequintecid.com) before installing splunk. (because splunk won't do this for us...it could have easily done so using facter...) This should prevent similar situations from ever arising again.

Wow..you answer made me realize something. we added 2 searchheads almost simultaneously, which were named sh1.aqx.com and sh1.oly.com. Splunk in it's wisdom only used the first part to identify them in server.conf...both were named sh1.

So my guess is each time I added the cluster to one of them, its pem overwrote the existing pem under /distServerKeys/sh1/ which obviously caused the re-authentication for the one whose pem was overwritten to fail.

This could all have been avoided if splunk used a unique identifier (such as an IP or a FQDN) when deciding what to call itself, instead of just assuming the output of the command "hostname" would be unique across the entire environment.

It looks like the peers you are talking about are indexers part of a cluster.I understand you are modifying the peer list on SH.You should never add or delete them on the search head as the CM will give the list to the search head in a way that replicated buckets are only used once.To fix it, I would : make a backup ,remove the SH from the CM (from the SH, by commenting out the ref to CM in server.conf then restart splunk on sh)remove all search peers (indexers) from the SH (via gui) uncomment the CM line in server.conf + restart

test that you can : see the SH in the CM listsee peers under the SH (but don't modify)test that from a search you don't see data twice (that would mean peers are defined as static)

if you are still loosing connectivity after a while : check that everything is NTP synchronized check load on indexer + if you are in a virtualized test env where something else load the hosts , increase the HTTP timeout on SH and idx.(if indexer is too slow, the SH will think the auth failed)if you are sending too many searches simultaneously on the indexers (ie more than their capacity), reduce the number of searches in // on the SH

I've already performed the steps you describe to clean the config and re-add the cluster masters. When I say I re-added the peers, i mean i did it by adding the CM. Not manually. That didn't help. ntp is enabled and this is not a load issue. the indexers have light workloads. any other ideas?

Ugh! I figured out how I broke this. I have a large linux host and I stacked 3 search heads on it in different directories with unique ports. One for ITSI, one for plain vanilla splunk, and one as a prod copy/staging. Basically 3 test environments. The trouble was that I set up the 2nd two quickly to help a colleague troubleshoot another issue and I didn't change the Splunk server name which defaults to the short host name of the server. Long story short, all 3 instances were identifying themselves as the same splunk server to the same search peers resulting in authentication working for a little while then breaking randomly. I imagine this could also happen to someone by cloning a search head to another host and not making the Splunk server name unique and then connecting up to the same search peers. I hope this saves someone else some frustration. In my case, there were no search clusters or index replications clusters. Possibly a future update to Splunk that warns that the name is already in use by another search head?

So it really was the exact same problem that the OP had. When you run multiple instances on a single server, it is IMPERATIVE that you set unique serverName values inside of server.conf for each instance to avoid this (and other) problems. If you had mentioned a multiple-instance server, I would have immediately known that this was the problem.