Remote (in band) configuration tips

Global Networks

The great thing about working on a Global network is that is sounds super when people ask you what you have been doing this week and you say, “I brought the Australia office live on Tuesday night, did some upgrades in Hong Kong on Wednesday morning, then some design work for London the rest of the week”, usually a look of astonishment on peoples faces wondering how you can get about so much. However working on a Global network, especially if you are making changes in-band and you don’t have the facility to access the system via a remote console or have remote power control(For people who do have such infrastructure I am insanely jealous), then you need have a few outs to keep you out of trouble. I thought I would share some useful tips that help minimise risk for you when doing remote changes.

99% of equipment I work on is Cisco, therefore all the tips are Cisco centric.

“Reload In “

This is a belts and braces command, but rest assured if you make a configuration change which kills your connection, locks you out because of an authentication mistake, prevented access from an access list change, then this will reload the system and put it back to the configuration prior to your changes.

Now for anyone jumping from lab environments to real worlds environments then this is not an end device friendly command particularly with switches in mind; If I am ever making a change in remote location where I feel it is necessary to use this as precautionary measure, I make sure the risk is highlighted on any Change Control Process documentation that the entire switch may need rebooted,†effectively†powering down phones, killing server connections and everything else connected into it.

Remember to “Reload Cancel” to stop the reload after completing your work!!!!!

AAA-NEW model and TACACS.

You don’t need to paste the whole AAA command set in to verify your TACACS has synced up ok with the TACACS server, where you could easily lock yourself out, here are some basic step to prove Device to TACACS is setup ok

Define a local user name password

Setup VTY login access (username and password)

Connect to th devices over SSH or Telnet and test User name and Password

Enable AAA New model

Setup the TACACS server

Switch on Debugging if required

Switch on Terminal Monitoring

Enable TACACS authentication for login, and default back to local

Start a separate session and test,

Once login has been confirmed via TACACS you can then process with other TACACS AAA commands

Now start a new session and verify you can logon via TACACS, if you can’t it will fall back to local user name, and you still have EXEC access in your current session.

If†successful†you can now do your other AAA commands with confidence.

Transport output and vty access-list

Some times you can kill access through the primary path, but still have access via another device on the site. However you may still not be able to Telnet or SSH from that device. You try to Telnet †”% telnet connections not permitted from this terminal”. To allow Telnet or SSH from the device you need to have transport output configured,†also the device you are accessing may have security preventing access through a different interface, remember to check if an access-class is specified against the VTY or access-group against the interface. Before starting the change make sure your alternative path is open.

As always this should be documented as part of any change process and security reapplied after the change.

line vty 0 4
transport output all

Security Access-list changes

I have been burned before with ACL changes and personally always have “reload in”, having said that I don’t think I have ever had to reload since being burned. I try not to modify an active ACL, I will start with a new copy, make the changes, apply the new ACL then delete the old ACL from the config. Alternatively I remove the ACL from the interface, but this is not†always†possible due to security policies.

Out-Of-Hours

There are lots of hardware features that allow hot swap and†fail-over†of equipment, however it wouldn’t be the first time for one of these to throw a flakie and kill the system, or for an engineer to knock out a power cable or network connection; software changes also run a risk; recently I have seen a span port command take out a core switch “IN THE MIDDLE OF THE WORKING DAY”.

The point is that making changes during working hours increases impact if something goes wrong, yes we could get into a long discussion about 24/7 operations, the point is that you need to understand the environment you are working on and the impact to the business if things don’t go to plan.

Now I would never have expected setting a SPAN port to reload a switch, but the impact this had during the middle of the day was huge, if it had happened in the evening for this customer, it would still be an issue but the impact would be much less.

Use the working day to plan the changes, and I mean plan, each command that needs to be entered should be prepared beforehand, not high level stuff like create vlan xxx on switch Y.

Here is a part of†typical†change I had written for adding a new vlan (removed customer specific info) I also had the back out commands documented(not included here); when implementing a change I do not want to be “thinking” about what I need to do, I want to follow a script. For more complicated changes I would look to†rehearse†the script in a test lab.
#logon to 6509_1
#create VLAN
config t
vlan 136
name Printers
spanning-tree vlan 136 root
interface vlan 136
description Printers
ip address 172.22.136.2
standby 1 ip 172.22.136.1
standby 1 priority 110
standby 1 preempt

#logon to 6509_2
show vlan ----confirm 136 is present over VTP
config t

Follow Process

If you are working on a large network then there should be some sort of change control process, don’t be afraid to highlight the potential for things going wrong and let the business decided if it should go ahead, if it goes wrong then you are covered, if it goes fine you are also covered. Last year I had the fun job upgrading Cisco 4006 to a SSH enable version of CatOS 7.6. I had highlighted that with any software upgrade there is a risk of the device not coming back online after a reboot and it may be prudent to have an engineer attend site. The business decided to take the risk (verse the cost of an engineer) , and for 33 of 36 upgrades all was good, on 3 occasions the switches didn’t come back very well, 2 needed a power off/on which was done with a local site contact, one needed an engineer called out to re-apply the config†which did disrupt the next business day for that site.

This was frustrating, but process was followed and the business took the risk on board, and all the processes followed, so no backlash.

Software upgrade

I hate performing remote software upgrades, however the following minor tips may help a little.

Do not use TFTP to transfer the image. (use a TCP based process like FTP)

use /verify option

copy /verify ftp: flash:

or (newer 3750)

archive download-sw /safe

Then fingers crossed on the reboot!

Summary

When you are use to being touching distance from equipment, it can be easy not to consider what happens if you lose connectivity during a change because you are so use to having this local access. Jumping from this local environment to a more†geographically†disperse†environment means you now have to consider what happens if you lose connectivity, hopefully these simple tips will help you reduce the risk of cutting off access to the device you are configuring.

Comments

Personally with ACL’s i always use the “ip access-list extended ACL_NAME/NR” command and use the sequence numbers of the ACL.
An example with ACL 101 with sequence number 510 that needs change would be like this:

Thanks for the Tip, I see more people have offered advice, so once the comments stop coming in I shall collate all the tips and added then into the Blog Posts.
Line-Numbered ACL is the right choice of ciurse, I sometime forget since I have been working with older IOS version.

After copying then image, i run verify (/md5) flash:/image and compare the md5 sum against it’s own embedded sum and/or against the cisco download site sums. And the show boot command is helpful after doing the boot system in config mode.

Standard config templates used for work I do have SNMP enabled before applying TACACS. SNMP needs to be functional anyway, so before turning aaa on, it’s always good habit to verify if R/W community is working (snmpwalk). In case of lockout, it’s just a matter of snmpset and one small file with “no aaa new-model”.

Anyway, something I really like while working with Junipers … “commit confirm”. Even if the change cuts your access, link or routing out, there is still high chance everything else is working and when router reverts the change it’s much less intrusive than reloading whole device. Recently I read somewhere that newer IOS XRs or SRs have similar rollback feature.

I’d be interested in some more “horror stories” where something unrelated triggered unexpected bug or outage – just like span command reloading switch mentioned here. How can one defend? Cannot … you can only keep in mind that “there is nothing like non-intrusive change” and act accordingly.

the “newer IOS XRs and SRs” are in fact the normal Versions since (IMHO) 12.3(7)T. IOS has the “configuration replace” feature which is build on the possibilities of the archive. With that you can not only rollback to a known past version, but also to the last version if you don’t commit your changes whithin a given timeframe.

Great summary of in-band management tips! But I’ll let one more thing – use TFTP to upload part of the configuration on a device, all the script will be uploaded in full and applied only after that. Can be a very good saver if you occasionally locks yourself out with some command in the middle of the script.

PS And yes, as been said – Juniper’s CLI is much more feture-full for in-band management.

Thanks for the Tip, I see more people have offered advice, so once the comments stop coming in I shall collate all the tips and added then into the Blog Posts.
I had to go and look this up, I had no idea 🙂

Thanks for the post, I’d never really considered TFTP to be a problem but makes sense, I’m adding the verify option to my routine!

I recently had a SPAN experience of my own, I added SPAN / RSPAN across my entire network and most of my switches are newer but I’ve got a couple of 2950s and after configuring and testing RSPAN source & destination on one, I removed the destination line to find that the switch stopped switching…. luckily that switch was down the stairs from me and I heard people start complaining and was able to put the config back while still logged in.

I used to use the verify command religiously when upgrading software images but then I found that, at least for 2950/60s after a specific software version, it stopped working! Every verify failed. I contacted Cisco thinking that I’d done something wrong and they advised it wasn’t supported in the [newer] version I was using. Why they left it there is beyond me. I’m not sure if there is some other command that replaces it (aside from the MD5 comparison mentioned earlier) but just thought I’d mention it.

Good stuff Greg, especially the AAA testing methodology. Something I’ve found useful over time is to regularly refresh the documentation for remote sites (say quarterly), including site inspections (say annually). This is a great safety net, especially when changes are performed by several individuals and/or teams. I’m sure we can all name colleagues who are less than keen on documentation management! One or more contact persons for local hands & eyes is also handy, as is detailed layer one documentation plus annotated photographs (not always easy to arrange at secure sites but can be a life-saver in a pinch). Slightly off topic I know, but for remote sites with OOB over POTS, regular testing of the lines is a must.

I’m with Pim. I find that most people who create additional access lists haven’t yet worked out how to edit them live properly using the sequence numbers. It bugs me to log on to a router and find multiple versions of the same ACL with just one line of difference? Meanwhile any standard is out the window because you have to increment the ACL number each time you want to make a change. ACL 104, 105, 106, 107, etc.

Network Break Podcast

Network Break is round table podcast on news, views and industry events. Join Ethan, Drew and myself as we talk about what happened this week in networking. In the time it takes to have a coffee.

Packet Pushers Weekly

A podcast on Data Networking where we talk nerdy about technology, recent events, conduct interviews and more. We look at technology, the industry and our daily work lives every week.