On the Job with a Network Manager

Alexander Clemm presents a number of scenarios to give an impression of the types of activities that are performed by people who run networks for a living. He also provides an overview of some of the tools network managers have at their disposal to help them do their jobs.

This chapter is from the book

This chapter is from the book

This chapter presents a number of scenarios to give an impression of the types of activities that are performed by people who run networks for a living. We refer to them collectively as network managers, although they perform a wide variety of functions that have more specialized job titles. In fact, strangely enough, the term network manager is rarely used for the people involved in managing networks. Instead, terms such as network operator, network administrator, network planner, craft technician, and help desk representative are much more common. Each of those terms refers to a more special function that is only one aspect of network management.

The chapter also provides an overview of some of the tools network managers have at their disposal to help them do their jobs. The intention is to give you a taste of the kinds of tasks and challenges that network managers face and how network management tools support their work.

Ultimately, the network management technology introduced in this book exists in an operational context. Although this idea might seem self-evident, it must be understood and emphasized, particularly for people who are not themselves users but are providers of network management technology—application providers, equipment vendors, and systems integrators. Network management involves not just technology, but also a human dimension—how people use management tools and management technology to achieve a given purpose, and how people who perform management functions and who are ultimately responsible for the fact that networks and networking services are running smoothly can best be supported. In addition, the organizational dimension must be considered—how the tasks and workflows are organized, how people involved in managing a network work together, and what procedures they have in place and must follow to collectively get the job done.

Reading this chapter will help you understand the following:

The types of tasks that people involved in the day-to-day operations of networks face

How network management technology supports network operators in those tasks

The different types of management tools that are available to help people running a network do their job

A Day in the Life of a Network Manager

Let us consider some typical scenarios people face as they run networks. No single scenario is representative by itself. Scenarios differ widely depending on a number of factors. One factor is the type of organization that runs the network. We refer to this organization as the network provider. The IT department of a small business, for example, runs its network quite differently than the IT department of a global enterprise or, for that matter, a global telecommunications service provider. Another factor is the particular function that the network manager plays within the organization. An administrator in an IT department, for example, has different responsibilities than a field technician or a customer-facing service representative. To cover the diversity of possible scenarios, this chapter examines the roles of several network managers.

The examples in this chapter are intended to be illustrative. Therefore, they are by no means comprehensive. The examples contain simplifications, and, in reality, the details described differ widely among network providers. Even people who have the same job description might perform their job functions in different ways. Ultimately, how they manage their networks differentiates network providers from one another, hence the presented scenarios should not be expected to be universally the same. Finally, don't worry if you are not familiar with all the networking details that are contained in the examples; they constitute merely the backdrop against which the storylines play out.

Pat: A Network Operator for a Global Service Provider

Meet Pat. Pat works as a network operator at the Network Operations Center (NOC) of a global service provider that we shall call GSP. She and her group are responsible for monitoring both the global backbone network and the access network, which, in essence, constitutes the customer on-ramp to GSP's network. This is a big responsibility. Several terabytes of data move over GSP's backbone daily, connecting several million end customers as well as a significant percentage of global Fortune 500 companies. Even with the recent crisis in the telecommunications industry, GSP is a multibillion-dollar business whose reputation rests in no small part on its capability to provide services on a large scale and global basis with 99.999% (often referred to as "five nines") service availability. Any disruption to this service could have huge economic implications, leading to revenue losses of millions of dollars, exposing GSP to penalties and liability claims, and putting jobs in jeopardy.

Pat works directly in command central in a large room with big maps of the world on screens in front, showing the main sites of the network. Figure 2-1 depicts such a command central.

In addition to the big maps, several screens display various pieces of information. For example, they show statistics on network utilization, information about current delays and service levels experienced by the network's users, and the number of problems that have been reported in different geographic areas. This gives everybody in the room a good overall sense of what is currently going on—whether things are in crises mode or whether everything is running smoothly.

Normally, everything on the map appears green. This means that everything is operational and that utilization on the network is such that even if an outage in part of the network were to occur, network traffic could be rerouted instantly without anyone experiencing a service outage. The network is designed to withstand outages and disruptions in any one part of the network. However, Pat still remembers the anxiety that set in on a couple occasions when suddenly links or even entire nodes on the map turned yellow or red. Once, for example, a construction crew dug through one of the main fiber lines that connect two of GSP's main hubs. And who could forget 9/11, when suddenly millions of people wanted to call into New York at the same time, while at the same time seemingly every news organization in the world requested additional capacity for their video feeds?

On Pat's desk is an additional, smaller screen that shows a list of problems that have been reported about the network. Pat has been assigned to monitor a region of the southeastern United States for any problems and impending signs of trouble. Pat sees on her screen a list of so-called trouble tickets, which represent currently known problems in the network and are used to track their resolution.

Those trouble tickets have two sources: problems that customers have reported and problems in the network itself. Let's start with customer-reported problems.

For every call that is received from a customer about a network problem, one of the customer service representatives at the help desk in building 7 opens a trouble ticket. The rep provides what GSP refers to as "tier 1 support." Those service reps have their own procedures. The person who first answers the call records a description of the problem, according to the customer, and asks the customer a series of questions, depending on the type of problem reported. If the service rep cannot help the customer right away, the customer is transferred to someone who is more experienced in troubleshooting the problem. That person is part of the second support tier. If this more experienced rep cannot solve the problem, or if it takes him or her too long to do so, the ticket is assigned to the people in Pat's group and shows up on Pat's screen. Pat's group provides the third tier of support.

The tickets contain a description of the problem, who is affected, and contact information. At least, this is what they are supposed to contain; sometimes Pat's group gets tickets with little or no information. In those cases, someone from Pat's group must call the service rep who first entered the ticket and find out more, which is always painful for everyone involved. It can be embarrassing when, in the worst case, Pat's co-workers need to call the customer back and the customer realizes that GSP is only starting to follow up on a serious problem hours after it was reported.

The second source of tickets is the network itself. These tickets are reported by systems that monitor alarm messages sent from equipment in the network. The problem with alarm messages is that they rarely indicate the root cause of the problem; in most cases, they merely reflect a symptom that could be caused by any number of things. Pat doesn't see every single alarm in the network—that would be far too many. For this reason, the alarm monitoring system tries to precorrelate and group alarm messages that seem to point to the same underlying problem. For each unique problem that alarm messages seem to point to, the alarm monitoring system automatically opens a ticket and attaches the various alarm messages to it, along with an automated diagnosis and even a recommended repair action. Ideally, the underlying problem can be corrected and the ticket closed before customers notice service degradation and corresponding customer-reported trouble tickets are opened.

Seeing messages grouped in this way is much more practical than having to deal with every single alarm individually. The sheer volume of alarms would quickly overwhelm Pat and her group. Also, tickets that are system generated are typically issued against the particular piece of equipment in the network that seems to be in distress. This makes system-generated tickets a little easier to deal with than customer-generated tickets, which often leave Pat's group feeling puzzled over where to start.

Pat remembers that tickets generated by alarm applications were problematic in the past. Often many more trouble tickets were generated than there were actual problems, so Pat sometimes saw 20 tickets that all related to the same problem. However, GSP has made significant progress in recent years—system-generated trouble tickets have become pretty accurate, with redundant tickets generated only in a small portion of cases. GSP's investment in developing better correlation rules for their systems paid off. Although Pat is an operator, not a developer, she knows that she was an important part of the development process because she provided much of the expertise that was encoded into those correlation rules. She still remembers being interviewed by a group of consultants for that purpose. During numerous sessions over the span of several months, they asked about how she determined whether problems that were reported separately were related.

Of course, despite all the progress made, many tickets still relate to the same underlying root cause. Many of those are tickets that were not automatically generated but instead were opened by customers. Perhaps a particular component in the access network through which customers were all connected to the network has failed, causing all of them to report a problem.

When clicking on a trouble ticket, Pat can see all the information associated with it. Pat must first acknowledge that she has read each ticket that comes in. If she does not acknowledge the ticket, it is automatically escalated to her supervisor. In busy times, this feels almost like a video game: Whenever a new ticket appears on the screen, she effectively "shoots it down" to stop it from flashing. Of course, acknowledging is only the first step. Next, Pat must analyze the ticket information. For the most part, her tasks are fairly routine. First she checks whether there are other tickets that might relate to the same problem. If there are, she attaches a note to the ticket that points to the other ticket(s) already being worked on. The system is intelligent enough to update the information in the other ticket to cross-reference the new one, thereby providing additional information that could prove useful in resolving it. This effectively leads to a hierarchy of tickets in which the original ticket constitutes a master ticket and the new ticket becomes a subordinate to the master. Pat then tables the resolution of the subordinate ticket until the master ticket that is already being worked on is resolved. At that point, she revisits the ticket to see whether the problem still exists or whether it can be closed also.

If she does not identify an existing ticket that might be related, she starts diagnosing the root cause of the problem. Let us assume that, in this case, the ticket was opened by a customer. Pat brings up the service inventory system to check which pieces of equipment were specifically configured to help provide service for that customer. With this knowledge, she brings up the monitoring application for the portion of the network that is affected to see for herself what is going on. This application offers her a view with the graphical representation of the device from which she can see the current state of the device, how its parameter settings have been configured, and the current communications activity at the device. She begins troubleshooting, starting with verifying the symptoms that are reported in the network.

In some cases, Pat eventually decides that a piece of equipment needs to be replaced, such as a card in a switch. In those cases, she brings up another tool, a work order system. She creates a new work order and specifies which card needs to be replaced. She enters the identifier of the trouble ticket as related information. This automatically populates the fields in the work order that identify the piece of network element, and also where it is located. Pat considers this to be a particularly nice feature. In the old days, she had to manually retype this information and also look up the precise location of the network element in the network inventory system. Now all those back-office systems are interconnected. She enters additional comments and submits the work order, and off it goes. This is all that she has to do for now.

It is not Pat's responsibility to dispatch a field technician or to check the inventory for spare parts; this is the job of her colleagues in the group that processes and follows up on equipment work orders. Actually, there are several groups, depending on where the equipment is located. Sometimes the equipment is in such a remote location that people have to physically get out there—"roll a truck," they call it. This is often the case for equipment in the access network. As mentioned earlier, the access network is the portion of the network that funnels network traffic from the customer sites to GSP's core network. In other cases—specifically, when the core network is affected—the equipment is at the NOC, in an adjacent building. Pat was once able to peek inside a room with all the equipment—many rows of rack-mounted equipment, similar to Figure 2-2.

Pat's friends tell her that the NOC equipment is more compact than it is used to be, but Pat still finds it very impressive, especially the cables (cables are shown in Figure 2-3). Literally hundreds, if not thousands, of cables exist; taken together, they would surely stretch across many miles. You would never want to lose track of what each cable connects to. Although it all looks surprisingly neat, Pat can only imagine what a challenge it must be to move the NOC to a different location if that ever becomes necessary.

Pat knows that the groups that do equipment work orders operate in similar fashion to her own group. The workflows are all predefined, and their work order system takes them through the necessary steps, autoescalates things when necessary, and generally makes sure that nothing can fall through the cracks—for example, it ensures that a work order does not sit unattended for days. It's impressive how integrated some of the procedures have become. For example, Pat has heard that when the technicians exchange a part, they scan it using a bar-code scanner that automatically updates the central inventory system. The system then warns them right away if they are scanning a different component than the one they are supposed to enter with the work order. In the past, occasional mismatches occurred between the equipment that was deployed and the equipment that was supposed to be there. This could lead to all kinds of problems—for example, equipment might be preconfigured in a certain way that would then no longer work as planned, or the installed equipment had different properties than expected. Those were rare but nasty scenarios to track and resolve.

Pat notes in the trouble ticket what she did and enters the identifier of the work order and when resolution is expected. For now, she is finished.

When the work order is fulfilled, Pat will find in her in-box a notification from the work order system identifying the trouble ticket that was linked to the work order and that should now be resolved. When she receives this notification, she does a quick sanity check to see if everything is up and running, and then closes the ticket for good.

When Pat first started her job, she was sometimes tempted to close the tickets right away without doing the check. Her department kept precise statistics on the number of tickets that she processed, the number of tickets that she had outstanding or was currently working on, the average duration of resolution for a ticket, and the number of tickets that had to be escalated. Of course, Pat wanted those numbers to look good because they were an indication of her productivity. Therefore, it was seemingly rewarding to take some shortcuts. It appeared that even in the unexpected case that a problem had not been resolved, someone would simply open a new ticket and no harm would be done. However, Pat soon learned that any such procedure violation would be taken extremely seriously. She now understands that procedures are essential for GSP to control quality of the services it provides. Doing things the proper way has therefore become second nature to her.

Chris: Network Administrator for a Medium-Size Business

Meet Chris. Together with a colleague who is currently on vacation, Chris is responsible for the computer and networking infrastructure of a retail chain, RC Stores, with a headquarters and 40 branch locations. RC Stores' network (see Figure 2-4) contains close to 100 routers: typically, an access router and a wireless router in the branch locations, and additional networking infrastructure in the headquarters and at the warehouse.

The company has turned to a managed service provider (MSP) to interconnect the various locations of its network. To this end, the MSP has set up a Virtual Private Network (VPN) with tunnels between the access routers at each site that connects all the branch locations and the headquarters. This means that the entire company's network can be managed as one network. Although the MSP worries about the interconnectivity among the branch offices, Chris and his colleagues are their points of contact. Also, the contract with the MSP does not cover how the network is being used within the company. This is the responsibility of Chris and his colleagues.

Chris has a workstation at his desk that runs a management platform. This is a general-purpose management application used to monitor the network. At the core of the application is a graphical view of the network that displays the network topology. Each router is represented as an icon on the screen that is green, yellow, orange, or red, depending on its alarm state. This color coding allows Chris to see at first glance whether everything is up and running.

Even though the network is of only moderate size, displaying the entire topology at the same time would leave the screen pretty cluttered. Chris has therefore built a small topology map in which multiple routers are grouped into "clusters" that are represented by another icon. Each cluster encompasses several locations. In addition, there is a cluster each for the headquarters and the warehouse. This configuration enables Chris to display only the clusters and thereby view the whole network at once. Chris can also expand ("zoom into") individual clusters when needed to see what each consists of. As with the icons of the routers, the icons for the clusters are colored corresponding to the most severe alarm state of what is contained within. This way, Chris does not miss a router problem, even though the router might be hidden deep inside a cluster on the map. As long as the cluster is green, Chris knows that everything within it is, too. Figure 2-5 shows an example of a typical screen for such a management application.

Mike calls from upstairs. Someone new is starting a job in finance tomorrow and will need a phone. Chris notes this in his to-do list. He will take care of this later. First, he is trying to get to the bottom of another problem.

Chris received some complaints from the folks at the Richmond branch that the performance of their network is a little sluggish. They have been experiencing this problem for a while now; they first complained about it ten days ago when access to the servers was slow. At the time, Chris wondered whether this was really a problem with the network or with the server. As an end user, there was really no way to tell the difference. Eventually, the problem went away by itself and Chris thought it might have been just a glitch. Then three days ago, the same thing happened, and it did this morning again. This time Chris tried accessing the server himself with the Richmond people on the call but did not notice anything unusual.

Chris thinks that perhaps it really is a problem with the network. He wonders whether the MSP really gives them the network performance that they have promised. The MSP sold Chris's company a service with 2 Mbps bandwidth from the branch locations and "three nines" (99.9%) availability from 6 am until 10 pm during weekdays, 98% during off hours. The people from the MSP did not contact Chris to indicate that there was a problem on the MSP's side, but maybe they don't know—and besides, why would they worry if they didn't get caught? Chris wonders whether he should have signed up for MSP's optional service that would have allowed him to view the current service statistics, as seen from the MSP's perspective, in near-real time over the web. Although Chris doesn't think the MSP can be entirely trusted, this would have provided an interesting additional data point.

From his management platform, Chris launches the device view for the router at the edge of the affected branch by clicking the icon of the topology map. The device view pops up in a window and contains a graphical representation of the device from which the current state, traffic statistics, and configuration parameter settings can be accessed. Currently, not much traffic appears to be going across the interface. From another window, Chris "pings" the router, checking the roundtrip time of IP packets to the router. Everything looks fine.

Chris decides that this problem requires observation over a longer period of time, so he pulls up a tool that enables him to take periodic performance snapshots. He specifies that a snapshot should be taken every 5 minutes of the traffic statistics of the outgoing port. Chris also wants to periodically measure the network delay and jitter to the access router at company headquarters and to the main server. The tool logs the results into a file that he can import into a spreadsheet. Spreadsheets can be very useful because they can plot charts, which makes it easy to discover trends or aberrations in the plotted curves. (Of course, sometimes management applications support some statistical views as well, as shown in Figure 2-6.)

For now, that seems all that he can do. Chris takes a look at his to-do list and decides to take care of the request for the new phone. He doesn't know whether they have spare phones, so he goes to the storage room to check. One is left, good. He will have to remember to stock up and order a few more. He then peeks at the cheat sheet that he has printed and pinned in his cubicle, which has the instructions on what to do when connecting a new user. Most phones in RC Stores' branch locations are assigned not to individual users, but to a location, such as a cashier location, so changes do not need to be made very often.

RC Stores recently replaced its old analog private branch exchange (PBX) system with a new Voice over IP (VoIP) PBX. This enables the company to make internal phone calls over its data network. It also has a gateway at headquarters that enables employees to make calls to the outside world over a classical phone network, when needed. Chris remembers that, to make phone calls, the old PBX worked just fine, but programming the phone numbers could be a pain. Phone numbers were tied to the PBX ports, so he had to remember which port of the PBX the phone outlet was connected to so he could program the right phone number. Because RC Stores had never bothered documenting the cabling plan in the building, there were sometimes unwelcome surprises. Connecting one new user wasn't that bad, but Chris would never forget when they were moving to a new building and he and his colleague spent all weekend to get the PBX network set up to ensure that everyone could keep their extensions.

Now it is a simpler. Chris jots down the MAC address from a little sticker on the back of the IP phone and brings up the IP PBX device manager application. He also gets his sheet on which he notes the phone numbers that are in use. His method to assign phone numbers is nothing fancy. He has printed a table with all the available extensions. Jotted on the table in pencil is the information on whether a phone number is in use. Chris selects a number that is free, crosses it out, and notes the name of the new person who is assigned the number, along with the MAC address of the phone.

Chris then goes into the IP PBX device manager screen to add a new user. The menu walks him through what he needs to do: He enters the MAC address and the phone extension, along with the privileges for the phone. In this case, the user is allowed to place calls to the outside. Now all that remains to be done is to add voice mail for the user. He starts another program, the configuration tool for the user's voice-mail server. RC Stores decided to go with a different vendor for voice mail than for the IP PBX. Chris often moans over that decision. Although having different vendors resulted in an attractive price and a few additional features, he now has to administer two separate systems. Not only does he need to retype some of the same information that he just entered, such as username and phone number, but he also needs to worry about things such as making separate system backups. Chris leaves the capacity of the voice mail box at 20 minutes, as the application suggested for the default; it is the company's policy that everyone gets 20 minutes capacity except department heads and secretaries, who get an hour.

The phone extension is now tied to the phone itself, regardless of where on the network it is physically plugged in. Chris walks over to the Human Resources (HR) person upstairs and asks where the new employee will sit. He carries over the phone right away, plugs it into the outlet, and makes sure that it works. He must remember to send a note to HR to let them know the number he assigned so they can update the company directory. Chris has been intending for some time to write a script that provisions new phones and automatically updates the company directory at the same time. Unfortunately, he has not gotten around to it yet. Maybe tomorrow.

Chris goes back to his desk and checks on the performance data that is still being collected. Things look okay; he will just let it run until the problem occurs again so that he has the data when it is needed. In addition, he decides that he wants to be notified right away when sluggish network performance is experienced. He goes again into his management platform and launches a function that lets him set up an alert that is sent when the measured response time between any two given points in the network exceeds a certain amount of time. He configures it to automatically check response time once per minute and to send him an alert to his pager when the response time exceeds 5 seconds. He hopes that this will give him a chance to look at things while the problem is actually occurring, not after the fact.

Chris realizes that the response time is needed for two purposes—once for the statistics collection function, once for the alerting function. Currently, there is no way to tie the two functions together. Therefore, the response times will simply be measured twice. Although this is not the most efficient method, there is no reason for Chris to worry about it.

Thinking about it, Chris suspects that the problem is related to someone initiating large file transfers. Perhaps an employee is using the company's network to download movies from the Internet. If this is the case, it would be a clear violation of company policy. Not only does it represent an abuse of company resources, but, more important, it also introduces security risks. For example, someone could download a program containing a Trojan horse from the outside and then let it run on the company network. Of course, Chris has set up the infrastructure to regularly push updates of the company's security protection software to the servers, but this alone does not protect against all possible scenarios. All the efforts to secure the network against attacks from the outside do not help if someone potentially compromises network security from the inside. Chris thinks that this hypothesis makes sense. The gateway that connects the company to the Internet is located at headquarters, and from the remote branch someone would have to go first via the company's VPN to that gateway to go outside. The additional traffic on the link between the remote branch and headquarters might be enough to negatively affect other connected applications. So maybe the problem resides with RC Stores after all, not with the MSP.

In any event, Chris knows that when the symptom occurs again, he will be able to find out what is going on by using his traffic analyzer, another management tool. He will be able to pull up the traffic analyzer from his management station to check what type of data traffic is currently flowing over a particular router—the gateway to the Internet, in that case—and where it originates.

Before Chris leaves in the evening, he forwards his phone extension to his mobile, in case something comes up. Also, he brings up the function in the alarm management portion of his management platform application and programs it to send him a page if an alarm of critical severity occurs, such as the failure of an access router that causes a loss in connectivity between a branch and headquarters. Chris has remote access to the VPN from home and can log into his management application remotely, if required.

Sandy: Administrator and Planner in an Internet Data Center

Meet Sandy. Sandy works in the Internet Data Center for a global Fortune 500 company, F500, Inc. The data center is at the center of the company's intranet, extranet, and Internet presence: It hosts the company's external website, which provides company and product information and connects customers to the online ordering system. More important, it is host to all the company's crucial business data: its product documents and specifications, its customer data, and its supplier data. In addition, the data center hosts the company's internal website through which most of this data can be accessed, given the proper access privileges.

F500, Inc.'s core business is not related to networking or high technology; it is a global consumer goods company. However, F500, Inc. decided that the functions provided by the Internet Data Center are so crucial to its business that it should not be outsourced. In the end, F500, Inc. differentiates itself from other companies not just through its products, but by the way the company organizes and manages its processes and value supply chains—functions for which the Internet Data Center is an essential component.

Sandy has been tasked with developing a plan for how to accommodate a new partner supplier. This will involve setting up the server and storage infrastructure for storing and sharing data that is critical for the business relationship. Also, an extranet over which the shared data can be accessed must be carved out. The extranet constitutes essentially its own Virtual Private Network that will be set up specifically for that purpose.

Sandy has a list of the databases that need to be shared; storage and network capacity must be assessed. Her plan is to set up a global directory structure for the file system in such a way that all data that pertains to the extranet is stored in a single directory subtree—perhaps a few, at most. She certainly does not want the data scattered across the board. Having it more consolidated will make many tasks easier. For example, she will need to define a strategy for automatic data backup and restoration. Of course, Sandy does not conduct backups manually; the software does that. Nevertheless, the backups need to be planned: where to back up to, when to back up, and how to redirect requests to access data to a redundant storage system while the backup is in progress.

Sandy's main concern, however, is with security. Having data conceptually reside in a common directory subtree makes it much easier to build a security cocoon around it. Security is a big consideration—after all, F500, Inc. has several partners, and none of them should see each other's data. A major part of the plan involves updating security policies—clearly defining who should be able to access what data. Those policies must be translated into configurations at several levels that involve the databases and hosts for the data, as well as the network components through which clients connect.

Several layers of security must be configured: Sandy needs to set up a new separate virtual LAN (VLAN) that will be dedicated to this extranet. A VLAN shares the same networking infrastructure as the rest of the data center network but defines a set of dedicated interfaces that will be used only by the VLAN; it allows the effective separation of traffic on the extranet from other network traffic. This way, extranet traffic cannot intentionally or unintentionally spill over to portions of the data center network that it is not intended for. The servers hosting the common directory subtree with the shared data will be connected to that VLAN. Sandy checks the network topology and identifies the network equipment that will be configured accordingly.

Figure 2-7 shows a typical screen from which networks can be configured. This particular screen allows the user to enter configuration parameters for a particular type of networking port.

In addition, access control lists (ACLs) on the routers need to be set up and updated to reflect the new security policy that should be in effect for this particular extranet. ACLs define rules that specify which type of network traffic is allowed between which locations, and which traffic should be blocked; in effect, they are used to build firewalls around the data. This creates the second layer of security.

Finally, authentication, authorization, and accounting (AAA) servers need to be configured. AAA servers contain the privileges of individual users; when a client has connectivity to the server, access privileges are still enforced at the user and application levels. Any access to the data is logged. This way, it is possible to trace who accessed what information, in case it is ever required, such as for suspected security break-ins.

However, before she can proceed with any of that, Sandy needs to assess where the data will be hosted and any impact that could have on the internal data center topology. After all, without knowing what servers should be connected, it is premature to configure anything else. When the partner comes online, demand for the affected data is sure to increase.

Sandy pulls up the performance-analysis application. She is not interested in the current status of the Internet Data Center because operations personnel are looking after that. She is looking for the historical trends in performance and load. Sandy worries about the potential for bottlenecks, given that additional demand for data traffic and new traffic patterns can be expected. She takes a look at the performance statistics for the past month of the servers that are currently hosting the data. It seems they are fairly well utilized already. Also, disk space usage has been continuously increasing. At the current pace, disk space will run out in only a few more months. Of course, some of the data that is hosted on the servers is of no relevance to the partnership; in effect, it must be migrated and rehosted elsewhere. This should provide some relief. Still, it seems that, at a minimum, additional disks will be needed. Given the current system load, it might be necessary to bring a new server with additional capacity online and integrate it into the overall directory structure. Sandy might as well do this now. This way, she will not need to schedule an additional maintenance window later and can thus avoid a scheduled disruption of services in the data center.

Of course, the fact that data is kept redundantly in multiple places will be transparent (that is, invisible) to applications. All data is to be addressed using a common uniform resource identifier (URI). The data center uses a set of content switches that inspect the URI in a request for data and determine which particular server to route the request to. The content switch can serve as a load balancer in case the same data and same URI are hosted redundantly on multiple servers. The content switch is another component that must be configured so it knows about the new servers that are coming online and the data they contain. Sandy makes a mental note that she will need to incorporate this aspect into her plan.

Observations

This should suffice for now as an impression of the professional lives of Pat, Chris, Sandy, and many other people involved in running networks. At this point, a few observations are key:

Pat, Chris, and Sandy handle their jobs in different ways. For example, in Pat's case, there are many specialized groups, each dealing with one specific task that represents just a small portion of running the network. On the other hand, Chris more or less needs to do it all. Sandy is less involved in the actual operations but more involved in the planning and setup of the infrastructure. This work includes not just network equipment, but computing infrastructure as well. There is no "one size fits all" in the way that networks are run.

Pat, Chris, and Sandy all have different tools at their disposal to carry out their management tasks. We take a look at some of the management tools in the next section. Not all tools that they use are management systems; in Chris's case, we saw how a spreadsheet and a piece of paper can be effective management tools.

A major aspect of Pat's job is determined by guidelines, procedures, and the way the work is organized. Systems that manage operational procedure and workflows are as much part of network management as systems that communicate with the equipment and services that are being managed. Their importance increases with the size and complexity of the network (and network infrastructure) that needs to be managed.

Some tasks are carried out manually; some are automated. There is no one ideal method of network management, but there are alternative ways of doing things. Of course, some are more efficient than others.

Management tasks involve different levels of abstraction and, in many cases, must be broken down into lower-level tasks. Chris and Sandy both were at one level concerned with a service (a voice service in one case, an extranet in the other case), yet they had to translate that concern into what it meant for individual network elements. Sandy had to worry about how security policies at the business level, that state which parties are allowed to share which data, could be transformed into a working network configuration that involved a multitude of components.

Many functions are involved in running a network—monitoring current network operations, diagnosing failures, configuring the network to provide a service, analyzing historical data, planning for future use of the network, setting up security mechanisms, managing the operations workforce, and much more.

Integration between tools affects operator productivity. In the examples, we saw how Pat's productivity increased when she was supported by integrated applications, which, in that case, included a trouble ticket, a work order, and network monitoring systems. Chris, on the other hand, had to struggle with some steps that were not as integrated, such as needing to keep track of phone numbers in four different places (company directory, number inventory, and IP PBX and voice-mail configuration).

Later chapters will pick up on many of the themes that were encountered here, after discussing the technical underpinnings of the systems that enable Pat, Chris, and Sandy do their jobs. Before we conclude, however, let us take a look at some of the tools that help network providers manage networks.