Ask the Expert: Troubleshooting Unified Contact Center Enterprise

Welcome to the Cisco Support Community Ask the Expert conversation. This is an opportunity to learn and ask questions about integrating Unified Contact Center Enterprise into your environment and troubleshooting the many features that are available with the Unified Contact Center Enterprise solution.

Goran Selthofer is a team lead for the Cisco TAC EMEAR Contact Center team based in Brussels. He has supported UCCE, UCCX, CVP, and UCCE applications for the past seven years within the Cisco TAC. He has more than 13 years of overall experience in the industry, with broad experience in Cisco Unified Communications infrastructure solutions as he has been also working for Cisco Gold Partner prior to joining Cisco TAC. Goran also provides internal training to TAC engineers on Contact Center topics. He graduated with a master's degree at the Technical Military Academy - Belgrade University. He also holds CCIE certification (number 27211) in voice as well as VMware Certified Professional certifications.

Remember to use the rating system to let Goran know if you have received an adequate response.

Goran might not be able to answer each question due to the volume expected during this event. Remember that you can continue the conversation in Collaboration, Voice and Video community, sub-community, Contact Center discussion forum shortly after the event. This event lasts through February 14, 2014. Visit this forum often to view responses to your questions and the questions of other community members.

Knowing your call flow in details will reveal all nodes and processes which you should or can troubleshoot within logs.

Now, basically, there are different types of nodes being: Central Control, Peripheral Gateways, different peripherals and CTI services (server and desktops). Each of those have their own specifics in setting and collecting traces.

Therefore, we have published the following Tech Note to help partners/customers with setting and collecting logs:

Ok, so usual confusion on that topic comes from the fact that users often think this should be similar to Microsoft SQL Replication. Thus, users expect to see something like GUI or visual presentation of that replication.

However, MSSQL REPL is not used here. Therefore, we need to understand architecture before we can think of ‘monitoring’ it. Also, to be very clear from the beginning, there is no ‘easy way’ of ‘monitoring’ it as there is no ‘tool’ for that.

Now, first, we need to separate Router from Logger because they get their data in a different way hence they have different way of syncing that and that is why they are to be observed separately.

Routers have MDS (Message Delivery Service process). Loggers do not have that process. However, Loggers use MDS of ‘same side’ Router. Logger on one side will never talk with Router on another side.

MDS is a sync zone, meaning every bit of data which comes to Router on one side is replicated through MDS to the Router on another side. Knowing that UCCE architecture utilizes two types of networks, MDS uses PRIVATE network for that communication. It is very active process since Routers sync their MEMORY. Therefore, that needs to be a perfect sync.

However, data which Router gets, router commits to the local DB and since Router doesn’t have DB, it means router commits that to the same side Logger’s DB. That is how Logger gets data. So, data bit which came from PGA to RTR (not relevant how at this point) ends up first in MDS on Router A side (assuming PGA has active link with RTRA) which is replicating it to Router B where it ends in both routers’ memory. Now, EACH router commits that to its own respective Logger.

Bottom line here is the following:

Routers sync their memory and that cannot be easily ‘monitored’ but RTR processes are designed in such a way that if there is a difference then they will for sure complain and it will go even up to the point that one side process will not even be able to start or it will restart if not able to go in sync. So:

MDS though would be good start point to check if something goes wrong as it will report process or peer disconnects

Also, RTTEST tool can be used to check if any failover happened and when or from which side sync was done.

Loggers do get their data from respective Routers but Loggers also have a possibility to ‘sync directly’. This kind of sync is done via socket connection by RECOVERY (RCV) process and it can be monitored via RCV logs (in a basic logical fashion way – is there any errors or unusual behavior or not). So:

RCV process logs for checking if it is all healthy on that side

ICMDBA tool to quickly see if replication of new data is happening (Space Used Summary option from Data menu when Logger DB is selected) by monitoring Max date.

Maybe not as you hoped to be but I have tried to give an overall perspective for other users reading it later as well…

based on this RETENTION parameters:HKEY_LOCAL_MACHINE\SOFTWARE\Cisco Systems, Inc.\ICM\\Distributor\RealTimeDistributor\CurrentVersion\Recovery\CurrentVersion\Purge\Retain

Tables are purged usually at 00:30 every day - controlled by this parameter:HKEY_LOCAL_MACHINE\SOFTWARE\Cisco Systems, Inc.\ICM\\Distributor\RealTimeDistributor\CurrentVersion\Recovery\CurrentVersion\Purge\Schedule\Schedule

Category 2: Emergency Purge--------------------------------There are 2 parameters to control this under this path:HKEY_LOCAL_MACHINE\SOFTWARE\Cisco Systems, Inc.\ICM\cim\Distributor\RealTimeDistributor\CurrentVersion\Recovery\CurrentVersion\Configuration\Purge\Automatic

1. AdjustmentPercentage Purge on 80%2. PercentFull on 90%

Both are set to purge 1% when DB reaches respectively 80% or 90%.

WARNING: ABOVE REGISTRY KEYS SHOULD NOT BE CHANGED!!! DOING THIS WILL MAKE YOUR SYSTEM UNSUPPORTED!

Reason for this is very simple: ICM processes are in charge of filling data into DB hence ICM needs to keep DB under 80% in order to compensate for the data burst while at the same time ensuring proper performance on process level interacting with DB.

Now, how is the PURGE done is very simple: Purge oldest data but fist but start from Tables starting with letter A.So, usually it will be oldest data in Agent tables to be purged first.

Here is the drawaback of that approach:Since it is set to purge 1% of data, just to make DB go under 80% usage, so to 79%, that means that if Agent table is purged with certain number of rows which dropped usage of DB to 79% then purge will stop. However, if there are still incoming data into DB making DB to go to 80% again, then again PURGE will be triggered with the same logic - start from A and purge 1%. So, if your DB is bouncing between 80% and 70% then it can easily happen that your Agent_ tables are purged totally thus making your reporting which depends on those tables not possible.

90% purge works the same way however, when it reaches 90% no new data will be allowed into DB.

So, you can argue with this but you have to keep in mind one SIMPLE RULE:

This is an EMERGENCY action.

Your tasks as system admin or system architect is to design the sytem in that way to AVOID reaching 80% full DB at any time.

So, answer to your a) question is above. keep in mind that it is not ICMDBA tool who is doing that but the code itself.Answer to your b) question is also above (keys for retention). Of course, you should use ICMDBA tool here, option to Estimate your DB size based on required retention periods and then ensure you have that disk space there already before increasing retention times.Asnwer for c) - NO. Definitelly NO!

However, PLEASE do not take that as 'a primary line of success' - meaning - I will just put now what I 'think' it is good as anyhow we can expand it later. That decision might cost your customer some data loss since UNTIL you are enaged back to expand it, almost for sure there has been a problem already and data started to drop.

Therefore, probably daily we have at least one TAC CASE opened asking 'where is my data'. This is because improper estimation is done during the deployment about retention periods compared to DB size. So, customer wanted to retain 3 years of data and retention periods are set according to that WISH. and that is nothing more than a WISH. However, in order for that to become reality then DB size also needs to follow that WISH. Well, DB size was left to 40 GB and then 'suddenly' everyone is wondering 'why I am losing data since I have configured retention period on 3 years'

I hope I have given you a clue - why is that

Also, if reporting is so important to customer, we do recommend HDS on both sides and regular backups and DB maintenance.

Also, as described in one of the above posts you can use checke tool to see what is the peripheral error mapping – code to description:

Peripheral Error Code Descriptions

A quick way to obtain the description for UCCE Peripheral Error Codes is to log onto a UCCE system is to open a command prompt and navigate to C:\icm\bin directory and "checke where error code is the peripheral error code that you have identified. In this example we would use c:\icm\bin>checke 12005

Now, although most of the processes are not completed from serviceability point of view to document/list all possible errors, intention of BU is to directly write in logs as much details as it can be done to give more clues of what is happening.

Examples of Error messages in logs:

Failed to update the database.

The Update succeeded at the controller but was not propagated back to the Distributor.

Check the status of UpdateAW on the Distributor.

Or:

Failed to update the database.

Another user has changed the configuration data. Re-retrieve the data and try save again.

If the problem persists, you need to reload your local database. You can do this using

Unfortunatelly, Finesse version is not shared and also more details like - i.e. phonebook changes are not reflecting but is it consistent for both servers if duplex server deployment) or only for one...all changes from certain time, all the time or intermittent...etc...etc...

In above post I shared Problem Solving process where there are lots of questions shared which might help to isolate.

This is probably not very popular to be asked to answer as some think that is a waste of time but this is how TAC resolves more than 65% of cases believe it or not

Those questions actually come from well-known Kepner-Tregoe Problem Analysis methodology and are used in troubleshooting diffent issues, not only in IT. Every Cisco TAC engineer is required to pass KT training so to be able to use it.

OK, so back to the issue, I will assume you are not on 10.0 release hence it might be that you are hitting known issue:

CSCul20619 CCE and CCX: PhoneBook update not shown on desktop after DB restart

There are some issues in seeing this defect from outside currently but it is marked as external so will be visible in the future. Anyway, the workaround is to restart Cisco Tomcat.

Please check if that resolves it for you and let me know. (Note: restart Cisco Tomcat out of production hours).

Now, if you want to troubleshoot Finesse for that issue then here is what usually you do for logs:

6) Agent attempt to make call and options window is open showing available phone books.

7) Collect Web services logs from the time Tomcat is restarted until just after the attempt to make a call and missing or incomplete phone books are observed.

Be careful, this is service impacting, so do it after hours. Also note, Tomcat restart might resolve the issue as well as mentioned above so you might not be able to reproduce it.

How to collect the Error and Desktop logs for review:

1. When agent sees that issue on the desktop have the agent hit "Send Error Report" on the desktop.This will send the client side logs to the Finesse server.2. Use the cli command to collect all Finesse logs - file get activelog desktop recurs compress3. Collect CTI server logs from the time of Finesse tomcat restart to the time the agent sees the issue on the desktop. (Healthcheck)

Well, honestly, there is no 'tool' which is used by TAC to read UCCE traces. Far from that . We use text editors with coloring schemes when reading logs and that is a long manual process.

One exception is a basic Call Flow tool which is distributed with CVP software. That one is promissing but currently it is still not widely used as it requires pre-created log templates and currently there are only few. However, it works very well for CVP SIP tracing.

However, back to UCCE logs, we are not using any special tools and that is why TAC would generally ask you to provide as much info as possible (like ANI, time stamps, Agent ID, Extension ...etc) in order to analyze logs as it is not enough just to send logs to TAC. So, unfortunatelly there is no magic buttons or crystal balls (YET! ) which TAC or partners and customers can use when reading logs.

With that being said, foundation for reading logs is to really understand processes and tasks, to know the exact call flow and expected behavior and to gather as much info as possible about BAD but also GOOD examples.

Now, related to the second part of your question, with intention to make it more actual by introducing Finesse in the same story and also as I have promised David from above post, I invite you to read great example written by my good friend and colleague Linda, who has created the following:

The information in this document is based on Cisco Finesse Version 9.1(1).

The information in this document was created from the devices in a specific lab environment. All of the devices used in this document started with a cleared (default) configuration. If your network is live, make sure that you understand the potential impact of any command.

Finesse Error-Desktop-webservices Log File

First locate any CTI server errors log in the Error-Desktop-webservices log. This log will help you identify if the Finesse Server is receiving error messages from CTI Server. In this log we see the following message at 14:43:50.838.

The error message indicates ENTITY_ID=2005 or agentID 2005 encountered an error when trying to login.

Finesse Desktop-webservices Log File

Open the Desktop-webservices log for the same timestamp that the error was found in the Error-Desktop-webservices log.

Search the log file to locate the error message. Search on the timestamp of the error (14:43:50) or the failure code, i.e.failureCode=70 to locate the error message. The webservices log will provide the "peripheralErrorCode" in this example the peripheral error code received from the backend cti server is 12005.

Open the CTI Server log for the same timestamp that the error was found in the Error-Desktop-webservices log and locate the peripheral error code 12005 with approximately the same time stamp. In this example search on 14:43.

The CTI server log will show that CTI Server received a SET_AGENT_STATE_REQ for agent 2005 using AgentInstrument 2005. This message was sent to CTI server from Session 5 or the Finesse Server. CTI server responded to the request with a PeripheralErrorCode:12005.

To determine which session your Finesse server you can use procmon. Using procmon we can verify that Session 5 is 10.10.10.38 with a client ID of Finesse. Where 10.10.10.38 is the IP address of our primary Finesse Server.

A quick way to obtain the description for UCCE Peripheral Error Codes is to log onto a UCCE system is to open a command prompt and navigate to C:\icm\bin directory and "checke where error code is the peripheral error code that you have identified. In this example we would use c:\icm\bin>checke 12005

Goran, thanks for the great detailed answer! One more question for you. If my CIM (Cisco Interaction Manager) is integrated with UCCE, what is the best point to start activity routing issues? Thanks again!

- in case CIM is integrated with ICM, then CIM will extended routing decisions to ICM. That is why the service from CIM side which makes this possible is called EAAS - EXTERNAL Agent Assignment Service.

- So, EAAS talks with PIM on MRPG side of ICM. Therefore, this makes life much easier as they are using MRI (MR Interface) hence they will have some standardized behavior.

- Now, exactly that PIM on your MRPG is THE Border Line to start troubleshooting routing issues.

- AND it is VERY SIMPLE - here is how you do it:

* First enable MR tracing on that MR PIM. Let's take example that ICM instance name is ACME and MRPG node is PG2A where PIM1 is mrpim talking to CIM.

So, open cmd line on MRPG box, procmon to that pim and enable all MR tracing:

> procmon acme pg2a pim1

>>>>ltrace (this will list traces currently enabled)

>>>>trace *.* /off (first you want to disable everything which is enabled currently to avoid noise in logs)

>>>>trace *heart* /off (this DIASABLES HEARBEATS as that will be too noisy in logs)

....mr_heartbeat_messages 3 Off....

>>>>

so in few seconds with above commands you enabled MR tracing.

To make long story short:

- when new email comes in, RX will pull it inside CIM and then after going via DB to generate ActivityID it will eventually hot CIM Workflow and there it will reach to integrated queue. That will make EAAS send activity to ICM for routing.

What exactly will happen is that EAAS will send NEW_TASK message to ICM via MRPIM:

So, bottom line, start from MR logs to see if NEW_TASK and DO_THIS_WITH_TASK are present for the same activity. If yes, then ICM job is done and issue is on CIM side. If DO_THIS_WITH_TASK is not there then it fails on ICM side.This will point you furhter which side you need to investigate more.

I have a question about redundancy amd failures... how does UCCE handle it if connection between side A and B got disconnected ( both public and private WAN links) ? and how can one recover the data after they reconnect together...

also, how is the "RecoveryKey" calculated, and is it possible to reset it manually?

However, I also invite you to read all other scenarios there as it really describes what happens when only Private network fails, or only Public network fails... or when only one Logger fails, or PG...

So, it will tell you that PG can buffer some data, also it tells you there that Logger can be 12 hours down and if more then you will need to do MANUAL sync of DBs....etc...

I am sure you will get very useful information from that document!

Now, your question about RecoveryKey calculation. I will use explanation I learned from BU while working on some case:

RecoveryKeys are numbers automatically generated by the CallRouter and once the data is replicated to the HDS are no longer used.So, the system automatically generates the Recovery Key in all of ICM historical tables as they are written from the "temp" tables into the real ICM tables by the Recovery process. This process generates the key based on the date/time of the Logger (down to the nano-second) and it figures out the number seconds that have passed since the starting date of 1/1/1995. This gives us a Julian Date in Seconds from our starting point to be able to compare dates/times.

This seconds value is multiplied by 1,000 to create the actual key -- based on the fact that we don't expect to be able to write more than 1,000 records per second into any one table. Each time a table iswritten to, the key is incremented for the next record within the same table. This key is set when the Logger HistLogger process starts on the Logger -- it even spits out a message telling you that this isthe new recovery key for all historical data.

The recovery keys are kept unique by this concept on a table-by-table basis -- every table starts with the same key value -- then it is incremented in that table by one on every record written in that table.

When the data is then replicated from the Logger to the HDS, it is not written based on the recovery key, it is written based on the table itself--in alphabetical order taking each new item in their recovery key order within the table and writing to the HDS.

So, answer to your question - can you manually reset those keys - is ABSOLUTELY NO.

I've been tryin to get the finesse up and running, but i'm getting the invalid user id/ password error when i try to login..i've entered valid parameters in the finesse admin page..and have also verified to user mapping of the awdb in the sql server management studio (i'm using windows authentication).

I had my colleague Zaid Salama preparing this while still being very busy with his own work so I want to thank him for that!

Ok, so here we go:

Regarding this question, I believe we will need more information on the Finesse version, UCCE version,..etc however from the first look on the description, I would expect that the issue is related to the AW, if you are sure that the username and password are correct, the user has the correct privileges, and the connectivity between both sides are good, it might be that you are using NTLMv2 authentication on the AW.

The Docuemnation defect "CSCuj95347 Document Finesse JDBC driver cannot authenticate using NTLMv2" confirms that NTLMv2 is not supported by Finesse as Finesse used a third party driver, that driver doesn't support NTLMV2. The resolution to this is to disable the NTLMv2 on AW, that can be done by running the following:

1) Disable NTLMv2 on the AW server hosting the AWDB and reboot the AW server.

I've always wondered... why doesnt TCD, or any other table within an ICM database hold the information as to who disconnected the call? I find it combersome to have to pull CDR logs and try to correlate it to TCD records just to find out who disconnected the call. I'm sure there's a good reason to this and I'd certainly like to hear your input on it.

Additionally, is there an easier way to correlate TCD and CDR records? Can RTMT play a role here?

Yeah, "who disconnected call" investigation was always part of the call control side of things... historically, ICM worked with ACDs and never had to worry about call control hence that part was always left out since from ICM point of view it was not something to be concerned of as it is irrelevant for ICM calculations.

Well, times are changing so if there is real demand for this I believe that pressure will and should come from customer's side by contacting local Cisco Account Managers and asking them to open Product Enhancement Request (PER) for that. Of course, BU will need them to provide some business case / justification. As it is less of a technical than capacity and priority issue. Keep in mind that it will require developer hours and possible protocol changes. That is why there is no action on that side as risks are higher than gain (considering call control part already has that info).

OK, but now, your real question about how to correlate TCD and CDR.

Not sure if you have seen this, but more than 3 years ago I have already answered that question here:

These IDs are not unique because the same PeripheralCallKey and CallID are re-used in redirect, transfer and conference scenarios.

Also, this only works with in a single cluster. So in a multiple cluster environment, you need to map Cluster CDRs to a specific PeripheralID.

Cradle to Grave Call Tracking in ICM

----------------------------------------------------------

The RouterCallKeyDay, RouterCallKey, and RouterCallKeySequenceNumber will track a call from its first route until its final call leg.

The RouterCallKeyDay and RouterCallKey combine to provide common attribute across the calls.

The RouterCallKeySequenceNumber gives you some sense of order of when calls were created. (gselthof: so note, 'some sense' is not guaranteed order!!!)

In a multi-peripheral environment, this requires routing between peripherals. This means calls to the IVR need to be translation routed, and calls to other agent clusters need to be routed as well.

Identifying Routed Agent TCDs

----------------------------------------------------------

You will want to filter out the TCDs created for the CVP call legs, and calls are generated for agents for internal agent to agent calls.

Use the AgentSkillTargetID to identify agent, SkillGroupSkillTagetID to identify SkillGroup, and CallTypeID to identify Call Type / program.

If all three of these values are filled in, you know you got a call that was routed to an agent.

Sometimes more than one TCD will meet these three criteria for the same PeripheralCallKey In those cases, the one with the lowest RouterCallKeySequenceNumber will identify the first call answered by the agent.

CallDisposition

-----------------------------

The CallDispositionFlag is the best indicator to find out if a call was handled or not. There are a bunch of CallDispositions. The CallDispositionFlag distills the results down to 7 categories.

You can find details on what the CallDispositionFlags are in the schema help or schema guide.

Thanks Goran for your clarification and links for failover and RecoveryKey

A couple of things related here:

- From my understanding about your explanation of how RecoveryKey is generated, does this mean that the RecoveryKey of ANY UCCE system is similar? As it depends on time, this means that today's RecoveryKey is always bigger than yesterday's, even if those were 2 different clusters at 2 different customers?

--> if that's the case, then why in a technology refresh system I can't migrate the HDS table AFTER the logger? This is based on the upgrade guide of UCCE, where all scenarios of upgrade have the HDS being upgraded first (or at the same time as the logger)

- In case side A and side B historical data (from loggers and also hds) are not equal due to some failure happening at some point (we checked icmdba and number of records is different), what is the best way to fix that?

- Yes. But they are independent between systems. One system will not create exact as other system. However, it is true that RC can only increase but looking from the own initial base on the system. RC between different customers should not even be discussed as it is not something which can or should be used anyway.

- Install/Upgrade Guide:

If you complete the upgrade of the main Administration & Data Server within the Logger purge window (usually 14 days), you can replace the temporary Administration & Data Servers with the upgraded Administration & Data Servers for reporting. The data replication process fills in any missing data.

So that means that since you would probably need your AW for some tasks during the upgrade, that you migrate it before/at the same time. However, if you don’t want then you can setup NEW TEMPORARY one for that and then migrate real AW/HDS later as said above.

- Recommendation is that you don’t bother with data holes as that is why you have two HDS on both sides. Since it can happen that you have data hole then you can just point and take reports from the side which has data. That action is anyhow just limited to the time you will need reports for that particular missing period. You collect reports and you are done. You don’t need to bother with that anymore. Not sure why you would need to really keep them in total sync as they are there to compensate for those data holes. If you want to keep them in total sync then stop icm services and do full backup of HDS1, start services on HDS1, copy that file over to HDS2, stop services on HDS2, delete HDS DB on HDS2 and recreate DATA and LOG size parts as for the HDS1 so that you can restore that HDS1 backup on HDS2 box (ICMDBA has limit to 32GB so once you create with ICMDBA then use SQL Studio to expand file parts to match HDS1 settings). Once you restore, truncate recovery table on HDS2 and start services on HDS2. Now you have both with same data.

Thank you for initiating the session and well explanation of the solutions.

having a query on CAD.

frequently, the agent log statistics in supervisor desktop doesnt display anything. it shows blank. could you please let us know how can we troubleshoot this issue. we frequently restart Cisco Enterprise service, Recording and Statitcs service and if the issue not resolved we go for Chat service and sync service.

but some time the issue doesnt get resolve. so kindly request you to give us how it works and what will be responsible for this.

OK, so CAD is one of the components which is integrated with CTI/CTIOS levels on ICM side. However, CAD is managed on its own with separate NT services and tools (like PostInstall).

Now, you didn’t send the version of your CAD (as there has been some changes in replication) but in general, Recording and Statistics Service is responsible.

Also, not sure if you are using Flat Files or SQL replication for RASCAL as that works totally in a different way. Also, people often mix LDAP and RASCAL replications.

Be Aware! LDAP and RASCAL are separate and independent databases.

In current versions, RASCAL uses XML files (flat files) or SQL on the UCCE PG as the datastore (Informix in UCCX). It stores data in three tables: FCRasRecordLog, FCRasCallLogWeek and FCRasStateLogToday. Flat files are in \Program Files\Cisco\Desktop\databaseTeamName folder on both the Primary and Secondary CAD Servers.

Sync Service uses LDAP (OpenLDAP), which syncs with the ICM AWDB to pull in agent, team and skill config information for the CAD Logical Call Center (LCC). Additionally, workflow group and phonebook customization is also stored within LDAP.

I imagine that you might have maybe issues with RASCAL replication there in your environment and that with restarts you are just triggering back working side. Due to the limitation of this ask/answer sessions I would invite you that you open case with Cisco TAC when you experience such issue so that it can be troubleshooted. In brief, Flat Files are NOT guaranteed mechanism of retaining statistics.

As a first aid, I can offer often used procedure done to re-establish broken or corrupted replica:

As far as ‘send to originator’ option goes, this is what needs to be known:

Three types of DNs work with Send To Originator: VRU label returned from ICM, Agent label returned from ICM, and Ringtone label.

Send To Originator does not work for the error message DN because the inbound error message is played by survivability and the post-route error message is a SIP REFER. (Send To Originator does not work for REFER transfers).

Note: For Send To Originator to work properly, the call must be TDM originated and have survivability configured on the pots dial peer.

On the top, there has been lots of discussions around this in the past on cisco forums. One of the best contributors there is Geoff and here you can find some example steps coming from his kitchen:

We upgraded a couple clients onto our HCS environment. Since we have had a couple outages where the A side loses connection with the B side. Normally this is related to some network interruption and it appears that way in the logs. However when I look in the system event viewer on the call server I see the following:

The Advanced QoS Setting for inbound TCP throughput level successfully refreshed. Setting value is not specified by any QoS policy. Local computer default will be applied.

I can match these up before every outage. As you guys know after 8.5 Cisco switched from packet scheduler based qos to the group policy. So I'm wondering if anyone else has seen this in 9.0. The first time I thought maybe it was coeincedence but since have seen it on other outages on completely seperate instances. The thing I wonder is if this is just an affect of an outage but I see this before is loses connection to the call server's duplexed partner. So believe it may actually be the cause. Any info you could provide on these messages I would appreciate. Because it is the first time we are seeing it with the upgrade to UCCE 9.0

Event ID 16501 — QoS Policy Update

Quality of Service (QoS) policies are applied to a user or computer account by using Group Policy. The QoS policies are applied to a Group Policy object (GPO), which is then linked to an Active Directory container, such as a domain, site, or organizational unit (OU), that contains the user or computer account.

Event ID 16504 — Advanced QoS Settings

"Advanced Quality of Service (QoS) settings provide additional controls for IT administrators to manage computer network use and DSCP markings. Advanced QoS settings apply only at the computer level, whereas QoS policies can be applied at both the computer and user levels."

OK, honestly, I cannot say I have seen this in 9.0 yet but I did see that in 8.5. Maybe 3-4 times so far.

One was related to the following known issue but that got resolved with 8.5.4 and 9.0.1:

"Note: Account policies are overwritten by the domain policy by default. Applying the Cisco Unified ICM Security Template does not take effect. These settings are only significant when the machine is not a member of a domain. Cisco Recommends that you set the Default Domain Group Policy with these settings."

Not directly related as above is about account policy but it is applying Domain Group Policy which in turn can also have above QoS features.

Thanks Goran. Some really good info in your post. And we've seen lots a problems if your VM settings are incorrect. So that is always the first thing we visit when approaching an issue as this.

I have to think it's some type of defect. Interesting the defect in previous release. Although we don't see that message. Just one no one has seen yet. And tough to catch because the problems masks itself as a network failure.

We are going to try and re-run setup and disable all the policies. See if that takes care of the issue. If it does, then we know 100% this is the cause and we'll have to work with TAC on the defect.

But think you for taking a look and for the docs. Really appreciate the Netgen. Have never used that.

I understand that MS SQL Server will consume as much memory as possible since it wants to cache as much as possible into memory, however, I'm having a difficult time interpreting the Serviceability Best Practices Guide. I have a customer that's concerned with their memory utilization on their AW-HDS and Rogger, but I need to be able to decifer the formula in the guide to really dig into this more.

Is memory utilization a good indicator to monitor strenuous events the server might be experiencing? I say no, but I'd like to hear the experts opinion.

Referencing the image below:

Let's take for example a Rogger that has 6GB of memory (6442450944 bytes) , I will use numbers that's displaying on the Rogger as of this writing:

Commited: 6442450944 bytes

Utilized: 6123421696 bytes

Free: 319029248 bytes

According to the image, anything less than 20% is crossing the threshold for available memory on the server.

(6442450949 - 6123421696) = 319029248 / 6442450949 * 100 = ~4.95 or 5%, which matches exactly what's in task manager. So the server is significantly lower than the 20% threshold, but again, how can this be a real indicator if SQL Server will take as much as it can? Am I misinterpreting all this? What exactly would be a good indicator, from an infrastructure perspective, that would tell me that a server is in fact healthy. I've attempted to leverage the counters that are collected by the node manager and saved to c:\icm\log, but again, I'm not sure what exactly is a good indicator to prove a healthy system.

Ok, I am not sure where did you get figures from for that calculation as I was trying to use the same approach in my lab.

I will paste a screenshot here so that it is cleary which figures i used.

Now, one very important thing there. When it is mentioned like "Measurement Counter: Memory – Committed Bytes" that means Windows Perf Monitor counter.

Also, check page 104 in that document, under "8.1 Platform Health Monitoring Counters" there is a table for health monitoring. It lists Performance Objects. So, those are Windows Perf counters as well. You can use those for health check but with note below about SQL.

Now, calculation. Here is my snapshot:

MEM physical = 6143 MB and that is 6441402368 bytes

MEM Sat = 80% of MEM physical and that is 4914 MB which is 5153121894 bytes.

My MEM 95% is Commited Bytes = 4833980416 bytes

So Mem p = 4833980416 / 5153121894 * 100 = 93.8

If you want to check Indicator Counter: Memory - Available Bytes with 20% threshold then my calculation is:

"When you start Microsoft SQL Server, SQL Server memory usage may continue to steadily increase and not decrease, even when activity on the server is low. Additionally, the Task Manager and the Performance Monitor may show that the physical memory that is available on the computer steadily decreases until the available memory is between 4 MB and 10 MB.

This behavior alone does not indicate a memory leak. This behavior is typical and is an intended behavior of the SQL Server buffer pool.

By default, SQL Server dynamically grows and shrinks the size of its buffer pool (cache), depending on the physical memory load that the operating system reports. As long as sufficient memory (between 4 MB and 10 MB) is available to prevent paging, the SQL Server buffer pool will continue to grow. As other processes on the same computer as SQL Server allocate memory, the SQL Server buffer manager will release memory as needed. SQL Server can free and obtain several megabytes of memory each second. This allows for SQL Server to quickly adjust to memory allocation changes.When you start Microsoft SQL Server, SQL Server memory usage may continue to steadily increase and not decrease, even when activity on the server is low. Additionally, the Task Manager and the Performance Monitor may show that the physical memory that is available on the computer steadily decreases until the available memory is between 4 MB and 10 MB.

This behavior alone does not indicate a memory leak. This behavior is typical and is an intended behavior of the SQL Server buffer pool.

By default, SQL Server dynamically grows and shrinks the size of its buffer pool (cache), depending on the physical memory load that the operating system reports. As long as sufficient memory (between 4 MB and 10 MB) is available to prevent paging, the SQL Server buffer pool will continue to grow. As other processes on the same computer as SQL Server allocate memory, the SQL Server buffer manager will release memory as needed. SQL Server can free and obtain several megabytes of memory each second. This allows for SQL Server to quickly adjust to memory allocation changes."

On the top, if there is a legal concern of low memory and paging is often, then adding extra memory to VM is totally fine.

My question is about the AGPG failover mechanism..we faced a problem in one of our recent implementation where AGPG side B would not take over if we shut the services on side A, finally we had to involve TAC to sort it out.

He changed a set of registry settings on both the sides and it started working.

I wanted to know what all registry settings would I have to look into in such situations.

Well, what i mean to say is that there are no requirements to tweak registry settings for 'green field' deployments to make PGs to work duplex.

This is done by running PG Setup.

I am not sure which exact case you are referring to and what exact regsitry settings were changed but I can assume the following was the problem:

- ports used by MDS process were not matching.

Now, this can happen if you try to deploy more than 3 PGs on the same box (yeah, I know... customers will always say "No, we did not do it" ).

But, I am just giving one possible example. In that case, since only 2 are supported per box, MDS ports might get reused. Then, if you go back and forth via installer and changing PG numbers and/or sides then again you can end up in a similar situation due to the fact that previous setup was not yet finished.

Again, there should be no registry tweaks done normaly to make duplex works but you can ping me case number so I will take a look.

Huh, honestly that would be very much welcomed by TAC as well but unfortunately there is no 'perfect' solution so far for monitoring UCCE as such.

However, there are lots of tools outhere already enabling to monitor some parts of the system or solution like:

- OPS Console in CVP

- Router Log Viewer on AW

- Script Editor in monitoring mode.

- Monitoring connections in CCMP

- Monitors and tools within CIM (Cisco Interaction Manager)

- CUIC reporting

...mentioning of course only available tools within product capable of 'real-time' monitoring or data capturing/reporting if you like...

All of those will give you some info which might be of interest to you but certainly those are not what you want to be final in your NOC.

Now, we do supply SNMP MIBs and so far that was the best bet for monitoring UCCE. I will leave to other customers/partners here to comment if and how valuable they find it is. But Certainly, by embedding UCCE SNMP within already deployed SNMP monitoring infrastructure, that is going to help.

For all above reasons, the last initiative is to try to align with Cisco Prime: