Thursday, April 25, 2013

So what's there to further squeeze the performance of Policy Agent during high-load? -- This is a typical question from customers.

There is this section on Policy Agent Configuration reload and cleanup interval. The default are set to 60 mins and 30 mins respectively.

This is what I verified by tailing the agent debug log (of course, you need to enable log level to MESSAGE at least). The timing is correct - once every 60 mins and 30 mins.

The following is what you'll see in detail when the reload takes place. (It's longer than the following)

So, should we change this value? My personal view (I already say personal view; do not point a gun at me :> ) is NO for Configuration Reload Interval, even for a very highly loaded server with policy agent installed.

PS: Since these are "Hot-swap : No" variables, do remember to restart the web container that has Policy Agent installed.

Now, one thing to note is if you enable notification, the cache will be flushed as and when an update comes from OpenAM servers. This might happen even when the polling interval has not reached.

In this mode, cache entry expiration still applies through use of the polling mechanism. In addition, the web agent gets notified by the OpenSSO Enterprise service about session changes through use of a notification mechanism. Session changes include events such as session logout or a session timeout. When notified of a session or a policy change, the web agent updates the corresponding entry in the cache. Apart from session updates, web agents can also receive policy change updates. Policy changes include events such as updating, deleting, and creating policies.

While writing this, I realized I have posted an article on this topic before. Still applicable today.

2. Mobile apps can send requests to these services to do the following operations:
■ Log in a user using Basic (username and password) authentication.
■ Verify the status of a SiteMinder session.
■ Log out a user.

Now, what happens when the Primary node (am1.cdemo.sg:7676) is restarted?

My observation is:

1. The OpenAM servers will continue to communicate with the last active node, which is Secondary node (am2.cdemo.sg:7777)
2. They will only be communicate with the Primary node (am1.cdemo.sg:7676) again upon OpenAM restart.

=> This follows the Database URL : am1.cdemo.sg:7676,am2.cdemo.sg:7777 which was initially configured during Site setup

Tuesday, April 16, 2013

Sun GlassFish(tm) Message Queue 4.4 is quite a black box to me. I find it hard to debug when things go wrong.

Now, at least I learnt a few sanity checks to ensure the broker cluster is running OK.

1. Ensure Message Broker is started properly

- Once AMSFO is configured, run ./amsfo start
- Tail the log file at ../jmq/imq/var/instances/ambroker/log/log.txt
- Make sure there is this line "Broker "ambroker@xxx.xxx.xx:YYYY" ready."
-

2. Ensure Session Failover is enabled

- Once AMSFO is configured, the 1st thing I'll do is to turn the debug log of each OpenAM server to MESSAGE level.
- Restart each OpenAM server
- Tail the Session debug log
- Make sure there is this line "Session Failover Enabled = true"

It's very surprising to me that customers are demanding software to be fully IPv6 compliant these days, especially here in Singapore. IDA (Infocomm Development Authority of Singapore) has been pushing this directive real hard to all ministries for the past 1-2 years.

Another interesting feature in OpenAM 11.1 is the Monitoring Dashboard. Pretty curious what kind of dashboard will that be. This is usually a requirement from customers when the SSO infrastructure goes LIVE. We have been writing custom scripts and application to handle this out of OpenAM core.

Saturday, April 13, 2013

I have customers who are using Microsoft Active Directory as the authoritative authentication store, as well as user data store. So do we internally (of course, ours is meant for a live POC/showcase).

We know that the connection to MS AD is via the LDAP Authentication Module. It has been there for years. However, it has its own shortcomings which I blogged here 3 years ago.

Starting in OpenAM 10.0, the LDAP Authentication Module has been revamped to add support for the latest LDAP Behera password policy standards (See OPENAM-613). The shortcomings mentioned in my blog previously was resolved in this revamp. However, it brings with it another issue - somehow, the connection pool wasn't working as expected (See OPENAM-590, OPENAM-627, OPENAM-1787).

A fix (in fact, a few iterations in total) was released. The latest being 3.0.0-OPENAMp3. (See maven repository here) (I'm not too sure about 3.0.0-Xpress1. This must be something new which I have not tested yet.)

A customer of ours in Thailand initially encountered connection timeout issue whenever the OpenAM is left "idle" for a period of time. e.g. the 1st user who attempted to log-in in the morning. It was an easy fix for this customer by replacing with opendj-ldap-sdk-3.0.0-OPENAMp3.jar. The system has been running superbly well since last December.

No good news for our internal SSO infrastructure though. :( In one of our test environment which we simulate the setup in Thailand, everything was running fine. No connection time out issue at all - for days, for weeks.

However, in our own production system, no luck. On and off, we'll get connection time out issue. It usually happens to the 1st staff who came into the office in the morning. So we started to zoom in and I must say it's super time-consuming.

When the problem occurred, we run jstack to determine the potential root cause. Pretty lucky here!

"http-apr-192.168.0.88-8080-exec-10" daemon prio=10 tid=0x00007f8b88010800 nid=0x1841 waiting on condition [0x00007f8b03e73000]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000007d92a3430> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:196)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2025)
at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:424)
at org.forgerock.opendj.ldif.ConnectionEntryReader.getNextResponse(ConnectionEntryReader.java:437)
at org.forgerock.opendj.ldif.ConnectionEntryReader.hasNext(ConnectionEntryReader.java:264)
at com.sun.identity.authentication.modules.ldap.LDAPAuthUtils.searchForUser(LDAPAuthUtils.java:883)
at com.sun.identity.authentication.modules.ldap.LDAPAuthUtils.authenticateUser(LDAPAuthUtils.java:466)
at com.sun.identity.authentication.modules.ldap.LDAP.process(LDAP.java:526)
at com.sun.identity.authentication.spi.AMLoginModule.wrapProcess(AMLoginModule.java:1000)
at com.sun.identity.authentication.spi.AMLoginModule.login(AMLoginModule.java:1170)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at com.sun.identity.authentication.jaas.LoginContext.invoke(LoginContext.java:208)
at com.sun.identity.authentication.jaas.LoginContext.login(LoginContext.java:124)
at com.sun.identity.authentication.service.AMLoginContext.runLogin(AMLoginContext.java:557)
at com.sun.identity.authentication.server.AuthContextLocal.submitRequirements(AuthContextLocal.java:696)
at com.sun.identity.authentication.AuthContext.submitRequirements(AuthContext.java:1244)
at com.sun.identity.authentication.AuthContext.submitRequirements(AuthContext.java:1230)
at org.apache.jsp.getServerInfo_jsp._jspService(getServerInfo_jsp.java:131)
at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:728)
at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:432)
at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:390)
at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:334)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:728)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:305)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
at org.forgerock.openam.validation.ResponseValidationFilter.doFilter(ResponseValidationFilter.java:44)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
at com.sun.identity.setup.AMSetupFilter.doFilter(AMSetupFilter.java:95)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:222) 0x00000007d92a3430>
:
:

We zoomed into LDAPAuthUtils.java, in particular the searchForUser method. This will attempt to query a user from the Active Directory via a LDAP connection from the Admin Connection Pool - adminPool.

We suspected the connection object is stale within the Admin Connection Pool. Somehow, the fix in 3.0.0-OPENAMp3 resolves the stale connections if the backend is a OpenDJ. It was not tested rigorously against a MS AD backend. ( Our guess here. Do not point a gun on us. :> ) Well, it's named opendj-ldap-sdk-3.0.0-OPENAMp3.jar for a reason right?

So, how do we resolve this issue? Firstly, we do not want to touch opendj-ldap-sdk-3.0.0-OPENAMp3.jar for sure. This to us is external to OpenAM core. The safest approach is to add a 1-liner fix to LDAPUtils.java.

The 1-liner fix attempts to perform a heartbeat on each connection object every 10 seconds. ( Of course, 10 seconds is debatable. You can choose your own magic number )

Our production OpenAM has been running smoothly since. I think about a month already.

PS: Is this a bug? I do not think so. That's why I never raise a bugster. Strange thing is against some MS AD, no problem. Against another set of MS AD, problem occurs. This has been haunting me for a while since last year after OpenAM 10.0 was released.

Wednesday, April 3, 2013

I have blogged quite a lot on this topic (see here and here) and I have also find out the Configurator is only expected to work for a particular setup (see here).

It will fail if customers happen to arrange the Organization and Administrator Authentication configurations in a way that was described in my previous posting.

Today, I am again setting up a pair of OpenAM for internal testing for a customer of mine. This time round, there is no customization at all. Just a plain vanilla setup.

And of course, the Configurator works like a charm as long as the First instance of OpenAM server is alive and running.

The Configurator will "contact" the 1st instance and grab all the configured parameters over. What's more, it will suggest the next best port numbers to use (since the default port numbers have already been used by the 1st instance).

Nice right? Only if the Organization and Administrator Authentication configurations are arranged in a way the Configurator likes.