Monday, August 8, 2011

As we know, Oracle recently released product called OEG (Oracle Enterprise Gateway) which is meant to replace OWSM gatweay from 10g stack. I was just comparing how it with my 10g eyes.

1. So entire 11g stack is deployed on weblogic and installation time you choose user, which is pretty much weblogic and your preferred password. OEG is separate installation, it doesn't ask for username and password during install. (It is actually admin/changeme, which you can find buried in some documents).

Ah, remind me about the same in 10g, whole stack will use oc4jadmin, but OWSM would have separate user "admin" and oracle chosen password "oracle".

2. OEG password is changeme, but when I login to OEG using http://localhost:8090/runtime/management/ManagementAgent, can not find a way change it. It was not very hard and later figured out it is in the Policy studio, as below:

I don't have to tell, how hard is to change "admin/oracle" password in 10g, or almost impossible to add new user in OWSM.

3. BPEL compatibility :
I remember (~2005ish during first OWSM release), when you register BPEL service to OWSM it won't work. You have to copy schema manually. Well, I could deal with it for a while and later patches fixed those problems.
With latest release of OEG (11.1.1.5.0), Java webservice registration went fine, but when I try to register BPEL composite (deployed on SOA 11.1.1.5.0), it didn't register at all. Of course no logs in any of the log file. Later realized that it only like WS which are compatible with WS-I standard (http://www.ws-i.org/). Upon downloading their tools, I figured out that 11g composites (at least exposed BPEL WS) are not WS-I standard composite.
I probably have to make it work using policy studio somehow, but it would be nice if gateway can accept my BPEL wsdl URL.

We were using XrefUtility and realized that Xref utility allows us to run select clause on source and target database, so that we can manipulate any data before comparing it with Xref table values. In our case, we had to manipulate the XRef values as well before comparing it with source and target database.

For example, it is pretty common to add revision number in BRM values in Xref table. Those revision number changes in BRM without getting updated in AIA database. To solve that problem, revision number should be truncated from AIA XREF and target query. I had to modify XRefUtility to achieve this functionality. I enhanced UI as below:

Another thing we found, is that going through 8 or so report is so cumbersome. If you look at my blog, you can see each report is combination of different cases. It would be nice to have all discrepancy in one or two reports, and have more control over queries. I ended up just writing my own utility in SQL as below:

Note we can create materialized view so that it doesn't affect the performance and refresh the view only when required. Now it is pretty simple to generate all reports and report numbers and data in one report:

count for sv_id null in xv - shows all descrepancy where source id is missing in AIA Xref Database
select count(*) from xv where sv_id is null
select * from xv where sv_id is null

count for tv_id null in xv - shows all descrepancy where target id is missing in AIA Xref Database
select count(*) from xv where tv_id is null
select * from xv where tv_id is null

count for rows in sv not in xv - shows all descrepancy where data is in source database but missing AIA xref
select count(*) from sv where not exists ( select * from xv where xv.sv_id = sv.sv_id )
select * from sv where not exists ( select * from xv where xv.sv_id = sv.sv_id )

count for rows in tv not in xv - shows all descrepancy where data is in target database but missing AIA xref
select count(*) from tv where not exists ( select * from xv where xv.tv_id = tv.tv_id )
select * from tv where not exists ( select * from xv where xv.tv_id = tv.tv_id )

Tuesday, May 3, 2011

Recently had chance to work on XrefUtility. It is just designed for COMM PIPS to check on xref data anomaly, but it can be used for any XREF (as long as common column is named as COMMON). It generates multiple reports to show the different anomaly in Xref database. So basically it provides two distinct and valuable features:
- Generates multiple reports for Xref anomaly.
- Provides an interface to fix them

Installation: You can download the XRefUtility from Oracle Support 9326510. The installation note is included but I faced multiple jar files issues during installation. Nothing major, after adding those missing jars or changing path installation goes smooth.

It basically installs FixXref Bpel process and web application called XrefUtility. The FixXref should be installed in default domain. I tried to install in different domain which worked fine during install, but during run time it expected to have it in default domain.

Usage: usage is well documented in the guide but for my own purpose, I created my own XREF table and added my own data to create all possible scenario where Xref can have issues.

I created source table, target table and Xref called Source_Target_Xref. I also created meta data Xref which is not required for the XrefUtility but definitely required if you planning to use FixXRef.

It can generate multiple reports. As mentioned earlier, I created all possible combination for Xref anomaly, and as mentioned below, reports were able to catch pretty much all of them. Following diagram shows which report causes which type of anomaly.

Reports

Along with report it also generates text file which can be used for input to FixXref interface. I did found some issue with Fix Xref but mostly tried to avoid as it is better to fix the code or process which causes such anomaly. Probably, I will blog about FixXref in future.

Friday, March 25, 2011

We started getting this error message in the log file all of sudden, and not sure how the dequeue for the queue got disabled.

[Linked-exception]
java.sql.SQLException: ORA-25226: dequeue failed, queue JMSUSER.AIA_CUSTOMERJMSQUEUE is not enabled for dequeue
ORA-06512: at "SYS.DBMS_AQIN", line 571
ORA-06512: at line 1
at oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:138)
at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:316)
at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:282)
at oracle.jdbc.driver.T4C8Oall.receive(T4C8Oall.java:639)
at oracle.jdbc.driver.T4CCallableStatement.doOall8(T4CCallableStatement.java:184)
at oracle.jdbc.driver.T4CCallableStatement.execute_for_rows(T4CCallableStatement.java:873)
at oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:1161)
at oracle.jdbc.driver.OraclePreparedStatement.executeInternal(OraclePreparedStatement.java:3001)
at oracle.jdbc.driver.OraclePreparedStatement.executeQuery(OraclePreparedStatement.java:3043)
at oracle.jms.AQjmsConsumer.dequeue(AQjmsConsumer.java:1601)
at oracle.jms.AQjmsConsumer.receiveFromAQ(AQjmsConsumer.java:916)
at oracle.jms.AQjmsConsumer.receiveFromAQ(AQjmsConsumer.java:835)
at oracle.jms.AQjmsConsumer.receive(AQjmsConsumer.java:776)
at oracle.tip.adapter.jms.JMS.JMSMessageConsumer.consumeBlockingWithTimeout(JMSMessageConsumer.java:405)
at oracle.tip.adapter.jms.inbound.JmsConsumer.run(JmsConsumer.java:330)
at oracle.j2ee.connector.work.WorkWrapper.runTargetWork(WorkWrapper.java:242)
at oracle.j2ee.connector.work.WorkWrapper.doWork(WorkWrapper.java:215)
at oracle.j2ee.connector.work.WorkWrapper.run(WorkWrapper.java:190)
at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(PooledExecutor.java:830)
at java.lang.Thread.run(Thread.java:595)

It was quite easy to fix it. Following query shows the queues which has dequeue or enqueue disabled:

select * from all_queues where trim(enqueue_enabled) = 'NO'

Once idetified which queue has issue, we can run following query to enabled the dequeue and enqueue for particular queue:

Wednesday, March 23, 2011

In BPEL, it is quite easy that if we don't specify the version during partnerlink invocation, it will use the default one. But in ESB, especially when ESB is calling BPEL via native binding the version of the BPEL process is hard coded inside the routing rule. This was quite annoying because this pattern is heavily used in AIA and redeploying newer version of BPEL requires change in routing rule in ESB.

In 10.1.3.5, Oracle released a new feature which solves this problem. After 10.1.3.5 upgrade, if you deploy any BPEL process, it creates a separate version in BPELSystem called "default" which is visible in ESB Console as well as JDeveloper.

If we select default during routing rule, then the default version of the BPEL process would be called from ESB during native invocation. Therefore if we change the BPEL default version or redeploy BPEL process with higher version, it won't require change in ESB routing rules.

Nothing really new, but been having my eyes on this promiscuous mode, but never had time to sit down and put it under the test. It is at the server level accessible via collax-config.xml or BPELAdmin console.

So in 10.1.3.3, we had simple feature called:
productionServer = true/false (false by default)
true: same version of BPEL process is not allowed to be deployed. You get following error message:

is being re-deployed to a Production Server with same revision number.Please modify the revision for the process.

In 10.1.3.5, productionServer is deprecated and it is replaced with serverMode property. As per documentation:

serverMode = production/developement/promiscuous
Identifies the server mode. Currently supported server modes are:
* production - re-deployment of process with same revision is not allowed.
* development - re-deployment of process with same revision marks the existing instance as stale.
* promiscuous - re-deployment of process with same revision will not stale instances, work items will be migrated.
The default value is "development".

I thought promiscuous is something everybody would want, so I thought experimenting it with. First changed the serverMode to promiscuous, and restarted.

Test case 1 (Synchronous) :

I deployed synchronous process A with couple of activities. Fired some instances of process A. Now I completely changed all the activity names in process A, added and deleted some more activities and redeployed process A with same version. Fired some more instances of process A.

Results: It was pretty good and as expected. None of the instances went stale. Old instances were shown based on old code and new instances were shown based on new BPEL code.

Now, as synchronous processes held the ground firm, it was time to experiment asynchronous flow.

Test case 2 (asynchronous ) :

I deployed asynchronous process B with a few activities. I had process B calling another asynchronous process C and then do receive activity. This was to make sure B goes to dehydration and I can get some inflight instances. I ran some transactions so that process B has a few instances inflight mode.

Now, I drastically change the process B, added/deleted new activities (but did keep the call to C and receive from C). I deployed the new code of process B with same version. I created some new transaction and tried to complete some old transactions.

Results: Quite fantastic. Nothing went stale. Process B in-flight instances with older code got migrated to newer code process B after dehydration point, and completed successfully with newer version of code. This is seems to be very powerful but delicate feature and person doing deployment should be very cautious about using promiscuous mode with in flight workflows.

Test case 3 (asynchronous human workflow) :

Created a simple process D, with Human Task, and created some inflight workflows for process D. I made some minor changes in process D, and redeployed the new code with same version.

Result: All bpel instances were inflight, but all human tasks went stale, at least didn't show up in WLA. This definitely didn't pass the expected results.

Conclusion
It is definitely very strong feature and avoid some hassles with stale activities. But in general it would be great to have this feature at the process level instead of server level. That way, we can do promiscuous only for sync and not just version the async one. I could not find such feature in 11g, guess I need to look around if composite have similar feature.

Wednesday, March 2, 2011

If I have to come up the list for top 5 things I hate most, Oracle ESB (only 10g) would definitely make it to that list. I have been working on it for almost 5 years and pretty much every single project, including project involving complex business processing in ESB to millions of transaction per hour just using ESB. Each time ESB never fails to disappoint me further. Every new release since 10.1.3.1 fixes hundreds of bugs and introduced another new hundreds. Just an instance "Unable to build the instance relationship", one should get nobel price for solving it as it has been unsolved mystery for 5 years and seems to be more complex than E=MC2. I personally have heard statement - "ESB is unreliable product" from internal Oracle SOA Gurus, and I completely disagree with it; I think that calling ESB a "product" itself is fundamentally breaking the laws of software ethics.

Anyways, recently did 10.1.3.5.2 upgrade and we saw that ESB connection pool (ESBPool) was growing 100 connections per hour. After deprived sleep multiple nights, we found that it was another ESB bug. There were constant error messages like "Unclosed connection detected : 'oracle.oc4j.sql.spi.ConnectionFinalizer@" in log.xml. Upon creating multiple different test cases, I found that it was with a very specific pattern:

AQ -> ESB Consumer -> ESB Async Routing Rule -> Asyn BPEL

The Async Routing Rule was the culprit and such flows were found in OOTB AIA code. Once we fixed it from Async to Sync, connections never went up more than 2, and life was back to normal again.

Wednesday, February 23, 2011

Well, not any cutting edge stuff, but did cause us lot of problem for us during 10.1.3.5 upgrade. It seems like 10.1.3.5 and especially 10.1.3.5.2 has lot of XSD validations and it can cause BPEL process not to load if some of the validation fails.

It doesn't allow user to click on this process in BPELConsole. Appropriate action would be to undeploy or redeploy those processes but without clicking on that process you can not undeploy it.

1) One way to undeploy such process would be to clean up database. We tried following:
a) clean up the database tables
PROCESS
PROCESS_DEFAULT
PROCESS_DESCRIPTOR
PROCESS_LOG
SUITCASE_BIN
b) clean up the BPEL process under tmp directory (bpel/domains/domain-name/tmp)
c) clean up ESB in BPELSystem

Thursday, February 10, 2011

I think all of us know that transaction=participate in 10.1.3.3 was replaced by handleTopLevelFault=true in 10.1.3.5. I never realized the impact it had in our processes and how Oracle keep changing their mind on what this value is and how the server behaves.

First of all, in 10.1.3.3, all BPEL processes DON'T participate in transaction by default. Which means if you have, process A -> Process B -> Process C, and

In this case, upon the failure of invoke C, entry in table 1 or 2 will never roll back, even if data sources are configured for XA transaction. If you do put transaction=partipate in your process at configuration level, then this process and their operation would be in Global transaction and throw activity would roll back insert in table 1 and 2.

[Note if you don't have throw, it would act like there is no error and nothing would ever roll back in any scenario, which is quite expected.]

Now, in 10.1.3.4, they replaced transaction=participate with handleTopLevelFault=false. The default value in the code [com.collaxa.cube.engine.core.map.BPELProcessBlock] was true. That's why behavior would be exactly same as 10.1.3.3. Nothing would participate in transaction unless you explicitly specify it to be. You can not define this property at domain level, it is only supported at process level.

In 10.1.3.5, the default value in [com.collaxa.cube.engine.core.map.BPELProcessBlock] got changed to "false". It caused numerous problems to us. Based on AIA best practice, we always have our own try/catch block, we do some processing in catch block (e.g. update db, queue, etc) and then we rethrow the error so the caller can get details of the error and instance can show as faulted. We had the pattern mentioned earlier all over the places. In 10.1.3.3 we could see the data which were inserted from catch block, but in 10.1.3.5 everything just rolled back because it started participating in transaction by default.

I agree, not a very good coding practice but it worked in 10.1.3.3 and nobody realized it can cause transaction issues during upgrade. The ideal solution should be to be catch block in local transaction so during roll back it doesn't rollback your work in catch activity.

Anyways, we had more than 200+ processes and we couldn't afford changing the code for each one of them, so I came up with neat solution which worked out without any issues. I wrote BPELClient API to insert "handleTopLevelFault=true" in majority of the BPEL processes, and that's it. No restart and no code change and life was back to normal. The only flow would be that if somebody redeploys the code and overwrites descriptor, but we can just run the util again in that case.

If your process has handleTopLevelFault = true, it will show the Faulted Instance in Flow view and Audit view in BPELConsole

if your process has handleTopLevelFault = false (or don't have this property in 10.1.3.5), it will not show the instance in flow view and will only show in Audit view. That was some new enhancement added in 10.1.3.5.

Thursday, February 3, 2011

In one of the project, we were trying to push lot of data to BAM (some of the trace data along with data required for reports). We needed a solid approach were we can control the amount of data going to BAM at run time, just like logging level (info/debug). I agree that for infrastructure and instance tracing, CAMM, Amberpoint or Grid Control would be a better solution, but not all clients have luxury to invest in those tools. Sometime EM and BAM can help with basic troubleshooting.

The only out of the box approach I found was to use EM to disable monitoring at composite level. Pretty much nothing if you are using BAM adapter (e.g. in Mediator). We can also use following at composite level, but it requires redeployment:

<property name="disableProcessSensors" many="false">true</property>

Following was the pretty intuitive approach we found to control the data going to BAM and we can stop/start/control BAM publishing at runtime. In BAM Sensor (Sensor action) there is feature called Filter. From there we can call any XSL expression, so we wrote custom XSL function which reads some configuration from JVM. We can control the BAM sensor publish event by returning true/false from this function.

We used the similar thing in Mediator, but it was quite simple as it also provides filter expression.

Tuesday, January 18, 2011

Recently we did upgrade of 10.1.3.5 and found that if you have faulted instance in BPELConsole, it doesn't show up the flow view. You can still see the details from Audit tab but flow tab shows following error (Javascript):

- First of all, faulted instances are the one which you want to look into BPELConsole first
- Audit tab is very busy, I believe flow view is graphical and very compact

Upon following up with Oracle, we found that it is expected behavior! (http://download.oracle.com/docs/cd/E14101_01/doc.1013/e15342/bpelrn.htm#BABCGIBB). Since when, the javascript error on the webpage become expected behavior??

Anyways, it was time back to get into dirty details of the JSP and Javascripts but I finally was able to fix the issue by just changing flowviewer.js file. Here are the steps, which should be applicable to any version:

1) open $ORACLE_HOME/j2ee/oc4j_soa/applications/orabpel/console/lib/flowviewr.js
2) You can use online beatifier as this JS is very compact, I used :
http://jsbeautifier.org/
3) Search for open.faulted, you should see something like

This is known issue, because during fresh install there is no gateway and there is no component called C0003001. So if you create gateway component then it will get the id of C0003001 and this error should go away.

In our environment, we had C0003001 component as gateway, but we decided to deactivate it and tried to create new one, and this error shows up again. Now deleting any component or recreating gateway doesn't ensure C0003001 id. On gateway, we started getting error: Gateway is not ready to process requests.

To resolve the issue, we cleaned up the entire OWSM repository and reset the sequence as below: