Resources for the Check Point Community, by the Check Point Community.

Tim Hall has done it again! He has just released the 2nd edition of "Max Power".Rather than get into details here, I urge you to check out this announcement post. It's a massive upgrade, and well worth checking out. -E

If this is your first visit, be sure to
check out the FAQ by clicking the
link above. You may have to register
before you can post: click the register link above to proceed. To start viewing messages,
select the forum that you want to visit from the selection below.

Checkpoint 5400 100% CPU usage

Hi,

in my place of work we have x2 checkpoint 5400 appliances running in a clustered configuration. We're struggling badly at the minute with them as CPU usage seems to be maxed out most of the time.

We have all of the acceleration templates drop templates etc enabled. I have tried to enable hyperthreading, but it looks like either the 5400 doesn't support it, or it's disabled in the BIOS (one for the support contractors to resolve).

when running cpview I have noticed that there is 1 connection which stands out when CPU usage is extremely high (90% - 100%) the TCP connection is iSCSI, I'm pretty sure we shouldn't have iscsi traffic running through the firewall, and that's something I'll look into resolving when back in the office, however the pps for this traffic is only 7500, bandwith throughput is a measly 65Mbps or so, everything else is barely hitting 3 figures pps and doesn't even register on the Mbps column.

the 5400 should be capable of 15,000 pps. What gives? Why is our appliance struggling so badly? Is it because it's iSCSI traffic, or is there something else that I've missed? (highly likely I'm very new to checkpoint).

Re: Checkpoint 5400 100% CPU usage

The comparison chart you linked actually says these boxes should be able to handle 150,000 new connections per second (under ideal testing conditions, of course). Setting up new connections is computationally expensive. They should be able to handle far more than that in terms of packets per second on existing connections.

Where are you seeing CPU usage maxed out? Different tools report different levels of usage as "100%". Some report an average of all cores (so 100% on one core would be reported as 25%, and 100% of four cores would be reported as 100%), while others add the cores together (so 100% on one core would be reported as 100%, but 100% of four cores would be reported as 400%).

What cluster mode are you running? You can check this with 'cphaprob state'.

Do you have a separate SmartCenter, or are these firewalls also management servers? To check this, run 'fwm ver'.

On the active member, what does your RAM usage look like? Check this with the 'free -m' command.

Depending on what features you have enabled, the boxes may be running low on RAM, which causes them to swap data out to the disk. Swapping data out to disk, then swapping other data back into RAM is a synchronous operation. The time spent doing that gets booked as consumed processor time, even though it isn't really the processor doing any work.

Re: Checkpoint 5400 100% CPU usage

Originally Posted by Bob_Zimmerman

Where are you seeing CPU usage maxed out? Different tools report different levels of usage as "100%". Some report an average of all cores (so 100% on one core would be reported as 25%, and 100% of four cores would be reported as 100%), while others add the cores together (so 100% on one core would be reported as 100%, but 100% of four cores would be reported as 400%).

We use SOlarwinds to monitor all our kit, but also SSH'ing to each node in the cluster and running cpview, both report similar numbers.

Originally Posted by Bob_Zimmerman

What cluster mode are you running? You can check this with 'cphaprob state'.

Cluster Mode: High Availability (Active Up) with IGMP Membership

Number Unique Address Assigned Load State

1 (local) XXX.XXX.XXX.XXX 100% Active
2 0% Standby

Originally Posted by Bob_Zimmerman

Do you have a separate SmartCenter, or are these firewalls also management servers? To check this, run 'fwm ver'.

We have a separate management appliance. running that command gives the following output:

This is not a Security Management Server station

Originally Posted by Bob_Zimmerman

On the active member, what does your RAM usage look like? Check this with the 'free -m' command.

Depending on what features you have enabled, the boxes may be running low on RAM, which causes them to swap data out to the disk. Swapping data out to disk, then swapping other data back into RAM is a synchronous operation. The time spent doing that gets booked as consumed processor time, even though it isn't really the processor doing any work.

to be honest in an effort to reduce the CPU usage we've taken to basically tuning 90% of the features off. only the IPS is running at the minute really

Typically it's now currently only consuming between 9 - 40% CPU, I'll grab the relevant outputs again when it's maxed out.

Distributed configuration (good) CPU obviously not too busy when these commands were run.

High CPU *might* be caused by an overloaded sync network between cluster members and you will need to consider selective synchronization of services if that is the case, to determine that please provide output of the following as well:

fw ctl pstat

Edit: Using cpview -t go back in time to a known period of high CPU utilization and please report the type of numbers being displayed for Bits/sec, Packets/sec, Connections/sec, & Concurrent connections on the Overview screen.

Since you suspect iSCSI traffic may be the culprit, make sure that traffic is not getting dragged into the PXL/F2F path by appi (ensure you are not using Any as a destination in APCL/URLF policy) or IPS (can be immediately disabled for new connections with the ips off command for testing). fwaccel conns can be used to verify which path the iSCSI traffic is getting processed in.

Re: Checkpoint 5400 100% CPU usage

Sorry what I meant by that was our support contractors have passed this issue onto Checkpoint and they suggested turning Hyperthreading on! this issue has been going on far too long, I've been trying to resolve the issue myself / research as much as I can which has led me to this forum.

Originally Posted by ShadowPeak.com

fw ctl pstat

Edit: Using cpview -t go back in time to a known period of high CPU utilization and please report the type of numbers being displayed for Bits/sec, Packets/sec, Connections/sec, & Concurrent connections on the Overview screen.

Since you suspect iSCSI traffic may be the culprit, make sure that traffic is not getting dragged into the PXL/F2F path by appi (ensure you are not using Any as a destination in APCL/URLF policy) or IPS (can be immediately disabled for new connections with the ips off command for testing). fwaccel conns can be used to verify which path the iSCSI traffic is getting processed in.

This image is a snip of the cpview overview screen, I couldn't copy and paste that screen - and I don't think it would've in a nice format anyeway

This is the network tab:

I've run the fwaccel conns command as you suggested, but I'm not really sure how to decipher the output? I get an awful lot in the output, more than securecrt can handle in it's view buffer anyway! Can you explain to me what the following means? "PXL/F2F path by appi" apologies if this is a very simple question, I am very new to checkpoint firewalls!

Re: Checkpoint 5400 100% CPU usage

Originally Posted by RichardPriest

Sorry what I meant by that was our support contractors have passed this issue onto Checkpoint and they suggested turning Hyperthreading on! this issue has been going on far too long, I've been trying to resolve the issue myself / research as much as I can which has led me to this forum.

This image is a snip of the cpview overview screen, I couldn't copy and paste that screen - and I don't think it would've in a nice format anyeway

This is the network tab:

CPU 2 is slammed to 100% mostly in kernel/system space while CPU 1 is 78% idle; so technically the overall firewall CPU load is 59%. Enabling the Dynamic Dispatcher is likely to help with this situation as it will more evenly balance the traffic load between the two available cores. Enabling the DD is definitely your first course of action to take and may solve your problem, mostly.

Beyond that, there are three different paths that packets can take through the firewall, in order of increasing CPU overhead: Accelerated/SecureXL Path (fastest - minimal CPU), Medium Path (PXL - slower - more CPU) and the Firewall Path (F2F - slowest - most CPU). You can see the percentages for each path with fwaccel stats -s and you provided statistics earlier while the firewall was not slammed showing 79% of traffic in the Accelerated/SecureXL Path (fastest). However now that the firewall is under heavy load almost all of your traffic is in the Medium Path (PXL) based on the second screenshot which is not that unusual, but I would strongly suspect that high-speed LAN traffic between two internal networks (or an internal network and a DMZ) is being pulled into the Medium Path where there is much more CPU overhead. The only blades you have currently enabled that can cause this effect are IPS and APCL. To figure out which one, try disabling IPS by unchecking the box on the firewall object and reinstalling policy. If you are still seeing CPU spikes, also disable application control and install policy. Once you determine which blade is causing the heavy PXL usage at LAN speeds issue we can troubleshoot further.

I've run the fwaccel conns command as you suggested, but I'm not really sure how to decipher the output? I get an awful lot in the output, more than securecrt can handle in it's view buffer anyway! Can you explain to me what the following means? "PXL/F2F path by appi" apologies if this is a very simple question, I am very new to checkpoint firewalls!

We need to figure out which blade is the culprit before dealing with this command's output.

Re: Checkpoint 5400 100% CPU usage

Originally Posted by RichardPriest

I thought this was odd, why is the cpu usage so high, but no interrupts?

Interrupts in this context mostly refer to the emptying of the NIC ring buffers via the SoftIRQ process. When a SND/IRQ core becomes much more heavily utilized than the others, SecureXL automatic interface affinity shifts the SoftIRQ processing away from the slammed CPU to the more lightly loaded SND/IRQ core(s). This helps ensure timely emptying of the interface ring buffers and avoids RX-DRPs of packets (visible with netstat -ni).

Re: Checkpoint 5400 100% CPU usage

Originally Posted by ShadowPeak.com

Sync network & memory look fine.

CPU 2 is slammed to 100% mostly in kernel/system space while CPU 1 is 78% idle; so technically the overall firewall CPU load is 59%. Enabling the Dynamic Dispatcher is likely to help with this situation as it will more evenly balance the traffic load between the two available cores. Enabling the DD is definitely your first course of action to take and may solve your problem, mostly.

Many thanks for that, I've actually just reloaded the x2 firewalls so Dynamic dispatcher should now be active on both units.

Originally Posted by ShadowPeak.com

Beyond that, there are three different paths that packets can take through the firewall, in order of increasing CPU overhead: Accelerated/SecureXL Path (fastest - minimal CPU), Medium Path (PXL - slower - more CPU) and the Firewall Path (F2F - slowest - most CPU). You can see the percentages for each path with fwaccel stats -s and you provided statistics earlier while the firewall was not slammed showing 79% of traffic in the Accelerated/SecureXL Path (fastest). However now that the firewall is under heavy load almost all of your traffic is in the Medium Path (PXL) based on the second screenshot which is not that unusual, but I would strongly suspect that high-speed LAN traffic between two internal networks (or an internal network and a DMZ) is being pulled into the Medium Path where there is much more CPU overhead. The only blades you have currently enabled that can cause this effect are IPS and APCL. To figure out which one, try disabling IPS by unchecking the box on the firewall object and reinstalling policy. If you are still seeing CPU spikes, also disable application control and install policy. Once you determine which blade is causing the heavy PXL usage at LAN speeds issue we can troubleshoot further.

We need to figure out which blade is the culprit before dealing with this command's output.

Fantastic, I'll try disabling the IPS when the load is particularly heavy and see if it improves matters then report back.

Re: Checkpoint 5400 100% CPU usage

Many thanks for that, I've actually just reloaded the x2 firewalls so Dynamic dispatcher should now be active on both units.

Fantastic, I'll try disabling the IPS when the load is particularly heavy and see if it improves matters then report back.

Really appreciate the help, many thanks

A question and few comments:

1- How do you if DD is enable on the firewalls? Can you provide the output of the command "fw ctl multik get_mode"?

- Enable DD might make the issue worse in other ways. I had an issue with where enable DD might make the traffics process by on CPU core on the inbound and another CPU core on the outbound thus the traffic got dropped. It is a KNOWN issue with DD. I did have an TAC case opened with Checkpoint.

- Is it possible that the traffics you see as iSCCI is actually Microsoft DFS or Oracle traffics RMAN? Are you running Microsoft DFS or Oracle application in your environment? Might want to investigate that. It is also a known issue in checkpoint as well. Had a TAC case open with Checkpoint too.

- Just because you disable IPS does not mean that IPS is actually is disabled. IPS is integrated with Checkpoint FW that you just can't simply uncheck the box and expect IPS to be completely OFF. It does not work that way. I learned a painful lesson on that as well.

Re: Checkpoint 5400 100% CPU usage

Is there any chance the iscsi traffic is fragmenting? Might explain high cpu usage as frags basically suck. Would need to packet capture to tell since the firewall is going to reassembly the frags before allow/deny the traffic.

Also have you checked if the iscsi traffic is really ipx encapsulated over a layer 2 gre tunnel that is streaming Highlander 2?

Re: Checkpoint 5400 100% CPU usage

Originally Posted by ShadowPeak.com

Beyond that, there are three different paths that packets can take through the firewall, in order of increasing CPU overhead: Accelerated/SecureXL Path (fastest - minimal CPU), Medium Path (PXL - slower - more CPU) and the Firewall Path (F2F - slowest - most CPU). You can see the percentages for each path with fwaccel stats -s and you provided statistics earlier while the firewall was not slammed showing 79% of traffic in the Accelerated/SecureXL Path (fastest). However now that the firewall is under heavy load almost all of your traffic is in the Medium Path (PXL) based on the second screenshot which is not that unusual, but I would strongly suspect that high-speed LAN traffic between two internal networks (or an internal network and a DMZ) is being pulled into the Medium Path where there is much more CPU overhead. The only blades you have currently enabled that can cause this effect are IPS and APCL. To figure out which one, try disabling IPS by unchecking the box on the firewall object and reinstalling policy. If you are still seeing CPU spikes, also disable application control and install policy. Once you determine which blade is causing the heavy PXL usage at LAN speeds issue we can troubleshoot further.

We need to figure out which blade is the culprit before dealing with this command's output.

OK this is the result of fwaccel stats -s after the IPS blade is disabled on the cluster in SmartDashboard, now what dynamic dispatcher is enabled the CPU usage has never gone as high, but when the iSCSI traffic is present everything has a definite slowness to it. (RDP sessions occasionally timeout, lots of egg timers when using applications etc.

Re: Checkpoint 5400 100% CPU usage

Originally Posted by RichardPriest

OK this is the result of fwaccel stats -s after the IPS blade is disabled on the cluster in SmartDashboard, now what dynamic dispatcher is enabled the CPU usage has never gone as high, but when the iSCSI traffic is present everything has a definite slowness to it. (RDP sessions occasionally timeout, lots of egg timers when using applications etc.

Does that output mean that none of the packets are now going through the medium path?

That looks pretty good as 75% of traffic is now accelerated even when passing iSCSI traffic and 23% is Medium Path, surprised things still feel slow for you with those kind of statistics. Try disabling APCL now as well and see if the "slowness" improves, once APCL is disabled the amount of fully accelerated traffic should be >90%. Your APCL policy may need some tuning.

Unlikely the iSCSI traffic is fragmented as another poster mentioned earlier as the fragmented traffic would go F2F (not PXL) and fw ctl pstat did not show any excessive fragmentation statistics.

To respond to a different poster, if the Dynamic Dispatcher is causing problems with certain types of traffic it can be disabled on the fly for certain port numbers. This procedure is not documented, but did get a mention in my book. Should not be needed for this iSCSI issue:

Enabling the Dynamic Dispatcher on R77.30 after loading the latest GA Jumbo HFA
is about as close to a no-brainer as it gets, and I have not personally witnessed any
situation where enabling the Dynamic Dispatcher caused problems with the firewall or
the applications traversing it. But interestingly enough, there appears to be a real-time
mechanism to partially disable the Dynamic Dispatcher on a per-port basis with these
kernel variables that can be set or queried via the fw ctl set/get commands:

dynamic_dispatcher_bypass_add_port

dynamic_dispatcher_bypass_ports_number

dynamic_dispatcher_bypass_remove_port

dynamic_dispatcher_bypass_show_ports

These kernel variables did not exist in the initial R77.30 code release but seem to
have been added in one of the R77.30 Jumbo HFAs; they also exist in the R80.10
firewall code. Be warned however that these variables are undocumented and tampering
with them is most definitely not supported. But if certain applications are proven to be
incompatible with the Dynamic Dispatcher for some reason, it is worth a call to the
Check Point TAC to inquire about this hidden feature rather than disabling the Dynamic
Dispatcher completely.

Re: Checkpoint 5400 100% CPU usage

Originally Posted by ShadowPeak.com

That looks pretty good as 75% of traffic is now accelerated even when passing iSCSI traffic and 23% is Medium Path, surprised things still feel slow for you with those kind of statistics. Try disabling APCL now as well and see if the "slowness" improves, once APCL is disabled the amount of fully accelerated traffic should be >90%. Your APCL policy may need some tuning.

Good to know, thanks!
I've just gone to re-enable IPS and disable APCL, via smart dashboard. It looks like APCL is already disabled!

Re: Checkpoint 5400 100% CPU usage

Originally Posted by ShadowPeak.com

That looks pretty good as 75% of traffic is now accelerated even when passing iSCSI traffic and 23% is Medium Path, surprised things still feel slow for you with those kind of statistics. Try disabling APCL now as well and see if the "slowness" improves, once APCL is disabled the amount of fully accelerated traffic should be >90%. Your APCL policy may need some tuning.

Unlikely the iSCSI traffic is fragmented as another poster mentioned earlier as the fragmented traffic would go F2F (not PXL) and fw ctl pstat did not show any excessive fragmentation statistics.

To respond to a different poster, if the Dynamic Dispatcher is causing problems with certain types of traffic it can be disabled on the fly for certain port numbers. This procedure is not documented, but did get a mention in my book. Should not be needed for this iSCSI issue:

IPS is now reenabled and everyone is back at work so the traffic flowing though the firewall is much greater now

with dynamic dispatcher enabled both CPU cores are currently hovering around 40 - 45% each, so in real terms 90%+ before it was enabled. the iSCSI traffic has now been removed too so that's helping an awful lot.

Is there anything I can do to reduce the CPU usage further?

EDIT:

actally looking in CPview (see screenshot attached) there are considerably more PXL connections than there are SecureXL connections. Is there a way I can find out why these are hitting the medium path and not the accelerated path?

with dynamic dispatcher enabled both CPU cores are currently hovering around 40 - 45% each, so in real terms 90%+ before it was enabled. the iSCSI traffic has now been removed too so that's helping an awful lot.

Is there anything I can do to reduce the CPU usage further?

In my book the stated goal is to have about 50% average utilization on the CPUs during the firewall's busiest period, thus allowing enough "headroom" for the firewall to potentially burst at double that speed. This is a realistic goal in most environments and it sounds like you are there!

EDIT:

actally looking in CPview (see screenshot attached) there are considerably more PXL connections than there are SecureXL connections. Is there a way I can find out why these are hitting the medium path and not the accelerated path?

EDIT again:

this is 30 minutes after turning IPS off via CLI

Having most traffic in PXL is normal on most firewalls; F2F is what you want to avoid. If the iSCSI traffic was still present and/or you hadn't reached the 50% utilization goal, the next steps taking into consideration which blades you have enabled would be:

1) Disable any IPS signatures in the current IPS profile with a "Performance Impact" of Critical or High: helps more traffic potentially get processed in the SXL/accelerated path
2) Tune the APCL/URLF policy: get rid of the "Any Any Any Recognized Accept" cleanup rule at the bottom and make sure that "Any" is not used in the source or destination of any APCL/URLF rule, this exempts high speed LAN-LAN or LAN-DMZ traffic from processing by APCL/URLF in PXL and potentially makes more traffic eligible for the SXL/accelerated path.

Re: Checkpoint 5400 100% CPU usage

Originally Posted by ShadowPeak.com

In my book the stated goal is to have about 50% average utilization on the CPUs during the firewall's busiest period, thus allowing enough "headroom" for the firewall to potentially burst at double that speed. This is a realistic goal in most environments and it sounds like you are there!

Having most traffic in PXL is normal on most firewalls; F2F is what you want to avoid. If the iSCSI traffic was still present and/or you hadn't reached the 50% utilization goal, the next steps taking into consideration which blades you have enabled would be:

1) Disable any IPS signatures in the current IPS profile with a "Performance Impact" of Critical or High: helps more traffic potentially get processed in the SXL/accelerated path
2) Tune the APCL/URLF policy: get rid of the "Any Any Any Recognized Accept" cleanup rule at the bottom and make sure that "Any" is not used in the source or destination of any APCL/URLF rule, this exempts high speed LAN-LAN or LAN-DMZ traffic from processing by APCL/URLF in PXL and potentially makes more traffic eligible for the SXL/accelerated path.

All this is covered step by step in my book.

Let say that step #1 and step #2 are done like you suggested and still has high CPU, what is the next step?