Already some success .. the unix team entered a static route (hope thats the correct terminology) such that backup traffic actually routes thru the backup (eth3) interface. We went from 1MB on the interface to over 30MB/s

Just wanted to add something to this thread that I just found out about .. "SECTION SIZE" parameter to the rman backup command. This helps to keep the multi-channel backups somewhat balanced when dealing with the large datafiles often found in the Exadata environment. In my case, I have 1 channel chewing away for 4 days now on the last datafile, an 8TB datafile. Would have been nice to have the whole power of the exadata pushing that datafile to tape.

Daryl, from what i read i assume that you have a very big database with very big datafiles (datafiles with more than 1 TB size)
I had the opposite problem with my DWH database. It had 3500 datafiles (many of them very small size). I reorganized the datafiles
to have only 8 datafiles for every 1 TB (128GB each datafile). Only from that change now the hole database has only about 100 datafiles.
Without changing anything else the backup went from 7.30h to 5.30h

Please accept my apologies if I cover something’s again.
h1. First – Verify the bus and device speeds from the DB Nodes through to the Tape Drives.
1. DB Nodes send out on eth3 at a rate of 1Gbit/sec – there are 8 DB Nodes – so maximum transfer rate across all 8 DB Nodes should be 8Gbit/sec
2. The Media Servers are receiving data in on a 10GigE card that is installed into a PCI Express 1.0 (X4) slot so should be able to receive at 8Gbit/sec. If the 10GigE card is installed into a PCI Express 1.0 (x8) or PCI Express 2.0 (x4) slot then this increases to 16Gbit/sec
3. The Media Server sends data to the FC SAN using a single 4Gbaud Fiber Channel card so the maximum transfer rate to the tape library is 3400Mbit/sec. The 4Gbaud Fiber Channel is installed into a PCI Express 1.0 (x4) slot which is 8Gbit/sec. If the Fiber Channel card is a dual ported 4GBaud FC card – then the maximum transfer rate to the SAN is 6800Mbit/sec.
4. Tape drives are typically specified in order of MB/sec (not Mbit/sec) so our 6800Mbit/sec becomes 850MB/sec.

If the data being backed up is EHCC compressed – then you are unlikely to be able to get any more compression on the tape drives, but if the data is either uncompressed or compressed using either BASIC compression or OLTP compression, then the data will be hardware compressible – and therefore you might need less tape drives to backup the 850MB/sec that the 2 x 4GFC cards can send.

h2. Second – verify the transfer rate into the Media Server.
Unfortunately – I do not have a 1GigE (eth3) on DB Nodes – to 10GigE on my media servers in my lab – so I cannot show you real numbers – but the same steps below can be used to verify transfer rate between the DB Nodes and the Media Servers. I am using the IB link between the DB Nodes and a single Media Server on an X2-2 DB Machine. The best practice paper has two Media Servers.

NOTE – If you are using InfiniBand – ensure Connected Mode is setup and the MTU size is increased to 65520 as per the best practice paper.
NOTE – If you are able to utilize Jumbo Frames, especially on the 10GigE links – ensure that is setup correctly.

This procedure does utilize the dcli utility running across the root accounts of the DB Nodes

Now on the first node – where SSH is established – we can run the following command
# dcli -g dbs_group -l root "node=\`hostname | cut -c 8-9\`; qperf 192.168.75.129 -lp 40\${node} -uu -m 65520 -t 30 tcp_bw quit"

Note that we are assuming the hostname is 9 characters long – with the last two characters being a numerical number – 01 through 08. If your hostname is shorter – then adjust appropriately. We then are running a 30 second (-t) qperf session to the IP Address 192.168.75.129 – with each DB Nodes sending to a different listener port (-lp) db01 to 4001, db02 to 4002 etc… The –m 65520 is setting the MTU Size of 65520 (you can get this from the ifconfig output). The final two options is tcp_bw which is the test you want to run – “TCP streaming one way bandwidth” and quit – which simply tells the server side qperf sessions to terminate when the 30 second run is over.

On a single 10GigE Link – this is probably going to be in the order of 1100MB/sec I believe – please let me know what you find. If you aggregate multiple 10GigE Links with 802.3ad (or LACP / etherchannel) then depending upon the hashing algorithm used by LACP – you might get a higher rate.
Obviously – if you only have 2 DB Nodes configured with the Media Server software – then run the dcli command using just the 2 DB Nodes.

h1. Third – reading the data off the Exadata Storage Cells

People commented about the backupksfq_bufcnt and backupksfq_bufsz parameters. During the Exadata 11.2.1.2.x time line – we made a change to the DBCA template used by the ACS engineers during the DB Installation to set these parameters away from their defaults.

Before going any further though – these parameters do consume memory on the DB Nodes – and are not recommended for an Oracle/HP Database Machine (V1) unless you know you have the memory available on the system. On an Oracle Database Machine (V2, X2-2 and X2-8) the ability to use these parameters are a little easier – although we recommend that the parameters are set and the resulting runs verified.

So – what about these parameters –backupksfq_bufsz should be set to 4194304backupksfq_bufcnt should be set to 64

What does this do to the memory used by RMAN? Each RMAN Channel that is run – so one RMAN Channel per tape drive – will obtain 64 Input and 64 Output Buffers (total 128Buffers) of 4MB each – for a total memory consumption of 512MB. This will result in 1GB of memory being consumed, if you have 2 channels running on each DB Node. This is the reason why these parameters are not recommended on an Oracle/HP Database Machine (V1) system – where memory is relatively small. Similarly – if the Database Nodes in a V2 or X2-2 or X2-8 system fully utilize the memory in the system – be careful about setting these parameters.

Also – In Oracle 11gR2 Patchset 1 (11.2.0.2) being installed by default on X2-2 and X2-8 systems – these parameters do NOT need to be set explicitly as they are automatically set to the correct values on a Database Machine.

During the running of an RMAN Backup – you can check the values of these parameters by querying
SQL> select buffer_count,buffer_size from v$backup_async_io where type = 'INPUT';

BUFFER_COUNT BUFFER_SIZE
64 4194304

h1. Fourth – Verify the correct network is being used for the backup
This might sound obvious – but it is always worth validating. The default route out of the DB Nodes in the Database Machine is the Client Access Network – which is typically running at 1GigE – and is also supporting other traffic.

Using sar -n DEV 1 100 – verify that the packets are going out of the DB Nodes on the correct interface and being received by the media server on the correct interface. The following shows a backup running into a Media Server via the default 1GigE interface, at a rate of approx 120MB/sec (wire speed).

Oracle Secure Backup controls this via the concept of a Preferred Network Interface. I know Symantec NetBackup recommends that this is specified during installation time. I am not sure about other products.

These are the primary setup verification checks that I go through whenever I get involved in a backup exercise.

h1. Fifth – Handling BigFile Tablespaces
If you have a number of BIGFILE Tablespaces you will want the tablespace to be broken up into little parts and run in parallel. The SECTION SIZE parameter can be used in RMAN for a Level 0 backup and will do this for you, but the SECTION SIZE parameter is not applicable for a Level 1 backup. However, if you have enabled the Block Change Tracking file – the use of the Block Change Tracking file will generally alleviate the need to break the backup into multiple sections depending upon how much data is changed in a given day.

You probably want to set the section size such that your largest tablespace is evenly distributed between the available tape drives (within reason). For instance – if you have a 2TB BIGFILE Tablespace – and you have access to 8 tape drives – then set the SECTION SIZE to 256GB. If you have other BIGFILE Tablespace that are only 1TB in size – then you might want to set SECTION SIZE to 128GB. The goal is to have all tape drives spinning until the last moments of the backup. RMAN will backup the BIGFILES first – and then work towards the smaller files such as SYSTEM as SYSAUX.

Sorry to resurect and old thread but I have come back to revisit this.. Seems we cant push our 1Gb backup interface hard enough to get the 100MB/s capacity. If I run 3-4 channels thru there it can. Is that "normal" should 1 channel be able to fill the backup interface connection?
From my sar outputs I see only about 15-20Mb/s with 1 channel running (of course I have the other 7 db nodes also pushing that same 15-20mb/s). Can I some how boost that one connection?

Your previous answer was very informative and helpful - but the situation is still not ideal. Full backups taking days and prone to failure with the media mgmt layer having issues with tapes and such for such long running jobs.

I dont think there is anything to unusual here - but if you see any improvements, let me know. Thanks!

If I double up the number of channels I seem to push the backup interface eth3 a bit closer to the theoretical max ~50Mb/s (would be nice to get ~100Mb/s)
Is there a better way then to hard code all those ALLOCATE CHANNEL commands? Using a service doesnt seem to load balance very well at all.

Symantec have a backup paper called "Protecting an Exadata Database Machine with NetBackup for Oracle" which talks about how to test backup rates to a disk pool (to verify that the network portion of RMAN & NBU are correct) and then goes into the tape portion of the job. It also covers certain NetBackup tuning parameters. You should be able to get a copy from Symantec and this might help.

I do not see any issues with the "RMAN" portions of the script below, but I am not sure what the different SEND commands are doing in Symantec.

I am wondering why you are limiting your backup pieces with FILESPERSET=3? You have parallelism set to 8, maxsetsize set to unlimited, the only reason i can think why you would do this is to minimize the amount time during a restore when looking for the correct piece? Otherwise i would just remove that parameter and make way bigger pieces.