I have performed a parallel run with 2048 cores (in future more than 2048) using an HPC with lustre file system. So far the simulation ran fine, results were good but there occured a problem I had not expected.
After some runs of my case accompanied by big IO traffic which was produced by the other users on the cluster, the storage system (lustre) of the cluster was messed up. A detailed analysis by SGI revealed that a massive simultaneous parallel access to the storage system was responsible for the damage.
For this reason, the admins of the cluster have passed some rules concerning the usage of the HPC. From now not more than ~600-800 processes (or tasks/threads/files) should be read or written simultaneously (especially not more than 6000 files should be written simultaneously by all users).

The admins asked me, if it would be possible to serialize OF for read/write access when using lustre file systems and more than (let's say) 1500 cores.
They suggested to read/write the first block of 128 or 256 processes/files/threads and then the next one and so on until all data is loaded or written within a time step.
Time steps without IO traffic would be not affected by this restriction.

600-800 threads is actually kinda small for Lustre, large sites routinely run >100k threads (see http://www.nccs.gov/jaguar/ for example)

If your backend storage cannot keep up with the volume of Lustre IO requests, there are various ways to tune the Lustre clients to reduce IO load.
You can reduce the number of RPCs in flight, reduce the amount of dirty memory cached per client, etc. Client tuning is quite easy - certainly simpler than forcing serialized IO.

See lustre.org or whamcloud.com for the Lustre manual, which has the tuning information. Also see the lustre-discuss email list