I am working on a multithreaded application (Forex trading app built on C#) and had the client upgrade from the 12 core 3.0GHZ machine (Intel) to a 32 core 2.2 Ghz machine (AMD). The PassMark benchmark results were significantly higher when using multicores doing Integer, Floating and other calculations while for a single core calculation it was a bit slower than the pack (others that were being compared to with similar config as the 12 core one). Oh it also comes with 64 GB RAM (4 times as the other one) and a much faster SSD.

So after configuring and running the application on that machine, not only did it not perform as well, it was significantly slower. We're talking about 30seconds - 1 minute slower on an app that usually completes processing within 5-20 secs. The application uses MAX DEGREE of PARALLELISM (TPL) which I've tried setting to number of cores and also half of that. I've also tried running single threaded and without setting any limits in parallel threading.

While it may be the hardware has some issues, I am wondering if the CPU processing speed is the issue. I can overclock to 3.0 GHZ. But is that even a good idea?

EDIT
I see a lot of useful information. I want to modify the question slightly now - Forget the Intel processor for now. What can be done with the AMD system to get more out of it? We're working on profiling. We've had a DBA look into the indexing, fragmentation and other parameters like I/O usages. There seems to be a lot more reads and writes than in the Intel based CPU. I saw an answer on AMD based optimization. Is there a way to do this other than use OpenCL? How about overclocking? Would that cook the CPU?
In terms of owning up - I see people kind of pissed off at me! The PC was on sale and boss and I discussed if the resources available (4 times more RAM, almost 3 times as many CPUs and a lot faster driver SSD) would help us gain a lot of performance. We're always looking to tune it from the software end, except it hasn't (I won't say didn't) turned out to be that magical bang for the buck we were looking/hoping for. I do feel every bit miserable about this - thus the lengthy post.

More Edit
I just wish some AMD rep would say this is bull** You're doing it the wrong way! You've overlooked this and haven't used this feature.. To make matters worse I read that AMD's made huge losses this year and are waiting on a bailout. :(

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
If this question can be reworded to fit the rules in the help center, please edit the question.

Please provide the actual CPU model numbers, operating system and version. We can't help you without that information.
–
ewwhiteDec 19 '12 at 3:16

4 Answers
4

Let me get this straight. You upgraded the client based on a hunch and a single benchmark?

That's a mistake. Benchmarks are entirely artificial and do not reflect how real world programs will perform. I will say that they do however provide an indication of potential performance.

Firstly, there is a lot more to getting apps to perform well on multiple cores and to use all the available memory effectively.

Many apps are not written with large concurrency in mind and not all problem domains lend themselves to concurrent solutions. The bottleneck on your app may be locks around shared memory.

For example, I've seen graphs of concurrent apps that seem to scale really well up to say 4 threads, but then for no apparent reason the performance drops off linearly as the number of threads are increased. This is an indication of starvation of a resource. Locks are really expensive. Consider using lockfree structures or minimise the amount of shared resources and interaction between threads.

Another slowdown can be around caches. A really interesting example is the lz4 compressor. Earlier versions were very fast, but another more complex compressor (snappy) gave similar performance. The reason was due to the way the caches are used.
Don't underestimate this. If you know what you're doing you can speed up some algorithms and data structures by many multiples which is exactly what the author of LZ4 did.

The first thing I'd do though, is run your code on the 32 core system and see if you can profile it to get an idea of where it's spending its time. It's probably with locks. Also, try reducing the number of threads and benchmarking again. You may find performance increases - in fact I'd say that's likely.

when they help pick servers they display the benchmarks to back that up. Think of auto industry or any other. IF benchmarking results are faked to sell something... Also the other system had to be either returned or purchased. It does however sound like a mistake to have purchased this.. that's why the question, right?
–
MukusDec 19 '12 at 4:26

Benchmark results were off! passmark.com/forum/….. Now your answer is making a lot more sense to me. UpVote.
–
MukusDec 19 '12 at 4:56

It also sounds like you have lots of memory available. Is your app able to utilise this more to speed things up? As I know nothing about what your software does it's all guesswork.
–
MattDec 19 '12 at 10:41

One way to think about this: You went from 12 cores x 2 threads per core (HT enabled) x 3.0 ghz = 72.0, to a system with 32 x 1 x 2.2 = 70.4.

Edit: Based on your updated info, the 3930k as described in the ARK has a 6x2 arch = 12 threads, not a 12x2 arch as I suggested. (http://ark.intel.com/products/63697/Intel-Core-i7-3930K-Processor-12M-Cache-up-to-3_80-GHz)

Oversimplified view of the system aside - Intel has more efficient physical cores while the "virtual" (HT) cores are less efficient, and there are many other variables to consider - triple-channel memory controller etc.

But one thing possibly stands out: thread blocking. If there are threads that block / prevent other threads from executing, the faster clock rates + more efficient architectures are going to win out over having simply more thread capability. That is more of a software optimization problem.

Another thing to look at: are you using an AMD-optimized compiler for the C# app, or are you still using the Intel-optimized version? Edit: Visual Studio and most other compilers have options that allow you to target specific CPU architectures, i.e. 32-bit vs 64-bit, ARM, specific instruction sets (SSE2/SSE3/SSE4 etc). I wonder aloud if that could be a factor at play?

Can you shed some light on you last paragraph? I haven't heard of that before. Looks quite useful.
–
MukusDec 19 '12 at 3:46

Intel HT only gives an approximately 15-30% performance boost, not 100% because it's not a fully independent core. So the weird comparison you make in the first paragraph is fairly meaningless.
–
MattDec 19 '12 at 4:24

Is the SSD the only "drive" on the system? If the SSD is NOT the only drive on the system is the SSD being used only for the operating system? Are you employing RAID for the application and if so does it connect to other servers that are databases that run RAID? RAID has been found to kill some aspects of database data retrieval.

Regarding the CPU, you really do need the chip model number to know that you are comparing Apples to Apples. The model number will tell you the chip cache, # of Cores and # of Threads, processor speed, bus type on the chip, as well as the gigabit per second pipeline speed between cores. For example, one Intel CPU may have an 8.00 GT/s bandwidth and another CPU may have a 6.5 GT/s bandwidth...and between cores that is very important. If data is stuck on a CPU core after doing its work... it effectively deadlocks the entire system, hardware and software.

Have you checked to see how large the data set is, and how large the application is when running in RAM? How fast is the RAM between the two systems being compared, AND does the chip that you purchased support the speed of the RAM purchased!!! It is well known that motherboards support many different speeds of RAM, but the CPU that you ordered the system with may not. So you may order a system with a motherboard that supports 1300MHz and due to the chip that you ordered you get less than 1000MHz. If this system has so many cores, why does it only have 64GB of RAM on it for a new system. I have a Dell T-410 for a home system and I purchased it around 2009 and it maxes out at 64GB, with 8 cores(2 quad cores)...and the newer model has 128GB of RAM available with 12 cores(2 x 6). If you reorder the system consider more RAM if you need it...heck, I use 32GB for an 8 core home system running VMware 5.0.

Me thinks based on how you wrote your post, and the type of inquiry being made, you did not bone up on the hardware aspects before ordering. If you look at the small print... you may be able to return it for another system. Just tell the boss that the performance is not as expected based on the application that it is running, and do not delay, because the return may be good for a week to two weeks, and after that YOU OWN IT.

Do not be ashamed, just own up to it and let management know that the numbers that you are getting back from initial testing are not within the ballpark of what you believed you would get for the outlay of cash...and we need to exchange it for another system.

As others have already noted, benchmarks are not always a good guidance for which processor to choose. Especially PassMark is definitely is not something you would want to look at for non-general-purpose applications.

If you have some idea about what resources your software is using and where it is going to be bottlenecked, you might want to be looking at "raw" performance data like memory latency, memory throughput and maybe also the distinct tests of the the Spec benchmark suite in the CINT (Intel 3960, AMD 6274) and CFP (Intel 3990, AMD 6274) disciplines.

Keep in mind that results (and also the perceived or measured application performance) may vary significantly depending on the compiling options or the compiler version used to produce a particular piece of binary. Things are somewhat different for .NET as compilers are only producing metacode which is translated to actual architecture-dependent code by the JIT runtime. But even there you also can specify optimization parameters for a specific architecture. Also, your specific patch level of the OS might be significant as well - Microsoft has released patches to fix underperformance on certain AMD CPUs.