Posted
by
timothy
on Friday September 10, 2010 @03:55AM
from the and-hammer dept.

unts writes "UK chip designer ARM [Note: check out this short history of ARM chips in mobile devices contributed by an anonymous reader] today released the first details of its latest project, codenamed 'Eagle.' It has branded the new design Cortex-A15, which ARM reckons demonstrates the jump in performance from its predecessors, the A8 and A9. ARM's new chip design can scale to 16 cores, clock up to 2.5GHz, and, the company claims, deliver a 5x performance increase over the A8: 'It's like taking a desktop and putting it in your pocket,' said [VP of processor marketing — Eric Schorn], and it was clear that he considers this new design to be a pretty major shot across the bows of Intel and AMD. In case we were in any doubt, he turned the knife further: 'The exciting place for software developer graduates to go and hunt for work is no longer the desktop.'"

According to ARM's web site [arm.com], there are 'Long Physical Address Extensions (LPAE)', that allow addressing 1 TiB (40 bit). The marketing schematics for the processor mentions a "Virtual 40b PA" for each CPU.

Unfortunately, the detailed A15 documentation is not available yet, so we're left to speculate over what this means. But at the same time, the supported architecture remains ARMv7 and there is no hint of any major changes on the instruction side. An easy implementation would use a MMU with 40-bit physical addresses to map this amount of memory, but the process size would remain at 4 GiB to avoid any drastic change to the programming model.

I don't know the heat dissipation figures, but I can safely say I have never yet seen an ARM processor with a heatsink.
As for power consumption a quick google seems to show that an 800MHz OMAP3 draws around 750mW at full load. This new A15 core is supposedly going to have similar figures.

Combined with the virtualization support, i suspect one could allocate the different cores to different OS images and use the address space to slice up the RAM as needed. Consider having a rack of these in a web hotel, with each core running its own server instance. Hell, given that one can fit a ARM SoC on a DIMM, one could make such a rack very easily expandable with the correct mother/logic-board.

No, nothing at all to do with DRM. Snooping refers to checking the contents of other caches for cache coherency. Cache comes from the French, meaning hidden - it is memory that the programmer doesn't see directly, so the CPU has to act in exactly the same (programmer-visible) way as if it were not there. This is pretty simple when you have just one core, but when you have more than one it becomes difficult.

If you have two threads, on different cores, both accessing the same memory, then each will try to pull it into the memory into the cache. This is fine, as long as both are reading it. When one writes to it, the copy in the other core's cache must be updated or the two threads will have an inconsistent view of main memory. This is called cache coherency. The snoop control unit is responsible for all of the cache-to-cache communication. Because ARM cores typically live on a die with other units that share the same RAM, it is also responsible for ensuring that the caches remain consistent with modifications to RAM by the other coprocessors.

It will come down to, if you know the old intel address modes to things called segments, which means you have so called segments of max 4 gigs you have to juggle around. This system on assembly level was quite evil because you had to shift around with segments for code data stack and whatsoever.

The + side it offered another layer of code injection protection. But for complexity reasons it was very unpopular, and when the segment spaces became big enough most compilers just rolled one huge segmetn and placed code and data there.

For a processor designer this approach however is very elegant because they can increas the memory range ad inifnitum while keeping the register size the same and thus keeping backwards compatibility.

From a programmers point of view segments are hell because you never know when you run into the boundary set by the segment and then the shuffeling beings. Also if you have data bigger than the segment you have to press it into multiple ones.

I am not sure if I like the way arm is going there just to keep the backwards compatibility. One point in time they will have to break it to keep the power consumption low (Intel just added on top of everything the next fluff), and I guess given their current success in the mobile phone area, they shun it a little bit to roll out the next breach in backwards compatibility like they had done in the past.

64bit architecture is 20 years old on the desktop but right now nobody is using it anyway.

They're certainly using more memory than is practically addressable on 32-bit. Ordinary people do need that memory. They do work with large images. They do handle lots of data. They do have many things open at once. They do run large games. Not everyone needs it for everything, but being stuck with only 4GB of address space would really suck. (Luckily ARM isn't limited this way; cortex15 can address 1TB of memory directly, which is rather a lot more than anyone currently puts in a single machine at the moment.)

If I get a Notebook with an ARM, which can run OpenOffice, Email, Firefox and maybe Flash, for half the price and have a battery life of 8 hours and more I really don't care what architecture it have.

The apps are what people care about, yes. But many apps like to have lots of memory because they work with lots of data. (Funny, that...)

The 4 GB barrier was overcome a long time ago on 32 bit systems. The reason people still think its a problem is because Microsoft decided you as a customer shouldnt be able to use more than 4 GB memory on 32-bit since Windows 2000 . The limitations are solely artificial today on Windows 32-bit but linux gladly handle any memory you toss at it.

According to this [slashgear.com], a typical cortex a9 core draws about 250mW. As this has a very similar architecture (still ARMv7), it should be somewhere in similar regions, maybe more, as they boosted the frequency. So I guess a 16 core version will draw something like 4W+, maybe more. Non-the-less, this is still an incredibly good figure for a web server type processor, though a little heat sink might appear.

I'm only guessing here though, based on previous figures. There is no practical data so far on the exact figures.

XOR calculations, you say? Well, how does having 16 cores help with that, one core is enough with sufficient memory bandwidth. So you say increase number of pins to add bandwidth? Well, packaging issues will arise.

It's a common misconception that RAID cards have powerful processors. They do not, server can XOR 5000-20000 megabytes per second on CPU accessing main memory, while most RAID controllers anywhere between 200 and say, 2000, for RAID5 or RAID6. RAID6 is more work in theory, but in reality takes on slightly longer time to compute, because it's a bus limited problem, and in practice your XOR data set will remain in CPU cache.

Of course combination(s) of RAID 0 and 1 you don't need any XORs at all. You might need additional checksum computation, unless your hardware does it for you for "free" (and many controllers just don't bother).

There's high likelihood the data that was just transferred from disk will very soon be needed by kernel or user process anyways. Conversely for writing, data was probably already in CPU cache for exactly the same reason. In those cases CPU XOR rate is actually significantly higher than in isolated simple first case.

When server does the calculations, it can also reliably verify data integrity of data after read DMA over the bus, and you get this nearly 'for free' due to CPU cache.

RAID controllers simply do not require much processing power. Of course SSD disks will change the equation a bit, but then you just need more bandwidth, you don't need other cores competing for same bandwidth.

Only good reason to have such a controller in the first place is just battery backed up cache they contain to improve database insert/transaction rate.

I think it would be about time to make the controllers simple 'dumb' devices and have a separate battery backed up cache as a module on a special connector on server motherboard or PCI express slot.

Well, Mac has been 68xxx series, PPC, and Intel Xeon. OS X has worked on both PPC and Intel wioth AMD's 64-bit extensions. I wouldn't be surprised terribly if they changed platforms again someday if it was evident they could get a good deal and be competitive. They're already using ARM in several products and hosting the devel environments for those on OS X.

Windows has actually been on IA32, Alpha, MIPS, PowerPC, IA64, and AMD64. The Alpha, MIPS, and PowerPC versions were short-lived. The IA-64 version is being phased out in favor of the AMD64 version. Microsoft also has experience with ARM, MIPS, SH3, SH4, OMAP, and more, though, for CE/PocketPC/Windows Mobile/Windows Phone. The XBox 360 is PPC, too. If Microsoft thinks they can make enough money off of it, they'll put a Windows on it. They just need to see really big money.

Linux already runs on lots of ARM hardware, too. Not too many desktops are built around the combination yet, but there should be once someone builds a cheap desktop or laptop motherboard for this chip.

I'm not sure why there's all this talk on Slashdot about how many ARM chips get shipped vs. Intel and AMD anyway. Intel ships millions of ARM chips themselves. XScale is one of the brands of chips out there that uses an ARM core, and StrongARM is another (both Intel). Intel also has other CPUs and microcontrollers besides the IA32, IA64, ARM, and EMT64 chips. That's all beyond what your post was about, but it saves me another reply just for a rant.;-)

Surprisingly, no. Archimedes actually used an initial version of the ARM architecture with 26 bit addressing. The high bits of the program counter register were used to store the CPU status and condition flags, giving an easy way to save/restore those flags across function calls. A clever trick, but unfortunately 64Mb of code address space wasn't enough for everyone, and so ARM moved to the fully 32-bit architecture in current use. For a transitional period, ARM CPUs supported both architectures, but that time is long gone now.

Sadly, this means that modern ARMs can only run Archimedes software through software emulation. I understand that a newer version of RISC OS does exist for the 32-bit architecture, but it's not compatible with older binaries. Programs have to be recompiled for it, and if written in assembly, partially rewritten! So, no "Sibelius 7" or "Lander"...

That means back to segmentation. That isn't a killer problem, but it is significant. In terms of how that works in modern computers, you can see it on Windows systems on Intel PAE processors. Basically the OS gets access to all the memory in the system, but it has to be divided up to be used. In the case of the Windows implementation, the kernel can get only 2GB and each application can get only 2GB. You can have multiple 2GB apps running, but they can't have more.

For an app to get more, it has to implement memory management internally. Basically it talks to Windows and gets a range of memory set up that will be paged, it then gets more RAM allocated and specifies how to page through it. Called AWE and used by a couple apps, like MSSQL. Of course that is complex on the part of the app and would be problematic if you had multiple ones running.

Also it makes task switching hit the system harder over all, because of the segmentation.

So i mean it works, don't get me wrong, I have seen servers doing it. However 64-bit is a much, much, cleaner solution both OS wise and software wise. It really is a hack when you get down to it.

I like current desktop CPUs, which have larger virtual address spaces than physical. You are right, 40-bits is fine for now. As far as I know the top end Intel CPUs only have 48-bits of address lines currently. No reason to implement all 64-bits, you wouldn't use it. However having a flat virtual memory space is something that is extremely useful. There's a reason everyone wanted to move to that with 32-bit CPUs as soon as it became feasible. We don't really want to go back to segmentation.

Because there were no drivers for XP 64bit for a lot of things that were important to end users (soundcards and motherboard components requiring custom drivers come to mind).

Also end user 64bit systems were still relatively new when XP was released and a lot of applications supposedly wouldn't run on XP64 (which I think didn't have the compatibility modes of Vista, not entirely sure, been a while), so the amount of software that would run on it was rather limited.

And of course last but definately not least, 64bit systems weren't marketed as much as they were leading up to Vista's release..

This is why you should realize that "You" != "Most people"
Stop making wide-arching statements about what you think the rest of the world is doing when you are basing it solely on yourself.

This is what you should have done instead - namely actually do some research before talking out your ass:

Here's an old article [downloadsquad.com] that discusses how in Q4 2008, 25% of Vista sales were 64bit.
Also note that "Windows 7 is expected to be Microsoft's last native 32-bit version - Server 2008 R2 has already moved to 64-bit only".

Also, here we have stats [windowsteamblog.com] indicating that 46% of Windows 7 PCs are 64bit.

Without getting into too much detail both are design concepts/operations that are critical components of any system that requires atomic operations. For example, implementing semaphores/mutexes which are in turn critical components of most symmetric multi-processing systems such as the linux kernel (when so configured), or windows. While these operations are most critical in multi-core systems, single core systems also have a large need for such operations.

Because these are such critical operations in modern operating systems, there are specific instructions in processors to handle them, for instance CAS is implemented in the CMPXCHG instruction in x86. In ARMv6 and above atomic operations are built using LDREX/STREX.

I'm guessing he's saying that LDREX/STREX aren't capable, are slow, or something, never really looked at the issue.