Things that work under a program

Hi,
I have been studying C++ a while.I am curious about some point.
When we fire up an executable,how does computer handle it?I have been hearing some buzz words that I do not understand a bit.SMID,cache memory,FPU,ALU,APU,ISA,xth stage pipeline,bootstrap,FPGA,southbridge,northbridge,CU etc etc.How do they affect my program?
I know how to write a simple c++ program.But I do not know how things work on hardware level.help me anyone,plz.

A book I can recommend on this subject is Structured Computer Organization by Tanenbaum. Below I give a brief summary of the subjects you enumerated.

SIMD (single instruction, multiple data): a general term for a kind of instruction set designed for data parallelism. In short, programs that use such instruction sets are able to add, multiply, etc. several values (typically 4) in a single cycle. This is useful primarily in programs that do linear algebra calculations. Programs that don't explicitly use SIMD instructions aren't affected by whether the CPU has them or not. Programs that do use SIMD instructions can't run on CPUs that don't support them, unless they use separate code paths to target more than one CPU architecture with a single executable.

Cache memory: In modern x86 architectures, there are several levels of cache. In order of increasing size and decreasing speed: L1, L2, and L3. When you buy a new CPU, the advertised cache size is L2. RAM bandwidth hasn't advanced at the same pace as CPU processing speed, so without a cache, the processor would spend most of its time waiting for RAM to load or store a value (typical latencies are in the order of hundreds of clock cycles). The point of the cache is hold intermediate computation values closer to the CPU and delay writing or reading RAM. In certain application domains, effective use of the cache can make or break the program's performance. Unfortunately, this is too large a topic for me to cover here.

Pipeline: Used to implement instruction-level parallelism inside the CPU. Basically, the CPU break down instructions into smaller operations and is capable of executing several instructions in parallel by executing the smaller operations serially. Wikipedia has a detailed description of the mechanism.
Like with the cache, the pipeline has a strong effect on the performance of a program, but it's also a lengthy subject.

FPU: The part of the processor that performs floating point operations. You don't need to be aware of its existence when you code any more that you need to be aware that your car has sparkplugs when you're driving.

ALU: The part of the processor that performs integer and boolean operations. Same deal as with the FPU.

CU: The part of the processor that fetches instructions from memory, decodes them, sends them to the appropriate units for execution, and stores the results. The pipeline is part of it. Same deal as with the FPU and ALU.

APU: It's a rather broad term that covers many different devices.

ISA: Obsolete hardware interface. Has no bearing on how programs are written.

Bootstrap: Another broad term. One definition of bootstrapping refers to running a short program stored in a special section of a storage device that loads to memory the bare minimum of an OS, which can then proceed to load its own components (like taking yourself out of a hole by pulling on your own bootstraps). Or booting, for short. This has nothing to do with how normal programs are written.
Another definition refers to a compiler capable of compiling its own source code, or an interpreter capable of interpreting its own source code. For example, the first Lisp compiler was written in Lisp: a Lisp interpreter written in Assembly already existed, so someone wrote a compiler in Lisp, ran the compiler inside the interpreter, fed the compiler its own source code, and machine code for a compiler came out.
A more fictional example: a power generator that converted 1 Joule of energy into 2 Joules of energy would be capable of bootstrapping, because once you had it running, you could feed part of its output back to itself and you could generate free energy indefinitely.
In general, a process bootstraps once it's able to sustain itself with no outside interference.

Northbridge: Where high bandwidth devices are connected for mutual communication. In modern systems, it connects the CPU/s, the RAM, and the GPU/s. The southbridge also connects to it. Has no bearing on how programs are written.

Southbridge: Where low bandwidth I/O devices (storage, network, other) are connected. Has no bearing on how programs are written, other than if you know that it exists, you probably also know that devices connected here have a huge lantency and shouldn't be used for normal computation save for exceptional circumstances (e.g. maybe the problem is so complex that its working data doesn't fit in main memory).

There is many abstractions between hardware and applications. They are exist so you can "write a simple c++ program" without knowing how all that elements work together. If you want to learn how it is works, you might consider learning assembly - low level language, basically processor instructions - first. Then learn how your OS works, what happens when you launch an application and why don't they interfere with each other. Then you might think how the multitasking works when processon can only process sequental commands. You learn about OS and hardware interaction... There is no easy way to learn that, as there is no easy way to become an Ph.D.

thanks guys.
SMID:If we consider doing below:
A=a+b+c+d or A=a*b*c*d
On my CPU(core2due 2.67GHz),I can do this kind of (2.67x1000x1000x1000) number calculation at one second if I use SMID? So for optimization I can use SMID,right? You said “typically 4”.What does it mean?
Cache: So cache is basically ram with tremendous speed but very low memory. Why are there several level of cache? Is it umm, let’s say for this reason: level one cannot hold (they can’t manufacture efficiently for consumer) more space than xKB/xMB. So we create another memory space adjacent to cpu that can hold more space, perhaps xMB. Again this level two has limitation of space. So we again create/manufacture another memory space called level3 and it has certain xMB in it. As we go on,we are increasing memory but we are shifting away from cpu. So we are losing speed. That is why we say for speed “lv1 >lv2>lv3”. But lv1, lv2, lv3 etc operates in what speed? In this analogy, can we say “RAM” is one kind of levelX cache? As for “speed” terminology of those memory device(lv1,lv2,lv3,ram etc),I think it this way. Lets consider 600MHz ram. So it can feed a 600M number of “chunk of data(buffer or similar?)” to CPU at one second. Am I right? Though I do not know amount of chunk of data.2MB? 6MB?or by just learning “less than or equal to cache size”? Let me know if my thought on this is ok or not?
Is there any way in C++ to use SMID,cache efficiently

Like I said, these instruction sets are often used in linear algebra applications. 4D vectors are some of the most common.

Is it umm, let’s say for this reason: level one cannot hold (they can’t manufacture efficiently for consumer) more space than xKB/xMB.

Sort of. The real reason is that fast memory is expensive. We could in principle manufacture RAM consisting entirely of flip flops (the kind of memory used for CPU registers), but it would be far too expensive.

Whether something is a cache or not depends on how it's used. In principle, RAM is not a cache because it doesn't need to be used as one, unlike CPU caches which have no other functionality (they're not programmatically accessible).
Most OSs cache disk accesses, and web browsers often keep caches of previously visited pages or content. Anything can be a cache if it can hold state.

Lets consider 600MHz ram. So it can feed a 600M number of “chunk of data(buffer or similar?)” to CPU at one second.

More or less. What gets sent in each transfer is a processor word, the natural-sized datum that a CPU works with. For example, a 32-bit processor has a 32-bit word.

Is there any way in C++ to use SMID

Some compilers can automatically generate code that uses SIMD instruction sets with varying degrees of performance gains. Some compilers have extensions known as "intrinsics" that translate directly to SIMD instructions.
Generally speaking, though, the only real way to use SIMD in a program is to manually code the relevant functions in Assembly.