glTexSubImage2D can be fast, but you need to set up your texture carefully because some combinations of parameters are less than optimal.

It's commonly seen that people choose GL_RGB for the texture format, but this is going to be one of the slowest choices possible. The driver very likely doesn't represent it's texture data in RGB order at all (and it definitely doesn't support 24-bit textures), so it will need to do some conversion before it can upload the new data. That's the single most likely cause of your performance loss.

I've benchmarked this extensively, and on all hardware the fastest choices are:

In your initial glTexImage2D call:

internalFormat: GL_RGBA8
format: doesn't matter
type: doesn't matter

For subsequent glTexSubImage2D calls:

format: GL_BGRA
type: GL_UNSIGNED_INT_8_8_8_8_REV

If you only care about NVIDIA hardware you can get away with GL_UNSIGNED_BYTE in the last case, but if you need to run well on AMD or Intel too, you absolutely must use these parameters. These will allow the driver to stream in the texture directly and without needing to go through any intermediate conversion steps, which in one benchmark ran 30 times faster. Yes, you read that right: 30.

Of course you need your incoming data to be 32-bit 4-component too. You can write your own up-conversion routine if you wish, and that may be faster than the driver's (just remember to only allocate the memory you write to once instead of doing a separate allocate/free per-upload).

With GL4.4 you could probably do something with a persistently-mapped PBO but I haven't benchmarked or even tested this and I'd advise that you get the basics right first anyway.

Thanks for your professional advise and I have taken it and successfully compiled my program a moment ago.

I run my program with the rendering function RenderScene() that does nothing, and the CPU occupation rate of it takes 40%, while I use glTexSubImage2D( GL_TEXTURE_2D, 0, 0, 0, in->width, in->height,GL_BGRA, GL_UNSIGNED_INT_8_8_8_8_REV, in->imageData) in renderring function, the CPU occupation rate of it takes 45%~48%, and I have the initial call of glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA8, in->width, in->height, 0, GL_BGRA, GL_UNSIGNED_INT_8_8_8_8_REV, in->imageData), where the variable "in" is an IplImage* structure of OpenCV that can represent an image.

This performance is almostly similar to the program I modified before.

By the way, I set a pixel of the renderring image to zero per call in the renderring function, then call glTexSubImage with your advise instead of switching different images in the renderring function.

I'm wondering if there are no APIs that can do enables GPU automatically copy data from CPU.

The transfer can be hidden if you use PBO and have something else to do on the CPU while the texture is transferring, but cannot draw and transfer simultaneously.

If you have NVIDIA's post-Fermi GPU, then moving transfer to another thread and using a separate context can trigger dual copy engine, which enables parallel rendering and transfer.

But I think you have some other problem since it is very unlikely the transfer itself can "eat up" a half of the processor time per frame. How do you measure CPU usage? Can you measure the time needed to transfer a single texture (and what's the size of that texture)? Which GPU are you using? If it is some very low budget card in a laptop, I could understand why it works so.

// update data directly on the mapped buffer
updatePixels(ptr, DATA_SIZE);
glUnmapBufferARB(GL_PIXEL_UNPACK_BUFFER_ARB); // release pointer to mapping buffer
With this solution, I can copy next image data to "ptr", then I can see an animation on my screen. But I just can't use any memory copy function, because it may lead to huge CPU usage.

And I have tried glDrawPixels(). This API works. It won't lead to CPU usage too much. When I use it, it keeps 47%~50% CPU usage with 100px * 100px image, and it keeps 47%~50%CPU usage too with 16000px * 10000px image. If the renderring function does nothing but glClear() and glutSwapBuffers(), the CPU usage keeps 16%~25%. If the renderring function definitely does nothing, I mean anything including glClear() and glutSwapBuffers(), the CPU usage takes about 45% on average.
I compared those situations as following:
#1: CPU usage 47%~50% , and with very high fps
void Renderring()
{
glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT | GL_STENCIL_BUFFER_BIT);
GetAnotherImage();// 100px * 100px
glDrawPixels(in->width, in->height, GL_BGRA, GL_UNSIGNED_INT_8_8_8_8_REV, in->imageData);
glutSwapBuffers();
}

Have you ever had a basic education in computer science?
I'm sorry I'm so rude, but your requirements are insane.

Originally Posted by sainimu78

I mean, for example, there are 100 different images(10000px *10000px) I want them to display on screen consecutively just like a vedio player.

On which screen do you plan to display 10000x10000pix image? On your laptop? Just a single uncompressed image of that size occupies 382MB. 100x382MB=38.2GB!On the other side, you have to transfer those images to a graphics card. A transfer of a single image(since two cannot fit in the memory) requires 2 seconds on the PCI-E 1.x buss (at the pick).

Originally Posted by sainimu78

The data of image displaying on screen, I believe, can be transferred to GPU without any CPU's help. What APIs I can use to set GPU to do this is the matter.

No, it cannot! CPU must copy an image to a driver's address space and initiate DMA, in the bast case.

Originally Posted by sainimu78

"ATI mobility Radean HD 4330" is my card and is out of dates and you can guess how slow my laptop is.

And you graphics card has 512MB of RAM (concluded from the previous posts), and you are quite happy since AMD supports rendering textures directly from the CPU-side RAM, otherwise you won't be able to upload 16000x10000pix texture (in fact it is not uploaded at all). NV cards refuses even to think about rendering if the object does not fit to dedicated memory.

All in all, your approach is the only problem here. Your card's driver swaps objects all the time and that's probably the cause of high CPU utilization. A single core (if you have a multicore CPU) is stuck at 100% utilization.

Yes, the 100 different images can be easily acquired though the way of modifying sevral regions on one image.

I get it, before I call glTex(Sub)Image(), I need to zoom 16000px * 10000px image in fitted magnification times to the window of my program.

But I know a windows app that can show images(are zoomed out from source's size of 10000+px * 10000+px image) captured by cemara with about 0% CPU usage. How can it do that? I think that thing must be done by graphics card and DMA automatically by some initial settings.

So, are the APIs all I need glTexImage() and glTexSubImage()? If this approach does not work, are there truely no APIs that will work?