The SitePoint Forums have moved.

You can now find them here.
This forum is now closed to new posts, but you can browse existing content.
You can find out more information about the move and how to open a new account (if necessary) here.
If you get stuck you can get support by emailing forums@sitepoint.com

If this is your first visit, be sure to
check out the FAQ by clicking the
link above. You may have to register
before you can post: click the register link above to proceed. To start viewing messages,
select the forum that you want to visit from the selection below.

>> (try that on a table with a million entries, then come and tell me it was a good idea).

Try telling any data guru that having a table with a million entries in your OLTP is a good idea

You should never really have to worry about extremely large data sets in your OLTP, because you should never really have extremely large data sets in your OLTP. Your OLTP should really only contain what is necessary for the online system. Everything else should be offloaded, and maybe even de-normalized, to a data warehouse.

But you're really going it about the wrong way if you find yourself with millions of rows in your online system.

Everybody seems to talk about "enterprise programming" but then completely ignore aspects of "enterprise databasing" (I know that's not a real word, but you know what I mean).

Maybe 10 or 20 years ago that was true, but I have, right now, a single database with 26 tables (due to file size limitations more than anything else) each containing ~ 500,000 - 1 million records serving content from a CMS. I can assure you that the DB handles this quite well, without any issue. Putting those in a separate system would make things quite slow, cumbersome, and difficult to manage.

Our user database is fast approaching a million users who have been active in the last 30 days as well.

The query that I gave as an example was from a real-world situation. We want to show 10 random articles from the database (the 26 content tables are managed with a MERGE table). The only solution that gives good, truly random results, was to do this:

1.) Count the number of records.

2.) use mt_rand(0,$RecordCount);

3.) Grab the records using the randomly chosen ID.

Like I said -- order by rand(), limit 10 would never be more efficient (even if you were dealing with a small data set, such as a few thousand rows).

Of course, not every application is (or should be) designed the same. Some applications need the database as lean as possible; others perform better when all the data is accessable all the time. Blanket statements like "never" or "always" are the first signs of bad design.

Oh, and I'd never use any of the existing open source CMS systems for the sites that I run. Have you ever tried to move 20,000 articles from one section of your site to another in those things? Yikes!

Any PHP application is inherently scalable (unless it has been very very very poorly coded). It is not PHP's job to scale the system; it will naturally scale with your system

Even a few simple mistakes can choke scalability, actually. The two most obvious examples are vBulletin and MediaWiki. Both are very popular, very well-regarded, and very poorly written pieces of software that scale extremely poorly. Yes, you can just throw more web servers at the problem (as most people do), but writing a more efficient code base fromt he ground up would have saved thousands upon thousands (or millions, in the case of mediawiki) of dollars in hardware and maintenance costs.

Obviously there's a certain point where limitations beyond php's control start to hit you (once you get over 10,000 simultaneous connections or so, it's cheaper to just buy a second web server most of the time, since apache can't really do much better than that on any reasonably-priced current hardware; I use a dual opteron w/ 4GB of RAM as my benchmark of "reasonable").

However -- most people don't get those kinds of numbers. I'm still baffled as to why it's more or less impossible to run vBulletin on a single machine once you have more than 2-300 concurrent connections going on (note: vbulletin's "current users online" is not a measure of concurrency, it's a measurement of users online in the last 10 or 15 minutes. and there's a world of difference). Of course, most sites can fall back to using things like Tux, which can easily toss out 25k+ pages per second without batting an eye.

Of course there are also situations where it is ridiculous to even think about running an application on only a handful of servers, it's clear you are going to need a whole park of servers

Rarely, and it depends on what you're actually doing.

fter all it doesn't really matter much, whether you are going to use 60 or 150 machines.

If it makes the difference between buying 100 servers and buying 110, it matters significantly to most every company out there. Servers are cheap, individually, but the numbers quickly add up, especially for the 95%+ of web companies out there who are not multi-billion dollar operations.

an example: queries should follow the format db_name.table_name for replication of individual databases.

That doesn't help you one bit for replication of individual databases on most platforms.

I generally say it's best to leave the scaling to the db server (mysql, postgresql, oracle, and microsoft sql server all do this naturally. I'm quite positive that the majority of other db platforms out there do as well, those are just the ones I've used personally). Attempting to implement a custom clustering system in your code is, at best, messy, and, at worst, dangerous.

And scaling databases is easy otherwise. Simple round-robin setup for a read-only database:

>> Obviously there's a certain point where limitations beyond php's control start to hit you (once you get over 10,000 simultaneous connections or so)

Most limitations hit before PHP gives out. The network is probably the biggest bottleneck. If we take a page having a size of 20k, A server on a 10Mbit line, will become saturated at around 50 requests/second. Around 500/second on a 100Mbit line. And around 5,000/second on a 1Gbit line.

Most servers probably are on 100Mbit lines, and would have one hell of a time serving more than 500 requests/second due to network limitations.

Edman, I hear you and am in a similar situation. I found this great resource (hudzilla.org) a few months back and mentions pretty much everything you'll need to know about optimising PHP code and database structure.

Good luck and apoligies if someone has already metnioned this resource!

>> Obviously there's a certain point where limitations beyond php's control start to hit you (once you get over 10,000 simultaneous connections or so)

Most limitations hit before PHP gives out. The network is probably the biggest bottleneck. If we take a page having a size of 20k, A server on a 10Mbit line, will become saturated at around 50 requests/second. Around 500/second on a 100Mbit line. And around 5,000/second on a 1Gbit line.

Most servers probably are on 100Mbit lines, and would have one hell of a time serving more than 500 requests/second due to network limitations.

20k, that's pretty big of an example :P
The average size of a gzipped page is more likely to be 5-10KB, unless the HTML is really boated

Speed & scalability in mind...
If you find my reply helpful, fell free to give me a point

That was my point. Depending on what you are doing (and how large you have to "scale") different things will matter with respect to what's important for "scale".

If it makes the difference between buying 100 servers and buying 110, it matters significantly to most every company out there. Servers are cheap, individually, but the numbers quickly add up, especially for the 95%+ of web companies out there who are not multi-billion dollar operations.

If you have software that's running on 50 servers, the cost of hardware will be smaller with respect to all the other costs involved the venture - most of the time. Hence why I said the number of server (beyond a certain threshhold) is not the most important thing. Decisions will be made on economics - sometimes it will be cheaper to have a few devs profile and improve the application (there's often some low hanging fruits in the beginning), sometimes (especially once you picked all the low hanging fruit) it will be cheaper to add hardware.

Otherwise I agree with a lot of what you wrote, especially with respect to popular software. A lot of popular OS projects have stock sentences about "flexibility" and "performance" in their blurbs - but they are often just that: blurbs intended to make people feel good; rarely was there serious effort made in that respect (let alone any honest comparisons to alternatives).

>> Obviously there's a certain point where limitations beyond php's control start to hit you (once you get over 10,000 simultaneous connections or so)

Most limitations hit before PHP gives out. The network is probably the biggest bottleneck. If we take a page having a size of 20k, A server on a 10Mbit line, will become saturated at around 50 requests/second. Around 500/second on a 100Mbit line. And around 5,000/second on a 1Gbit line.

Most servers probably are on 100Mbit lines, and would have one hell of a time serving more than 500 requests/second due to network limitations.

Maybe we're looking at "limitations" as two different things. If I see a request not even being served in under 100ms, I see that as being a problem; it makes the site appear sluggish, and that eventually turns users away. No matter how many servers you throw at the problem, the only way to make the page get served faster is to write better code. Ultimately, that's the final word in application performance -- user experience. If the users get their pages quickly, the application is performing well. If they are not, it isn't. If you ignore the execution speed of the script itself and only focus on the overall performance on a massive scale, the site will appear sluggish to users, and that's always a bad thing. There's no reason why it should take me 5 seconds to download a 25k page when I'm on cable or DSL -- period.

Writing better code also enables you to do much more complicated work in real time, which is going to become increasingly important as stuff like AJAX gets more and more popular. Sure, your site may be able to handle 10,000 requests per second (or whatever), but it's still performing extremely poorly if the user isn't getting near-instant responses from your application (they may as well just do things with traditional "click and wait for response" type of stuff).

>> Sure, your site may be able to handle 10,000 requests per second (or whatever)

As I recall you were the one talking about some magical server you have that can do 10,000 PHP requests per second or something like that (which I don't buy for 1 second anyways)... I was just trying to say that unless your server has some kind of uber uplink to the net, you're not even going to be able to get close to that due to network limitations.

You also seemed baffled how so many servers could not handle more than about 300 concurrent connections, with vBulletin or something, but that doesn't really matter, as most single servers, around 300 is probably their limitation due to network bottlenecks. Most people probably only have 10Mbit uplinks (which would limit to far fewer) or 100Mbit which 300 is probably about right. Once you saturate the line, game over. Trying to pump out a few more cycles isn't going to solve anything if the problem is that the line is saturated.

>> If you ignore the execution speed of the script itself and only focus on the overall performance on a massive scale

If you focus on a massive scale, the network will be a far bigger factor than cpu cycles will be, assuming your script can handle what the network can (which I assume most can, since networks saturate easily).

I'm not trying to say you shouldn't optimize your code to use less CPU cycles, but let's be realistic: In nearly any web app, the network will most likely be the biggest bottleneck.

>> If the users get their pages quickly, the application is performing well. If they are not, it isn't.

I think most of us are talking about web apps, and there is far more to the puzzle than just the application. If a use is not getting pages quickly it could be any one of the following (or more, as this is just off the top of my head):

- network line is saturated (reached limit)
- web server (apache) not tuned/setup correctly, or reached its limit.
- CPU has reached its limit
- other processes on server causing too much overhead
- SQL server has reached its limit

If the users don't get their pages quickly, the server is not performing well. Not necessarily the application. There are a myriad of things that could be the cause or contributing to the cause.

Coming in a bit late to the discussion, but I wanted to jump in on the 'session' statements I saw.

I didn't really understand someone's earlier comment about "I got rid of sessions - they'll just have to use cookies". Was that meant as storing information IN a cookie? Not a good idea from a performance standpoint, as generally that cookie information is sent back in every HTTP request (15 images on a page means that cookie information is sent 15 times over the network for that page request).

I'm getting offtopic a bit, but I'll bring it back to sessions. Generally I don't use the PHP built in session handling. My earlier experience with the LogiCreate framework was that the PHP session handling code wasn't all that hot. Now granted, this was early days of PHP4, and there was no $_SESSION or other improvements. BUT, the thing I've noticed is that it *always* writes out the full session to disk even if there have been no changes to the session data. On large sites that's wasteful - sometimes very much so.

You can bypass this somewhat by using session_set_save_handler() to write your own save/write routine, but you still need a way to know if anything's been changed. It's probably worth it to simply write your own session system which would check for a 'dirty' flag if the session's been modified, and only write out when things have been changed.

Just thought I'd throw that out. Excepting the LogiCreate system I'd started years ago, I don't think I've ever seen a PHP framework deal with this issue. drupal doesn't (just mentioning it because it was mentioned as 'use drupal if scalability is important' or something similar).

>> If the users get their pages quickly, the application is performing well. If they are not, it isn't.

I think most of us are talking about web apps, and there is far more to the puzzle than just the application. If a use is not getting pages quickly it could be any one of the following (or more, as this is just off the top of my head):

- network line is saturated (reached limit)
- web server (apache) not tuned/setup correctly, or reached its limit.
- CPU has reached its limit
- other processes on server causing too much overhead
- SQL server has reached its limit

If the users don't get their pages quickly, the server is not performing well. Not necessarily the application. There are a myriad of things that could be the cause or contributing to the cause.

We're talking about the same thing. I was using the term "application" as a generic term for the user's interaction with you. That includes server, network, whatever. It's everything. And if any element is slow -- it's bad. You can NOT let users wait 5 or 10 seconds to receive feedback and claim that your application performs well because you're able to handle thousands of simultaneous users. This is where code performance matters.

As I recall you were the one talking about some magical server you have that can do 10,000 PHP requests per second or something like that (which I don't buy for 1 second anyways)...

I don't believe in magic, nor was I the one who made that claim. I used that number because it was provided previously, although it is quite possible to serve large numbers of requests when you're dealing with web requests and not actually returning much (or, in many cases, not returning anything at all, just processing input). You assume that every request being made is going to produce a full page of content. That's simply not the case for modern web apps. I can show you plenty of examples of sites taking on thousands of hits, processing / updating databases, and then returning absolutely nothing (i.e. a 304). Amazon and NetFlix come to mind immediately here, as they both do this quite effectively, and I'm sure save a whole lot of bandwidth because of it.

However, like I said, there is no reason why a single, moderate piece of hardware should not be able to handle a few hundred requests per second (my original benchmark for vbulletin and wikipedia). Of course, these applications (which I use as examples because they're typical of most PHP code) don't really perform all that well even when there is no load.

You're correct that the average user is heavily bandwidth-limited. Typically, users who are on those types of connections (< 10Mb upstream) aren't encountering these types of scalability issues in the first place, though, so it's irrelevant.

I've downloaded the php_apc.dll and put it in my Apache root directory, where the other dlls exist for my installation. But, since APC is only for PHP4.x, I'm having a problem with this extension

Downloaded phpts4.dll (PHP4.4 package) and put that into C:/Apache2/bin/ but still no result, so can you (or someone else) tell me how to use this extension with PHP5.0.x? I might look also at memcache since I've downloaded that extension as well if this isn't solved.