benwills has asked for the
wisdom of the Perl Monks concerning the following question:

In the most sincere and blunt sense of the word, I've been a hack programmer since I started about 20 years ago. I never went deep into learning the art of programming, but could usually Frankenstein together what I needed. I'm stating this so you may consider the source (that would be me) of this question about why the use of threads is discouraged...

I've spent the last two+ weeks learning(ish) perl to write the leanest and fastest threaded/asynchronous/parallel/forked code to perform a pretty basic task, millions of times a day (downloading web pages). In the process, I tried every forking/parallel/asynchronous/threaded solution I could hack together. I tried every http client I could find. I tested all of them in terms of speed, accuracy, and resource usage. (If it's important: in the end, I went with a pure-perl socket connection (not IO::Socket, but Socket) with some fine tuning of my own.)

But, more to the point of the question, I found that absolutely no solution competed with threads in any way, shape, or form. Every non-thread solution was much heavier than threads, functioned much slower, and, for whatever reason (additional layers of code?), produced less accurate results and required more "management" in the code.

Yes, figuring out the right threads solution took longer. But for the best solution (if threads is the best solution), I'll spend an extra few days on it to get it right.

I've seen the heated debates about thread usage. I've read just about every single piece of threaded code BrowserUK has posted (and couldn't have written what I wrote without his help in the forums). And I've tested it all for my own use. And the answer is clear: threads wins, hands down.

So: why such severe discouragement? Because it's a little more confusing and not as straightforward? Is there something I'm missing in terms of performance? Is my code situation unique to where threads are outperforming the alternatives, and this is uncommon?

I found absolutely zero public data on performance comparisons, but lots of assertions about performance that contradicted my own tests.

So, I'm just confused and, if I'm missing something, would love to know how to look at this differently.

But, if I'm not confused, then why are threads so actively and severely discouraged? I'm really just trying to understand this.

And if this isn't the place for this question, let me know where a more appropriate forum would be.

Thanks for any help/pointers/thoughts you have that could help me understand this better.

You can not blindly create a thread in the middle of your program because doing it duplicates all the structures in the program memory. If your script is using 1Gb of memory and you create a thread, it immediately goes to 2Gb and if then you create another thread, it goes to 3Gb, etc. Replicating those data structures may also be quite expensive in terms of CPU usage.

For instance, imagine you want to query several HTTP servers in parallel. Threads look like a good match for that, so you build a module using threads, you test it and it runs fine. But then, when you use it from some data processing script that holds big datasets in memory everything goes nuts.

With the current thread support in Perl you can only use them at the high level designing your program around.

Actually, I would prefer something that would allow me to start an empty interpreter, run some initialization code inside and then, clone it on-demand and run arbitrary code on the clones (I toyed a couple of times in the past with the idea of doing something like that myself... but never got the tuits).

Other than looking up the relevant keywords on perldoc, a history of lurking and reading on perlmonks, and some basic knowledge of how computers work from university courses, it didn't take much research either.

For example, setting up a game with multiple threads, one for comms, one for physics, one for AIs - it worked well on the first try. All I did was start the threads at the beginning and share a couple variables. Keep data isolated to the thread that needs to know. Load the multimeg gamestate only in the physics thread, keep the player connection info in only the comms thread, and away it goes.

I'd also seen posts warning of the troubles of mixing threads and TK, but that wasn't really a problem either. There's only a couple options; load and/or init the TK stuff before and/or after spawning the second thread, and start the MainLoop() in one of the two threads. I know I had to try a couple ways before finding one that worked, but it wasn't painful enough of a process to remember off the top of my head. Looking at the code, it turns out I initialized the gui, then spawned the processing thread and detached it, then kicked off MainLoop(); in the parent thread.

Using threads takes a fair bit of computer sense to get right. My suggestion is simply to start small and minimalistic, understand what's going on, and then expand from there. You've still got to realize the basics, such as why a thousand threads is bad, why multiple threads writing to disk goes wrong, that starting threads is expensive, and that it copies memory when the thread starts...

So. I would certainly discourage the use of threads by beginners. On the other hand, for those who understand the fundamentals of How Things Work, threads shouldn't be much trouble at all. A blanket discouragement is not sensible to me. Threads are simply an advanced concept, and a fairly sharp blade in the tool chest.

The "interpreter-based threads" provided by Perl are not the fast, lightweight system for multitasking that one might expect or hope for. Threads are implemented in a way that make them easy to misuse. Few people know how to use them correctly or will be able to provide help. The use of interpreter-based threads in perl is officially discouraged.

And it seems like most of the posts or suggestions to use threads (that I've come across) are often countered saying that threads shouldn't be used, are heavier, more confusing, etc. If you'd like references, I can go pull some up from links I've saved.

The use of interpreter-based threads in perl is officially discouraged

I seem to recall that I've seen others making reference to that statement, too.
I think that you (or at least someone who cares about threads) should challenge that remark with p5p by filing a perlbug report that requests a "please explain the reason that remark is there".
You should point out that, IYO, threads is a useful and important part of perl and that discouraging people from using it is stupid, counter-productive and plain wrong ... or whatever words you choose to make your point :-)

At worst they can only decide to reject your bug report and close it without taking any action.

Just as with any other tool, threads are good only when used at appropriate place. If your application puts into separate threads tasks that mostly run independently - then you win. If you try to use separate threads for tasks that have heavy interaction, then most likely you'll loose.

Use the right tool at the right place and you'll never have to think about what others say :)

I can only speak for myself but I really dislike Perl threads for two reasons:

Data is NOT shared between threads unless you go to extreme lengths to share them. For large complex data structures you'll save yourself a lot of time if you just go with fork() and some sort of inter-process communication instead because it'll be far easier to debug. For me, this goes against the whole point of threads. Compare and contrast with Java threads, about the only thing in Java that actually works exactly as you would expect. You even get a hash class that automagically takes care of locking issues for you, how neat is that?

Too many of the really useful modules on CPAN will just blow up in your face if you try to use them with threads. To work around those issues, threaded scripts have to jump so many hoops you end up losing sight of whatever advantage threads were supposed to give you in the first place.

Talking about threads in Perl makes me sad because it's the one thing I really don't like about my favorite language.

I find that threads are often misunderstood, therefore misused. People often suppose that threads somehow multiply the CPU resource, when in fact they just divide it. They also often discount the additional load that might be placed upon the I/O subsystem, which, in spite of good cache buffering, still can only accomplish so-many reads/writes per second. Probably the most common abuse is what I’ve dubbed the “flaming arrows approach.” Each time another request comes in, you shoot another flaming arrow into the air and then just hope for the best. Of course, this can lead to an un-throttled number of threads, all competing for the same resource ... “thrashing.”

The classic “thrash curve” is elbow-shaped, and the point at which the elbow goes straight-up is called “hitting the wall.” The sweet-spot is just at the point where the elbow starts to curve up, but you have to have some kind of governor mechanism to hold it there. The simplest way to accomplish this is with the Unix xargs command and the -p maxprocs option., using this to run a limited number of processes that are not each thread-aware. The key concept is that the number of worker-threads is not identical to the number of units-of-work that the pool of workers is given to process, and that the number of threads can be regulated.

Your particular application is an ideal application for threading, if you can properly control the number of threads vs. the number of web-pages that need to be scraped. You know that each thread is communicating (probably) to a different server, along a different Internet network-route, so there will be a rich “mix” of completion times for each request, with a moderate amount of local disk I/O needed to file-away each request and a negligible amount of RAM footprint. Because of the random completion times, serious competition for disk-drive time is unlikely. A pool of threads will be able to take advantage of that naturally self-regulating workload ... especially if you do distribute the workload with some consideration as to which URLs are being pinged. (A “pseudo-random pick” from a moderately-sized pool of URLs-to-be-scraped would be a simple strategy to use here.) Like a field of hitters swatting balls toward the outfield, you will naturally have many balls in the air at once. Because some are pop-flys and others are line-drives, the outfielders can catch them all easily.

You have proved time and time again that you don't understand threading; indeed, everything you've ever posted on the subject -- which has never included a single line of code -- has been proven wrong.

So, just stop talking; before you make your already totally tattered reputation even worse.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

And such a vitriolic comment profited this discussion how, exactly? Honestly, just cast your obligatory down-vote, as you customarily do, and leave it at that. Please keep your personal opinions to yourself. (On the other hand, I happen to strongly agree with the positive compliment that you were very rightly paid in the OP.)

Threads are a misunderstood and thus often-misused feature, no matter what language is being talked-about. The OP hit upon a textbook example of where threading is particularly well-suited, and obtained great results. All of his program’s requests were being served by another well-designed application of threading ... Apache. But how many times have we seen, even right here, situations where people fired off “one thread per request, regardless of transaction volume,” and wondered (publicly) why their server was being brought to its knees thereby? A good design in a suitable situation works consistently well, whereas one that is permitted to “hit the wall” is disastrously-bad (and negatively impacts the system as a whole). (In some cases it is literally a “fork bomb.”)