Answers

If it's a short query, you want 1, approximately. If it's a long query and you only want to run one at a time, you want a number comparable to the number of cores. But if you don't know what kind of query, then the default will be just fine.
If you're throwing a mixed workload in, try cascading resource pools. That way queries start with minimal parallelism, but it increases as the query runs longer.

Some basic rules I follow (works for most of my use cases):
Max value: # of CPU cores per node
If the resource pool typically handles nothing but small, fast, low resource queries (less than 2 seconds is my rule), set that pool's Priority to High and ExecutionParallelism to 1. Typically larger, more intense, higher resource queries: set priority to Low, leave it at Auto or specifically set it to number of cores. Somewhere in between: I tend to increment by 8 pending resource pool workload. So where my EC2 instances have 32 core, my execution parallelism will be 1, 8, 16, 24 or 32 based on relative workload intensity across my cluster. Those with 8 or 16 i usually have Priority set to Medium. Those with 24 or 32 usually low. Again, this works with my use cases, you may have to experiment a bit

I am benchmarking queries on multiple clusters , one on prem. one on cloud. I am finding that on some queries too many threads are counterproductive. The query i am looking at is a union all of multiple tables with predicates and group by on top. What could be causing this behavior?
6 ParallelUnion clock time (us) 12 102.89 => 12 threads only
6 ParallelUnion clock time (us) 24 118.4 => 24 threads
Depending on the query , having fewer threads ( via something else than AUTO) could provide faster assembly time of the various pieces from the multithreaded operator.
The on prem cluster above is not dramatically faster with fewer cores on the same query, but the cloud cluster cuts the time in half
6 ParallelUnion clock time (us) 12 139.43 => 2 min 2 sec to assemble data
6 ParallelUnion clock time (us) 28 275.76 => 4 min 35 sec
6 ParallelUnion execution time (us) 28 35.67 => exec time much smaller than clock
This is also where i see a large discrepancy between exec time and clock time. I would guess this is indication of heavy context switching in the host where threads cannot run , and clock time continues to count.
So , I am going to ask whether on systems that have a lot of cores ( 64 cpu count) like the cloud system, it is more productive to change AUTO across the board to something smaller like 12 or 24. Some queries would take a hit, but overall will we see more work completed due to less context switching? Keep in mind that cluster is almost always running workloads and so the use case of running 1 long query only fastest is not common in real life.
Regarding cascading pools, will that not mean restarting the execution from scratch with different threading parameter and wasting all the cycles spent so far?

@colin_loghin How are you my friend? Interesting case study, I've been playing around with it a bit myself. I do tend to agree that in many (but not all) cases, changing AUTO to a lower number is a good thing to do. Still messing around with it though to find that "sweet spot" for a few of our resource pools.
As to your final question, no the execution should not restart from scratch. From the docs: "After Vertica finds an appropriate pool on which to run the query, it continues to execute that query uninterrupted."