I am currently writing a program where I have a graph and want to find the minimal sub-graph which satisfies my requirements. The graph is directed with four types of nodes. It has many cycles but no 1 cycles. There are hundreds to low thousands of nodes. Nodes generally connected to around ten other nodes, but that can vary a lot.The types of node and how they are connected:Root nodes which may point to any number of Process nodesLeaf nodes which may be pointed to by any number of Process nodesProcess nodes which may be pointed to by any number of Root or Product nodes and may point to any number of Product or Leaf nodes.Product nodes which may be pointed to by any number of Process nodes and which may point to any number of Process nodes.

I wish to find the sub-graph which satisfies the following requirements:Any Root or Leaf which is adjacent to a node in the sub-graph must be in the sub-graphFor every Process node in the sub-graph, all Product nodes which point to it must be included in the sub-graphFor every Product node in the sub-graph, at least one Process node which points to that Product must be present in the sub-graphMedium priority: Minimize Root, Product, and Process nodes included in sub-graphLow priority: Maximize Leaf nodes

Per your description, the smallest subgraph satisfying your requirements is the empty graph. I assume you're starting with either a product or leaf, and you're looking for the smallest subgraph telling you how to build that product with the least amount of processing steps?

And if I understand your description, the only choices you have is to pick just one of the available parents of each product node?

I think the first step would be to estimate the amount of possible subgraphs that could satisfy your query. If that number is small, you can evaluate all of them and pick the best one. If that number is large, you need to employ heuristics (which will end up with a lot of trial and error).

You could implement a recursive marking algorithm: start with your product node, mark all its parents, mark all their parents etc. until you've reached the roots. For every product node you've marked, note how many parents it has, and multiply those numbers. For example, if that graph contains products A, B and C, which can be produced by 3, 4 and 5 different processes respectively, then you have an upper bound of 3*4*5 = 60 different subgraphs that satisfy your requirements.

Actually I start with a process node and my goal is to find a cycle of processes that can produces all its inputs. I'm starting with all the process / resource definitions from the Mars Simulation Project and my goal is to use it as a starting point for finding possible seed factory configurations.

So it looks like if you map vertices onto products and sets onto processes, (you need to have at least one parent process per product, and you need at least one set to cover each vertex), you can reduce the problem of minimizing non-leaf nodes into set cover, which is NP-hard and can't be approximated better than order log n.

In fact since you can have multiple layers of product and process and product and process nodes, you can even build some sort of hierarchical set-covering problem into the graph, where the set of points which must be covered on layer i+1 depends on the sets selected on layer i of the problem.

Here's a rough sketch of a algorithm which might get reasonable answers:

Start with a single process node, P. Now we are forced to add in all parent product nodes and parent root nodes.Each parent root node needs select at least one parent process node, (if any exist). Do so at random with a fixed seed. Now we just have process nodes left to satisfy, so repeat the process of selecting all parent product nodes and root nodes, etc.

When we're done, we'll have picked an alternating sequence of process nodes, product nodes, process nodes, product nodes, etc. So far we've taken O(n) time. Now in the first set of selected product nodes, begin trying out greedy alterations of which parent process to select. If we do each product node one by one separately, and each product node has at most O(n) parent nodes, then there are O(n^3) possible choices here. For each of these choices, throw out the rest of the solution and rerun the algorithm using the new choices of parent process, taking the solution with the best score (lowest # nodes) as the final choice. Now proceed to the second layer of process nodes, and repeat. Continue through all layers of process nodes. This should taken O(n^4) time in total.

Finally, at the end, include all the leaf children of all the process nodes.