I'm wondering how to design a fairly simple class whose properties are complex to compute. Also, the properties depend upon each other for computation.

An example using graphs and graph processing (think nodes and edges, not charts or scatterplots) to motivate the problem:

Class CommunityGraphSet is a collection of Community instances. CommunityGraphSet is initialized with an input graph (some large social network graph), along with some basic parameters describing what is known about the graph. A Community is a subgraph of the input graph, along with some structural descriptors.

After initialization, the CommunityGraphSet contains no Community subgraphs. After several major processing steps, it contains a set of Community subgraphs. Each major processing step (i.e. function call) triggers a GUI for a human to tune the parameters of the algorithm being run in that step. The human sets parameters, the processing results are visualized, and they continue to refine the parameters until the results are acceptable. After they "accept the parameters" with a button click, the results they saw are returned by the method that triggered the GUI.

input_graph = read("graph_file.csv")
number_communities = 3
sparseness_thresh = (0.05, 0.10)
# MAJOR PROCESSING STEP
# get parameters for key community member identification based on overall graph structure
key_ident_params = compute_key_identification_params(input_graph)
# initialize the graph set
graph_set = CommunityGraphSet(input_graph, number_communities, sparseness_thresh)
# MAJOR PROCESSING STEP
# mark where community subgraphs definitely are not.
partition_mask = isolate_potential_communities(graph_set)
# use mask to extract subgraphs for refinement by different algorithms.
potential_community_subgraphs = graph_set.extract_subgraphs(partition_mask)
for community in community_list:
# MAJOR PROCESSING STEP
# key community member extraction
key_members = extract_key_members(community, key_ident_params)
# MAJOR PROCESSING STEP
# use identified key members and community sparseness to trim community to final subgraph
final_community = refine_community(community, key_members, sparseness_thresh)
# put the refined community in the final set
graph_set.add_community(final_community)
save(graph_set.serialize(), "communities.json")

My problems with this design:

CommunityGraphSet is stateful. It contains Community subgraphs only after the right function calls in the right order.

The processing pipeline is monolithic. As the processing pipeline grows and changes, maintaining it will become unweildy as I'm faced by a growing number of variables used in increasingly many places.

Are these indeed problems? Or are they inherent in data processing pipelines?

If they are problems, how can I solve them?

Notes:

None of CommunityGraphSet is not coupled to GUI logic, it is coupled to an IterativeOptimizer interface for each step. The described GUI or a clever optimization algorithm could implement that interface.

1 Answer
1

I think that the two problems you listed arise not from your design but rather out of the problem you are solving.

Depending on how generic you want to make your design, you could have some meta-data that defines each major processing step, and then some methods that you call to "push" the pipeline into the next step. Of course, since it sounds like each step requires its own specific parameters from the user, you would have to define the parameters generically as well, using more meta-data.

I don't think that what I just outlined is a particularly good design based on your problem description. It adds a lot of complexity and meta-data overhead for no real benefit. Sometimes a problem is so domain-specific that the code you design to solve it is going to be equally specialized. That's not necessarily a bad thing. You can mitigate it by writing code that is as clear and simple and well-documented as possible :-)

Update: As to your specific question of whether it's "OK to have a have a class that's designed to be incomplete on initialization, and completed by necessary processing steps," you may be thinking of the more general advice that is mentioned in some OO design guidelines:

A properly written constructor leaves the resulting object in a
valid state.

Remember, it's up to you to decide what "valid" actually means in context. In your example, I would argue that the CommunityGraphSet is in a valid state even at the beginning because its purpose is not to serve as a static immutable representation of the graphs, but as a processing pipeline to generate the graphs based on successive stages of user input. In other words, it's not really "incomplete" because the requirements dictate that the processing requires several stages of user input. Perhaps a better name would be CommunityGraphGenerator or CommunityGraphProcessor. Further, you could define a separate class CommunityGraphSet that would hold the results of the entire process once the pipeline is finished. Even if this class is just contained within the larger one, it might make the design more understandable to an OO purist.

Thank you, that's helpful. So it doesn't seem like a horrible code smell to have a class that's designed to be incomplete on initialization, and completed by necessary processing steps? (meaningful default values for most of the attributes are really hard to even imagine in this case)
– kdbanmanMay 13 '15 at 0:51

No, I wouldn't consider it a code smell, but I'm usually on the liberal/pragmatic side of these arguments. To make it more maintainable, make sure your design is well documented, and consider throwing exceptions when a processing step (method) is called outside of the correct sequence. Kind of like a wizard UI. This implies that you maintain some kind of pointer to the next step (a very minimal meta-data structure.)
– Jordan RiegerMay 13 '15 at 1:50

That's encouraging. I was having a lot of difficulty fitting this problem into what I was taught in school. Thanks very much for the help.
– kdbanmanMay 13 '15 at 14:53

1

@kdbanman My pleasure. I just updated my answer with some more discussion of why it may be OK to design a class to be initialized in a seemingly "incomplete" state.
– Jordan RiegerMay 13 '15 at 17:33