Sign up to like post

Share

Last time on Programming Feedback for Advanced Beginners, we analyzed a data processing script written by PFAB reader Frankie Frankleberry. Frankie's program helps his company scrutinize its product ranges to see if any items are mispriced. The program loads a big CSV of product data and flags any products meeting certain criteria, such as those with particularly low sales prices or profit margins. The company presumably uses the program's output to try to charge more money for the same stuff.

Last time we looked at how Frankie could use first-class functions to avoid having to use the evil eval function. This week we're going to look at some of his functions that have become overworked and responsible for too many different tasks. We'll see how we can take some weight off of them by migrating them into classes, and by the end of the episode Frankie's program will be looking positively cromulent.

Describing filters

Frankie's program uses "filters" to select a subset of his input data matching a criteria. For example, a filter could select all products in the luxury category with a price below $50. In order to document his filters for his users, Frankie wants to associate each filter function with a plain-English description of what the function does. This might be used as follows:

$ python3 filter.py list-filters
######################
## 7 active filters ##
######################
1. low_price_products: Find all products that cost less than $2
2. low_selling_items: Find all items that have not sold any
copies in the last week
3. ...etc...

In his code, Frankie wants each filter's description to be directly attached the function containing its logic. He doesn't want the functions to live in one part of his code and the descriptions in another. He doesn't want to have to write code like the following snippet, in which one function called run_filter is responsible for executing filters, and another completely separate function called get_description is responsible for keeping track of their descriptions:

This code is unpleasant for at least two reasons. First, future programmers have to be careful to keep the two if/elif blocks in run_filter and get_description exactly in sync. This is not being checked or enforced, and it will be easy to forget to update a branch if a filter name changes or a new one is added.

Second, it's difficult for a programmer working on the code to see which description is associated with which function, because they have to hop back and forth between run_function and get_description. It would be clearer and more robust if the description and logic for a function lived right next to each other and didn't need to be matched up by strings and parallel if-statements.

Uniting a filter's description and its logic is a worthy aim. But as we'll see, the way in which Frankie achieved this goal created new problems for him. Let's look at Frankie's solution, the problems it created, and how we can fix them.

Functions that do too much

Frankie bound his logic and descriptions together by requiring each filter function to accept a description_only boolean flag argument. If the function is called with description_only=False, the function filters the data as normal and returns the results as a dataset of some sort. But if it is called with description_only=True, it instead returns the description of the filter as a string. For example:

This approach tightly links logic and description, as desired, but at a high cost. The filter functions (like low_price_products) have become overworked. You should aim to have each component of your code be responsible for a single thing. Frankie's functions either return a string of the description, or the output dataset. A function that returns different types of data depending on the input it is given is by definition responsible for too much.

Functions should always return the same datatype

A rule of thumb:

Functions should always return the same data type (list, dictionary, integer, string, custom class, etc.), regardless of the arguments that they are given.

As with all things in life and software, there are exceptions. A function that takes a Twitter username and searches for their profile information could reasonably return either:

A dictionary of information if the username exists

Or nil if it the username doesn't exist

Anything fancier than this is likely a bad idea, because it means that your function is doing too much.

An aside on dynamically- and statically-typed languages

How easy or hard it is to write functions that return different data types depends on the programming language you are using. Dynamically-typed languages like Python and Ruby make it easy (arguably too easy). A loose but still helpful definition of a dynamically-typed language is one in which you don't specify in your code the data types of variables and function inputs and outputs. For example, in Python you write code that looks like:

# We don't have to specify the type of `a` and `b`,
# or the type that `multiply` returns.
#
# New versions of Python do allow you to specify
# *type-hints*, but we can safely ignore type-hints
# for the sake of this discussion.
def multiply(a, b):
return a * b
x = 3
y = 4
z = multiply(x, y)

By contrast, statically-typed languages like Go and Java require you to specify data types in your code (again, this is not the strict, textbook definition, but it's good enough). For example, you might rewrite the above Python code in Go as follows:

// We specify that `a` and `b` are integers, and that
// the return value of `multiply` is an integer too.
func multiply(a int, b int) int {
return a * b
}
// Sometimes the language's compiler can infer
// the type of a variable on its own.
x := 3
y := 4
z := multiply(x, y)

If we tried to write a filter function in Go that returned either a string or a dataset, the Go compiler would get upset when we tried to run our program:

// XXX: we have to specify in advance the data type that
// `lowPriceItems` will return. This means that we
// can't have it return either a string or a Dataset -
// we have to choose one!
func lowPriceItems(input Dataset, descriptionOnly string) ????? {
if descriptionOnly {
return "This filter returns low priced items"
} else {
// ... do some filtering of `input` ...
return outputDataset
}
}

There are ways around this restriction if we were determined enough, but they are unlikely to be a good idea. For example, we could make a type called FilterOutput with two fields called Description and Dataset. We could tell the Go compiler that the lowPriceItems function will return a FilterOutput object and set only the field corresponding to the type of data that we want to return:

To migrate this code to use this our Filter class, we'll convert the list of functions to a list of Filters; add descriptions; and have the code pass our input dataset through each filter in turn:

filters = [
Filter(
filter_f=low_price_items,
description="Find all products costing less than $2"
),
Filter(
filter_f=low_selling_items,
description="Find all items that have not sold any copies in the last week"
),
# ...etc...
]
input_data = load_data()
for f in filters:
print(f.description)
print(f.apply(input_data))

This approach has several advantages. First, it splits up a filter's description and logic so that they are no longer squashed into the same function. filter_f always returns a dataset; description is always a string. Second, it gives us a framework on which we can hang additional properties of filters in the future, such as:

The name of the filter

The name of the team who owns the filter

Who to email if the filter breaks

Where the filter should load its input data from

Eventually we may find ourselves with so many extra properties on Filter that we split them off into smaller classes like DataSource and FailureNotifier. But that's a story for another day.

In summary

Don't write functions that return drastically different data types. If you need to tightly couple several concepts or pieces of data together, consider hanging them off of a class instead.