Main menu

Post navigation

SQL for the Fantasy Football Knapsack Problem

Someone (FilippeSoaresRoza) asked a question 21 June 2013 on OTN about finding the best fantasy football team in SQL, Processing Cost - How to catch a soccer team with the highest combined score?. I saw that this was another knapsack problem, of the single container type. I had solved that problem on the forum before, and here, A Simple SQL Solution for the Knapsack Problem (SKP-1), so I decided to adapt the solution for this case. This is in fact a more general form of the problem, wherein the items now have categories, with constraints on the numbers in each category, and on the overall number of items. The first solution I posted provided an exact solution, as in the above article, and performed well enough on the simple sample data, returning in a few seconds. However, the poster reported that the query was still running on his full data set after a couple of hours. I therefore decided to look for a mechanism to reduce the work done by the query on what is a hard combinatorial problem, and to return 'good' solutions in a practical amount of time, but without guaranteeing optimality (I recently provided solutions like this for a related problem, SQL for the Balanced Number Partitioning Problem).

This article provides the SQL that does this, and also a PL/SQL package containing a pipelined function that applies a slightly different algorithm; the latter is also practical, although it proved less efficient on my test problems.

Note that this article is a re-write of the article I published 22 June, which had the exact solution approach, and also an earlier data model, closer to the poster's own.

Test Problems

I used two test problems.

Test Problem 1: Brazilian League
The first problem was supplied by the OTN poster and appears to be based on a Brazilian league. It has 114 players, in seven positions (one being coach), with twelve players forming a team. The problem is to find the team with maximum total player points within a given maximum price, and matching the positional constraints:

Note that I dropped the poster's formations based data model in favour of the above, more general one. I used AL as a code for team size, and chose the maximum price arbitrarily (but having an influence on results). I also multiplied the points and prices by a factor of 100 to allow me to work in integers.

Test Problem 2: English Premier League
The second problem is be based on English Premier League and I got the data from a 'scraping' web-site, https://scraperwiki.com/scrapers/fantasy_premier_league_player_stats/. There are some data quality issues with the data, but it is good enough for technical testing. I summed the players' points over the last season and took their values at the last week as prices.

After excluding zero-point players, there remained 576 players, of five positions, with eleven players forming a team, and the problem is the same, with the positional constraints:

The solution approach is based on the method used to provide exact solutions for knapsack problems in my earlier article, but with a number of extensions to cater for the new category constraints, and to reduce searching to manageable proportions.

players_ranked subquery: Computes a unique rank for the items, ordered by category, then profit descending

rsf subquery: A recursive subquery that returns a set of item sets in the form of strings of the concatenated item ids

rsf anchor branch: Initialises the recursion with a single record

rsf recursive branch: Items are joined having strictly higher rank, and such that the constraints are not violated, both at the current position and with any possible extrapolations

Row_Number is used to rank the records by overall profit, and the where clause excludes records from the previous iteration that have rank below an input figure;
this exclusion is what makes the computation practical; the ranking is partitioned by the category path, which is important to avoid closing off solution paths too early

Item category minima are treated differently from the maxima; once a category is in a position, the subsequent positions are required to be of the same category until the minimum number is reached

paths_ranked subquery: Excludes records that are not of full length,, and ranks those that are by profit

top_ten_paths subquery: Selects the top ten paths and cross-joins them with a row-generator to provide an indexed set of records with set size cardinality for each path

top_ten_teams subquery: Builds the item records for each of the best sets by extracting the item id from the paths according to index

The solution approach uses a modified depth-first recursion, following a similar idea to the SQL method, of adding items in strictly increasing order of category and profit ranking. Treatment of constraints uses similar ideas to the SQL solution.

The package is completely generic, with the items and categories being specified by means of input cursors

Depth-first is modified by a ranking of the next sets of feasible items, partitioned by category, in order to limit the number progressed

Hashes (associative arrays in Oracle) are used for ranking

A function, Dedup_Hash is used to allow for duplicate hash keys; it works by storing as key the actual key multiplied by the ranking set cardinality, then adding one iteratively until no duplication occurs

The recursion is truncated if the number of recursive calls exceeds an input limit

The input cursors are read into arrays and all subsequent processing is in memory; this is not a scalability problem because only the current best solutions are retained; also, I reset hashes after a given number of updates

Results

Test Problem 1: Brazilian League

The pipelined function solved this in 5 seconds, while the SQL solution solved it in 21 seconds. The solutions were identical, as follows:

I used a maximum price of 900, and a keep parameter of 40, meaning retain the best 40 records by partition during recursion for the SQL and a value of 10 for the pipelined function. The keep parameter operates differently in the two cases so does not need to be the same value.

The SQL solution took 98 seconds, while the pipelined function took 290 seconds. Both methods got the same best solution, but the tenth best was marginally better for the SQL, at 1965, compared with 1962 for pipelined function (which truncated after 100000 recursive calls).

[For followers of Manchester United, and David Moyes, it may be of interest to note that all of the best solutions included both Leighton Baines and Patrice Evra, and the best also had Marouane Fellaini. Will these players be united also in the real world next season?]

My idea for using recursive subquery factoring to solve combinatorial optimisation problems, such as knapsack problems, described in other articles on my blog, was previously only practical for small problems. The extensions described here render it a practical proposition even for larger problems. It is also relatively simple compared with procedural approaches.

2 thoughts on “SQL for the Fantasy Football Knapsack Problem”

Hey! Thanks for posting this, I'm trying to figure out how to modify this to work for american football in SQL server, instead of Oracle. I'm stuck where at the RSF portion of your query, I can't quite figure out how to replicate it. Do you have any suggestions? Thanks!!