ruaok: we were using Rdds for lookup. To effectively use Rdds we must parallelize them which I still need to understand. Also, RDDs are slow. So we used dataframe for lookup (a dataframe by default has 200 partitions) and they are fast.

alastairp: Related: https://dependabot.com/blog/gemnasium/ : "Finally, for us, Gemnasium's blog post is a warning of what can happen to businesses in a platform ecosystem. We believe Dependabot adds a lot of value over GitHub's dependency graph, and over Gemnasium, but if GitHub were to replicate our functionality they would likely crush us. We don't believe that's in their interest, but are staying as close to them as possible." :)

alastairp: I did try and import the data dump, but I haven't been successful.. I've used `./develop.sh run --rm webserver python2 manage.py import_data path_to_the_archive` and `./develop.sh run --rm webserver python2 manage.py init_db --force path_to_the_archive` before to import smaller archives that I have made, but those were .tar.xz files and the .sql.bz2 doesn't work with that, is that correct?

aidanlw17: I found something else while reading last week too: from https://github.com/spotify/annoy - "another feature that really sets Annoy apart: it has the ability to use static files as indexes. In particular, this means you can share index across processes."

so we should definitely see if we also need this tradeoff - either we want to do this, or it might be more important for us to update easily. I still think that it's OK that we start with annoy, but if we see our requirements change we might want to look at this again

alastairp: So it could be helpful for us in terms of using the index in multiple ways at the same time, but if our update mechanism doesn't work we could look for something that updates rather than allowing multiple processes?