Statistical significance testing plays an important role when drawing
conclusions from experimental results in NLP papers. Particularly, it is a
valuable tool when one would like to establish the superiority of one algorithm
over another. This appendix complements the guide for testing statistical
significance in NLP presented in \cite{dror2018hitchhiker} by proposing valid
statistical tests for the common tasks and evaluation measures in the field.

Captured tweets and retweets: 2

Made with a human heart + one part enriched uranium + four parts unicorn blood