Abstract: Generating up to date, well labeled datasets for machine learning (ML)
security models is a unique engineering challenge, as large data volumes,
complexity of labeling, and constant concept drift makes it difficult to
generate effective training datasets. Here we describe a simple, resilient
cloud infrastructure for generating ML training and testing datasets, that has
enhanced the speed at which our team is able to research and keep in production
a multitude of security ML models.