Data Engineering

We build scalable platforms for the collection, management, and analysis of data.

Systems and Tools

Probabilistic Record Linkage

Our team is developing scalable Apache Spark based
systems to disambiguate hundreds of millions of records referencing identical entities from disparate internal and external data sources.
Through the application of large scale partitioning and machine learning algorithms trained to identify
relations between records from different systems, our efforts aim to unify many different datasets into a single queryable
analytics resource that can return data describing individuals within MassMutual's information systems and beyond.

Data Ingestion Pipelines

Our analytics platform ingests data across sources ranging from relational databases and mainframe extracts
to log files, images, and tweets. We build and deploy systems based on
Apache Spark and other tools from the
Hadoop ecosystem. We use Jenkins CI
for job scheduling and continuous integration/delivery, git for source control,
Ansible for configuration management,
virtualenv for python environments, and
Docker for deployment and testing.