I'm pretty new to machine learning so wondering whether someone can help check my thinking or point me in the right direction!

I need to create a classifier which can predict an outcome for a person based on attributes about that person + multiple time series containing activity data for each person.

After a lot of research I have concluded that the best approach will be to reduce the dimensionality of each time series and have chosen to use Symbolic Aggregate approXimation (SAX) to generate a symbolic encoding for each time series for each person (e.g. "abfaadda").

This means the input to my classifier will look as follows, where each case is a person:

Gender

Ethnic group

Age

...

Time series 1 SAX encoding

Time series 2 SAX encoding

...

Class attribute

The SAX approach defines a distance measure for the encoded representation of the time series which can be used when clustering time series across multiple cases, or for classification of a time series.

The issue I have is that I have multiple encoded time series as inputs to the classifier. The classifier needs to take into account the similarity of each encoded time series, rather than just splitting on the encoded string. It also needs to take into account the other attributes too.

The solution I have come up with is to cluster each SAX encoded time series and then manually label each cluster (e.g. Low Activity, High Activity). The labelled cluster would then be used as the input to the classifier, i.e.