Forum Stats

Remove Duplicate Examples

I'm working on some genetics data. I have 6151 examples and 157 attributes. My attributes are patient IDs and my examples are gene names. My goal is to transpose the matrix table. Here is a sample of my data set:

My problem now is I can't use the "Transpose" operator because there are duplicate row/example names. In order to transpose it, the attribute name needs to be unique. I wish to find all the pairs that have the same example names and edit their names. I was thinking about doing a loop, but I don't really know where to start and what operators to use to change the row names. Can somebody give me some advises on how to achieve this?

Nice challenge, but honestly, I don't see any solution to perform automatically what you want to do with RapidMiner's native operator(s)...... however there is a (relativ) simple solution using a Python script to perform this task.Basically, the script add a number to the name of the duplicate and this number is incremented according to the number of duplicate(s) of a name.Concretely the output example set looks like that :

After executing this process, all the names/values of the "gene_name" attribute are unique et thus you can transpose your exampleset...

To execute this process, you need to : - Install Python on your computer - Install the Python Scripting extension in RapidMiner (from the Marketplace)

If you don't care about extra characters in the new id, then you could also simply use Generate ID to create a numerical index, and then use Generate Attributes to concatenate that with the existing Gene Name into a Gene Name / ID hybrid, and then use Transpose with that new field serving as the id. This is similar to yy's solution.