Code snippet

As the code above shows, it’s trivial to generate your own 1/0 columns of data instead of relying on Factors. There are two things to keep in mind when creating your own dummy variables:

The problem you are trying to solve

How much RAM you have available

While it may make sense to generate dummy variables for Customer State (~50 for the United States), if you were to use the code above on City Name, you’d likely either run out of RAM or find out that there are too many levels to be useful.

Of course, with any qualitative statement such as “too many levels to be useful”, oftentimes the only way to definitively know is to try it! Just make sure you save your work before running this code, just in case you run out of RAM. Or, use someone else’s computer for testing

Edit 1/2/14: John Myles White brought up a good point via Twitter about RAM usage:

@randyzwitch If you're running out of RAM with dummy variables, you probably want to use a sparse matrix instead of a data.frame.