Gabi over at The Stata Things posts a neat Stata trick using a profile.do file, which is a file with commands that are run each time you start Stata. You place some commands into a file, name it profile.do and then make sure that it’s in a location on your computer that Stata knows where to look for it. When Stata starts, it scans the following locations in order:

Directory where Stata was installed;

Current directory;

along your Unix (or Mac) PATH or in your HOME directory on Windows;

along the adopath.

I have my profile.do saved in ~/ado/personal/, which is as good a place as any. In my profile.do file, the only task I keep is to set boost the memory up to a few gigs since I have plenty of memory on my computer and I don’t like dealing with lack of memory errors. This is simply:

set mem 2048m

Gabi uses his profile.do file to log all commands to a file with the current date as a name so that he can review his interactive commands and copy-paste them into a do-file or ado program later – a good plan. I used to log all commands and output to a file using profile.do using this code:

Ultimately, I decided to drop this from my profile.do file because the log files were getting huge and I had to keep going in to clean things up. I also never found myself using the logs.

Another trick that I used for a time was to set a bunch of global variables every time that Stata starts and use that for setting directory locations that I’d commonly use. For a while I was putting all my Stata work into one directory which was further broken down into sub-folders for raw data, logs, do-files, output, graphs, etc. This is what that code looked like:

Then anytime I wanted to load a raw data file called data.dta I’d type in:

use "${RAW}data.dta", clear

I ended up dropping this bit from my profile.do file as well as the number of projects I worked on exploded and the files in each of these folders got crazy. Now I have a projects folder with subdirectories by project.

I’m sure there are all kinds of other profile.do hacks out there, so if you have one, please share in the comments or by e-mail.

Some additional posts on profile.do use are at the links below. Check them out!

A recent post on the eKonometrics blog outlines some test of the performance of the R programming language using the same statistical model run in Stata. The author, noting that he didn’t attempt any optimization of the R routines, found that Stata is around 5 – 8 times faster than R in the models he tested (multinomial logit, ordered logit, and generalized logit).

Although I use R quite a bit, this is one of the reasons that I prefer to use Stata for many tasks. Although R is free, my time isn’t so I’d much rather spend some $ to save some time than wait for my computer to catch up to me.

A test

The results linked to above are interesting, but, many people don’t use multinomial or ordered logit models, so I feel that the question of speed is better viewed from using a task that everyone would encounter such as loading data. Below I post some results from my tests of loading three data files with 100, 10,000, and 1,000,000 observations of three numerical variables on my dual core intel laptop running Ubuntu 11.04.

Observations

Stata load time

R load time

100

1.6s

0.02s

10,000

0.1s

0.1s

1,000,000

7.2s

25.4

I was surprised to see these results because R handily beats Stata in loading 100 observations, ties with 10,000 observations and does much, much poorer with the 1,000,000 observation data file. From there results (taken with a large grain of salt) Stata could improve in small data files and some work could be put into R when it comes to large data sets.

One last comment
My analysis here is very basic and I hope it doesn’t sway anyone’s decision to use one programming language over the other as I feel that there is a really important component missing the the usual programming language head-to-head. That is, even if execution time varies greatly between languages, I feel that this is only a minor component of development time.

When I work on a project, the problem-solving, program writing and debugging time are the most important to get under control because for the most part, my output is not used in a real-time product. Rather, the analysis is generally used as background to a report or decision making process. In this light, execution time is such a small component of getting me from start to finish that I tend not to be too concerned about it. Anything that can get me from raw data to results will always get the bulk of my attention. That is why, although I love R and the amazing community behind it, when it’s crunch time, Stata is the most trusty trick up my sleeve.

I recently wrote up a quick little bot to download Bitcoin (BTC) trade data from mtgox.com using their API. The data was very easy to parse using Python and I wrote the data to a tab-delimited text file for analysis using Stata as I’m interested in the bitcoin market and considering purchasing some BTC. I’ve uploaded the dataset to Buzzdata if you are interested in Bitcoin as well or in following along with this post.

A quick -twoway- plot shows an interesting trend but the x-axis label is non-sensical to (most) humans. Below I’ll cleanup the x-axis label by converting the unix timestamp to something Stata can use and to format the x-axis.

I read a little of an excellent post on the Stata blog about dates and times from other software, but this didn’t address unix timestamps explicitly. The approach is similar to that used with the SAS conversion as mentioned in that post but with the difference that unix timestamps are seconds since January 01, 1970. So to convert a unix timestamp to a Stata clock (%tc) formatted variable one could use something like:

The important bit is the middle line. Multiplying datetime by 1000, accounts for Stata’s measurement of time in milliseconds, the -msofhours()- part accounts for the number of days between January 1, 1960 and January 1, 2970, and the last part subtracts five hours from the time to express it in Eastern Standard Time (EST) since it is GMT – 5:00.

This still isn’t quite what I want because now the x-axis of my graph will show the times and get really cluttered, so I create one more date variable which is formatted daily (%td) and use that in my plot below.

One last thing to mention, note that there is a big (or not so big depending on how specific you need to be) difference between Stata’s %tc and %tC formats. I use the first, while the second will account for leap seconds. For more details, you can read up more on the Stata blog or Stata help files on dates and times.

I just replied to an older, but unanswered, Stata post on Stackoverflow that dealt with reading .sql files in Stata. The poster wanted to take a dataset in the form of a .sql file and import it into Stata. A .sql file is just a collection of SQL commands, stored in a text file. These files can be used to backup a database and can be run in bulk to re-create a database.

To my knowledge, it’s not possible to import from a .sql file directly, so I suggested that the poster use Stata to import the .sql file into a database and then load it into Stata using -odbc-. The commands for doing this would be:

where dataset.sql is the name of the file containing your set of SQL commands, DataSourceName is the name of and ODBC connection setup using your ODBC manager of choice for your operating system, and MyTable is a table in the database from which you’d like to load data.

Although this may be slow to import the data, especially if the database is large, using a database has one significant advantage over loading data from a text file in that the database can be queried. So, instead of loading an entire dataset into Stata’s memory, one can load a subset using the SQL language, which I won’t get into at this time.

So, am I right that Stata cannot read a .sql file correctly? If so, would you approach this problem in the same way or do something different?

In Stata, both interactively as well as in do-files, I use the -if- statement as a form of subsetting a dataset while keeping the full dataset in memory. An example of this would be using the auto dataset. You may want to list all datapoints with mpg above 30 in the terminal for inspection. You can do this using:

list * if mpg > 30

Now suppose you would like to subset your data further by using values of foreign and rep78, you could end up with something like:

list * if mpg > 30 & foreign == 1 & rep78 == 5

The same strategy using -if- can also be used to apply a function to a subset of your data. For example, regressing mpg on weight for foreign made vehicles with at least three repairs in 1978

reg mpg weight if foreign == 1 & rep78 >= 3

Where using regular expressions in Stata can help here is, for example, if one wanted to regress mpg on weight for all Datsuns, Pontiacs and Toyotas. Using the -if- statement strategy above one would have to type:

You can see it starts to get quite long. One could use -regexs- and -regexm- to create two new variables, manufacturer and model, out of make, but then you’d still have to choose the three manufacturers in the if statement.

The regex hack I propose here is to use -regexm- instead of multiple if statements. Instead of the code above, one could write:

reg mpg weight if regexm(make, "(Datsun|Pont|Toyota)")

which will search for any of the three terms within the brackets which are separated by the | symbol.

I have been using -regexm- quite a bit lately and I’ve found that besides saving some typing time, do-files are shorter and easier to follow.

A recent discussion (closed group/registration required) on the Stata users LinkedIn group highlights the use of recode to create 5 year periods in a panel dataset. The question asks how to take yearly data and create a variable that contains the 5-year average of some data.

The first step is to recode your year data to 5 year periods. One does this by running:

recode year2 (2001/2005 = 2000) (2006/2010 = 2006), gen(year2)

Then, to take care of creating averages of X by period one uses the -collapse- command by running:

collapse (mean) X, by(year2 id)

where id is a unique identifier for each cross-sectional entity in the dataset. Good luck!

I have been a Stata user for about 5 years now and no matter what other programming language I try, I keep coming back. It’s very easy to learn and gets the job done quickly. There is also a great deal of information on the web if one is having troubles with some topic. I hope that this blog can contribute to the body of information on the web about using Stata. Grab my RSS feed if you want to see some Stata tips and tricks in your reader roughly once a week.