tag:blogger.com,1999:blog-43530815428899002112018-03-07T15:51:21.861-08:00R in the AntipodesMichellehttp://www.blogger.com/profile/17580800643996840530noreply@blogger.comBlogger9125tag:blogger.com,1999:blog-4353081542889900211.post-88838553443590137002015-03-26T14:43:00.004-07:002015-03-26T14:43:41.504-07:00Getting variable labels in R, from SPSS<span style="font-family: Arial,Helvetica,sans-serif;">When using R, we may need find our data has been saved in a different statistics package. While there are some export functions in other statistical software that will export to a different filetype, or we may simply use a .csv file, R can import some datasets from their native filetype.</span><br /><br /><span style="font-family: Arial,Helvetica,sans-serif;">SPSS is one of those filetypes. SPSS datafiles have a .sav extension, and we can import these into R using the <span style="font-family: &quot;Courier New&quot;,Courier,monospace;">foreign </span>package. This package is installed by default as part of the R core installation.</span><br /><br /><span style="font-family: Arial,Helvetica,sans-serif;">Ensure the foreign library is attached:</span><br /><span style="font-family: &quot;Courier New&quot;,Courier,monospace;">library(foreign)</span><br /><br /><span style="font-family: Arial,Helvetica,sans-serif;">There is a nifty trick to getting the filepath for the SPSS datafile you wish to import, use:</span><br /><span style="font-family: &quot;Courier New&quot;,Courier,monospace;">file.choose()</span><br /><br /><span style="font-family: Arial,Helvetica,sans-serif;">Copy and paste the filepath into this code:</span><br /><span style="font-family: &quot;Courier New&quot;,Courier,monospace;">dataset = read.spss("[filepath including filename goes here]", to.data.frame=TRUE)</span><br /><span style="font-family: Arial,Helvetica,sans-serif;">The option at the end creates the R file as a dataframe, which is the type of data object I want in R.</span><br /><br /><span style="font-family: Arial,Helvetica,sans-serif;">Note: I am using</span> <span style="font-family: &quot;Courier New&quot;,Courier,monospace;">dataset </span><span style="font-family: Arial,Helvetica,sans-serif;">as my dataset name in this example. Use whatever name is best for you, and remember to change all instances of</span> <span style="font-family: &quot;Courier New&quot;,Courier,monospace;">dataset </span><span style="font-family: Arial,Helvetica,sans-serif;">to your actual dataset name in later code. </span><br /><br /><span style="font-family: Arial,Helvetica,sans-serif;">Unfortunately, if your SPSS datafile had variable labels (e.g. "Sex of respondent"), these aren't shown in the R dataframe, only the variable names are shown (e.g. Sex). While the name is often clear for variables such as sex, you may find that the names are less clear for other options (e.g. for a survey containing multiple "select all that apply" type questions/responses). It is therefore very useful to have the list of variable names and their associated labels.</span><br /><br /><span style="font-family: Arial,Helvetica,sans-serif;">You can simply print the concordance to the console by using:</span><br /><span style="font-family: &quot;Courier New&quot;,Courier,monospace;">attr(dataset, "variable.labels")</span><br /><br /><span style="font-family: Arial,Helvetica,sans-serif;">I didn't find this helpful, for two reasons:</span><br /><ul><li><span style="font-family: Arial,Helvetica,sans-serif;">I have a lot of variables, so it takes up a sizeable amount of console space</span></li><li><span style="font-family: Arial,Helvetica,sans-serif;">I am going to keep referring to the labels when I need to do analyses, and retaining the information in the console is not helpful if I have to keep scrolling back, or reissuing the command</span></li></ul><span style="font-family: Arial,Helvetica,sans-serif;">I found it much more efficient to output the variable names and their labels to a separate dataframe that I can use:</span><br /><span style="font-family: &quot;Courier New&quot;,Courier,monospace;">dataset.labels &lt;- as.data.frame(attr(dataset, "variable.labels")) </span><br /><span style="font-family: Arial,Helvetica,sans-serif;"><br /></span><span style="font-family: Arial,Helvetica,sans-serif;">Voila, 5 lines of code to get my SPSS data and variable labels into R.</span><br />Michellehttp://www.blogger.com/profile/17580800643996840530noreply@blogger.comtag:blogger.com,1999:blog-4353081542889900211.post-76572813255530660372013-05-14T23:13:00.002-07:002013-05-14T23:13:30.339-07:00Easier confidence interval estimation with matrices and similar arrays in RWhen dealing with survey data in particular, social scientists are often wanting to produce proportions from the data, and associated confidence intervals. The prop.test command in R can be used to generate the desired results. When dealing with small tables, such as 2x2, prop.test is easily applied to the output of an xtabs command, e.g.<br /><span style="font-family: &quot;Courier New&quot;, Courier, monospace;">load(MASS)</span><br /><span style="font-family: &quot;Courier New&quot;, Courier, monospace;">test &lt;- xtabs(~low+smoke, data=birthwt)</span><br /><span style="font-family: &quot;Courier New&quot;, Courier, monospace;">test</span><br /><span style="font-family: &quot;Courier New&quot;, Courier, monospace;">#get 95% CI estimate of the proportion of low birth weight babies who had mothers who smoked during pregnancy</span><br /><span style="font-family: &quot;Courier New&quot;, Courier, monospace;">prop.test(test[4], test[2]+test[4], 0.05)</span><br /><br /><span style="font-family: &quot;Courier New&quot;, Courier, monospace;"><span style="font-family: Arial, Helvetica, sans-serif;">We can see that the xtab output can be referenced in the prop.test command by putting in the cell reference information for the data. The cell references are column-by-column, so go down each row in the first column, before moving onto the second column. We can also see that this type of prop.test command is quite short when the xtabs output is relatively small.</span></span><br /><br /><span style="font-family: &quot;Courier New&quot;, Courier, monospace;"><span style="font-family: Arial, Helvetica, sans-serif;">Issues arise when there are multiple rows or columns in the xtabs output. For example, if we wanted to examine number of physician's visits in the first trimester by mother's age:</span></span><br /><span style="font-family: &quot;Courier New&quot;,Courier,monospace;">test2 &lt;- xtabs(~age+ftv, data=birthwt)&nbsp;</span><br /><span style="font-family: &quot;Courier New&quot;, Courier, monospace;">test2&nbsp;</span><br /><br /><span style="font-family: Arial,Helvetica,sans-serif;">Now if we want to know the proportion and associated CIs for each age of the mother, for each number of physician visits, there are 24 cells to sum for each physician visit count. This is way too long to enter into our prop.test; while prop.test can handle summing all these numbers, the R syntax will be hard to understand.</span><br /><br /><span style="font-family: Arial,Helvetica,sans-serif;">This is where colSums is a handy command. The column sums for test2 are simply calculated using:</span><br /><span style="font-family: &quot;Courier New&quot;,Courier,monospace;">colSums(test2)</span><br /><br /><span style="font-family: Arial,Helvetica,sans-serif;">&nbsp;This command can be easily used in prop.test. For example, to get the proportion of mothers who saw no physical, who were 19 years of age, the prop.test command is:</span><br /><span style="font-family: &quot;Courier New&quot;,Courier,monospace;">prop.test(test2[6], colSums(test2)[1], 0.05)</span><br /><span style="font-family: Arial,Helvetica,sans-serif;"><br /></span><span style="font-family: Arial,Helvetica,sans-serif;">Because the count we want is cell number 6 in test2, and the colSum we want is for the first column. We can instruct R to conduct the colSums test within prop.test, or alternatively we could assign the colSums result to an object, and then reference the cell in the object.</span><br /><span style="font-family: Arial,Helvetica,sans-serif;">&nbsp;</span><br /><span style="font-family: Arial,Helvetica,sans-serif;">&nbsp;</span>Michellehttp://www.blogger.com/profile/17580800643996840530noreply@blogger.comtag:blogger.com,1999:blog-4353081542889900211.post-69158892630547398972013-01-04T13:14:00.000-08:002013-01-04T13:14:15.834-08:00Shortening code in R: "with" and "within" are your friendsIf you're anything like me, you use words to name both data frames and variables. I use metadata documentation to keep track of new variables I have constructed, and the logic I have used to construct them (and why I bothered!) but I find many variable names (such as var1) or using cutdown phrases (such as &nbsp;F1519Inc for an income variable for females aged 15 to 19 years) to describe a variable become relatively noninformative and confusing as time passes since I constructed my variables.<br /><br />The main issue with using longer dataframe and variable names is that they take up so much space in the code. You can end up spending more time trying to lay out your code so it is readable, and less time thinking and doing analyses. But R comes with two handy tools for shortening code:<br /><br /><ul><li>with</li><li>within</li></ul><div>I had been using R for a few years before I found out about these two handy commands, and now I use them all the time.</div><div><br /></div><div>Let's work through these with an example. Because R comes with the package MASS present, I'll use data supplied with MASS. We're going to use "minn38" which is data relating to &nbsp;Minnesota high school graduates of 1938. I'm using social science data as the example as this is the type of data I use most often.</div><div><br /></div><div>First we need to get MASS into our workspace. Once we load the package in, the MASS datasets become available for us to use:</div><div><span style="font-family: 'Courier New', Courier, monospace;">library("MASS")</span></div><div><br /></div><div>and we check we can see the <span style="font-family: Courier New, Courier, monospace;">minn38 </span>package:</div><div><span style="font-family: 'Courier New', Courier, monospace;">str(minn38)</span></div><span style="font-family: Courier New, Courier, monospace;"><br /></span>Now let's construct a new data frame off this one with longer, more informative names:<br /><span style="font-family: 'Courier New', Courier, monospace;">names(Minn.High.School.Grad.1938)[names(Minn.High.School.Grad.1938)=="hs"]&lt;-"High.School.Rank"</span><br /><br /><span style="font-family: Courier New, Courier, monospace;">names(Minn.High.School.Grad.1938)[names(Minn.High.School.Grad.1938)=="phs"]&lt;-"Post.High.School.Status"</span><br /><span style="font-family: Courier New, Courier, monospace;">names(Minn.High.School.Grad.1938)[names(Minn.High.School.Grad.1938)=="fol"]&lt;-"Father.Occup.Level"</span><br /><span style="font-family: Courier New, Courier, monospace;">names(Minn.High.School.Grad.1938)[names(Minn.High.School.Grad.1938)=="f"]&lt;-"Count"</span><br /><br />if you run the <span style="font-family: Courier New, Courier, monospace; font-size: x-small;">str </span>command on the new dataframe, you will see that it contains the updated variable names.<br /><br />Now, just for fun, let's create a new variable that interacts <span style="font-family: Courier New, Courier, monospace;">Sex </span>and <span style="font-family: Courier New, Courier, monospace;">High School Rank</span>. We can use the <span style="font-family: Courier New, Courier, monospace;">ifelse </span>command. Because the names are so long in this example, I've only coded for two levels and then cheated by setting every other level to "Other".<br /><span style="font-family: Courier New, Courier, monospace;">Minn.High.School.Grad.1938$Rank.by.Sex &lt;- ifelse(Minn.High.School.Grad.1938$High.School.Rank=="L" &amp;&nbsp;</span><span style="font-family: 'Courier New', Courier, monospace;">Minn.High.School.Grad.1938$sex=="F", "Low Rank Female",</span><br /><br /><span style="font-family: Courier New, Courier, monospace;">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;ifelse(Minn.High.School.Grad.1938$High.School.Rank=="L" &amp;</span><span style="font-family: 'Courier New', Courier, monospace;">&nbsp;Minn.High.School.Grad.1938$sex=="M", "Low Rank Male",</span><span style="font-family: 'Courier New', Courier, monospace;">&nbsp;"Other"))</span><br /><br />That's some loooooooooooong code lines. So, the pro of using long data frame and variable names is that you can easily see what data frame and what variable you should use. The con is that it makes your code so much harder to lay out.<br /><br />Let's shorten the code by using <span style="font-family: Courier New, Courier, monospace;">with</span>. The R description of <span style="font-family: Courier New, Courier, monospace;">with </span>is available&nbsp;<a href="http://stat.ethz.ch/R-manual/R-devel/library/base/html/with.html" target="_blank">here</a>. We can remove the repeated call to the data frame name in our <span style="font-family: Courier New, Courier, monospace;">ifelse </span>statements, because the <span style="font-family: Courier New, Courier, monospace;">with </span>command is telling R that we are using that single data frame for all variable calls:<br /><span style="font-family: 'Courier New', Courier, monospace;">Minn.High.School.Grad.1938$Rank.by.Sex2 &lt;- with(Minn.High.School.Grad.1938, {ifelse(High.School.Rank=="L" &amp;&nbsp;</span><br /><br /><span style="font-family: Courier New, Courier, monospace;">&nbsp; sex=="F", "Low Rank Female",</span><br /><span style="font-family: Courier New, Courier, monospace;">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;ifelse(High.School.Rank=="L" &amp; sex=="M", "Low Rank Male",</span><br /><span style="font-family: Courier New, Courier, monospace;">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; "Other"))})</span><br /><br />Alternatively, the within command can be used to the same end:<br /><br /><div><span style="font-family: 'Courier New', Courier, monospace;">Minn.High.School.Grad.1938 &lt;- within (Minn.High.School.Grad.1938, {</span></div><div><div><span style="font-family: Courier New, Courier, monospace;">&nbsp; Rank.by.Sex2 &lt;- ifelse(High.School.Rank=="L" &amp; sex=="F", "Low Rank Female",</span></div><div><span style="font-family: Courier New, Courier, monospace;">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;ifelse(High.School.Rank=="L" &amp; sex=="M", "Low Rank Male", "Other"))})</span></div></div><div><br /></div><div>The <span style="font-family: Courier New, Courier, monospace;">within </span>method is preferred if you have a lot of variables to recode simultaneously, as you're only specifying the data frame at the start. All the other variables can be inserted ahead of the closing curly brackets.</div><div><br /></div><div>Key points to note:</div><div><ul><li>there must be a comma following the data frame name</li><li>both commands uses curly brackets, so remember to change your bracket type</li><li>the <span style="font-family: Courier New, Courier, monospace;">with </span>command can be used in situations other than data preparation, e.g. in the formula for a regression, see the R examples</li><li>the two commands are only functionally equivalent in this example because the <span style="font-family: Courier New, Courier, monospace;">with </span>command is being used to construct a permanent variable, there are examples (e.g. in regressions) where <span style="font-family: Courier New, Courier, monospace;">with </span>has a transient effect&nbsp;</li></ul></div><br /><br /><br /><br /><div></div>Michellehttp://www.blogger.com/profile/17580800643996840530noreply@blogger.comtag:blogger.com,1999:blog-4353081542889900211.post-41554205555294092362012-03-27T15:22:00.003-07:002012-03-27T15:22:59.042-07:00R: ifelse statements with multiple variables and NAsifelse statements in R are the bread and butter of recoding variables. Normally these are pretty easy to do, particularly when we are recoding off one variable, and that variable contains no missing values. There are lots of examples on how to do this simple coding already available, so I will simply redirect you to the posts&nbsp;<a href="http://www.statmethods.net/management/variables.html" target="_blank">here</a>&nbsp;and&nbsp;<a href="http://rwiki.sciviews.org/doku.php?id=tips:programming:ifelse" target="_blank">here</a>, which are excellent summaries of how to do simple ifelse coding.<br /><br />Things get more complicated when:<br /><br /><ul><li>the recoding is defined off more than one variable, and</li><li>the variables contain missing values (NA in R speak)</li></ul><div>I've been working on this type of coding for the past day, and I thought I would share my tips with you as it took me quite a while to figure out how to do this, with numerous internet searches. Hopefully I can save someone else the time!</div><div><br /></div><div>My data frame is a survey of the general population, looking at consumption of a particular type of food (we'll call it Widgets), because there was insufficient information about Widget-eaters to be able to cut a better sample frame. The end result is that I have a data frame where only around 10% of respondents say they have eaten Widgets. This means that I have lots of information about the relevant 10% of the population, and a much smaller amount of information about the other 90% (age, location, and sex only). Currently I am holding, and working on, all the data in one data frame without subsetting, although I may subset later. The upshot is that I have a data frame where 90% of the data is NA.</div><div><br /></div><div>The current problem I am faced with is that there were a number of open-ended response options to questions because we had "Other, please specify" type options. These other responses were selected by only a subset of the Widget eaters (who are 10% of my respondents). These "Other" response options, which are answers to questions like "Where did you last buy Widgets", "Where did you obtain information about the benefits of eating Widgets" were coded into updated categories by subject matter experts and provided to me as Excel files. My role here is to merge the updated coding into the existing response option coding for each question where there is an open-ended response.</div><div><br /></div><div>Normally, that's been pretty easy. Most of the questions have required a "pick one only" type response so it's just been a matter of updating an existing variable, from a response of "Other" to a better coded response. That's simply replacement coding, where an existing variable level is overwritten by a new one.</div><div><br /></div><div>However, I have just been recoding response options to a "choose all that apply" question. And this is where my ifelse has smacked up against the NA problem. Here, I am recoding "Other" responses to the question&nbsp;&nbsp;"Where did you obtain information about the benefits of eating Widgets?"&nbsp;</div><div><br /></div><div>There were two recodes I had to make:</div><div><ul><li>update an existing variable to reflect that the open-ended response should have been coded to an existing response option (e.g. because of post-hoc decision to expand a response option, e.g. so that the response option "TV" now also includes "radio")</li><li>to construct new response option categories because the existing ones are not sufficient to capture the "other" response and there are a sufficiently large number of similar response options to justify creating a new response option.</li></ul><div>There are 5001 respondents in the data frame.&nbsp;Of the 460 Widget eaters, 73 provided "other" responses to the question.&nbsp;The data came to me with 21 response options precoded, and the 73 responses to the "other" option were split off into Excel for recoding by a subject matter expert. I received back the recodes in the Excel file. I saved a copy of the Excel file to csv format, read that into R, and created a new variable "Other.Source" to capture all of the numeric recodes. &nbsp;(I have created a data dictionary that shows what each numeric code stands for.) Inside the master R data frame ("foo"), the 20 non-"Other" responses are coded as separate variables (because this is an "all that apply" question), from C1a to C1t, with C1r as the "Other" category. The codes are:</div></div><div><ul><li>1 if the Widget eater gave that response</li><li>0 if the Widget eater did not give that response</li><li>NA if the respondent is not a Widget eater</li></ul><div>The Other.Source value equals the response option number for question C1 in the survey. For example, where the response should have been coded to response option 1 (i.e. variable C1a in foo), the value in Other.Source has been set to 1.</div><div><br /></div><div>I have 4 recodes to apply to existing response option categories, and 4 new response option categories that need to be constructed as new variables in the data frame (taking the range C1u to C1x to keep with the extant metadata framework). I have had to use a slightly different approach to recode, depending on whether I am updating an existing variable, or creating a new one. I show both approaches below.</div></div><div><br /></div><div>Other.Source has been merged into foo, ahead of the work below. This means that all respondents who did not have an "other" response for the question, regardless of whether or not they are Widget eaters, are NA for Other.Source in foo.</div><div><br /></div><div>First, updating an existing variable, based on values to two variables. This shows how I updated the 7th response option, which was coded to variable C1g in foo. I had been constructing a temporary variable to take the recodes (foo$temp) because I was having so many issues getting my code correct)</div><div><div><span style="font-family: 'Courier New', Courier, monospace;">table(foo$C1g)</span></div><div><span style="font-family: 'Courier New', Courier, monospace;">#code below is complicated because a missing value screws up the counts, so need to exclude missing values</span></div><div><span style="font-family: 'Courier New', Courier, monospace;">#explicitly, hence all the !is.na() syntax = ensuring that the value is NOT missing</span></div><div><span style="font-family: 'Courier New', Courier, monospace;">foo$temp &lt;- ifelse(!is.na(foo$C1g) &amp; (foo$C1g==1) | (!is.na(foo$Other.Source) &amp; foo$Other.Source==7),1,</span></div><div><span style="font-family: 'Courier New', Courier, monospace;">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ifelse(foo$C1g==0 &amp; !is.na(foo$Other.Source) &amp; foo$Other.Source!=7,0,</span></div><div><span style="font-family: 'Courier New', Courier, monospace;">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;ifelse(foo$C1g==0 &amp; is.na(foo$Other.Source),0,NA)))</span></div><div><span style="font-family: 'Courier New', Courier, monospace;">foo$C1g &lt;- foo$temp</span></div></div><div><br /></div><div>Walking through this code:</div><div><ul><li>I produce a table of the original responses coded to C1g so I can make sure that the update means that I have at least the same number of &nbsp;"1" responses after recoding. If I have less, something has gone wrong.</li><li>Some comments to remind myself of why I have such complicated code: it's all down to the NAs!</li><li>The first <span style="font-family: 'Courier New', Courier, monospace;">ifelse </span>is giving "1" codes to temp where C1g contains 1 or where Other.Source contains 7 (the value indicating the "other" response should have been coded to the 7th response option, i.e. C1g). Because both C1g and Other.Source contain NAs, <span style="font-family: 'Courier New', Courier, monospace;">!is.na()</span> needs to be added as an extra condition for both variable checks, otherwise NA values interfere with getting the correct test result. Note that this contains an OR (|) condition: if either C1g OR Other.Source indicates that the 7th response option is valid for the respondent, set temp to "1".&nbsp;</li><li>The second <span style="font-family: 'Courier New', Courier, monospace;">ifelse </span>puts temp to 0 if C1g is 0 (i.e. the respondent is a Widget eater and they were not coded to the 7th response option originally), and that they gave an "other" response that was not related to the 7th response option. Note how the <span style="font-family: 'Courier New', Courier, monospace;">!is.na()</span> condition has been included for Other.Source, again it is included because only using&nbsp; <span style="font-family: 'Courier New', Courier, monospace;">foo$Other.Source!=7 </span><span style="font-family: inherit;">without the&nbsp;</span><span style="font-family: 'Courier New', Courier, monospace;">!is.na </span><span style="font-family: inherit;">would give incorrect results.</span></li><li>The third <span style="font-family: 'Courier New', Courier, monospace;">ifelse </span>is there because only a subset of Widget eaters gave "other" responses, so most Widget eaters will be NA for Other.Source, and we want to set them to "0" (did not give the 7th response option) and not NA (were not Widget eaters).</li><li>The NA result will therefore only be assigned where the respondent is not a Widget eater.</li><li>The last line replaces the original (and incorrect) C1g variable with the updated (and correct) results stored in the temp variable.</li></ul><div>What about creating a new variable? Well, as I mentioned above I had to do this four times. The code is a little different because we don't have an extant response option variable to match off. What I do have is a UseMentioned variable that indicates whether the respondent is a Widget eater (value="Yes") or not (value="No"). So there are no NAs in the UseMentioned variable, which is part of foo. The code to do the new variable construction is below. We are constructing the 24th variable, which is named C1x*:</div></div><div><div><span style="font-family: 'Courier New', Courier, monospace;">foo$temp &lt;- ifelse(!is.na(foo$Other.Source) &amp; foo$Other.Source==25,1,</span></div><div><span style="font-family: 'Courier New', Courier, monospace;">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ifelse((is.na(foo$Other.Source) | foo$Other.Source!=25) &amp; foo$UseMentioned=="Yes",0,NA))</span></div><div><span style="font-family: 'Courier New', Courier, monospace;">foo$C1x &lt;- Aussie$temp</span></div></div><div><br /></div><div>Walking through this code:</div><div><ul><li>The first <span style="font-family: 'Courier New', Courier, monospace;">ifelse </span>gives the value 1 to the new variable if the respondent picked this response option, with the<span style="font-family: 'Courier New', Courier, monospace;"> !is.na()</span> preventing problems arising from NA values in Other.Source.</li><li>The second <span style="font-family: 'Courier New', Courier, monospace;">ifelse </span>gives the value 0 to the new variable if the respondent is a Widget eater (UseMentioned=="Yes") and the respondent did not provide this response option. Note the use of <span style="font-family: 'Courier New', Courier, monospace;">is.na()</span> and <b>not </b>the <span style="font-family: 'Courier New', Courier, monospace;">!is.na()</span>that was all that was used in the previous recoding example. While it would appear that all that is required is the&nbsp; UseMentioned=="Yes" criterion, the presence of NAs in the Other.Source variable causes issues so such a simple criterion cannot be used.</li></ul></div><div>* The recode value is 25 and not 24, because of some complexity in the metadata where I have one variable that takes different values depending on which of two countries the respondent lives in, but the letter of the alphabet used for the variable has been made the same. So the variable letters are one out.</div><div><br /></div><div>I hope you find this useful.</div><div><br /></div><div><br /></div>Michellehttp://www.blogger.com/profile/17580800643996840530noreply@blogger.comtag:blogger.com,1999:blog-4353081542889900211.post-78777259722632813472012-01-14T17:00:00.000-08:002012-01-14T17:06:39.868-08:00R: ggplot2 for social scientistsI've been tied up with other work, so things have been a bit slow on the mixed methods development. I hope to do another post on the analysis in about a week, but today I would like to do an introduction to&nbsp;<a href="http://cran.r-project.org/web/packages/ggplot2/index.html" target="_blank">ggplot2</a>&nbsp;and demonstrate some of its usefulness to social scientists when one is dealing with grouped data, e.g. in a regression.<br /><br />If you remember from the last post, I got as far as fitting a linear mixed effects model to the data using the&nbsp;<a href="http://cran.r-project.org/web/packages/lme4/index.html" target="_blank">lme4</a>&nbsp;package. That gave a mixed effects model, that was linear in both the fixed (age and day of intake) and random (subject) effects.<br /><br />This is a multivariate model, so one method to examine the model appropriateness, given that it is fitted in multidimensional space, is to look at a plot of the residuals against the fitted values. I showed the plot in the previous post, but for ease of comparison, it is reproduced here:<br /><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-sfuQVR6Ntko/TxIPyg_N20I/AAAAAAAAAAY/tZ_WDUEKXnQ/s1600/Male+energy+residuals+lmer.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="319" src="http://4.bp.blogspot.com/-sfuQVR6Ntko/TxIPyg_N20I/AAAAAAAAAAY/tZ_WDUEKXnQ/s320/Male+energy+residuals+lmer.jpeg" width="320" /></a></div><br />We look at residuals plots to see if the model assumptions are not met. The types of assumptions that can be tested using a residuals plot are:<br /><br /><ol><li>The residuals should have a mean of 0. This is demonstrated by equal scatter of residuals above and below the horizontal line at y=0.</li><li>The residuals should be independent, e.g. there should be no autocorrelation. This can be an issue with time series data, where there is (for example) a repeating seasonal effect.</li><li>The residuals should show constant variance over the fitted values. The graph does not appear to show a &nbsp;"fanning out" of the residuals, which would suggest heteroscedasticity (the technical name for non-constant variance).</li><li>The residuals should be normally distributed. This does appear to be an issue from the graph, as the left hand, middle, and right hand sides of the residuals graph do not appear to be similarly distributed.</li></ol><div>ggplot2 can help with our examination of the residuals plot. This sets up the data that I use in ggplot2 and creates the first graph:</div><div><div><span style="font-family: 'Courier New', Courier, monospace;">library(ggplot2)</span></div><div><span style="font-family: 'Courier New', Courier, monospace;">Male.Graph &lt;- as.data.frame(fitted(Male.lme1))</span></div><div><span style="font-family: 'Courier New', Courier, monospace;">names(Male.Graph)[names(Male.Graph)=="fitted(Male.lme1)"] &lt;- "fits"</span></div><div><span style="font-family: 'Courier New', Courier, monospace;">Male.Graph$resids &lt;- c(residuals(Male.lme1))</span></div><div><span style="font-family: 'Courier New', Courier, monospace;">Male.Graph$AgeFactor &lt;- Male.Data$AgeFactor</span></div><div><span style="font-family: 'Courier New', Courier, monospace;">Male.Plot &lt;- qplot(fits,resids, data=Male.Graph, geom=c("point","smooth"))</span></div></div><div><span style="font-family: 'Courier New', Courier, monospace;">Male.Plot</span></div><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-bPC0yd-nWyM/TxIVGC41z7I/AAAAAAAAAAg/-mx4bxg935I/s1600/Qplot+of+male+data+with+scaling.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="261" src="http://3.bp.blogspot.com/-bPC0yd-nWyM/TxIVGC41z7I/AAAAAAAAAAg/-mx4bxg935I/s320/Qplot+of+male+data+with+scaling.jpeg" width="320" /></a></div><div>This is a much more attractive version of the first graph above, and the fitted line clearly shows something not quite right with the residuals. A normal probability plot is needed, and this can be done in base R. The output confirms the suspicion that the residuals are not normally distributed. The second line of code below plots the line so that it is easier to see where deviations from normality occur.</div><div><div><span style="font-family: 'Courier New', Courier, monospace;">qqnorm(Male.Graph$resids)</span></div><div><span style="font-family: 'Courier New', Courier, monospace;">qqline(Male.Graph$resids)</span></div></div><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-Py-ys9-9DD4/TxIXrvpZqPI/AAAAAAAAAAo/NGJ2S8L6sXc/s1600/Males+normal+qq+plot.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="261" src="http://3.bp.blogspot.com/-Py-ys9-9DD4/TxIXrvpZqPI/AAAAAAAAAAo/NGJ2S8L6sXc/s320/Males+normal+qq+plot.jpeg" width="320" /></a></div><div>Now we can go back to having some fun with ggplot2. There are four different age groups of males in the data, how do the groups affect the residuals? ggplot2 allows for&nbsp;easy&nbsp;group identification in plots:</div><div><div><span style="font-family: 'Courier New', Courier, monospace;">Male.Plot2 &lt;- qplot(fits,resids, data=Male.Graph, colour=AgeFactor)</span></div><div><span style="font-family: 'Courier New', Courier, monospace;">Male.Plot2</span></div></div><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-fOENFCCY_Wo/TxIYS4p9adI/AAAAAAAAAAw/Mpbn6gCH258/s1600/Male+residuals+coloured+by+age+band.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="261" src="http://1.bp.blogspot.com/-fOENFCCY_Wo/TxIYS4p9adI/AAAAAAAAAAw/Mpbn6gCH258/s320/Male+residuals+coloured+by+age+band.jpeg" width="320" /></a></div><div><br /></div><div><br /></div><div>That looks interesting, and the values are as expected: as the children get older, generally their energy intake increases. But are the trend lines for the residuals for each group the same? Again, ggplot2 can help, and I have set quite a wide trend bar and shrunk the size of the points so the the bar is more evident:</div><div><div><span style="font-family: 'Courier New', Courier, monospace;">Male.PlotBasic &lt;- ggplot(Male.Graph, aes(fits, resids)) +</span></div><div><span style="font-family: 'Courier New', Courier, monospace;">&nbsp; &nbsp; geom_point(size=0.01) +</span></div><div><span style="font-family: 'Courier New', Courier, monospace;">&nbsp; &nbsp; geom_smooth(size=1.5, se=F)</span></div><div><span style="font-family: 'Courier New', Courier, monospace;">Male.PlotBasic + aes(colour=factor(AgeFactor))</span></div></div><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-PjyvlcyMXLg/TxIjBcqlUTI/AAAAAAAAAA4/OFK4I5VnHxk/s1600/Male+group+data+with+lines.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="261" src="http://2.bp.blogspot.com/-PjyvlcyMXLg/TxIjBcqlUTI/AAAAAAAAAA4/OFK4I5VnHxk/s320/Male+group+data+with+lines.jpeg" width="320" /></a></div><div><br /></div><div>The lines are clearly not parallel, so there are differences in model fit between the age groups.&nbsp;</div><div><br /></div><div>There are a lot more options in the ggplot2 package, which makes these types of plots much easier and faster to produce compared to using base R.</div><div><br /></div><div>A small plug for the author: I have used the book&nbsp;<a href="http://www.amazon.com/ggplot2-Elegant-Graphics-Data-Analysis/dp/0387981403/ref=sr_1_1?ie=UTF8&amp;qid=1326588876&amp;sr=8-1" target="_blank">ggplot2: Elegant Graphics for Data Analysis</a>&nbsp;to skill myself up on this package. I purchased the Kindle version as I am trying to get my technical books into e-book format and&nbsp;<a href="http://had.co.nz/" target="_blank">Hadley Wickham</a>,&nbsp;who is the author of the package as well as the author of the book, was very helpful answering my questions about the book before I purchased it.&nbsp;</div><div><br /></div>Michellehttp://www.blogger.com/profile/17580800643996840530noreply@blogger.comtag:blogger.com,1999:blog-4353081542889900211.post-10449907253209441702011-12-29T12:38:00.000-08:002012-01-14T17:00:45.337-08:00R: Nutrient intake data, mixed methods analysis for malesTo recap:<br /><ul><li>we've cleaned the data (see earlier posts)</li><li>we're up to constructing the output data that will be used to estimate the population distribution values for the particular nutrient we are examining (energy)</li></ul>The code below is my implementation of the&nbsp;<a href="http://riskfactor.cancer.gov/diet/usualintakes/mixtran_macro_v1.1.sas" target="_blank">mixtran sas macro</a> for amount, where the nutrient is consumed daily. This means that a bunch of the macro code is irrelevant for the current analysis, as we don't need to estimate daily consumption probabilities, or apply them in the analysis, because probability of consumption = 1. <br /><br />A boxcox analysis is performed to identify the best transformation to normality prior to undertaking the mixed methods analysis. The data contains age as single year, so I am interested in the effect of analysing age as both<br /><ol><li>age bands used for nutrient reference values - there are standard age bands used for the reference values, so it makes sense to create age factors using these bands and then analyse age as a factor based on these bands, and</li><li>a continuous variable, although because the method below is linear it assumes a linear effect of age on nutrient consumption and this is probably an unrealistic assumption</li></ol>The R code is below. Code lines mentioned relate to the line in the mixtran macro linked above. Note that this method is using a linear mixed effects model on the data, the SAS macro uses a nonlinear mixed effects model, but I am currently unsuccessful in duplicating that exact model into R. The data that I have is much less complex from a covariates perspective compared to the data that was analysed in SAS, so at this point I am not too worried about the change in this part of the method.<br /><span style="font-size: small;"><span style="font-family: 'Courier New', Courier, monospace;">#The data equivalency with the SAS macro data:</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">#SAS macro variable&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; SAS&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; R</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">#subject&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; seqn&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; RespondentID</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">#response&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; add_sug&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; IntakeAmt</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">#repeat&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; drddaycd&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; IntakeDay</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">#covars_amt&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; {Age groups}&nbsp;&nbsp;&nbsp; AgeFactor&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; Note: no dummies as AgeFactor is an ordered factor</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">#replicate_var&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; rndw1&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; SampleWeight&nbsp;&nbsp;&nbsp; Note: used as a frequency variable ("replicate" in SAS), used as weight in R</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">#seq&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; seq2&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; IntakeDay&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; Note: no dummy as IntakeDay is an ordered factor</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">#&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; indwt&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; SampleWeight</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">#ANALYSIS OF MALE DATA - STEP 1 OF 2</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">#MIXTRAN </span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">#Delete any unused RespondentID levels that are related to the female subjects</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">Male.Data$RespondentID &lt;- Male.Data$RespondentID[,drop=TRUE]</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">#row 419 reduce the data to one record per person to calculate weights</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">Male.Subset &lt;- subset(Male.Data,!duplicated(RespondentID), select=c(RespondentID, SampleWeight, IntakeAmt, AgeFactor))</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">#row 446 add the total weights&nbsp; to the data frame</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">Male.Subset$TotWts &lt;- sum(Male.Subset$SampleWeight)</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">#row 455 create the subgroup weights, these are all by age only</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">Male.Grpwt &lt;- aggregate(Male.Subset$SampleWeight,by= data.frame(Male.Subset$AgeFactor),sum)</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">names(Male.Grpwt)[names(Male.Grpwt)=="Male.Subset.AgeFactor"] &lt;- "AgeFactor"</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">names(Male.Grpwt)[names(Male.Grpwt)=="x"] &lt;- "GrpWts"</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">Male.Subset &lt;- merge(Male.Subset,Male.Grpwt, by=c("AgeFactor"))</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">#row 472 get the number of persons in total and by subgroup if given</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">#the SAS macro produces a dataset of 1 row per agegroup, with agegrp, num subjects in age group, total num subjects output</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">count &lt;- function(x) { </span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">&nbsp; length(na.omit(x)) </span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">}</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">Numsub &lt;- aggregate(Male.Subset$RespondentID,by=data.frame(Male.Subset$AgeFactor),count) </span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">names(Numsub)[names(Numsub)=="Male.Subset.AgeFactor"] &lt;- "AgeFactor"</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">names(Numsub)[names(Numsub)=="x"] &lt;- "NumSubjects"</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">Numsub$TotSubjects &lt;- sum(Numsub$NumSubjects)</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">#row 492 merge numsub to _persons</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">Male.Subset &lt;- merge(Male.Subset,Numsub, by=c("AgeFactor"))</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">Male.Subset &lt;- Male.Subset[order(Male.Subset$RespondentID),]</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">#END OF GENERAL SET UP</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">#convert proc transreg, from row 803</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">#NOTE: CODE ONLY USED TO IDENTIFY THE LAMBDA VALUE BETWEEN 0.05 AND 1 ASSOCIATED WITH THE MAXIMUM LOG LIKELIHOOD</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">#fitting a linear model with a BoxCox transformation, on the data that includes replicates</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">library(MASS)</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">Male.Boxcox &lt;- as.data.frame(with(Male.Data, {</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">&nbsp;&nbsp;&nbsp; boxcox(IntakeAmt ~ AgeFactor + IntakeDay,</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">&nbsp;&nbsp;&nbsp; lambda= seq(0.05, 1, 0.05), plotit=FALSE)</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">&nbsp;&nbsp;&nbsp; }))&nbsp;&nbsp;&nbsp; </span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">#locate the row containing the largest log likelihood</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">which.max(Male.Boxcox$y)</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">Lambda.Value &lt;- Male.Boxcox[Male.Boxcox$y == max(Male.Boxcox$y),1 ]</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">#row 818 add boxcox values to Male.Data</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">Male.Data$BoxCoxXY &lt;- (Male.Data$IntakeAmt^Lambda.Value-1)/Lambda.Value</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">#clean up the R workspace by deleting data frames no longer required</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">objects()</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">rm(Imported.Data, Long.Data, Male.Boxcox, Male.Grpwt, Numsub)</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">#row 840 through nlmixed starting at row 996, do all the data preparation in one hit</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">#working through nlmixed lines now, prework on variable construction</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">#not sure why subject is fit as a nonlinear effect, fit using a linear method</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">#use a model with an intercept for fixed effects</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">library(lme4)</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">Male.lme1 &lt;- lmer(BoxCoxXY ~ AgeFactor + IntakeDay + (1|RespondentID),</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; data = Male.Data, </span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; weights = SampleWeight)</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">summary(Male.lme1)</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">plot(fitted(Male.lme1), residuals(Male.lme1))</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">Male.Data$lmefits &lt;- fitted(Male.lme1)</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">#Compare with using age as a continuous variable</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">Male.Boxcox.Age &lt;- as.data.frame(with(Male.Data, {</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">&nbsp;&nbsp;&nbsp; boxcox(IntakeAmt ~ Age + IntakeDay,</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">&nbsp;&nbsp;&nbsp; lambda= seq(0.05, 1, 0.05), plotit=FALSE)</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">&nbsp;&nbsp;&nbsp; }))&nbsp;&nbsp;&nbsp; </span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">which.max(Male.Boxcox.Age$y)</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">Lambda.Value.Age &lt;- Male.Boxcox.Age[Male.Boxcox.Age$y == max(Male.Boxcox.Age$y),1 ]</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">Male.Data.Age$BoxCoxXY &lt;- (Male.Data.Age$IntakeAmt^Lambda.Value.Age-1)/Lambda.Value.Age</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">#Age as continuous fixed effect</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">Male.lme2 &lt;- lmer(BoxCoxXY ~ Age + IntakeDay + (1|RespondentID),</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; data = Male.Data, </span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; weights = SampleWeight)</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">summary(Male.lme2)</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">plot(fitted(Male.lme2), residuals(Male.lme2))</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: 'Courier New', Courier, monospace;">Male.Data$Agefits &lt;- fitted(Male.lme2)</span></span><br /><br />Some highlights for me:<br /><ol><li>The same transformation was identified as the best for both analyses. </li><li>An ANOVA comparison of the two models showed no significant improvement from using age as a continuous variable, so the age band method is retained for subsequent analyses:</li></ol><span style="font-family: 'Courier New', Courier, monospace;">anova(Male.lme1,Male.lme2)</span><br /><span style="font-family: 'Courier New', Courier, monospace;">Data: Male.Data</span><br /><span style="font-family: 'Courier New', Courier, monospace;">Models:</span><br /><span style="font-family: 'Courier New', Courier, monospace;">Male.lme2: BoxCoxXY ~ Age + IntakeDay + (1 | RespondentID)</span><br /><span style="font-family: 'Courier New', Courier, monospace;">Male.lme1: BoxCoxXY ~ AgeFactor + IntakeDay + (1 | RespondentID)</span><br /><span style="font-family: 'Courier New', Courier, monospace;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Df&nbsp;&nbsp;&nbsp; AIC&nbsp;&nbsp;&nbsp;&nbsp; BIC&nbsp; logLik Chisq Chi Df Pr(&gt;Chisq)</span><br /><span style="font-family: 'Courier New', Courier, monospace;">Male.lme2&nbsp; 5 9894.4&nbsp; 9926.4 -4942.2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><br /><span style="font-family: 'Courier New', Courier, monospace;">Male.lme1&nbsp; 7 9966.2 10011.1 -4976.1&nbsp;&nbsp;&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1</span><br /><br />&nbsp;The residuals plot for the age band analysis is reasonably nicely behaved given there are &gt;4000 data points, although there could be a slight upwards trend:<br /><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-tJyc83SoLsM/TvzOZ3Kt02I/AAAAAAAAAAQ/gN4-v3BWGvo/s1600/Male+energy+residuals+lmer.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="319" src="http://2.bp.blogspot.com/-tJyc83SoLsM/TvzOZ3Kt02I/AAAAAAAAAAQ/gN4-v3BWGvo/s320/Male+energy+residuals+lmer.jpeg" width="320" /></a></div>The output from the age band lmer analysis is:<br /><br /><br /><div class="MsoNormal"><span style="font-family: 'Courier New'; font-size: 9pt; line-height: 115%;"><span style="font-size: small;">L</span><span style="font-size: small;">inear mixed model fit by REML </span></span></div><div class="MsoNormal"><span style="font-size: small;"><span style="font-family: 'Courier New'; line-height: 115%;">Formula: BoxCoxXY ~ AgeFactor + IntakeDay + (1 | RespondentID) </span></span></div><div class="MsoNormal"><span style="font-size: small;"><span style="font-family: 'Courier New'; line-height: 115%;">&nbsp;&nbsp; Data: Male.Data </span></span></div><div class="MsoNormal"><span style="font-size: small;"><span style="font-family: 'Courier New'; line-height: 115%;">&nbsp; AIC&nbsp;&nbsp; BIC logLik deviance REMLdev</span></span></div><div class="MsoNormal"><span style="font-size: small;"><span style="font-family: 'Courier New'; line-height: 115%;">&nbsp;9994 10039&nbsp; -4990&nbsp;&nbsp;&nbsp;&nbsp; 9952&nbsp;&nbsp;&nbsp; 9980</span></span></div><div class="MsoNormal"><span style="font-size: small;"><span style="font-family: 'Courier New'; line-height: 115%;">Random effects:</span></span></div><div class="MsoNormal"><span style="font-size: small;"><span style="font-family: 'Courier New'; line-height: 115%;">&nbsp;Groups&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Name&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Variance Std.Dev.</span></span></div><div class="MsoNormal"><span style="font-size: small;"><span style="font-family: 'Courier New'; line-height: 115%;">&nbsp;RespondentID (Intercept) 0.19408&nbsp; 0.44055 </span></span></div><div class="MsoNormal"><span style="font-size: small;"><span style="font-family: 'Courier New'; line-height: 115%;">&nbsp;Residual&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.37491&nbsp; 0.61230 </span></span></div><div class="MsoNormal"><span style="font-size: small;"><span style="font-family: 'Courier New'; line-height: 115%;">Number of obs: 4498, groups: RespondentID, 2249</span></span></div><div class="MsoNormal"><br /></div><div class="MsoNormal"><span style="font-size: small;"><span style="font-family: 'Courier New'; line-height: 115%;">Fixed effects:</span></span></div><div class="MsoNormal"><span style="font-size: small;"><span style="font-family: 'Courier New'; line-height: 115%;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Estimate Std. Error t value</span></span></div><div class="MsoNormal"><span style="font-size: small;"><span style="font-family: 'Courier New'; line-height: 115%;">(Intercept)&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 13.98016&nbsp;&nbsp;&nbsp; 0.03405&nbsp;&nbsp; 410.6</span></span></div><div class="MsoNormal"><span style="font-size: small;"><span style="font-family: 'Courier New'; line-height: 115%;">AgeFactor4to8&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.50572&nbsp;&nbsp;&nbsp; 0.04084&nbsp;&nbsp;&nbsp; 12.4</span></span></div><div class="MsoNormal"><span style="font-size: small;"><span style="font-family: 'Courier New'; line-height: 115%;">AgeFactor9to13&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.94329&nbsp;&nbsp;&nbsp; 0.04159&nbsp;&nbsp;&nbsp; 22.7</span></span></div><div class="MsoNormal"><span style="font-size: small;"><span style="font-family: 'Courier New'; line-height: 115%;">AgeFactor14to18&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.30654&nbsp;&nbsp;&nbsp; 0.04312&nbsp;&nbsp;&nbsp; 30.3</span></span></div><div class="MsoNormal"><span style="font-size: small;"><span style="font-family: 'Courier New'; line-height: 115%;">IntakeDayDay2Intake -0.13871&nbsp;&nbsp;&nbsp; 0.01809&nbsp;&nbsp;&nbsp; -7.7</span></span></div><div class="MsoNormal"><br /></div><div class="MsoNormal"><span style="font-size: small;"><span style="font-family: 'Courier New'; line-height: 115%;">Correlation of Fixed Effects:</span></span></div><div class="MsoNormal"><span style="font-size: small;"><span style="font-family: 'Courier New'; line-height: 115%;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; (Intr) AgFc48 AgF913 AF1418</span></span></div><div class="MsoNormal"><span style="font-size: small;"><span style="font-family: 'Courier New'; line-height: 115%;">AgeFactr4t8 -0.775&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span></div><div class="MsoNormal"><span style="font-size: small;"><span style="font-family: 'Courier New'; line-height: 115%;">AgeFctr9t13 -0.761&nbsp; 0.634&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span></div><div class="MsoNormal"><span style="font-size: small;"><span style="font-family: 'Courier New'; line-height: 115%;">AgFctr14t18 -0.734&nbsp; 0.612&nbsp; 0.601&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span></div><div class="MsoNormal"><span style="font-size: small;"><span style="font-family: 'Courier New'; line-height: 115%;">IntkDyDy2In -0.266&nbsp; 0.000&nbsp; 0.000&nbsp; 0.000</span></span></div>Michellehttp://www.blogger.com/profile/17580800643996840530noreply@blogger.comtag:blogger.com,1999:blog-4353081542889900211.post-46571882653100981182011-12-28T08:35:00.000-08:002011-12-28T17:41:57.828-08:00Nutrient intake data, finalising the data in RI run plain R in the normal gui under Windows 7, which means no bells and whistles. This means that I find the R gui somewhat awkward to program in. Thanks to advice I received a number of years ago, I use <a href="http://notepad-plus-plus.org/" target="_blank">Notepad ++</a> as my programming environment. It has line numbering, and when you use Language &gt; R through the menu to set the programming language, you get colour coded syntax. It also has the nice feature of emphasizing the current bracket set that you are using, which makes it very easy to see whether you have remembered to close all your brackets - it counts backward from the last open bracket.<br /><br />We're finally in R. :) The code below sets up the data sets for nutrient intake analysis, which will be the subject of my next posts. If you're following along in the SAS macro, the code is the R substitute for the data preparation in the starting macro called "example1_amount_mixtran_distrib.sas" from <a href="http://riskfactor.cancer.gov/diet/usualintakes/example1_amount_mixtran_distrib.zip" target="_blank">the Example 1 zip file</a> which is downloadable from <a href="http://riskfactor.cancer.gov/diet/usualintakes/macros_single.html" target="_blank">this webpage</a> if you don't want to download the zip immediately.<br /><br />SAS syntax files, which are identifiable by the .sas as a file extension, can be viewed with any text reader, and I use Notepad++ for that as well. If you open that SAS syntax file, the code below prepares the data for analysis in the mixtran macro, basically down to line 114.<br /><br />You'll notice that I comment my code a lot, probably more than most. That is because I have had numerous experiences of coming back to code I wrote 6 months, or a couple of years earlier, and needing to revise it. I have found that what is obvious at the time of programming may not be so obvious as time passes and other programming projects have been completed.<br /><br />You'll see the use of the <a href="http://cran.r-project.org/web/packages/reshape2/" target="_blank">reshape2</a> package. The data is basically a repeated measures design, as there are two 24-hour recall periods for nutrient intake per person. The data coming in from the .csv file constructed earlier has one row per person, with the nutrient intakes as two variables. For repeated measures, the data analysis later requires one row per intake (i.e. two rows per person). As this is the main data preparation stage, it makes sense to reform the data frame now.<br /><br />While I cannot supply the data at this point, I will post the header() result before the melt so you can see the type of data in the data frame. <br /><br />#This section of code duplicates the SAS code from example1_amount_mixtran_distrib.sas from line 1 through line 114<br />#Read in the Australian energy data<br />Imported.Data &lt;- read.csv("foo.csv",header=T)<br />#check that headers have imported fine<br />names(Imported.Data)<br />length(Imported.Data)<br />nrow(Imported.Data)<br />#sort the data frame by subject<br />Imported.Data &lt;- Imported.Data[order(Imported.Data$RespondentID),]#check sort worked, look at first few observations<br />head(Imported.Data)<br />#melt data frame so that each repeated measure (intake) is one row, and<br />#create factor to indicate whether it's a day1 or day2 intake.<br />#remember that reshape2 package must be installed at this point<br />library(reshape2)<br />Long.Data &lt;- melt(Imported.Data, id=1:6, variable="IntakeDay",<br />measured=c("Day1Intake", "Day2Intake"))<br />names(Long.Data)[names(Long.Data)=="value"]&lt;-"IntakeAmt"<br />#construct age group factors, lowest age group number = youngest age group<br />#age groups for analysis are set here (latest edition): http://www.nhmrc.gov.au/guidelines/publications/n35-n36-n37<br />#ASSUMPTION: no children &lt;1 year old in data<br />#construct one variable that contains all the age factors<br />#evaluate from lowest to highest age, evaluation stops when condition is first met<br />#evaluate from lowest to highest age, evaluation stops when condition is first met<br />Long.Data$AgeF &lt;-ifelse(Long.Data$Age&lt;=3,1, ifelse(Long.Data$Age&lt;=8,2, ifelse(Long.Data$Age&lt;=13,3, <br />&nbsp;&nbsp; &nbsp;ifelse(Long.Data$Age&lt;=18,4, ifelse(Long.Data$Age&lt;=30,5, ifelse(Long.Data$Age&lt;=50,6,<br />&nbsp;&nbsp; &nbsp;ifelse(Long.Data$Age&lt;=70,7, ifelse(Long.Data$Age&gt;70,8,""))))))))<br />Long.Data$AgeFactor &lt;- as.factor(Long.Data$AgeF)<br />levels(Long.Data$AgeFactor) &lt;- c("1to3","4to8","9to13","14to18","19to30","31to50","51to70","71Plus")<br />table(Long.Data$AgeF, Long.Data$AgeFactor)#Delete AgeF and any unused AgeFactor levels<br />Long.Data$AgeF &lt;- NULL<br />Long.Data$AgeFactor &lt;- Long.Data$AgeFactor[,drop=TRUE]<br />#Make RespondentID into a factor, it should not be treated as numeric<br />Long.Data$RespondentID &lt;- as.factor(Long.Data$RespondentID) <br />#males and females are analysed separately, do not need to be specified as factors,<br />#construct different data frames for each - the code will duplicate the analysis for the second gender<br />#ASSUMPTION: males = 1 and females = 2<br />Male.Data &lt;- subset(Long.Data, Gender==1)<br />Female.Data &lt;- subset(Long.Data, Gender==2)<br /><br />The result from head(Long.Data) is:<br /><span style="font-size: small;"><span style="font-family: &quot;Courier New&quot;,Courier,monospace;">&nbsp; NutrientID RespondentID Gender Age BodyWeight SampleWeight Day1Intake Day2Intake</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: &quot;Courier New&quot;,Courier,monospace;">1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 267&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 100013&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2&nbsp; 15&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 59.4&nbsp;&nbsp;&nbsp; 0.3335521&nbsp;&nbsp; 8591.535&nbsp;&nbsp; 8747.908</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: &quot;Courier New&quot;,Courier,monospace;">2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 267&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 100020&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1&nbsp; 12&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 51.6&nbsp;&nbsp;&nbsp; 0.4952835&nbsp; 12145.852&nbsp; 13495.798</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: &quot;Courier New&quot;,Courier,monospace;">3&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 267&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 100050&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2&nbsp; 15&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 62.1&nbsp;&nbsp;&nbsp; 0.3335521&nbsp; 14202.496&nbsp; 13724.582</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: &quot;Courier New&quot;,Courier,monospace;">4&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 267&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 100100&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2&nbsp;&nbsp; 4&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 18.5&nbsp;&nbsp;&nbsp; 0.3563699&nbsp;&nbsp; 8621.690&nbsp;&nbsp; 6218.391</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: &quot;Courier New&quot;,Courier,monospace;">5&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 267&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 100128&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2&nbsp;&nbsp; 2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 13.2&nbsp;&nbsp;&nbsp; 0.1666111&nbsp;&nbsp; 5140.690&nbsp;&nbsp; 6427.673</span><br style="font-family: &quot;Courier New&quot;,Courier,monospace;" /><span style="font-family: &quot;Courier New&quot;,Courier,monospace;">6&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 267&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 100370&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2&nbsp;&nbsp; 7&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 24.9&nbsp;&nbsp;&nbsp; 0.3563699&nbsp;&nbsp; 7418.029&nbsp; 13620.542</span></span>Michellehttp://www.blogger.com/profile/17580800643996840530noreply@blogger.comtag:blogger.com,1999:blog-4353081542889900211.post-16597089146124472752011-12-28T03:54:00.000-08:002011-12-28T03:55:30.289-08:00Nutrients consumed daily, R analysis story boardWith any relatively complicated programming task, I prefer to first create a story board that provides the steps I will take to complete the overall task. This may end up being a heavily edited post if I find that my plans change part way through.<br /><br />The point of this analysis is to produce a cleaned distribution of nutrient intakes for the Australian population, using two 24-hour intake recall periods. The general method being followed is from <a href="http://riskfactor.cancer.gov/diet/usualintakes/macros_single.html" target="_blank">here</a> where the method has been implemented for SAS. I was a SAS programmer for around 12 years, but never had much to do with macros as all my programming tasks were one-offs. This has meant that understanding the SAS code has taken me a reasonable period of time.<br /><br />The approach will use the data sets constructed in the previous blog post, which I cleaned in Excel using VBA.<br /><br />The main steps are:<br /><ol><li>Identify a test data set (in this case, energy (KJ) intake),</li><li>Reprogram the SAS code into R, hard coding in the variable names from the cleaned data set as these will be standard for the data analysis of the various nutrients,</li><li>Compare the output distribution to that obtained by the previous method of analysis, then if all goes well </li><li>Output the cleaned distribution to a .csv file for subsequent use by the client, and then finally</li><li>Automate the process for the client, e.g. include GUI features for the client to select the input .csv data set, so the user does not need to change any of the R code. </li></ol>&nbsp;The method will use a number of R packages as well as the standard installation. For example,<br /><ul><li>the&nbsp;<a href="http://cran.r-project.org/web/packages/MASS/index.html" target="_blank">MASS</a> package is used to Box Cox transform the data to normality,&nbsp;</li><li>the&nbsp;<a href="http://cran.r-project.org/web/packages/reshape2/index.html" target="_blank">reshape2</a> package is required to melt the data so that repeated measures are separate observations, not separate variables - this will double the length of the data set as all observations have two 24-hour recall periods, and</li><li>mixed effects analysis is being undertaken using <a href="http://cran.r-project.org/web/packages/lme4/index.html" target="_blank">lme4</a>.</li></ul>I have been working with two of the SAS macros for the past couple of months, and the R code will be dramatically shorter compared to the SAS code as there will be no interim data sets output. Because this process is only addressing nutrients consumed daily (rather than those consumed episodically, e.g. alcohol, or for foods rather than nutrients), the SAS code is simplified into R by not having to generate probability distributions for intake. Once this R analysis has been implemented, I will rework it for the episodic case. To ensure that the correct R program is used, I will incorporate a "missing value check" to ensure that the correct program is used. For the "consumed daily" nutrients, the data set contains only observations with consumption so by definition there is no missing data - but it is a good idea to check all assumptions.<br /><br />Along the way, I will be doing some additional testing. The NCI method linked above uses a number of covariates, such as age group. The client has been analysing the data (not using the NCI method) separately by age group. I will be testing the effect on the overall intake distribution by:<br /><ol><li>analysing separately for age group (3 age groups), versus</li><li>using age group as fixed effect covariates, versus</li><li>using age-in-years as a continuous fixed effect variable instead of age group covariates (there are &gt;4000 observations so there are multiple observations per age-in-years)</li></ol>Michellehttp://www.blogger.com/profile/17580800643996840530noreply@blogger.comtag:blogger.com,1999:blog-4353081542889900211.post-52367817555077893102011-12-27T18:11:00.000-08:002012-01-14T17:00:57.571-08:00Data preparation for analysis, use VBAData on daily nutrient intake information is stored within a database, and the users are extracting the relevant data on a nutrient-by-nutrient basis through an export procedure. The exported data is being stored in Excel 2010 data sets, one for each nutrient. There is a set of variables being saved in each export, as well as an optional variable that does not need passing to R, but - when it is present - is situated between subsets of relevant variables. The names of the variables are not exactly the same between the Excel data sets, nor is the starting row for the variable names, and a couple of variable names are saved across two rows instead of one. This creates a bit of a headache for creating a .csv file to import into R.<br /><br />For one file, the starting rows look like:<br /><br /><table border="0" cellpadding="0" cellspacing="0" style="width: 702px;"><colgroup><col style="mso-width-alt: 3730; mso-width-source: userset; width: 77pt;" width="102"></col> <col style="mso-width-alt: 3620; mso-width-source: userset; width: 74pt;" width="99"></col> <col span="2" style="width: 48pt;" width="64"></col> <col span="2" style="mso-width-alt: 3072; mso-width-source: userset; width: 63pt;" width="84"></col> <col span="2" style="mso-width-alt: 3035; mso-width-source: userset; width: 62pt;" width="83"></col> <col style="mso-width-alt: 1426; mso-width-source: userset; width: 29pt;" width="39"></col> </colgroup><tbody><tr height="20" style="height: 15.0pt;"> <td colspan="5" height="20" style="height: 15.0pt; mso-ignore: colspan; width: 310pt;" width="413">Energy, including dietary fibre. 2007 ANCNPAS. Including supps.</td> <td style="width: 63pt;" width="84"><br /></td> <td style="width: 62pt;" width="83"><br /></td> <td style="width: 62pt;" width="83"><br /></td> <td style="width: 29pt;" width="39"><br /></td> </tr><tr height="20" style="height: 15.0pt;"> <td height="20" style="height: 15.0pt;"><br /></td> <td><br /></td> <td><br /></td> <td><br /></td> <td><br /></td> <td><br /></td> <td><br /></td> <td><br /></td> <td><br /></td> </tr><tr height="20" style="height: 15.0pt;"> <td height="20" style="height: 15.0pt;">nutrient_code</td> <td>Respondent ID</td> <td>Gender</td> <td>Age</td> <td>Body weight</td> <td>samplewt</td> <td>Intake- day1</td> <td>Intake- day2</td> <td>Units</td> </tr><tr height="20" style="height: 15.0pt;"> <td align="right" height="20" style="height: 15.0pt;">267</td> <td align="right">100013</td> <td align="right">2</td> <td align="right">15</td> <td align="right">59.4</td> <td align="right">0.333552084</td> <td align="right">8591.5354</td> <td align="right">8747.9084</td> <td>KJ</td> </tr><tr height="20" style="height: 15.0pt;"> <td align="right" height="20" style="height: 15.0pt;">267</td> <td align="right">100020</td> <td align="right">1</td> <td align="right">12</td> <td align="right">51.6</td> <td align="right">0.495283471</td> <td align="right">12145.8524</td> <td align="right">13495.798</td> <td>KJ</td> </tr></tbody></table><br />For another file, the starting rows look like:<br /><br /><table border="0" cellpadding="0" cellspacing="0" style="width: 883px;"><colgroup><col span="4" style="width: 48pt;" width="64"></col> <col style="mso-width-alt: 3181; mso-width-source: userset; width: 65pt;" width="87"></col> <col style="width: 48pt;" width="64"></col> <col style="mso-width-alt: 2742; mso-width-source: userset; width: 56pt;" width="75"></col> <col style="mso-width-alt: 3364; mso-width-source: userset; width: 69pt;" width="92"></col> <col style="mso-width-alt: 3913; mso-width-source: userset; width: 80pt;" width="107"></col> <col style="width: 48pt;" width="64"></col> <col style="mso-width-alt: 5046; mso-width-source: userset; width: 104pt;" width="138"></col> </colgroup><tbody><tr height="20" style="height: 15.0pt;"> <td height="20" style="height: 15.0pt; width: 48pt;" width="64">iodine</td> <td style="width: 48pt;" width="64"><br /></td> <td style="width: 48pt;" width="64"><br /></td> <td style="width: 48pt;" width="64"><br /></td> <td style="width: 65pt;" width="87"><br /></td> <td style="width: 48pt;" width="64"><br /></td> <td style="width: 56pt;" width="75"><br /></td> <td style="width: 69pt;" width="92"><br /></td> <td style="width: 80pt;" width="107"><br /></td> <td style="width: 48pt;" width="64"><br /></td> <td style="width: 104pt;" width="138"><br /></td> </tr><tr height="20" style="height: 15.0pt;"> <td colspan="2" height="20" style="height: 15.0pt; mso-ignore: colspan;">with supplements</td> <td><br /></td> <td><br /></td> <td><br /></td> <td><br /></td> <td><br /></td> <td><br /></td> <td><br /></td> <td><br /></td> <td><br /></td> </tr><tr height="20" style="height: 15.0pt;"> <td height="20" style="height: 15.0pt;"><br /></td> <td><br /></td> <td><br /></td> <td><br /></td> <td><br /></td> <td><br /></td> <td><br /></td> <td class="xl63">day1</td> <td class="xl64">day 2</td> <td><br /></td> <td><br /></td> </tr><tr height="20" style="height: 15.0pt;"> <td height="20" style="height: 15.0pt;">nut_code</td> <td>id</td> <td>sex</td> <td>age</td> <td>Body weight</td> <td>seifa</td> <td>samplewt</td> <td class="xl63">Intake</td> <td class="xl64">Intake</td> <td>Units</td> <td>Upper safe level (UL)</td> </tr><tr height="20" style="height: 15.0pt;"> <td align="right" height="20" style="height: 15.0pt;">315</td> <td align="right">100013</td> <td align="right">2</td> <td align="right">15</td> <td align="right">59.4</td> <td></td> <td align="right">0.33355208</td> <td align="right">103.7881</td> <td align="right">128.07576</td> <td>ug</td> <td align="right">900</td> </tr><tr height="20" style="height: 15.0pt;"> <td align="right" height="20" style="height: 15.0pt;">315</td> <td align="right">100020</td> <td align="right">1</td> <td align="right">12</td> <td align="right">51.6</td> <td></td> <td align="right">0.49528347</td> <td align="right">140.46202</td> <td align="right">218.31528</td> <td>ug</td> <td align="right">600</td> </tr><tr height="20" style="height: 15.0pt;"> <td align="right" height="20" style="height: 15.0pt;">315</td> <td align="right">100050</td> <td align="right">2</td> <td align="right">15</td> <td align="right">62.1</td> <td></td> <td align="right">0.33355208</td> <td align="right">118.21546</td> <td align="right">184.33722</td> <td>ug</td> <td align="right">900</td> </tr></tbody></table><br />Because there are multiple data sets that will need conversion to R, the best method is to construct some VBA code in an Excel workbook to handle the data cleaning and preparation. The saving grace is that the relative order of the variables appears to be the same across the different data sets.<br /><br />The code below cleans both files using a button click event in Excel 2010. The cleaned data is saved as a .csv file, using the same file name as the input .xlsx file and into the same directory. The Respondent IDs are integer, but must be dimmed as Long because they exceed the maximum value that can be stored in an Integer type. The variables are defined as arrays, and then redimmed to the correct length just before they have values allocated, so the method will work with data sets of varying numbers of observations. For those unfamiliar with VBA, the ' is used to comment code.<br /><br />The Open file has been restricted to only Excel 2010 workbooks - this should save the user from scrolling through irrelevant non-Excel files. Because we have constructed the .csv filename to have the correct file extension, we just need to put in the file name at the end but we do need to amend the type.<br /><br />Sub Button1_Click()<br />&nbsp;&nbsp;&nbsp; Dim FilePath As String<br />&nbsp;&nbsp;&nbsp; Dim FileName As String<br />&nbsp;&nbsp;&nbsp; Dim CSVFile As String<br />&nbsp;&nbsp;&nbsp; Dim WB As Workbook<br />&nbsp;&nbsp;&nbsp; Dim StartRow As Range<br />&nbsp;&nbsp;&nbsp; Dim Count As Integer&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 'Used to store the size of the array for each variable, can only be a whole number<br />&nbsp;&nbsp;&nbsp; Dim i As Integer&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 'Count in the values to each array<br />&nbsp;&nbsp;&nbsp; Dim NutrientCode() As Integer&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 'Numeric code for nutrient, same value for all rows in the same sheet<br />&nbsp;&nbsp;&nbsp; Dim Respondent() As Long&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 'Respondent IDs are larger than 32K so need to make these long rather than integer<br />&nbsp;&nbsp;&nbsp; Dim Gender() As Integer&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 'Sex, codes are 1 or 2<br />&nbsp;&nbsp;&nbsp; Dim Age() As Integer<br />&nbsp;&nbsp;&nbsp; Dim BodyWeight() As Double<br />&nbsp;&nbsp;&nbsp; Dim SampleWeight() As Double<br />&nbsp;&nbsp;&nbsp; Dim Day1Intake() As Double<br />&nbsp;&nbsp;&nbsp; Dim Day2Intake() As Double<br />&nbsp;&nbsp;&nbsp; <br />&nbsp;&nbsp;&nbsp; On Error GoTo Err_Clr<br /><br />&nbsp;&nbsp;&nbsp; 'From http://msdn.microsoft.com/en-us/library/ff834966.aspx<br />&nbsp;&nbsp;&nbsp; 'Get workbook name<br />&nbsp;&nbsp;&nbsp; FilePath = Application _<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; .GetOpenFilename("Excel Files (*.xlsx), *.xlsx")<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; If FilePath &lt;&gt; False Then<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 'MsgBox "Open " &amp; FilePath<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; End If<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <br />&nbsp;&nbsp;&nbsp; 'Save filename and path for output to csv<br />&nbsp;&nbsp;&nbsp; FileName = Dir(FilePath)<br />&nbsp;&nbsp;&nbsp; 'MsgBox "The filename is " &amp; FileName<br />&nbsp;&nbsp;&nbsp; CSVFile = Left(FileName, Len(FileName) - 4) &amp; "csv"<br />&nbsp;&nbsp;&nbsp; 'MsgBox "The filename is " &amp; CSVFile<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <br />&nbsp;&nbsp;&nbsp; 'Open workbook<br />&nbsp;&nbsp;&nbsp; Set WB = Workbooks.Open(FilePath)<br />&nbsp;&nbsp;&nbsp; WB.Activate<br />&nbsp;&nbsp;&nbsp; <br />&nbsp;&nbsp;&nbsp; 'Locate variable name row<br />&nbsp;&nbsp;&nbsp; Range("A1").Select<br />&nbsp;&nbsp;&nbsp; Do Until (ActiveCell.Value = "nutrient_code") Or (ActiveCell.Value = "nut_code")<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ActiveCell.Offset(1, 0).Select<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Loop<br />&nbsp;&nbsp;&nbsp; Set StartRow = ActiveCell<br />&nbsp;&nbsp;&nbsp; 'MsgBox "Startrow is with cell " &amp; StartRow<br />&nbsp;&nbsp;&nbsp; 'Get Count value<br />&nbsp;&nbsp;&nbsp; Count = 0<br />&nbsp;&nbsp;&nbsp; Do While ActiveCell.Value &lt;&gt; ""<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ActiveCell.Offset(1, 0).Select<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Count = Count + 1<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Loop<br />&nbsp;&nbsp;&nbsp; 'Count is 1 too high due to looping to empty row, reduce by 1<br />&nbsp;&nbsp;&nbsp; Count = Count - 1<br />&nbsp;&nbsp;&nbsp; 'MsgBox "The number of observations is " &amp; Count<br />&nbsp;&nbsp;&nbsp; 'Go back to variable name row<br />&nbsp;&nbsp;&nbsp; StartRow.Select<br />&nbsp;&nbsp;&nbsp; 'Nutrient intakes<br />&nbsp;&nbsp;&nbsp; ReDim NutrientCode(Count)<br />&nbsp;&nbsp;&nbsp; For i = 1 To Count<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ActiveCell.Offset(1, 0).Select<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; NutrientCode(i) = ActiveCell.Value<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Next<br />&nbsp;&nbsp;&nbsp; 'Respondent IDs<br />&nbsp;&nbsp;&nbsp; StartRow.Select<br />&nbsp;&nbsp;&nbsp; ActiveCell.Offset(0, 1).Select<br />&nbsp;&nbsp;&nbsp; ReDim Respondent(Count)<br />&nbsp;&nbsp;&nbsp; For i = 1 To Count<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ActiveCell.Offset(1, 0).Select<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Respondent(i) = ActiveCell.Value<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Next<br />&nbsp;&nbsp;&nbsp; 'Gender<br />&nbsp;&nbsp;&nbsp; StartRow.Select<br />&nbsp;&nbsp;&nbsp; ActiveCell.Offset(0, 2).Select<br />&nbsp;&nbsp;&nbsp; ReDim Gender(Count)<br />&nbsp;&nbsp;&nbsp; For i = 1 To Count<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ActiveCell.Offset(1, 0).Select<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Gender(i) = ActiveCell.Value<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Next<br />&nbsp;&nbsp;&nbsp; 'Age<br />&nbsp;&nbsp;&nbsp; StartRow.Select<br />&nbsp;&nbsp;&nbsp; ActiveCell.Offset(0, 3).Select<br />&nbsp;&nbsp;&nbsp; ReDim Age(Count)<br />&nbsp;&nbsp;&nbsp; For i = 1 To Count<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ActiveCell.Offset(1, 0).Select<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Age(i) = ActiveCell.Value<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Next<br />&nbsp;&nbsp;&nbsp; 'Body Weight<br />&nbsp;&nbsp;&nbsp; StartRow.Select<br />&nbsp;&nbsp;&nbsp; ActiveCell.Offset(0, 4).Select<br />&nbsp;&nbsp;&nbsp; ReDim BodyWeight(Count)<br />&nbsp;&nbsp;&nbsp; For i = 1 To Count<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ActiveCell.Offset(1, 0).Select<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; BodyWeight(i) = ActiveCell.Value<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Next<br />&nbsp;&nbsp;&nbsp; 'need to check for seifa column at this point<br />&nbsp;&nbsp;&nbsp; 'Sample Weight<br />&nbsp;&nbsp;&nbsp; StartRow.Select<br />&nbsp;&nbsp;&nbsp; ActiveCell.Offset(0, 5).Select<br />&nbsp;&nbsp;&nbsp; If ActiveCell.Value = "seifa" Then ActiveCell.Offset(0, 1).Select<br />&nbsp;&nbsp;&nbsp; ReDim SampleWeight(Count)<br />&nbsp;&nbsp;&nbsp; For i = 1 To Count<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ActiveCell.Offset(1, 0).Select<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; SampleWeight(i) = ActiveCell.Value<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Next<br />&nbsp;&nbsp;&nbsp; 'Day 1 intake<br />&nbsp;&nbsp;&nbsp; StartRow.Select<br />&nbsp;&nbsp;&nbsp; ActiveCell.Offset(0, 5).Select<br />&nbsp;&nbsp;&nbsp; If ActiveCell.Value = "seifa" Then<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ActiveCell.Offset(0, 2).Select<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Else<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ActiveCell.Offset(0, 1).Select<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; End If<br />&nbsp;&nbsp;&nbsp; ReDim Day1Intake(Count)<br />&nbsp;&nbsp;&nbsp; For i = 1 To Count<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ActiveCell.Offset(1, 0).Select<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Day1Intake(i) = ActiveCell.Value<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Next<br />&nbsp;&nbsp;&nbsp; 'Day 2 intake<br />&nbsp;&nbsp;&nbsp; StartRow.Select<br />&nbsp;&nbsp;&nbsp; ActiveCell.Offset(0, 5).Select<br />&nbsp;&nbsp;&nbsp; If ActiveCell.Value = "seifa" Then<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ActiveCell.Offset(0, 3).Select<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Else<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ActiveCell.Offset(0, 2).Select<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; End If<br />&nbsp;&nbsp;&nbsp; ReDim Day2Intake(Count)<br />&nbsp;&nbsp;&nbsp; For i = 1 To Count<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ActiveCell.Offset(1, 0).Select<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Day2Intake(i) = ActiveCell.Value<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Next<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <br />&nbsp;&nbsp;&nbsp; 'Save all the array values to a new workbook<br />&nbsp;&nbsp;&nbsp; Workbooks.Add<br />&nbsp;&nbsp;&nbsp; 'Output Nutrient ID<br />&nbsp;&nbsp;&nbsp; ActiveCell.Value = "NutrientID"<br />&nbsp;&nbsp;&nbsp; For i = 1 To Count<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ActiveCell.Offset(1, 0).Select<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ActiveCell.Value = NutrientCode(i)<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Next<br />&nbsp;&nbsp;&nbsp; 'Output Respondent<br />&nbsp;&nbsp;&nbsp; Range("B1").Select<br />&nbsp;&nbsp;&nbsp; ActiveCell.Value = "RespondentID"<br />&nbsp;&nbsp;&nbsp; For i = 1 To Count<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ActiveCell.Offset(1, 0).Select<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ActiveCell.Value = Respondent(i)<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Next<br />&nbsp;&nbsp;&nbsp; 'Output Gender<br />&nbsp;&nbsp;&nbsp; Range("C1").Select<br />&nbsp;&nbsp;&nbsp; ActiveCell.Value = "Gender"<br />&nbsp;&nbsp;&nbsp; For i = 1 To Count<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ActiveCell.Offset(1, 0).Select<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ActiveCell.Value = Gender(i)<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Next<br />&nbsp;&nbsp;&nbsp; 'Output Age<br />&nbsp;&nbsp;&nbsp; Range("D1").Select<br />&nbsp;&nbsp;&nbsp; ActiveCell.Value = "Age"<br />&nbsp;&nbsp;&nbsp; For i = 1 To Count<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ActiveCell.Offset(1, 0).Select<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ActiveCell.Value = Age(i)<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Next<br />&nbsp;&nbsp; 'Output Body Weight<br />&nbsp;&nbsp;&nbsp; Range("E1").Select<br />&nbsp;&nbsp;&nbsp; ActiveCell.Value = "BodyWeight"<br />&nbsp;&nbsp;&nbsp; For i = 1 To Count<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ActiveCell.Offset(1, 0).Select<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ActiveCell.Value = BodyWeight(i)<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Next<br />&nbsp;&nbsp;&nbsp; 'Output SampleWeight<br />&nbsp;&nbsp;&nbsp; Range("F1").Select<br />&nbsp;&nbsp;&nbsp; ActiveCell.Value = "SampleWeight"<br />&nbsp;&nbsp;&nbsp; For i = 1 To Count<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ActiveCell.Offset(1, 0).Select<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ActiveCell.Value = SampleWeight(i)<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Next<br />&nbsp;&nbsp;&nbsp; 'Output Day 1 Intake<br />&nbsp;&nbsp;&nbsp; Range("G1").Select<br />&nbsp;&nbsp;&nbsp; ActiveCell.Value = "Day1Intake"<br />&nbsp;&nbsp;&nbsp; For i = 1 To Count<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ActiveCell.Offset(1, 0).Select<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ActiveCell.Value = Day1Intake(i)<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Next<br />&nbsp;&nbsp;&nbsp; 'Output Day 2 Intake<br />&nbsp;&nbsp;&nbsp; Range("H1").Select<br />&nbsp;&nbsp;&nbsp; ActiveCell.Value = "Day2Intake"<br />&nbsp;&nbsp;&nbsp; For i = 1 To Count<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ActiveCell.Offset(1, 0).Select<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ActiveCell.Value = Day2Intake(i)<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Next<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <br />&nbsp;&nbsp;&nbsp; 'Save the worksheet as a csv file for R<br />&nbsp;&nbsp;&nbsp; ActiveWorkbook.SaveAs FileName:=CSVFile, FileFormat:=xlCSV<br /><br /><br />Err_Clr:<br />&nbsp;&nbsp;&nbsp; If Err &lt;&gt; 0 Then<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Err.Clear<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Resume Next<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; End If<br />&nbsp;&nbsp;&nbsp; <br />&nbsp;&nbsp;&nbsp; <br />End SubMichellehttp://www.blogger.com/profile/17580800643996840530noreply@blogger.com