It is such a pleasure for me to get involved into the summer research at W&M. This is also my first formal systematic and academic research experience at college. Thank to professor Frazier, I learned how to conduct an academical research professionally from the very beginning, searching for an area, and on the way to the end, making summary and highlight the crucial and creative points in the research. This is a great treasure for me. This research is mostly about using random forest and mathematical models including hierarchal Bayesian model to generate a synthetic population description for Liberia, and the future work would be applying the method into other regions and countries and also to solve problems related to various fields, like public health and disease tracking.

As the end of summer sessions on campus approaching, we are keep making progress on the research but meanwhile facing more and more difficulties. To processing HBM in R is one of them. Since Hierarchical Bayesian Model is still on its way to develop, it is kind of hard and time-consuming to find some specific functions that fit right into our model. There are so many arguments and different functions in the package to explore. While doing more and talking more to the professor about it, I also start on learning about the validation part of random forest.

I still remember that I felt half nervous half excited while waiting for the “predict” function to finish, and I felt thrilled while seeing the map with the predicted values. Although we still have to add some more variables to improve the accuracy of the data and population description map, this first step — the template making process— is crucial. After discussing with Professor Frazier, we plan to go further to explore some Mathematical model first and use the sample data from DHS website to create a more continuous and detailed population datasets. One of the most promising method is Hierarchical Bayesian Model (HBM) for this use. We figure out a general outline of our later work. Since we have already had all the layers and run the data with random forest to have clan population prediction. Now we are going to collect some more specific household data from DHS, like the household size and particular locations of them, and put the data into HBM which may involve some multivariable logistic regression models. These kinds of models will have some input which are discrete point data and generate some output which are predicted continuous data. This step will be noteworthy for this whole research. Also, for our research, we are not going to use individual data from DHS but household data. This is a noteworthy part of our study. Some of the other studies before have investigated individual information and map that onto the map. However, there are many aspects of life which are much more strongly related to factors of household. For example, the places where people choose to live is highly depended on where the labor of the family work. Household, or in other words, family is highly important for mapping population and population description. Only several research take household size and related information into account so that our research will be a breakthrough. However, there are also some technical issues waiting to be addressed and most of them will require some more advanced mathematical models. These are mostly what we talked about and are researching on during this period of time. We are studying more and more about the promising method HBM and keep encountering new problems.

After generating random forest model, which is our first milestone. we are going to use it as a tool to predict the population distribution. First off, we have to replace all the data which has the value of zero to speed up the later process by using “replace_na”, and then drop some layers that has no help for the predicting model. These are all the necessary steps to reorganise the data and make the predicting process much faster and more efficient. We also found that “raster” package is a very powerful package for us to aggregate and tidy up the data. There are many useful and common functions either from it or evolve from it. Since the data set is pretty big and the model will take some time to process the data, I learned many skills from the professor and I also talked to some experts in IT office of the college. For this particular process, we used “beginCluster” and “endCluster”, which can speed up the process. Also, in other cases, we might use public computer of our college to create some nodes and files instead of using our own computer. The HPC functions will make the process much faster but it will take some effort to look at the plots and the result of some lines of code since it will have to run the code all together at the one time.

The time flies so fast, though we are a little bit behind schedule for now, we already have a clear plan ahead. For now, we are going to create a template and analyse the data we have already downloaded first. I have to keep reading paper about random forest to see how it works underneath the model itself. I have read more about strapping method, tree creating process, how many trees to create, and such and such. Before setting up the random forest model, we are going to see whether all the sub-variables of the variable — land cover, nighttime light, and settlement — are important to the population distribution that we care about. We use RMSE to make an importance plot of the variables and delete the sub variable data which has the importance zero. For landcover, the “snow” variable has been deleted from the data set since it doesn’t have any affect on the result and might have a bad affect on the time while running the model. Then after sorting through all the data we downloaded, we are going to put the data into the “random forest” model to create a Large randomForest formula. From there, I also create a plot called variable importance plot to see which variable is the most crucial for our project. The result is that “urban” is the most important one, and “water.permanent” is the least importance one.