http://beehive.cs.princeton.edu
Fri, 04 Aug 2017 18:44:15 +0000en-UShourly1https://wordpress.org/?v=4.8.1Fragile Family Scale Constructionhttp://beehive.cs.princeton.edu/fragile-family-scale-construction/
Mon, 06 Mar 2017 06:31:46 +0000http://beehive.cs.princeton.edu/?p=837Continue reading Fragile Family Scale Construction→]]>The workflow for creating scale variables for the Fragile Family data is broken into four parts.
Here, we describe the generation of the Social skills Self control subscale.
I highly recommend for you to open the scales summary document at /tigress/BEE/projects/rufragfam/data/fragile_families_scales_noquote.tsv with some spreadsheet viewing software (e.g. excel) and one or more of the scales documents for years 1, 3, 5, and 9: http://www.fragilefamilies.princeton.edu/documentation
First, SSH into the della server and cd into the Fragile Family restricted use directorycd /tigress/BEE/projects/rufragfam/data

Step 1: create the scale variables file. Relevant script: sp1_processing_scales.ipynb or sp1_processing_scales.py. This python script first obtains the prefix descriptors for individual categories. That is, in the scales documentation, a group of questions is labeled as being asked of the mother, father, child, teacher, etc… Each one of these has an abbreviation. The raw scale variables file can be accessed with

less /tigress/BEE/projects/rufragfam/data/fragile_families_scales_noquote.tsv

It is useful to have this file open with some spreadsheet or tab delimited viewing software to get an idea of how the data is structured. Next, it creates a map between each prefix descriptor and fragile family raw data file. It then, through some automated and manual work, attempts to match all variables defined in the scale documentation with the raw data files. After this automated and manual curation, 1514 of the scale variables defined in the PDFs could be found in the data, and 46 could not.

This step only needs to be run if there are additional scale documents available, for instance, when the year 15 data is released. And the year 15 scale variables need to be added to the fragile_families_scales_noquote.tsv file prior to running this step.

Relevant script: sp2_creating_scales_data.ipynb or sp2_creating_scales_data.py. This script takes the scale variables computed in part 1 and converts them into data tables for each scale. The output is stored in the tab delimited files

ls -al /tigress/BEE/projects/rufragfam/data/raw-scales/

The output of this step still contains the uncleaned survey responses from the raw data. For any scale, there are a large number of inconsistencies and errors in the raw data. These need to be cleaned before we can do any imputation or scale conversion. Similarly to step 1, this step only needs to be done if new scales documentations are released and only after updating fragile_families_scales_noquote.tsv.

Step 3: data cleaning and conversion of fragile families format to a format that can actually be run through imputation software.

Relevant script: sp3_clean_scales.ipynb or sp3_clean_scales.py.

All unique responses to questions for a scale, e.g. Mental_Health_Scale_for_Depression, can be computed with

Unfortunately, there doesn’t seem to be an automated way to do this so I recommend going through the scale documents and the question/answer formats.

The FF scale variables and the set of all response values they can take can be found in the file:/tigress/BEE/projects/rufragfam/data/scale_variables_and_responses.tsv

The FF variable identifiers and labels (survey questions) can be found in the file:/tigress/BEE/projects/rufragfam/data/all_ff_variables.tsv

To add support for a new scale, the replaceScaleSpecificQuantities function needs to be updated to encode the raw response values with something meaningful. For instance, for the social skills self control subscale, we process Social_skills__Selfcontrol_subscale.tsv and replace values we wish to impute with float(‘nan’), and the result of the values are replaced according to the ff_scales9.pdf documentation. The cleaned scales will be generated in the directory /tigress/BEE/projects/rufragfam/data/clean-scales/

Step 4: compute the scale values

Relevant script: sp4_computing_scales.ipynb or sp4_computing_scales.py. From the cleaned data and the procedures defined in the FF scales PDFs, we can reconstruct scale scores. To add support for your scale, add in your scale to the if scale_file if statement block. For example, the Social_skills__Selfcontrol_subscale.tsv scale is processed by first imputing the data and then summing up the individual counts across survey questions for each wave. The final output file with all the scale data will be stored in /tigress/BEE/projects/rufragfam/data/ff_scales.tsv.

After adding in your scale in Steps 3 and 4, you can use the ff_scales.tsv file for data modeling. This is where it gets interesting!

]]>Installing IPython notebook on Della – the conda way (featuring Python 3 and IPython 4.0, among other things)http://beehive.cs.princeton.edu/installing-ipython-notebook-on-della-the-conda-way-featuring-python-3-and-ipython-4-0-among-other-things/
Sun, 20 Dec 2015 06:43:57 +0000http://beehive.cs.princeton.edu/?p=635Continue reading Installing IPython notebook on Della – the conda way (featuring Python 3 and IPython 4.0, among other things)→]]>Thanks to Ian’s previous post, I was able to set up IPython notebook on Della, and I’ve been working extensively with it. However, when I was trying to sync the notebooks between the copies on my local machine and Della, I found out that the version of IPython on Della is the old 2.3 version, and that IPython is not backward compatible. So any IPython notebook that I create and work on in my local directory will simply not work in Della, which is quite annoying.

Also, I think there is a lot of benefit to setting up and using Anaconda in my Della directory. It sets up a lot of packages (including Python 3, instead of the archaic 2.6 that Della runs; you have to module load python as Ian does in his post in order to load 2.7) and manages them seamlessly, without having to worry about what package is in what directory.

Conda is an open source package management system and environment management system for installing multiple versions of software packages and their dependencies and switching easily between them. It works on Linux, OS X and Windows, and was created for Python programs but can package and distribute any software.

Note: I initially tried using the easy_install way of installing conda, only to run into the following error:
Error: This installation of conda is not initialized. Use 'conda create -n
envname' to create a conda environment and 'source activate envname' to
activate it.

# Note that pip installing conda is not the recommended way for setting up your
# system. The recommended way for setting up a conda system is by installing
# Miniconda, see: http://repo.continuum.io/miniconda/index.html

It indeed is preferable to follow their instructions. Then run:
sh Miniconda3-latest-Linux-x86_64.sh

And follow their instructions. Conda will install a set of basic packages (including python 3.5, conda, openssl, pip, setuptools, only to name a few useful packages) under the package you specify, or the default directory:
/home/my.user.name/miniconda3

It also modifies the PATH for you so that you don’t have to worry about that yourself. How nice of them. (But sometimes you might need to specify the default versions of programs that are on della, especially for distributing jobs to other users, etc. Don’t forget to specify them when needed. But you should be set for most use cases.)

Now, since we are using the conda package version of pip, by simply running,
pip install ipython
pip install jupyter

or

conda install ipython
conda install jupyter

conda will integrate these packages into your environment. Neat.

That’s it! You can double check what packages you have by running:
conda list

After this, the steps for having the notebook serve the notebook to your local browser is identical to the previous post. Namely:

ipython notebook --ip=127.0.0.1 --profile=profilename --port <Your Port #>
# note that if you are trying to access Della
# from outside the Princeton CS department, you
# may have to forward the same port from your home computer
# to some princeton server, then again to Della

# secret_gist_string is the string already associated with a particular file on Github
# To obtain it, the first time you upload a file to Github (e.g. my_notebook.ipynb) go to
# https://github.com/ | Gist | Add file [upload file] | Create secret Gist, which will
# return a secret_gist_string on the panel at right (labeled “HTTPS”)

Here is an example ipython notebook that I shared through gist and is available for viewing:

This post contains a quick overview of useful command line tools for research found in standard bash shells. For the sake of brevity, I will only include essential information here, and link to other informative pages where available.

awk

awk (also implemented in newer versions nawk and gawk) is a utility to manipulate text files. It’s syntax is much simpler than perl or python but still enables the fast writing of scripts to process files by lines and columns.

examples

For each line in the file input.txt, print the first and third column separated by a tab. Store the result in output.txt

awk '{ print $1 “\t" $3 }' input.txt > output.txt

Replace each column in file input.txt with its absolute value and print out each line to output.txt.

links

screen

screen is an invaluable tool for creating virtual consoles that (1) keeps sessions active through network failure, e.g. when using secure shells, (2) connect to your session from different locations, and (3) run a long process without maintaining an active shell. Also see the alias command for a useful screen alias.

examples

Open a new screen console, list all available screen sessions, and reattach to screen with a particular ID

links

find

A tool for finding files. Can be chained with other tools for powerful pipelines. Useful options are –name to find files by filename, -wholename to find files by filename and path, -maxdepth descends down to at most this level. The -exec option is VERY useful.

examples

links

paste/join/cat

Tools for combining files.
paste merges files line by line.
join merges two files on a common field.
cat concatenates a number of files one after the other. Can also be used to print a file to standard input.

examples

paste the lines of input1.txt and input2.txt together separating them with a space

paste –d " " input1.txt input2.txt

join two files input1.txt and input2.txt by the first field of both files

uname -a: get information about currently logged in machine. Related: echo $0 to print your interpreter.

md5sum: checksum files. Verify that some important files you downloaded are genuine.

A worked-through exercise

Now that we’ve learned the basics of these commands, let’s put them all together.
You will have to use what you have learned to work out solutions to the tasks in bold. Note that there are many ways to solve each problem.

The scenario

Due to the recent discovery that the Brontosaurus may indeed be a genus of dinosaur, the NSF has reallocated all of its funding to dinosaur research. A collaboration between leading archaeologist Dr. Li-Fang (aka the iron fist of Taiwan) and crazed molecular biologist Dr. Bianca resulted in the extraction of DNA samples from several Brontosauri fossils. After characterizing the set of variants in the sample, a variant call format (VCF) file was generated and uploaded to the cloud. During upload, a deranged hacker and former world-class sprinter named Greg, who has a personal vendetta against the Brontosaurus, corrupted the text file. We must clean this file using the tools described in this tutorial.

First we grep the double # lines of the header and store it in tmp. We then grep the non header lines, paste them with the new dinosaur sequence using a space delimter, and concatenate with our header temp file.

# secret_gist_string is the string already associated with a particular file on Github
# To obtain it, the first time you upload a file to Github (e.g. my_notebook.ipynb) go to
# https://github.com/ | Gist | Add file [upload file] | Create secret Gist, which will
# return a secret_gist_string on the panel at right (labeled “HTTPS”)

]]>http://beehive.cs.princeton.edu/upload-a-gist-to-github-directly-from-della/feed/1Running NBP-Iso framework on BEERS simulated datahttp://beehive.cs.princeton.edu/running-nbp-iso-framework-on-beers-simulated-data/
Fri, 23 Jan 2015 15:36:52 +0000http://beehive.cs.princeton.edu/?p=162Continue reading Running NBP-Iso framework on BEERS simulated data→]]>This tutorial will explain how to run the full NBP-iso framework.
The major steps are:
1) Simulate reads sampled from novel splice forms using BEERS simulator.
2) Simulate multiple individuals on using the output of the BEERS simulator.
3) Run the NBP-iso model.
4) Visualization and analysis of the results.

ipython notebook --ip=127.0.0.1 --pylab=inline --profile=profilename --port <Your Port #>
# note that if you are trying to access Della
# from outside the Princeton CS department, you
# may have to forward the same port from your home computer
# to some princeton server, then again to Della