Tag Archives: code

The data was made available to use on 20151224 and took two days to download.

The full list of samples (and the individual samples/libraries/indexes) submitted to Genewiz for this project by Katherine Silliman & me can be seen here (Google Sheet): White_BS1511196_R2_barcodes

The data supplied were all of the Illumina output files (currently not entirely sure where/how we want to store all of this, but we’ll probably want to use them for attempting our own demultiplexing since there were a significant amount of reads that Genewiz was unable to demultiplex), in addition to demultiplexed FASTQ files. The FASTQ files were buried in inconvenient locations, and there are over 300 of them, so I used the power of the command line to find them and copy them to a single location: http://owl.fish.washington.edu/nightingales/O_lurida/2bRAD_Dec2015/

Location of the files I wanted to search through. The path looks a little crazy because I was working remotely and had the server share mounted.

-name '*.fastq.*'

The name argument tells the find command to look for filenames that have “.fastq” in them.

-exec cp -n '{}'

The exec option tells the find command to execute a subsequent action upon finding a match. In this case, I’m using the copy command (cp) and telling the program not to overwrite (clobber, -n) any duplicate files.

for i in *.gz; do linecount=`gunzip -c "$i" | wc -l`; readcount=$(($linecount/4));

Same for loop as above that calculates the number of reads in each FASTQ file.

printf "%st%snn" "$i" "$readcount" >> readme.md;

This formats the the printed output. The “%st%snn” portion prints the value in $i as a string (%s), followed by a tab (t), followed by the value in $readcount as a string (%s), followed by two consecutive newlines (nn) to provide an empty line between the entries. See the readme file linked above to see how the output looks.

>> readme.md; done

This appends the result from each loop to the readme.md file and ends the for loop (done).