I'm running an RNA-seq analysis using FASTQ data from an Illumina HiSeq Rapid V2 machine (single reads at 50bp). I don't have experience with UNIX coding so I am using Galaxy, and specifically the Tuxedo applications to align/map my reads, then preform differential analysis (probably with Cuffdiff).

For my data, I have 3 conditions with 6 biological replicates in each. In addition each biological replicate was run on 2 lanes so I have 2 technical replicates per sample. In addition to all that, the way I received data for each technical replicate was in 2 separate FASTQ files. The technician mentioned something about the machine automatically creating a new file when it hits about 200mb or something.

So I'm wondering about the proper method of combining these files. First of all, how do I combine the 2 FASTQ files for each technical replicate? I imagine this has to be done early prior to mapping with Tophat? Secondly, at which step should I combine the technical replicates? At the differential expression analysis step I should only be comparing biological replicates- including separate technical replicates would be pseudoreplication. So I imagine combining my technical replicates happens prior to the Cuffdiff step, but I'm not sure if this happens prior to mapping or after.

Merge multiple fastq files representing a single sample by using the tool Concatenate datasets tail-to-head. It is okay to do QC first (in order to narrow down where lab issues may have occurred), but then merge before doing anything else. These are paired end? Merge the forward datasets together, then merge the reverse datasets together. Run each sample's pair through a mapping tool like Tophat.

I clarified with the technician and she would not treat sequences from 2 lanes for one sample as technical replicates (because data was not collected twice from the same prep). Rather, she said to treat the data from the 2 lanes as subsets of the same sequence. So they should be combined cumulatively such that if each lane produced 5million reads, the combined sequence would be 10million reads. What tool would be appropriate to use to perform this type of merge? Would it be the "concatenate two datasets into one dataset" option under "operate on genomic interval"?

I clarified with the technician and she would not treat sequences from 2
lanes for one sample as technical replicates (because data was not
collected twice from the same prep). Rather, she said to treat the data
from the 2 lanes as subsets of the same sequence. So they should be
combined cumulatively such that if each lane produced 5million reads, the
combined sequence would be 10million reads. What tool would be appropriate
to use to perform this type of merge? Would it be the "concatenate two
datasets into one dataset" option under "operate on genomic interval"?