I am performing differential gene expression analysis using the Tophat-Cuffdiff protocol. I have run fastQC on 12 fastq files, and realized that all 12 files have "N" as base call in the middle of each read(total read length = 51bp, N appears at 28bp), due to a fluidics issue while running the sequencer. All the other base calls are of high quality and there are 20 million reads for each sample.

How can I go about mapping these reads, while ignoring the N at 28bp of each read? Should I be setting the "final read mismatches" value from 2(default) to 3?

N indicates an ambiguous base call and it may interfere with optimal alignments. Perhaps try running the mapping job first and examining the results to see if more manipulation is needed (since this is easier). Then you can decide if you want to remove the base or not. Manipulating the global matches parameters will impact all of the sequence - not just this base - and will probably not produce the desired results. But this could be tested and the results compared (overall mapping statistics and a visualization of a few regions in Trackster, or at UCSC, or the browser of your choice).

To remove the base, the workflow will go something like this: Convert the fastq file(s) to tabular format, split at the N base, remove that base, merge the sequence end back together, and convert the format back to fastq (fastqsanger). I do not believe that using the tools FASTQ splitter/joiner will work in this case since the tool was not designed for this type of manipulation (the ends are not the same length). Creating a custom workflow is likely the only option unless you want to correct the file locally using unix tools line-command ('sed' is one option).

Tools can be found in the groups Text Manipulation, Convert Formats, and NGS: QC and manipulation. Most that you will use will be analogous to common line-command operations.

Thanks for the response. I have always appreciated your fast responses and helpful suggestions. I just want to clarify a couple things:

1) How can visualization of the mapped reads lead to identifying which method is the 'correct' or 'preferred' manipulation method? In my case, there is no 'answer' to how the reads align onto a genome, since RNA-seq for my study is used as one of the "screening" methods, instead of verification purposes.

2) If I were to remove the base and merge the ends of the reads back together, wouldn't that lead to all bases beyond 28bp moving up by one base pair, since N is removed. Would this not cause problems while mapping the reads to the reference genome?

If the N base is artifact, then it can and should be removed. If it is poor base call, then leave it in. I wasn't sure which you meant originally so gave both options.

The idea behind visualizing the mapping results is that you can compare to see which runs produce the best and most complete spliced alignments. "Best" is a judgment call, for any mapping parameters/tools. Including known transcripts (GTF, GFF, BED) that cover some of the regions can help with this, if available for your genome.

One more question: Would contatenating two fastq increase the sequencing depth? one fastq with the N in the middle, and one without?
I ask this question, as my samples were run across two lanes- one which worked fine, and one with the fluidity issue(causing the ambiguous base call in the middle of every read)- and I would like to stack these reads to get a greater depth.