I am preparing to conduct a RRBS study of a conifer (with an estimated ~30Gb genome), and I would like to first complete a pilot study with very few samples on a Miseq (due to the lower cost of a Miseq run) to plan how many experimental samples I can eventually assess through one run of a Hiseq4000 or Nextseq500. I am targeting 103,000 loci of interest across the genome, and would like to obtain 60x coverage.

I have heard mixed opinions on the advisability of the cross-platform planning approach I am considering, so I am interested to know if others have found that type of approach helpful, and/or can provide reasoning to help guide this decision.

First of all standard RRBS using MspI may not be a good approach for plants as RRBS interrogates methylation status of C in CpG context which is most relevant for mammalians.

In a pilot RRBS you will be interested in knowing restriction fragment numbers in a sample library to estimate sequencing requirement. This can be done in any platform. Choice of sequencer for large scale experiment will depend on sample number and cost. I donít know about iSeq but other Illumina platforms can sequence bisulfite converted DNA equally well.

What I gathered from your response is that the pilot test I propose using a MiSeq will be useful mainly for providing an estimate of the number of restriction fragments I am actually attempting to sequence, given that I currently only have theoretical estimates via in-silico digests. In that case, it makes sense that there should be no problem in using a MiSeq to assess the number of restriction fragments I achieve through my double digest. Knowing fragment number will then allow me to calculate the number of samples to assess in parallel on a higher throughput Illumina machine. Thanks for that clarification.

Regarding the first portion of your reply:

While plants do maintain a number of methylated sequence contexts, CpG appears to remain an important sequence context for studies of differential methylation in plants.

I should correct that by fragment numbers I meant fragments that are flanked by both RE and are in size selection window. With a 4x6 cutter in a 30 Gb genome I guess there will a lot more restriction fragments than 100k. Thanks for the reference as well.

Assuming bisulfite conversion will be after adapter ligation I wonder how you go around high cost of methylated adapters.

My in-silico digest (simRAD in R) returned only ~100k fragments that meet a size-selection criteria of ~150-550 bp, which I'll target for my sequencing runs. That criteria is surely why there were only 100k fragments instead of millions.

To answer your question, I do not know of an alternative to using methylated adapters for this kind of work, so that is precisely what I've chosen to use...for now.

I'd like to raise another aspect of my initial question on this thread: I have inexpensive & easy access to a NextSeq500, but have the sense that the HiSeq4000 might be the best platform for my full sequencing effort once this pilot study is complete. Given that the number of reads obtained from one NextSeq500 run (~400M) versus one lane of HiSeq4000 (~300M) are roughly comparable, what other factors should be considered for deciding which platform would be best for a ddRADseq approach?

I am very surprised that you see just 100k fragments for a 6-cutter plus 4-cutter and a size selection of 150-550 bp. The 4-cutter will cut primarily every 150 bp - 350 bp, so your size selection will include many if not most of the 6 cutter sites, as it is likely they will have a 4-cutter in the 150-550bp flanking sequence. A 6-cutter will cut every 4kb, so I would expect >5M sites. Even with a skewed GC composition in the genome compared to the cut sites, it is hard to imagine going from 5M sites to 100k.

We like the HiSeq4000 compared to the NextSeq because of the better quality data. For an RRBS study, though, will you just do short reads to tag the loci and see if they are present in the methylation-sensitive library? In that case the error doesn't matter much. But the HiSeq is cheaper per nucleotide where we are (https://gc3f.uoregon.edu/illumina-sequencing), compare $1700 for a Hiseq4000 lane of 150 bp to $2,800 for a comparable NextSeq lane. They run a ton of RADSeq and ddRAD libraries as well.

I've actually been in touch with U of O core center personnel regarding this project, and if I go HiSeq, that's where I'll send my samples.

Your question is pertinent to my decision regarding which platform to use: "will you just do short reads to tag the loci and see if they are present in the methylation-sensitive library?"

I'd like to obtain sequence data of high enough quality, coverage, and length to accomplish the following objectives:
1) detect variation in methylation status among loci
2) search for CG/CHH/CHG etc. sequence contexts surrounding or underlying differentially methylated loci
3) call SNPs and blast differentially methylated regions to the closest reference genomes available for my non-model study organism (which, for now, will have to be another species of white pine) to infer rates of differential methylation in different functional genomic regions

I've found published studies that achieved the above goals using 100 cycle single-end HiSeq2000 runs, and I'm honestly unclear as to whether paired-end sequencing and/or longer reads would better enable me to achieve my objectives. The only reason I am considering NextSeq is that I have access to that machine for only the price of the reagents, making it a bit more convenient than HiSeq. So, if the machine is appropriate for my needs, I would opt for NextSeq...but not at the cost of foregoing any of my research objectives.

We got good results with both NextSeq and HiSeq4000 runs, so I think you are safe either way. Given the size of the genome, I'd want as much sequence information at each locus to help improve the mapping accuracy. I can't quite remember how the NextSeq does these days with invariant nucleotides (the cut site). It used to be an issue and require lots of phiX. I think they fixed it to some extant but I'd be careful if running the lane yourself and look into it.

Will you pair the methylation-sensitive and insensitive libraries for every sample? Given ddRAD's propensity for locus drop out (from size selection variation and SNPs in the cut sites and locus sequence) you will need to have a control library for each sample and it should be in the same size selection run.

Can you share details on how you got the 100k sites from the simulation? That one still has me wondering.