I have the expression of one gene for 273 glioma patients, as well as their clinical data. I want to do a survival analysis and generate a Kaplan-Meier plot of the patients' survival based on the expression of the gene: "high" or "low". I saw this tutorial on Biostars (Tutorial: Survival analysis with gene expression), and the author takes the Z-score of the expression data to stratify expression as high or low. However, the Z-score is based on the expression of all genes per patient (i.e. taking the average expression and standard deviation of all genes for every patient). Since I don't have the expression of other genes, is it appropriate to take the Z-score for the expression of this gene across all patients (i.e. use the average expression and standard deviation of this gene for all the patients) and stratify high or low expression based on that? Or does survival analysis with gene expression have to be based on the expression of genes per patient, rather than one gene across all patients? I hope this makes sense, please let me know if I need to clarify more.

If you just have one gene, why not instead use tertiles, quartiles, quintiles, sextiles, etc? Also, if you are just testing one gene, then you don't have to use RegParallel, as its designed for quickly testing hundreds or thousands of genes independently.

For a survival model, you can test each gene independently, or create a multivariate model whereby the values of multiple genes are used. One can even include clinical parameters:

Hi Kevin, thank you for responding and for your suggestion (and for writing such a great tutorial). Doing quartiles or something similar is easier indeed, but I want to make sure it's appropriate; I will be dividing into quartiles for the expression of this one gene from all the samples, i.e. 'high' will mean high expression relative to other patients, rather than relative to other genes. Is this a conventional way to do survival analysis?

There is no right or wrong way, really. In my tutorial, I first transform the expression data to Z-scores by row (gene), and then perform the 1st pass analysis using the gene Z-scores on the continuous scale. I then identify key genes from this 1st pass and put those into a new Cox model, but encoded this time as low|mid|high. So, indeed, a gene with a high Z-score has high expression relative to all other genes.

In your case, using quartiles, you can just refer to upper-, mid-, and lower- quartiles, and avoid the use of the word 'high' or 'low', if that helps. Indeed, it would not be high relative to the other genes (well, it may be, but we don't know).