Main navigation

How to Data-Mine the Slush Pile

Let’s be realistic: it is not possible to give as much attention to unsolicited manuscripts as they may warrant. There simply isn’t the time. Furthermore, even if we did assiduously examine each manuscript submitted, it is very difficult to know what the quality of writing actually is. The odd stylistic aberration can be irritating to the traditionally trained editor but the average reader may not respond in the same way. The odd typographical error can also spell doom for a manuscript, notwithstanding that such errors can be found in even the most august publications. And let us not forget that there are those writers who were effectively dyslexic – F. Scott Fitzgerald comes to mind. Finally, we must consider that the judgement of the editor may not align with that of the reading public. How many times was J.K. Rowling’s initial manuscript rejected? And, drawing from experience in another genre, wasn’t it the case that Decca Records turned down the Beatles?

There has been a lot of discussion of whether or not it is possible to use algorithms to aid in the search for profitable manuscripts. The purpose of this article is to explain in non-technical terms how a publisher can very quickly sort through the slush pile and efficiently avoid turning down a potentially valuable manuscript without investing huge amounts of time and effort. Furthermore, the method described can reduce (but not entirely eliminate) the phenomenon of publishing an unsuccessful book. We know that a large proportion of books that are published are eventually pulped because the reading public did not feel the same way about them that the publisher did. In order to survive, the publisher therefore needs to have a number of successes to cover their losses from the failures. The method described below, while not guaranteeing success, can potentially shift the mix of publications towards the successful and away from the unsuccessful.

Before I begin I should point out that much of what follows will be couched in economic terms. Unfortunately, the era of the ‘gentleman publisher’ has passed and even those publishers with altruistic aims need to appeal to the market. This can take the form of having two imprints, such that one imprint is aimed at a mass market, the profits from which are used to subsidise a loss-making quality imprint. If this sounds a little instrumental, consider that writers have done this since the beginning of time. Robert Graves, for example, says that he wrote the first two Claudius books to make money (Graves Kersnowski 1989: 99). As he often stated, poetry was his real muse.

I am also acutely aware that discussions of economics can have a negative emotional impact on some segments of an audience. I agree that some things should be above thoughts of economics. But this is because I have the luxury of being able to hold such ideas – I am not running a publishing firm that is finding it hard to make ends meet. However, I am aware that publishers are finding things difficult and I want publishers to survive. One way to increase the probability of publishers surviving is to demonstrate the means of doing so. (Keep in mind that many publishers are already doing what I am about to discuss).

Please note that in the following discussion we are dealing here with probabilities, not absolutes. When I say that a manuscript has greater potential than another, this does not mean that I am asserting that a manuscript has some absolute quality that will guarantee success. I am merely saying that it has more characteristics in common with manuscripts that have been successful than it has with manuscripts that have been unsuccessful.

The first step is to identify those publications that have generated a profit or at least broken even. If you can identify 10 that’s good. If you can identify 100 that’s great. The next step is to identify an equivalent number of manuscripts that have not broken even. In technical language, the first group is called the study group and the second group is called the control group. The identification of these two groups is the most important part of the procedure because this is what lies behind the ability to determine the difference between a loss-making book and a profit-making book.

The next step in the procedure is to try to determine what the difference is between these two groups of books. However, there is an important caveat here: you can not rely on the finished product. You have to go on the same material that you will be using to assess the unsolicited manuscripts in the slush pile. It may be that the successful books were marketed in a particular way while the unsuccessful ones were not and this is something that a publisher needs to be aware of. But our focus in this article is on the slush pile and how to get the most out of it. None of the manuscripts in the slush pile have had any marketing, editing or been altered in any way by your firm. If you are going to come up with valid decisions about what manuscripts in the slush pile to publish and what to reject you must therefore base your decisions on the raw materials. Thus, with your study group of successful books and your control group of unsuccessful books, you need to get hold of the manuscripts as they were first presented to you.

The next step is to analytically examine the differences between the unsuccessful and the successful manuscripts. The best starting point is to concentrate on the first chapter of each manuscript. The question is, How does one critically and objectively determine the differences between two sets of chapters? To Put this a different way, let us assume that you have 50 first chapters from the initially submitted chapters of books that were eventually successful and 50 first chapters from the initially submitted chapters of books that were not successful. How do you trawl through this huge amount of text looking for patterns that distinguish between the successful manuscripts and the unsuccessful manuscripts? The answer is, you don’t – you get a computer to do it for you. This is where we call in the services of software specifically designed to look for patterns in text.

Computational linguistics is the term used for the procedure I am referring to. If you look at my article titled The Conservative/Liberal Spectrum in Australia’s Federal Parliament you will see an example of this technique. That particular article looks at the differences between the maiden speeches of 485 Australian federal parliamentarians in order to determine the different way that left/right ideology is reflected in the maiden speeches of the cohort. Importantly, the algorithm is able to correctly classify 74% of the cohort as being members of the ALP (left wing) or the Liberal National Coalition (right wing) parties purely by looking at the use of language. I can assure you that the computer knows nothing about politics. However, it is very good at picking up on the different ways that left wingers as opposed to right winger use language.

A sceptic might say that this cannot work for literature because of the subjectivity involved in literary taste. Well, the examples of successful application of this technology to the filed of literature are legion. To see some recent examples, go to Google Scholar and enter “machine learning literature” or some similar combination of terms. For the purposes of this article I shall focus on one of my contributions to the field.

In ‘Ranking Contemporary American Poems’ (Dalvean 2015) I demonstrated a method for determining whether a poem was written by an establish poet, defined as being represented in Contemporary American Poetry, edited by Poulin and Waters (2006), or by an amateur poet, defined as having a poem on the site www.amateurwriting.com. The algorithm is over 80% accurate, indicating that the algorithm can distinguish between a poem written by a ‘professional’ as opposed to an ‘amateur’ in over 80% of cases. You can try out the algorithm yourself at www.poetryassessor.com.

The situation with successful and unsuccessful book manuscripts is analogous to the situation with professional vs amateur poems. As the Poetry Assessor is calibrated on professional and amateur poems I shall discuss the method of use in these terms. The only real difference between the Poetry Assessor and the book publisher’s algorithm is simply that the algorithm for the latter would be calibrated on the first chapters of unsuccessful and successful books.

Let us assume that you are a publisher of a general interest literary magazine and every week100 unsolicited poems are received by you for publication. You simply do not have the time or resources to critically consider each poem. Furthermore, you know from bitter experience that only 10 out of the 100 poems are likely to be of any merit and of these 10, only 1 will be outstanding. How can you find the time to go through 100 poems in the hope of finding one good one? My solution is as follows. You run the 100 poems through the Poetry Assessor. This algorithm is specifically designed to place poems on a spectrum that represents having the linguistic characteristics of an ‘amateur’ poem, at one end of the spectrum, and, at the other end of the spectrum, having the characteristics of a ‘professional’ poem. A score above zero shows that it shares more linguistic characteristics with a professional poem than it does with an amateur poem while a score below zero shows the opposite.

The result of this procedure is that your 100 submissions are scored based on a comparison between amateur and professional poets. Let us say that you have time to critically examine only 10 poems. The scores enable you to pick to top ten scoring poems, which are the poems that most closely resemble the linguistic patterns of professional poets. Obviously, the first poem you should look at is the top scoring poem. If this is to your liking then problem solved. However, it is possible that it may not be what you have in mind. So you move to the next poem and so on down the list. You continue looking until you find a poem that you think has merit.

The likelihood is that, over time, you will be able to publish some high quality submissions that otherwise would not have seen the light of day and you have done so by expending only 10% of the time and effort it would have taken to critically assess all the submissions.

It is worthwhile keeping in mind that this algorithm is calibrated on contemporary poems so it does not work on any other type of text. Thus, it will not generate valid score for songs or stories or random text. Certainly you will get a result from these types of input but it will not be a readily interpretable score.

Now, let us return to the book publisher’s problem. The solution is exactly the same as that for the magazine editor except that in the case of the Poetry Assessor the algorithm has been calibrated on amateur and professional poems while for the book publisher, the algorithm is calibrated on a selection of chapters from successful and unsuccessful manuscripts. The algorithm will have been set up so that it has ‘learnt’ the linguistic differences between those manuscripts that go on to be successful and those that do not. When you get a new manuscript you run it through the algorithm and, if it scores above zero it has more linguistic characteristics of successful manuscripts than of unsuccessful manuscripts and therefore the probability is that it can be made into a successful book. If it scores low then you say to the author “thank-you but no-thank-you”.

There are a few issues that may emerge which complicate the above scenario but most problems are amenable to careful design. The main problem is not the technology or the methods but the resistance of people within the organisations to using such methods. Some people get a bit indignant that a mere computer could possibly know more than an experienced literary editor. The correct response to this is to point out that the computer knows nothing. However, it is far better than any human at sifting though huge amounts of data.

In Michael Lewis’ (2004) Moneyball: the art of winning an unfair game, the old-fashioned baseball selectors were reluctant to use the data-driven methods of selecting baseball players, arguing that their subjective judgements could not be surpassed by an algorithm. Again, this was shown to be incorrect and now most professional baseball teams use some variation of the method discussed by Lewis in his book.

So there you have it. In less than 2500 words I have explained how to streamline your procedures. I should reiterate at this point that there is nothing absolute about this kind of procedure: it works on probabilities. The idea is that using a method like this will enable you to 1) sort through a very large volume of manuscripts very quickly so that you can find the ones that warrant further attention and 2) shift the ratio of loss-making to profit making books towards the latter, and 3) potentially (possibly) find a bestseller that otherwise might have languished in the slush pile.