In Part One I presented the simplest version of the algorithm, which I think is easy to understand, but slow. In Think Bayes I present some ways to optimize it. In Part Two I apply the algorithm to real data and generate predictive distributions. Now in Part Three, as promised, I publish the predictions the algorithm generates. In Part Four I will compare the predictions to actual data.

Background: Belly Button Biodiversity 2.0 (BBB2) is a nation-wide citizen science project with the goal of identifying bacterial species that can be found in human navels (http://bbdata.yourwildlife.org).

Transparent science

In an effort to explore the limits of transparent science, I have started publishing my research in this blog as I go along. This past summer I wrote a series of articles exploring the relationship between Internet use and religious disaffiliation. This "publish as you go" model should help keep researchers honest. Among other things, it might mitigate publication bias due to the "file drawer effect." And if the data and code are published along with the results, that should help make experiments more reproducible.

Toward that end, I will now subject myself to public humiliation by generating a set of predictions using my almost-entirely-unvalidated solution to the Unseen Species problem. In the next installment I will publish the correct answers and score my predictions. Here are the details:

1) I am working with data from the Belly Button Biodiversity project; this data was used in a paper published in PLOS ONE and made available on the web pages of the journal and the researchers. The data consists of rDNA "reads" from 60 subjects. In order to facilitate comparisons between subjects, the researchers chose subjects with at least 400 reads, and for each subject they chose a random subset of 400 reads. The data for the other reads was not published.

2) For each subject, I know the results of the 400 selected reads, and the total number of reads. I will use my algorithm to generate a "prediction" for each subject, which is the number of additional species in the complete dataset.

3) Specifically, for each subject I will generate 9 posterior credible intervals (CIs) for the number of additional species: a 10% CI, a 20% CI, and so on up to a 90% CI.

4) To validate my predictions, I will count the number of CIs that contain the actual, correct value. Ideally, 10% of the correct values should fall in the 10% CIs, 20% should fall in the 20% CIs, and so on. Since the predictions and actual values are integers, a value that hits one end of a predicted CI counts as a half-hit.

Predictions

And now, without further ado, here are my predictions. The columns labelled 10, 20, etc. are 10% credible intervals, 20% CIs, and so on.

Code

# reads

# species

10

20

30

40

50

60

70

80

90

B1234

1392

48

(4, 4)

(4, 5)

(3, 5)

(3, 5)

(3, 6)

(2, 6)

(2, 7)

(1, 7)

(1, 9)

B1235

2452

69

(11, 12)

(10, 12)

(10, 13)

(9, 13)

(8, 14)

(8, 15)

(7, 16)

(6, 17)

(5, 19)

B1236

2964

45

(4, 5)

(4, 5)

(4, 6)

(4, 6)

(3, 7)

(3, 7)

(3, 8)

(2, 9)

(1, 10)

B1237

3090

62

(9, 10)

(9, 11)

(8, 11)

(8, 11)

(7, 12)

(7, 12)

(6, 13)

(5, 14)

(4, 16)

B1242

3056

61

(9, 9)

(8, 10)

(8, 10)

(7, 11)

(7, 11)

(6, 12)

(6, 14)

(5, 15)

(5, 16)

B1243

1518

71

(10, 11)

(10, 12)

(9, 12)

(8, 13)

(8, 13)

(8, 14)

(7, 15)

(6, 16)

(5, 17)

B1246

4230

91

(23, 24)

(22, 25)

(21, 26)

(21, 27)

(19, 28)

(18, 29)

(17, 30)

(16, 33)

(14, 35)

B1253

1928

86

(16, 17)

(15, 18)

(14, 18)

(14, 20)

(13, 20)

(13, 21)

(12, 23)

(11, 24)

(10, 26)

B1254

918

58

(5, 5)

(4, 6)

(4, 6)

(3, 6)

(3, 7)

(3, 7)

(2, 8)

(2, 9)

(1, 10)

B1258

1350

87

(15, 16)

(14, 17)

(14, 17)

(13, 18)

(12, 19)

(11, 19)

(11, 20)

(10, 21)

(8, 24)

B1259

1002

80

(10, 11)

(10, 12)

(10, 12)

(9, 13)

(9, 14)

(8, 14)

(7, 15)

(6, 16)

(6, 18)

B1260

1944

96

(22, 23)

(21, 24)

(20, 25)

(19, 25)

(19, 26)

(18, 27)

(17, 29)

(15, 30)

(14, 32)

B1264

1122

83

(12, 13)

(12, 14)

(11, 14)

(10, 15)

(10, 15)

(9, 16)

(8, 17)

(7, 18)

(6, 20)

B1265

2928

85

(18, 19)

(17, 20)

(16, 21)

(16, 22)

(15, 23)

(14, 24)

(13, 25)

(12, 26)

(11, 28)

B1273

2980

61

(9, 9)

(8, 10)

(8, 10)

(7, 11)

(7, 12)

(6, 12)

(6, 13)

(5, 14)

(4, 16)

B1275

1672

85

(16, 17)

(15, 18)

(15, 19)

(14, 19)

(13, 20)

(13, 21)

(12, 22)

(11, 24)

(9, 25)

B1278

1242

47

(4, 4)

(3, 4)

(3, 5)

(3, 5)

(2, 6)

(2, 6)

(2, 6)

(2, 7)

(1, 8)

B1280

1772

46

(4, 4)

(4, 5)

(3, 5)

(3, 5)

(3, 6)

(2, 6)

(2, 7)

(2, 8)

(1, 9)

B1282

1132

67

(8, 9)

(7, 9)

(7, 10)

(6, 10)

(6, 11)

(6, 11)

(5, 12)

(5, 13)

(3, 15)

B1283

1414

67

(8, 9)

(8, 10)

(7, 10)

(7, 11)

(7, 11)

(6, 12)

(5, 13)

(4, 14)

(3, 16)

B1284

1158

91

(15, 16)

(14, 17)

(14, 17)

(13, 18)

(13, 19)

(12, 19)

(12, 20)

(10, 21)

(9, 23)

B1285

2340

55

(7, 7)

(6, 8)

(6, 8)

(5, 8)

(5, 9)

(4, 9)

(4, 10)

(3, 12)

(2, 13)

B1286

1728

66

(9, 10)

(9, 11)

(8, 11)

(8, 12)

(8, 12)

(7, 13)

(6, 14)

(6, 14)

(4, 16)

B1288

1280

107

(23, 24)

(22, 25)

(21, 25)

(21, 26)

(20, 27)

(19, 27)

(18, 29)

(17, 31)

(15, 32)

B1289

2054

103

(26, 27)

(25, 28)

(24, 29)

(23, 30)

(23, 30)

(22, 32)

(21, 33)

(20, 34)

(17, 36)

B1291

1248

94

(17, 18)

(16, 19)

(16, 20)

(15, 20)

(15, 21)

(13, 22)

(13, 23)

(12, 25)

(10, 27)

B1292

1864

82

(15, 16)

(14, 16)

(13, 17)

(13, 18)

(13, 19)

(12, 20)

(11, 21)

(10, 22)

(9, 24)

B1293

1904

76

(13, 14)

(12, 14)

(12, 15)

(11, 16)

(11, 16)

(10, 17)

(9, 18)

(8, 19)

(7, 22)

B1294

1784

78

(14, 15)

(13, 16)

(12, 16)

(12, 17)

(11, 18)

(11, 19)

(10, 19)

(9, 20)

(8, 23)

B1295

1408

70

(10, 10)

(9, 11)

(9, 12)

(8, 12)

(8, 12)

(7, 13)

(7, 14)

(6, 15)

(4, 17)

B1296

2034

55

(7, 7)

(6, 8)

(6, 8)

(6, 8)

(5, 9)

(4, 9)

(4, 10)

(4, 11)

(3, 12)

B1298

1478

72

(10, 11)

(9, 12)

(9, 12)

(9, 13)

(8, 13)

(8, 14)

(7, 15)

(6, 16)

(5, 18)

B1308

1160

58

(6, 6)

(5, 7)

(5, 7)

(5, 7)

(4, 8)

(4, 8)

(3, 9)

(3, 10)

(2, 11)

B1310

1066

80

(11, 12)

(11, 13)

(10, 13)

(9, 14)

(9, 15)

(8, 15)

(7, 16)

(7, 17)

(5, 19)

B1374

2364

48

(5, 5)

(4, 6)

(4, 6)

(4, 6)

(3, 7)

(3, 7)

(3, 8)

(2, 9)

(2, 10)

B940

2874

93

(22, 24)

(21, 25)

(21, 25)

(20, 26)

(19, 27)

(19, 28)

(18, 30)

(16, 32)

(14, 33)

B941

2154

48

(5, 6)

(5, 6)

(4, 6)

(4, 7)

(4, 7)

(3, 7)

(3, 8)

(2, 9)

(2, 11)

B944

954

52

(4, 4)

(4, 5)

(3, 5)

(3, 5)

(3, 6)

(2, 6)

(2, 6)

(2, 7)

(1, 9)

B945

2390

67

(10, 11)

(10, 12)

(9, 12)

(9, 13)

(8, 13)

(8, 14)

(7, 15)

(7, 16)

(5, 17)

B946

5012

85

(20, 21)

(19, 22)

(19, 23)

(18, 24)

(18, 24)

(17, 26)

(16, 27)

(15, 28)

(12, 31)

B947

3356

62

(10, 11)

(9, 11)

(9, 12)

(8, 12)

(7, 13)

(7, 14)

(6, 14)

(5, 15)

(5, 17)

B948

2384

80

(16, 17)

(15, 18)

(14, 18)

(14, 19)

(13, 20)

(12, 21)

(11, 22)

(10, 23)

(9, 25)

B950

1560

63

(8, 9)

(8, 10)

(8, 10)

(7, 10)

(7, 11)

(6, 11)

(5, 12)

(5, 13)

(4, 15)

B952

1648

57

(7, 7)

(6, 8)

(6, 8)

(6, 8)

(5, 9)

(5, 9)

(4, 10)

(3, 11)

(3, 12)

B953

1452

32

(2, 2)

(1, 2)

(1, 2)

(1, 3)

(1, 3)

(1, 3)

(0, 3)

(0, 4)

(0, 5)

B954

1996

29

(2, 2)

(1, 2)

(1, 2)

(1, 2)

(1, 3)

(1, 3)

(0, 3)

(0, 4)

(0, 4)

B955

1474

65

(8, 9)

(8, 9)

(7, 9)

(7, 10)

(7, 10)

(6, 11)

(5, 12)

(5, 13)

(4, 14)

B956

1482

71

(10, 11)

(10, 12)

(10, 12)

(9, 13)

(8, 14)

(8, 14)

(7, 15)

(6, 16)

(5, 18)

B957

2604

36

(3, 3)

(3, 3)

(2, 4)

(2, 4)

(2, 5)

(1, 5)

(1, 6)

(1, 6)

(1, 7)

B958

2840

29

(2, 2)

(2, 2)

(1, 2)

(1, 3)

(1, 3)

(1, 3)

(1, 4)

(0, 4)

(0, 5)

B961

1214

36

(2, 3)

(2, 3)

(2, 3)

(2, 4)

(1, 4)

(1, 4)

(1, 5)

(1, 5)

(0, 6)

B962

1138

41

(3, 3)

(2, 3)

(2, 3)

(2, 4)

(2, 4)

(1, 4)

(1, 5)

(1, 6)

(0, 7)

B963

1600

71

(10, 12)

(10, 12)

(9, 12)

(9, 13)

(8, 14)

(8, 14)

(7, 15)

(5, 16)

(4, 19)

B966

1950

80

(15, 16)

(14, 16)

(14, 17)

(13, 17)

(12, 18)

(11, 18)

(11, 20)

(10, 22)

(9, 23)

B967

1108

47

(3, 4)

(3, 4)

(3, 4)

(2, 5)

(2, 5)

(2, 5)

(2, 6)

(1, 7)

(1, 7)

B968

2432

52

(6, 7)

(6, 7)

(6, 7)

(5, 8)

(5, 8)

(4, 9)

(3, 9)

(3, 10)

(2, 11)

B971

1462

49

(4, 5)

(4, 5)

(4, 5)

(3, 6)

(3, 6)

(3, 7)

(2, 7)

(2, 8)

(1, 9)

B972

1438

75

(11, 12)

(11, 13)

(10, 13)

(10, 14)

(9, 14)

(8, 15)

(8, 16)

(7, 17)

(6, 18)

B974

5072

54

(7, 8)

(7, 9)

(6, 9)

(6, 9)

(5, 10)

(5, 10)

(4, 11)

(4, 12)

(2, 14)

B975

1542

63

(7, 8)

(7, 9)

(6, 9)

(6, 9)

(6, 10)

(5, 11)

(5, 11)

(4, 12)

(3, 14)

Reading the last row, subject B975 yielded 1542 reads; in a random subset of 400 reads, there were 63 different species (or more precisely, OTUs). My algorithm predicts that if we look at all 1542 reads, the number of additional species we'll find is between 3 and 14, with 90% confidence.

I have to say that this table fills me with dread. The intervals seem quite small, which is to say that the algorithm is more confident than I am. The 90% CIs seem especially narrow to me; it is hard for me to believe that 90% of them will contain the correct values. Well, I guess that's why Karl Popper called them "bold hypotheses". We'll find out soon whether they are bold, or just reckless.

I want to thank Rob Dunn at BBB2 for his help with this project. The code and data I used to generate these results are available from this SVN repository.

EDIT 2-22-13: I ran the predictions again with more simulations. The results are not substantially different. I still haven't looked at the answers.