Introduction

When I get a chance to do a CRAN update (a major project, as I have to thoroughly revise the documentation etc.), it will be available there as well. For now, you should source() the full collection of update R files in the example script (which you should do anyway, after each time you invoke library(BioGeoBEARS)).

Citation of Biogeographical Stochastic Mapping

Note 2016: The best paper to cite regarding Biogeographical Stochastic Mapping is Dupin et al. (2016, //Journal of Biogeography//), where we first describe the basics of the method, and its utility. Eventually there may be a more specialist application note on the method, but this is behind several other projects.

Clarifications about Biogeographic Stochastic Mapping (BSM)

Because of the programs SIMMAP and phytools, some people think of stochastic mapping as an alternate model to ancestral range inference, or they think that Biogeographic Stochastic Mapping is the same thing as what SIMMAP and phytools are doing. In reality:

I think it is useful to distinguish models, methods, and programs. The model consists of the transition matrices, rates, etc. SIMMAP and phytools are (typically, at least) working with unordered characters (they may do other things also). If used for geography, this assumes (a) nothing can ever live in more than one area, (b) the only kind of dispersal is "range-switching" (e.g., sudden jump from A->B along a branch), (c) nothing happens at cladogenesis (or, rather, the range is copied exactly to both descendants: A->A,A).

This (or any other) model can then be used to (a) calculate the log-likelihood of the data (for ML or Bayesian parameter inference); (b) calculate the probability of ancestral states under the model (ancestral area reconstruction, or, better, ancestral range estimation); (c) stochastic mapping (for event counts, residence times, etc.) These are different methods that make use of one or more models.

The handy thing about Biogeographic Stochastic Mapping (BSM) in BioGeoBEARS is that it is programmed to do the stochastic mapping method (and ML ancestral states inference methods) on any of the programmed biogeographical models.

All of the programs can do the simpler models, but only BioGeoBEARS can do the complex biogeographic models. If you do want to run a simple standard character in BioGeoBEARS, this is totally possible. I call it the "BAYAREALIKE+a-d-e" model, with max_range_size=1 (I use this to do validation of BioGeoBEARS here: BioGeoBEARS Validation). So, BioGeoBEARS, SIMMAP, and phytools are three programs that can all run the stochastic mapping method on the standard unordered character model.

Biggest caveats

The biggest caveats with Biogeographic Stochastic Mapping are those common to all available biogeographic models, namely:

Even the best model might be wrong. In fact, it almost certainly is. "All models are wrong, but some models are useful." The event counts represent estimates of the number of events conditional on the model, the estimated model parameters, the phylogeny, and the data.

The biggest worrisome assumption of all of these biogeographic models is that they treat the observed tree as the true tree, and thus they leave out extinction (and unsampled species, if those exist). This is a problem with almost all phylogenetic models (SSE models attempt to model extinction, but one can argue if they do it well), but might be worse for biogeographic models, since cladogenetic range-change processes are being postulated, so missing/extinct species mean missing cladogenetic events in the tree. The main thing missing-extinction-events means for event counts is that event counts are probably underestimates. Mostly what we are hoping is that we are at least getting the minimum number of events, and that these counts will correlate with the true events.

If you have hugely biased extinction or sampling, then of course this will drastically effect results. This is common to all phylogenetic methods (a phylogeny of only large mammals would estimate that the common ancestor of mammals was large).

Note: Only *single*-letter codes for areas!

Note: The stochastic mapping code assumes that all of the areas have single-letter codes — without that assumption, there is no efficient way to store/list ranges of multiple areas. So: *always* use single-letter codes for areas, if you ever plan to do stochastic mapping.

Example Biogeographical Stochastic Mapping (BSM) graphics

Most probable states/ranges (time-stratified DEC and DEC+J)

Figure Caption: Plot of the single-most probable ancestral state/range for a time-stratified DEC model on Hawaiian Psychotria. Note that the single-most probable range at a particular node could still have very low probability; pie charts (below) should be consulted to get a sense of this.

Figure Caption: Plot of the single-most probable ancestral state/range for a time-stratified DEC+J model on Hawaiian Psychotria. Note that the single-most probable range at a particular node could still have very low probability; pie charts (below) should be consulted to get a sense of this.

Single stochastic map plot (time-stratified DEC and DEC+J)

Figure Caption: Plot of a single Biogeographical Stochastic Map (BSM) for a time-stratified DEC model on Hawaiian Psychotria. Note that a single BSM is just a single realization of a possible history. A sample of BSMs should be examined to determine the variability in possible histories.

Figure Caption: Plot of a single Biogeographical Stochastic Map (BSM) for a time-stratified DEC+J model on Hawaiian Psychotria. Note that a single BSM is just a single realization of a possible history. A sample of BSMs should be examined to determine the variability in possible histories.

Figure Caption: Histograms of the counts of different kinds of events found in each of the 50 BSMs. The x-axis gives the number of events, the y-axis gives the number of BSMs in which a specific number of events was observed. Model is a time-stratified DEC model on Hawaiian Psychotria.

Figure Caption: Histograms of the counts of different kinds of events found in each of the 50 BSMs. The x-axis gives the number of events, the y-axis gives the number of BSMs in which a specific number of events was observed. Model is a time-stratified DEC+J model on Hawaiian Psychotria.

Validation of BSMs against ML ancestral state probabilities (time-stratified DEC and DEC+J)

The script below can be run, verbatim, after the standard BioGeoBEARS example script that runs DEC, DEC+J, DIVALIKE, DIVALIKE+J, BAYAREALIKE, and BAYAREALIKE+J. This script only runs Biogeographical Stochastic Mapping on the DEC model, but it could be easily modified to run it under the other models.

NOTE: THIS IS NOT A COMPLETE SCRIPT BY ITSELF. IT ASSUMES YOU HAVE JUST RUN THE standard BioGeoBEARS example script, which covers the source() commands, the working directories, etc.

I am including this run-BSM-after-the-example-script, because many users run the basic example script, then try running the time-stratified BSM example script, which won't work without a few minor modifications (to change the time-stratified BSM script into the non-time-stratified BSM script, the modifications are basically: change the working directory, skip the ML runs if you have already done those, then set model_name, set res=resDEC or whatever model you have saved, and set stratified=FALSE).

This script should "run out of the box", i.e. if you copy/paste into your R window. (On some machines, huge copy-pastes introduce errors, in that case, you have to copy-paste in moderate-sized chunks.) You could also save the code to a .R file and then run from the Terminal (from outside of R) with Rscript scriptfilename.R.

Notes:

This is a time-stratified DEC+J analysis. To run other models, or a non-time-stratified model you will have to edit the model setup; see the setups of 6 basic models in the main BioGeoBEARS example script.

As usual, you will have to change the script to specify your own working directory.

The input files (in this case, for a time-stratified analysis of Psychotria, with Hawaiian islands disappearing back in time) should auto-load from the BioGeoBEARSextdata directories specified. Change these directory locations/file names to use your own files.

Also, if you switch to a non-stratified analysis, be sure to change stratified=FALSE to stratified=TRUE.

Interpreting the outputs of Biogeographic Stochastic Mapping

Interpreting the output of Biogeographic Stochastic Mapping. I had thought that the BSM script was fairly self-explanatory, but based on the questions I have gotten, it seems that many users are overwhelmed by the amount of stuff that prints to the screen. Here is the key: the important stuff for the user is at the very end of the script.

Here is the screen output that you get if you run the above script through to completion. Notice that the key function, doing all of the summary work for you, is count_ana_clado_events():

A series of dispersal count matrices are printed to screen. In the dispersal count matrices, the row headings represent "from" (source areas), and the column headings represent "to" (destination areas).

The mean is the average number of dispersal events of each type, across 50 stochastic maps (or however many BSMs you ran). The second matrix is the standard deviation of these counts. Multiply the standard deviation by +/-1.96 to get the theoretical 95% confidence interval (if the histograms are normal distributions; if non-normal, doing a Bayesian credibility interval, using percentiles on the actual list of counts, would be better).

The first pair of matrices is for range-switching dispersal. Range-switching dispersal is controlled by the parameter 'a', which is fixed to 0 in DEC and related models, so all of these counts are zero.

The second pair of matrices is for for range-expansion dispersal. Range-switching dispersal is controlled by the parameter 'd', which is freely estimated in DEC and related models. If the d parameter is 0, these counts are zero, but otherwise there should be a few events. (In this particular DEC+J model on the time-stratified Hawaiian islands, there are only a few 'd' events, as founder-event dispersal dominates.)

The third pair of matrices adds up both types of anagenetic dispersal. Since 'a' was zero, this equals the 'd' matrices.

The fourth pair of matrices is for for founder-event/jump dispersal (which is cladogenetic). Founder-event/jump dispersal is controlled by the parameter 'j', which is freely estimated in DEC+J and related models. If the j parameter is 0, these counts are zero, but otherwise there should be a some j events. (In this particular DEC+J model on the time-stratified Hawaiian islands, j is quite a strong explainer of the data, as founder-event dispersal dominates.)

The fifth pair of matrices adds up ALL types of dispersal, both anagenetic and cladogenetic.

The last item printed to screen is "Summary of event counts from 50 BSMs". This contains the means and standard deviations of all event types. Readers will note that column "all_clado" (the sum of all cladogenetic events) has a mean of 18 events, with a standard deviation of 0. This is because there are 18 nodes in the Psychotria phylogeny, thus there are always exactly 18 cladogenetic events in every single stochastic map. This means, of course, that sympatric range-copying (e.g. K->K,K) is being counted as an event.

The row "sums" adds up all of the events across the 50 stochastic maps. Dividing the sums by 50 produces the row "means".

Accessing all of the summary tables

Accessing all of the summary tables. All of these tables, as well as the raw counts per stochastic map, are stored in the object counts_list. You can see all of these items with names(counts_list). You can access individual elements with e.g. counts_list$founder_totals_list.

Accessing the tables storing all of the actual stochastic maps

Accessing the tables storing all of the actual stochastic maps. If you run the script all of the way through to the end, the actual stochastic maps (exact event histories) are saved in these two lists of tables: clado_events_tables, and ana_events_tables.

The object clado_events_tables is a list of 50 tables, containing the cladogenetic range-inheritance events at each nodes. The object ana_events_tables is a list of 50 tables, containing the anagenetic events, on branches where anagenetic events occurred.

(Note that it is perfectly possible to have a stochastic map where no anagenetic events occur, particularly in smaller datasets and those with a strong j weight parameter.)

Readings lists and tables in R

Accessing these tables requires some basic R skillz. I cannot teach these to everyone individually. If you are totally at sea with R, you should probably learn basic R before attempting to manipulate tables.

But, just you get you started, this is how you would access the third table in the list of 50 cladogenetic events tables: clado_events_tables[[3]]. This could be a huge table, so it could be useful to just look at the top or bottom of the table, with head() or tail().

The column headings that come from prt() and chainsaw2()

#######################################################
# Reading BSM output
#######################################################
# The key outputs are:
#
# 1. The anagenetic events table:
head(ana_events_tables[[1]])
# 2. The cladogenetic events table:
tail(clado_events_tables[[1]][,-20])
# The first 9 columns of these tables are information about the
# nodes and branch of the tree. These columns are output
# by the BioGeoBEARS function prt() ("print tree to table").
# See here for a mini-tutorial about node numbers and branch
# numbers in APE/R, and about prt():
# http://phylo.wikidot.com/example-biogeobears-scripts#node_numbering
# MEANINGS OF PRT() columns:
# (Note: Like biologists/paleontologists, we are assuming tips are the top
# of the tree, and the root is at the bottom. Computer scientists often
# do it backwards.)
#
# node - The node number used in the standard APE phylo object
# node.type - Is the node "tip", "internal", or "root"?
# parent_br - What is the branch number for the branch below the node?
# (parent_br gives you the row of tr$edge and tr$edge.length)
# edge.length - The branch length
# ancestor - The node number at the bottom of the branch below the node
# daughter_nds - The node numbers of the daughters above the node
# time_bp - The time-before-present of the node (where the highest tip is age 0)
# fossils - If a tip, is it a fossil tip? (as determined by the
# fossils_older_than option of prt()
# label - The node name. Equals the tip species if a tip node, or "inNode"+
# node number, if internal.
# tipnames - (optional output) If get_tipnames=TRUE in prt(), this column
# contains the alphabetical, comma-delimited list of all tip species
# that descend from that node. This is useful as a unique
# identifier of a clade or bipartition across trees, which can be a
# huge problem in APE/R (which can relabel node numbers after any
# change, rotation, etc.)
# COLUMNS DEALING WITH TIME-STRATIFICATION
# Background: A time-stratified tree, in R, is a nightmare. Imagine taking a
# phylogeny, then taking a chainsaw to it, cutting across it
# at several different timepoints. Now you have a pile of "tree pieces";
# some of these will be subtrees (almost always with branches below the
# subtree root node), and some will just be segments (or "1-branch trees",
# if you like).
#
# Keeping track of all of this in R pretty much sucks, as the code
# has to pass all of the likelihoods up and down, keep track of which
# branches connect to what, have special code for single-branch segments,
# more special code for fossil branches that don't come up to 0, etc.
# This would all be much easier with an object-oriented approach, but,
# well, I'm stuck with this unless/until I reprogram everything from
# scratch in a new language. (Yes, I should have listened to my advisor
# and done everything in C++, but then we wouldn't have all the nice
# graphics, and I might still be fighting with compilers on some Windows32
# machine instead of doing science.)
#
# Anyhow, the tree-pieces are generated by the BioGeoBEARS function chainsaw2().
# This generates a list of tree-pieces. To keep track of all of this I have a
# big table, which includes the information from prt(), and then has
# similar information for each tree-piece.
# These columns track which stratum a tree piece is in, the beginning and ending of
# that stratum, the relative time point of the node within that stratum, and
# number and type of the tree-piece in the tree-pieces list produced by chainsaw2():
#
# stratum
# time_top
# time_bot
# reltimept
# piecenum
# piececlass
#
#
# The "SUB" columns contain the prt() output, but for the subtree pieces:
# SUBnode
# SUBnode.type
# SUBparent_br
# SUBedge.length
# SUBancestor
# SUBdaughter_nds
# SUBtime_bp
# SUBfossils
# SUBlabel
#

The meanings of the column headings of clado_events_tables

# COLUMNS RECORDING CLADOGENETIC RANGE-INHERITANCE EVENTS
# clado_event_type - What type of event at cladogenesis? (and letter
# for the parameter that BioGeoBEARS uses to control
# the process; see Figure 1 of Matzke (2013, Frontiers
# in Biogeography).
# sympatry (y)
# vicariance (v)
# subset sympatry (s)
# founder-event/jump dispersal (j)
# clado_event_txt - What specific event happened? E.g.:
# K->K,K would be a narrow sympatry event
# ABC->ABC,ABC would be a broad sympatry event
# ABC->A,BC would a vicariance event, etc.
# (NOTE: This is one reason SINGLE-LETTER CODES
# are REQUIRED by BSM)
# clado_dispersal_to - *IF* it's a dispersal event, dispersal to where?
# sampled_states_AT_nodes - What's the state number of the range at the node?
# (1-based index: typically,
# 1=null range
# 2=A
# 3=B
# 4=AB ...for a 2-area system)
# sampled_states_AT_brbots - State number at the bottom of the branch below the node
# left_desc_nodes - State number at top of the left descendant branch
# right_desc_nodes - State number at top of the right descendant branch
# samp_LEFT_dcorner - State number just after speciation of the node
# (bottom of the left descendant branch)
# samp_RIGHT_dcorner - State number just after speciation of the node
# (bottom of the right descendant branch)
# anagenetic_events_txt_below_node - For the branch below the node, records the
# events along the branch, in a plain-text, semicolon-
# delimited format. Use the function
# events_txt_list_into_events_table() to convert
# these to a table of events (as in the object
# ana_events_tables)
#

The meanings of the column headings of ana_events_tables

# COLUMNS RECORDING ANAGENETIC RANGE-INHERITANCE EVENTS
#
# NOTE CAREFULLY: An ana_events_table include cladogenetic information, because it is
# copied from the clado_events_table when clado_events_table$anagenetic_events_txt_below_node
# is converted into an ana_events_table. This helps us to keep track of the states at
# the top and bottom of each branch. However, this information will be COPIED to
# every anagenetic event. So, if there are several events on branch, the same cladogenetic
# information will be copied to every row of the anagenetic events table. SHORT VERSION:
# DON'T COUNT CLADOGENETIC EVENTS FROM THE ANAGENETIC EVENTS TABLE; THAT IS WHAT THE
# CLADOGENETIC EVENTS TABLE IS FOR!
#
# COLUMNS SPECIFIC TO AN ana_events_table (the others are as above)
#
# trynum - The number of tries that occurred before a successful
# simulation of this branch history.
# (If trynum is greater than the user-set maxtries, then
# the history was created with the so-called "manual"
# option that finds the minimum events and puts them
# in random order; see the BioGeoBEARS Google Group.)
# brlen - The branch length (or length of the segment, when time-
# stratified).
# current_rangenum_1based - The 1-based index of the range (the state) before the event.
# new_rangenum_1based - The 1-based index of the range (the state) after the event.
# current_rangetxt - The text version of the range before the event.
# new_rangetxt - The text version of the range after the event.
# abs_event_time - The absolute time of the event (time before the highest
# tip in the overall tree, which marks time=0 my before
# present.
# event_time - Relative event time (within the time stratum).
# event_type - Type of event (currently allowed anagenetic events are
# d, e, and sometimes a).
# event_txt - Event text, eg. A->AB or ABC->BC.
# new_area_num_1based - The 1-based index of the new area (note: area indexes are
# different than range indexes!) for a d or a event.
# lost_area_num_1based - The 1-based index for a lost area for an e event.
# ana_dispersal_from - The source area (important when there are multiple
# possible areas, e.g. ABC -> ABCF. The source area is
# probabilistically chosen according to the dispersal
# constraints etc. used in the model. The function doing this
# is simulate_source_areas_ana_clado(), run near the end of
# the example script.
# dispersal_to - The area being dispersed to (text version).
# extirpation_from - The area being lost (text version).
#

On Macs at least, ImageMagick's "convert" function will require that you have ghostscript (gs) installed (http://ghostscript.com/download/gsdnld.html). If you use MacPorts to install ImageMagick, ghostscript should come along with that.

For example, in Terminal, navigate to the directory with your PDF, then: