Toward an Integrated BICCN Data Set (IDS)

Data, Tools, and Knowledge

A central organizing principle of the BCDC is the data, tools, and knowledge structure. This three-tiered approach forms a natural conceptual organization serving to best manage information architecture, utility, and flow of data from raw numerical observational data, through mapping standards and analysis methods, to summary knowledge and education. A federated architecture based on a public facing Cell Registry and Portal indexes all data providing lower resolution data summaries and has appropriate pointers to raw data sources that will allow for a flexible and more easily maintainable database. At the Data level within the registry, QC/QA data meeting standards of quantification and mapping are populated in the BCDC. Robust and consistent data models across providers, driven by the Data Standards Committee, insures that data is available for access within network and ready for effective mapping in the CCF at the Tools level, where it can be processed with various analysis and clustering tools, or visualized through helpful applications. At the highest Knowledge level, the Collaborative Analysis Working Group will facilitate analysis between BICCN partners to reconcile and highlight corroborated results, as well as provide input into which results should be presented or referenced.

The BCDC Portalis the main external facing entry to the BICCN, its members, data modalities, resources and results. The portal is intended to be a living resource that can be frequently updated with news and relevant cell type related content and providesBICCN Consortium description with goals, Cell type classification and its value for neuroscience, individual project pages, BCDC strategy and timeline, and current data availability. Some initial possibilities for content are described below.

Data Levels

A convenient way to track data from primary raw data through information rich structured data is through data levels. Each data modality produced by the BICCN starts as Level 0 raw data and migrates through higher levels as data QC/QA (Level 1), spatial localization within the Common Coordinate Framework (CCF) (Level 2), computational feature extraction (Level 3), and finally integrated or knowledge level content is extracted. The general definition of data levels is described below and will be described for each U01/U19 as these definitions become available.

Level 0:(Raw) Raw data including mapped sequence reads, raw non- or minimally-compressed image series. Multivariate data sets are generated where applicable. These data are center specific and not meant for being publicly distributed or distributed to the BCDC.

Level 1:(QA/QC) Preprocessed data sets having passed an initial round of QA/QC or computation. This is data intended for sharing within the BICCN network and storage in R24 archive centers , and it uses the appropriate format and compression level for derived computation and analysis.

Level 2: (Linked) These data are mapped to a spatially determined region or stereotaxic site. Ultimately this data is accessible and displayable within the Common Coordinate Framework (CCF

Level 3: (Featured) A derived data set summarizing relevant features and properties for a modality of interest. This level enables shared and novel algorithmic approaches across or within centers. Examples include differential expression, clustering analyses, cross modality analysis.

Level 4: (Integrated) Informative data sets and that meet standards for public deployment and have value for the BICCN network and larger community in advancing our understanding of cell types in the brain. Appropriate results will be release on the BICCN portal.

The BCDC Cell Registry and BICCN Integrated Data Set

As data moves through levels from Level 0 raw data to Level 4 integrated data it meets increasingly rigorous and systematic standards for mapping and presentation. All experimental cell data meeting Level 1 quality assurance is archived and will be made available through an indexing and search engine, the BCDC Cell Registry. This application, presently under development is accesible from the portal and will provide a detailed survey of all BICCN data presented at the time. As data is spatially resolved in location or ontology it can be presented within the CCF (see Standards, BICCN Standards for CCF). When common sets of features and standard data summaries are computed in higher levels, the corresponding higher resolution data becomes available in the IDS viewer which can use Allen Instiute tools and products.

The BCDC Cell Registry, (presently under development) provides an index of all level 1 (or level 0 for transcriptomic data), pointers to raw data access, indication of data modality, provenance and associated metadata. The Cell Registry does not offer any explicit analysis or computational resources, but based search on associated metadata. Raw level 0/1 data of the BICCN is intended to be stored in the R24 archives at the University of Maryland (NeMO) for transcriptomic/epigenomic data, and at the Pittsburgh Supercomputer Center (MBIC). From a practical perspective, the BCDC will prefer to only be aware of data that has been deposited with the R24 archives. As we have argued, a global repository for all high-resolution data across all modalities stored at a single facility, Cloud based or otherwise is at present not feasible.

The BICCN Integrated Data Set (IDS) is the full realization of Level 3 mapped and feature rich data that can be integrated with Allen Institute internal data. Extracted quantitative features of each cell, together with access to the raw data, tools to manipulate and analyze the data, and description its relationship to the current state of our ontology. Data achieving Level 3 will be both liste in the Cell Registry and viewable in any of the Allen Institute online resources. This proposed architecture is illustrated in the below.

All properly quality controlled and archivally deposited Level 1 data will be indexed in the BCDC Cell Registry. To enable cross comparison and study data must reach Level 3 in such that it is anatomically localized and described by a compact feature attribute list. Data achieving this level of structure is potentially comparable with Allen Institute internally generated modalities and forms part a larger Integrated Data Set (IDS).

For more information about the BICCN Integrated data Set and upcoming plans please contact info@biccn.org