In previous posts in our data mining series, we laid out our initial technical framework for guiding data mining projects, then supplemented that with plug-ins to facilitate its use for R&I policy research specifically. These plug-ins helped to overcome the challenge of applying a generic framework to a specific thematic area. However, there was another major challenge that we identified in using data mining for policy research, and this difficulty prompted another revision: the reorientation of the framework to include a scoping phase. This week’s post explores that challenge and how the reorientation helps to solve it. Take a look!

As noted in earlier posts, the feedback loops in our initial framework provide mechanisms for projects to be iterative, offering users a way back to earlier steps when (for instance) limitations to available data force the team to select a new set of indicators. There are several feedback loops, and we employed them frequently in our case studies in an attempt to overcome various challenges. However, while the feedback loops technically provide the resources necessary to address these challenges, the need to employ these loops was so frequent and so acute that we came to revise our initial decision to portray the process as primarily linear.

Accordingly, we restructured the initial steps of the project (1–4 in the figure from last week’s post) as an explicitly iterative cycle; this is a scoping phase, in which the project design is carried out with small-scale sampling to ensure that the approach fits both the needs of the policy discussion and the realities of what’s technically feasible within the timeline, budget and other constraints on the project. The revised structure is shown here (without the domain plug-ins, which are re-integrated below):

This revised structure better accommodates the numerous iterations involved in bringing the project to implementation readiness. Data mining projects are much more difficult to forecast than projects that use more traditional and established research approaches, and this risk profile needs to be structurally accommodated. Failure is a real possibility, and expectations and management practices need to be adjusted in light of this reality. (Appropriate management also decreases the risk of failure, in addition to better preparing for that possibility.) The scoping phase allows for small-scale exploration, to design and determine the feasibility of a project before moving to full-scale implementation. Once again, a decision tree is provided to help navigate those discussions.

While the feedback loops are no longer explicitly pictured, opportunities for such feedback remain open throughout the process, though the iterative scoping phase is designed to cover the most frequent and important loops. Additionally, this structure enables the team to acknowledge the most major barriers to project success, and for those barriers to become the central drivers of study design, in order to minimize chances of failure.

For instance, if data availability issues are the major stumbling block to getting a project off the ground, then the team can start with the data that are available and figure out what valuable questions they might be able to answer—in effect reverse-engineering the project. Given that data availability is such a challenge in many R&I policy contexts, this frequently acts as a principal driver. Furthermore, when a resource is that scarce, it’s important to ensure that its full value is extracted—once a high-quality data source is identified, a range of potential study questions should be developed to ensure that no fruit is left to wither on the branch.

This structural adjustment to the framework has also reshuffled a few steps in the process. First of all, in the context of data mining, the selection of analysis methods and preparation of data cannot be sufficiently disentangled for the process to be totally linear. Rather, some data must be prepared, with small-scale analyses conducted to determine feasibility of full-scale implementation; “feasibility” here pertains to both the full-scale preparation of data in the implementation phase later on and the value of the kinds of analyses that can be undertaken with the data prepared in this manner. Using big data sources, data preparation can include the creation of new indicators from unstructured information (such as text mining for sentiment analysis), which itself is a quasi-analysis. In traditional applications, these steps are more easily distinguished, but less so when working with novel data sources.

Second, data mining projects sometimes include not only the extraction and analysis of an existing data source, but even sometimes the creation of a totally new source of data (such as through web scraping). Acknowledging this reality, a “Data Collection” step has been added after the scoping phase, when the project moves to full-scale implementation. When designing a study, it’s seldom necessary to have access to the full data set to assess feasibility—a sample of the data along with a characterization of the data set’s full contents is usually enough. Accordingly, the full collection process does not need to fall within the scoping phase, and insisting that it be so can waste valuable time and other resources at a moment when the fledgling project is still just taking shape.

As noted above, the uncertainty in using data mining methods for policy research needs not only to be represented in the technical structure, but also in terms of expectations and appropriate project management practices used to guide it. We came to learn that these are equally important considerations in trying to get the best value out of data mining projects in a policy context—in fact, the technical and management aspects are deeply intertwined. Accordingly, we’ll discuss them in depth in the next post in this series.

Coda: R&I plug-ins with the revised framework

For those interested in seeing how the domain-specific R&I plug-ins fit into the revised framework structure, this image should sate your curiosity. The indicator inventory is used during the scoping phase only, whereas all the other plug-ins get used twice—once during the scoping phase as the study design is taking shape and being tested, and once again when the project shifts into full-scale implementation.

Data Mining. Knowledge and technology flows in priority domains within the private sector and between the public and private sectors. (2017). Prepared by Science-Metrix for the European Commission. ISBN 978-92-79-68029-8; DOI 10.2777/089

Note: All views expressed are those of the individual author and are not necessarily those of Science-Metrix or 1science.

Related content

About the author

Brooke Struck

Brooke Struck is the Senior Policy Officer at Science-Metrix in Montreal, where he puts his background in philosophy of science to good use in helping policy types and technical types to understand each other a little better every day. He also takes gleeful pleasure in unearthing our shared but buried assumptions, and generally gadfly-ing everyone in his proximity. He is interested in policy for science as well as science for policy (i.e., evidence-based decision-making), and is progressively integrating himself into the development of new bibliometric indicators at Science-Metrix to address emerging policy priorities. Before working at Science-Metrix, Brooke worked for the Canadian Federal Government. He holds a PhD in philosophy from the University of Guelph and a BA with honours in philosophy from McGill University.