Please refer to my Research Projects page for active projects in this theme.

In my research theme Model-Driven Analytic Systems I investigate intertwining aspects of the three dual phases of the Knowledge Discovery Process with a solution-oriented Design Science Research approach within an Applied Data Science (ADS) context. Spruit & Lytras (2018) define the emerging research field of Applied Data Science—based on Pritzker & May’s (2015) Data Science Venn-diagramme—as “the knowledge discovery process in which analytic systems are designed and evaluated to improve the daily practices of domain experts”, in contrast to more theoretical data science which primarily aims to develop novel statistical and machine learning techniques for improving data science itself. Nevertheless, novel applications of data science methodology and engineering in a particular scientific domain likely result in new theoretical data science research questions, in line with the UPADS (2017) Starting Document. I am particularly interested in Meta-algorithmic modelling solutions with respect to three key dilemmas in data science: depth vs breadth, selection vs configuration, and accuracy vs transparency (Spruit & Jagesar, 2016).

Knowledge Discovery Process

Firstly, the problem investigation phase requires modelling of the application domain, including hypothesis generation, and exploratory data analysis to understand the often unstructured or semi-structured data. My strategic focus is on the Health domain, but I am also active in complex adaptive systems such as security, fisheries, and language. Three examples: First, Spruit et al. (2014) uncover the potential for data-driven long-term care. Second, Syed … Spruit (in press) capture the complexity of the fisheries domain through a latent topic analysis of most journal articles in the domain from 1990 to 2016. Third, Baars … Spruit (2016) develop a custom survey instrument to understand and quantify the influence of organisational characteristics within the information security domain through an adaptive maturity model for incremental process improvement.

Secondly, in the treatment design phase the data need to be semantically processed before they can model potential insights to help answer the raised hypotheses. This requires hypothesis-sensitive Natural Language Processing (NLP) techniques, where I focus on technique selection and configuration for decision making in daily practices. Three examples: First, in Spruit & Vlug (2015) we present a text snippet enrichment process for automatic classification of financial transactions. Second, Menger … Spruit (in press) develop an information extraction method for automatic de-identification of Dutch medical texts. Third, in Syed & Spruit (2017) we examine the quality of latent topics in scientific publications when employing Latent Dirichlet Allocation (LDA) analyses based on either abstract or full-text data.

Thirdly and finally, once the analytic model performs well in controlled computational experiments, the treatment validation phase shifts to software prototype engineering to create an adaptive analytic system with which one can determine the system’s utility determinants in daily practice. For example, Meulendijk, Spruit … (2015) evaluate their STRIPA analytic system’s usability for physicians to optimise medical records for polypharmacy patients by jointly measuring its effectiveness, efficiency and user satisfaction. It is in this knowledge deployment step that this research theme particularly aims to contribute to the body of knowledge on Information Infrastructure, which in its broadest sense, is "the technical, social, and political framework that encompasses the people, technology, tools, and services used to facilitate the distributed, collaborative use of content over time and distance" (Borgman, 2010:19). An information infrastructure can refer to either a schema-on-write datawarehouse, a schema-on-read big data lake or a distributed Spark-based computing cloud, which is (to be) used in daily practices. I consider dilemmas such as interoperability vs uniformity, data quality vs usability, and standardisation vs situationality (e.g. Hanseth et al., 1996). Three examples: First, Dijk … Spruit (2017) describe a data quality resolving architecture in the justice domain. Second, Shen … Spruit (2016) present a federated information architecture for the medication review process in multinational clinical trials. Third, Seddik & Spruit (2018) demonstrate the SNPCurator analytic system for enriched, interactive literature mining of SNP-disease associations.

Design Science Research

My research is primarily solution-oriented (i.e. application-oriented), in contrast to being either purely applied or theoretical. Methodologically, an analytic system prototype functions as a research intervention instrument. The prototype is used to evaluate the design science artefact under development (e.g. a method, model, process, framework, or architecture), employing metrics such as effectiveness, efficiency and usability to determine the analytic system’s societal impact. I refer to Prat et al. (2014) for a complete overview of relevant artefact evaluation metrics.

The research artefacts are preferably modelled using Meta-algorithmic modelling which Spruit & Jagesar (2016) define as “an engineering discipline where sequences of algorithm selection and configuration activities are specified deterministically for performing analytical tasks based on problem-specific data input characteristics and process preferences”. This effectively extends the reusability and generalisability of the industry-standard knowledge discovery process (Lefebvre, Spruit …, 2015), thus contributing to the scientific body of knowledge by providing proven receipes for properly addressing the three key data science dilemmas given a problem-specific challenge.

My ADS research team performs this research in close collaboration with various research groups within and outside of Utrecht University, to elicit and utilise their domain expertise, including UMCU/Psychiatry, UMCU/WKZ, UMCU/Geriatrics, UMCU/Julius Centre, UMCU/Cardiology, UMCU/Cell Biology, UU/Bioinformatics, UU/Social Sciences, as well as RUG/Behavioral Neuroscience, OU/Information Science, Ministry of Security and Justice/WODC, Switzerland’s UBERN/Internal Medicine, Switzerland’s Fachhochschule Nordwestschweiz/Software engineering, and Norway’s Arctic University/Fisheries management, among many others.

 References not in my publication list

  • Borgman, C. (2010). Scholarship in the digital age: Information, infrastructure, and the Internet. MIT press.
  • Hanseth, O., Monteiro, E., & Hatling, M. (1996). Developing information infrastructure: The tension between standardization and flexibility. Science, technology & human values, 21(4), 407-426.
  • Prat,N., Comyn-Wattiau,I., & Akoka,J. (2014). Artefact Evaluation in Information Systems Design-Science Research - a Holistic View. In: PACIS 2014 Proceedings, Paper 23.
  • Pritzker, P., and May, W. (2015). NIST Big Data interoperability Framework (NBDIF): Volume 1: Definitions. NIST Special Publication 1500-1. Final Version 1. National Institute of Standards and Technology.
  • Shearer, C. (2000). The CRISP-DM model: the new blueprint for data mining. Journal of data warehousing, 5(4), 13-22.
  • UPADS (2017). Starting document for the Focus area of the Utrecht Platform for Applied Data Science. Retrieved 20 Aug 2017, from
  • Wieringa, R. (2014). Design Science Methodology for Information Systems and Software Engineering. Springer.