Please refer to my Research Projects page for active projects in this theme.

My overarching research theme is Natural Language Processing Systems for Self-Service Data Science. The emerging importance of self-servicing data science brings many new opportunities, but also new research challenges. This research theme’s aims are threefold. First, we should empower researchers, domain professionals and citizens to maximise the societal impact of data science technologies. Second, we should evaluate our Analytic Systems in daily practice, where they are needed: Society is our Lab! Third, we should beforehand design our analytic systems and afterwards curate our findings in meta-algorithmic models. Please refer to (Spruit & Lytras, 2018) for more information on this research theme.

I primarily focus on designing Adaptive analytic systems to investigate aspects of the Knowledge Discovery Process using a solution-oriented Design Science Research approach within an Applied Data Science (ADS) context. Spruit & Lytras (2018) define the emerging research field of Applied Data Science—based on Pritzker & May’s (2015) Data Science Venn-diagramme—as “the knowledge discovery process in which analytic systems are designed and evaluated to improve the daily practices of domain experts”, in contrast to more theoretical data science which primarily aims to develop novel statistical and machine learning techniques for improving data science itself. Nevertheless, novel applications of data science methodology and engineering in a particular scientific domain likely result in new theoretical data science research questions, in line with the UPADS (2017) Starting Document. I am particularly interested in Meta-algorithmic modelling usable solutions for reusability purposes with respect to the three key dilemmas in the post-algorithmic era of data science: depth vs breadth, selection vs configuration, and accuracy vs transparency (Spruit & Jagesar, 2016).

Knowledge Discovery Process

Firstly, the problem investigation phase requires modelling of the application domain, including hypothesis generation, and exploratory data analysis to understand the often unstructured or semi-structured data. My strategic focus is on the Health domain, but I am also active in complex adaptive systems such as security, fisheries, and language. Three examples: First, Spruit et al. (2014) uncover the potential for data-driven long-term care. Second, Syed … Spruit (2018) capture the complexity of the fisheries domain through a latent topic analysis of 46,000+ journal articles in the domain from 1990 to 2016. Third, Baars … Spruit (2016) develop a custom survey instrument to understand and quantify the influence of organisational characteristics within the information security domain through an adaptive maturity model for incremental process improvement.

Secondly, in the treatment design phase the data need to be semantically processed before they can model potential insights to help answer the raised hypotheses. This requires hypothesis-sensitive Natural Language Processing (NLP) techniques, where I focus on technique selection and configuration for decision making in daily practices. Three examples: First, in Spruit & Vlug (2015) we present a text snippet enrichment process for automatic classification of financial transactions. Second, Menger … Spruit (2018) develop an information extraction method for automatic de-identification of Dutch medical texts. Third, in Syed & Spruit (2017) we examine the quality of latent topics in scientific publications when employing Latent Dirichlet Allocation (LDA) analyses based on either abstract or full-text data.

Thirdly and finally, once the analytic model performs well in controlled computational experiments, the treatment validation phase shifts to software prototype engineering to create an adaptive analytic system with which one can determine the system’s utility determinants in daily practice. For example, Meulendijk, Spruit … (2015) evaluate their STRIPA analytic system’s usability for physicians to optimise medical records for polypharmacy patients by jointly measuring its effectiveness, efficiency and user satisfaction. It is in this knowledge deployment step that this research theme particularly aims to contribute to the body of knowledge on Information Infrastructure, which in its broadest sense, is "the technical, social, and political framework that encompasses the people, technology, tools, and services used to facilitate the distributed, collaborative use of content over time and distance" (Borgman, 2010:19). An information infrastructure can refer to either a schema-on-write datawarehouse, a schema-on-read big data lake or a distributed Spark-based computing cloud, which is (to be) used in daily practices. I consider dilemmas such as interoperability vs uniformity, data quality vs usability, and standardisation vs situationality (e.g. Hanseth et al., 1996). Three examples: First, Dijk … Spruit (2017) describe a data quality resolving architecture in the justice domain. Second, Shen … Spruit (2016) present a federated information architecture for the medication review process in multinational clinical trials. Third, Seddik & Spruit (2018) demonstrate the SNPCurator analytic system for enriched, interactive literature mining of SNP-disease associations.

Design Science Research

My research is primarily solution-oriented (i.e. application-oriented), in contrast to being either purely applied or theoretical. Methodologically, an analytic system prototype functions as a research intervention instrument. The prototype is used to evaluate the design science artefact under development (e.g. a method, model, process, framework, or architecture), employing metrics such as effectiveness, efficiency and usability to determine the analytic system’s societal impact. I refer to Prat et al. (2014) for a complete overview of relevant artefact evaluation metrics.

The research artefacts are preferably modelled for improved reusability and reproducibility using Meta-algorithmic models which Spruit & Jagesar (2016) define as “an engineering discipline where sequences of algorithm selection and configuration activities are specified deterministically for performing analytical tasks based on problem-specific data input characteristics and process preferences”. This effectively extends the reusability and generalisability of the industry-standard knowledge discovery process (Lefebvre, Spruit …, 2015), thus contributing to the scientific body of knowledge by providing proven receipes for properly addressing the three key data science dilemmas given a problem-specific challenge.

My ADS research team performs this research in close collaboration with various research groups within and outside of Utrecht University, to elicit and utilise their domain expertise, including UMCU/Psychiatry, UMCU/WKZ, UMCU/Geriatrics, UMCU/Julius Centre, UMCU/Cardiology, UMCU/Cell Biology, UU/Bioinformatics, UU/Social Sciences, as well as RUG/Behavioral Neuroscience, OU/Information Science, Ministry of Security and Justice/WODC, Switzerland’s UBERN/Internal Medicine, Switzerland’s Fachhochschule Nordwestschweiz/Software engineering, and Norway’s Arctic University/Fisheries management, among many others.

References not in my publication list

  • Borgman, C. (2010). Scholarship in the digital age: Information, infrastructure, and the Internet. MIT press.
  • Hanseth, O., Monteiro, E., & Hatling, M. (1996). Developing information infrastructure: The tension between standardization and flexibility. Science, technology & human values, 21(4), 407-426.
  • Prat,N., Comyn-Wattiau,I., & Akoka,J. (2014). Artefact Evaluation in Information Systems Design-Science Research - a Holistic View. In: PACIS 2014 Proceedings, Paper 23.
  • Pritzker, P., and May, W. (2015). NIST Big Data interoperability Framework (NBDIF): Volume 1: Definitions. NIST Special Publication 1500-1. Final Version 1. National Institute of Standards and Technology.
  • Shearer, C. (2000). The CRISP-DM model: the new blueprint for data mining. Journal of data warehousing, 5(4), 13-22.
  • UPADS (2017). Starting document for the Focus area of the Utrecht Platform for Applied Data Science. Retrieved 20 Aug 2017, from
  • Wieringa, R. (2014). Design Science Methodology for Information Systems and Software Engineering. Springer.