Here's a blog which highlights some of the more memorable events during my daily routine.... Events include accepted or rejected papers (ACCEPT/REJECT), master thesis defenses by students I supervised (MBI), research presentations of papers I (co-)authored (TALK), grant awards and rejections (ACCEPT/REJECT), and important research interest statements, among others.

EDU: Applied Data Science masterclass

posted Dec 14, 2019, 4:58 AM by Marco Spruit   [ updated Dec 14, 2019, 7:27 AM ]

On Friday Dec 13 I gave a one-day masterclass about Applied Data Science for Data Science trainees in the context of our Life Long Learning programme. With 4 rounds of 75 minutes, it felt not unlike a sporting activity, but there was a great and enthousiastic atmosphere and I enjoyed myself a lot as well.

My focus was on explaining the entire knowledge discovery process based on CRISP-DM, and by illustrating each phase with a real-life healthcare case study. The corresponding learning objectives were to recognise the knowledge discovery process in applied data science, to understand the role of data science and its societal impact, and to identify trends and developments in data science technologies.

Topics included requirements elicitation, hypothesis-free data exploration, natural language processing, automated machine learning, and dashboard design. In the closing talk about current trends and developments I touched upon all other cool technologies and developments that I hadn't mentioned yet, including deep learning, transfer learning, active learning, and deep reinforcement learning... Pretty mind-boggling and fun!

Dr. Menger: Vincent's successful defense

posted Oct 3, 2019, 2:06 AM by Marco Spruit

Yesterday, on Oct 2, Vincent Menger quite successfully defended his dissertation "Knowledge Discovery in Clinical Psychiatry: Learning from electronic health records" in the Academiegebouw. This is another solid reference work for Utrecht University's Applied Data Science focus area research, especially related to Natural Language Processing applications and foundational research as envisioned by the Special Interest Group (SIG) Text Mining. Here are some key bits from this work which investigates the following overarching research question: "How can data from Electronic Health Records provide relevant insights for psychiatric care?"

In the first three research chapters of this work, he identifies key technical, organizational and ethical challenges related to knowledge discovery in EHRs. He introduces the CRISP-IDM process, where the I stands for Interactive, as a process model for collaboration based on data visualization. He introduces the Capable Reuse of EHR Data (CARED) framework, aiming to support health care institutions to design such infrastructure. He develops and validates the De-identification Method for Dutch Medical Text (DEDUCE), which aims to automatically remove information that can identify a patient from free text.

In the second part of this research, Vincent focuses on applying knowledge discovery techniques to EHR data to obtain new insights with potential to improve care. First he looks at violence risk assessment, by using two clinical datasets to train models that can assess violence risk based on clinical text, and then perform a rigorous evaluation of their accuracy and generalizability. Finally, he turns to identifying psychiatric patient subgroups, and investigate how unsupervised learning can find robust and accurate stratifications of patients using cluster ensembles.

The two parts of this dissertation combined show that learning from EHRs, after addressing key challenges related to the nature of data, is a new and interesting approach with clear potential for improving psychiatric health care.

VACANCIES: Postdocs or PhD students in Dutch NLP

posted Jul 31, 2019, 1:07 PM by Marco Spruit   [ updated Sep 20, 2019, 1:06 PM ]

The Applied Data Science lab at Utrecht University (UU), the Information Systems group at Technical University Eindhoven (TUe) and the Psychiatry department of the University Medical Center Utrecht (UMCU) seek to appoint two fulltime Postdoc researchers for the project “COVIDA: Computing Visits Data for Dutch Natural Language Processing in Mental Healthcare” led by Dr. Marco Spruit. The COVIDA project kickstarts the interuniversity and interdisciplinary COVIDA research group by furthering the state-of-the-art in Natural Language Processing technologies for Dutch to improve daily practices in Mental Healthcare.

COVIDA’s scientific objective comprises the development of a hybrid Dutch language model to better understand human language in general, and Dutch Mental Healthcare language use in particular. We operate within the Design Science Research paradigm to model our computational experiment findings from both Computational Linguistics (i.e. knowledge-based) and Machine Learning (i.e. data-driven) inspired representations. Our societal contribution consists of a publicly available self-service facility for Natural Language Processing (NLP) of already routinely collected Dutch medical texts. Thus, COVIDA aims to deliver a game-changing innovation of Dutch mental healthcare institutions’ daily practices by enabling healthcare professionals throughout the Dutch language area to reuse their daily clinical notes by nurses and doctors from patients’ EHRs to predict inpatient violence risk assessment, depression, and more.

Your core research task is to design, implement and evaluate NLP pipelines for Dutch clinical texts which utilise linguistic and domain knowledge and structured data in a privacy-by-design architecture from both Deep/Transfer Learning and symbolic NLP perspectives

TALKS: Invited @ DisCo 2019

posted Jun 6, 2019, 2:42 AM by Marco Spruit   [ updated Jun 23, 2019, 5:35 AM ]

Data Science & Society
I have given a keynote at the DisCo 2019 conference on E-learning – Unlocking the Gate of Education around the Globe. In this talk I introduced the focus area of Applied Data Science at Utrecht University and how the ADS Profile for master students can help empower them. As a running example I discussed the obligatory ADS Profile course Data Science & Society which focuses on Knowledge Discovery with Big Data using Cloud Computing technologies. Next to that, I chaired the opening DisCo session and participated in a discussion panel on the Future of Education.

It was a lot a fun doing it.

Shaheen Syed: From MSc to Dr

posted Mar 21, 2019, 5:43 AM by Marco Spruit   [ updated Dec 6, 2019, 8:08 AM ]

Yesterday was a great day: Shaheen Syed successfully defended his dissertation titled Topic Discovery from Textual Data: Machine Learning and Natural Language Processing for Knowledge Discovery in the Fisheries Domain in the Academiegebouw. In my opinion, a solid reference work for Utrecht University's Applied Data Science focus area research, especially related to Natural Language Processing applications and foundational research as envisioned by the Special Interest Group (SIG) Text Mining. Here are some key bits from this work.

The main research question in this thesis is: How can we improve the knowledge discovery process from textual data through latent topical perspectives? The first three chapters of this thesis seek to understand how different types of textual data, pre-processing steps, and hyper-parameter settings of probabilistic topic models affect the quality of the derived latent topics. The remaining three chapters are aimed at the interpretation of the latent topics and how such (raw) latent topics can be turned into useful (fisheries) domain knowledge. Throughout this thesis, and within each chapter, specific phases of the KDD process are covered. Combined, they provide guidelines on how to optimize the knowledge discovery process with the aim to understand the latent topical content of scientific publications better.

This work was funded by Horizon2020 Marie Skłodowska-Curie – ITN - ETN grant: SAF21.

Health Round Table @ICT.OPEN

posted Mar 21, 2019, 5:22 AM by Marco Spruit   [ updated Mar 21, 2019, 5:23 AM ]

On Tuesday 19 I attended ICT.OPEN2019, the Dutch conference for ICT research and participated in the Round Table Session on Health. The purpose of the discussion was to jointly reflect with scientific researchers and other stakeholders (e.g. SMEs) on the Dutch digitization strategy. The topic of discussion is IT research for social and economic challenges. All participants prepared a Canvas as input for a 2-minute pitch and input for the discussions. Here was my contribution:

EDU: Linguistics 101

posted Jan 8, 2019, 1:51 AM by Marco Spruit   [ updated Jan 8, 2019, 1:37 PM ]

Today I gave my favorite lecture again in the Data Analytics course: Linguistics 101! From Language in Context (Consonants, Vowels, Cognates, Indo-European, Language Changes, Diversity, Dialects) to illustrating why NLP is so hard, through Constituents, Collocations, Linguistic Levels, Ambiguity, and Other Difficulties. Obviously, I also cover Parts of Speech basics such as Nouns, Pronouns, Determiners and Adjectives, Verbs, and Other Parts of Speech in the first hour...

This provides the groundwork for the actual Natural Language Processing (NLP) tasks of calculating Text Similarity (Automatic Similarity Computation, Types Of Text Similarity), Stemming (Morphological Similarity, Stemming, Porter’s Stemming Method, Porter’s Algorithm, Examples of Measures) and Edit distance (Spelling Similarity, Edit Operations, Levenshtein Method, Example, Pronunciation & Lexis).

All in 2 hours ;-)

EDU: Data Science & Society 2018

posted Nov 13, 2018, 6:50 AM by Marco Spruit   [ updated Nov 13, 2018, 7:21 AM ]

The Applied Data Science Lab just finished teaching the Data Science & Society course for 120 students. We revised the course significantly, such that it captures the research fields as shown with their interdependent relationships in the conceptual Venn diagramme.

In a nutshell, illustrative of applied data science research, we regularly focused on relevant questions in a number of data science application domains including neonatology, epidemiology, geoscience, marketing, psychiatry, cell biology, ethics & privacy, through a series of guest lectures. Thus, students can better understand the role of data science and its societal impact (ILO1). Next, students apply the CRISP-DM Knowledge Discovery Process in both lectures and many workshop sessions, also with special attention to methodological issues in Big Data analyses like p-value interpretation, multiple testing, replicability, overfitting, and construct validity. This learns students to recognise the knowledge discovery processes in applied data science (ILO2). Throughout the course we maintained a Big Data focus, operationalised in a popular data science book review assignment, clarifying the particularities of big data in relation to datawarehousing, SQL vs NoSQL, and ethical and privacy implications. Hereby we help students identify trends and developments in big data technologies (ILO3). The Cloud Computing focus amply provides a thorough engineering component by utilising MS Azure as the Infrastructure-as-a-Service environment. Every student worked individually on their own personal Virtual Machine on weekly Hadoop and Spark assignments with real data and real research questions within an MS DevTest Labs context, mostly on Data Science Virtual Machine (DSVM) images. Thereby, students actually apply selected big data technologies to solve real-world problems (ILO4). All these tasks are performed to prepare students to help empower domain experts run their own analyses, possibly by using pretrained models and APIs to help realise our services computing-compatible vision of self-service data science.

We concluded the course with an online Remindo final exam which consisted of 85 multiple choice questions with the following resulting statistics as reported in Remindo:
We are quite content with the results, as the exam was intended to be more thorough than the Remindo midterm exam with 95 questions (which scored significantly higher grades). It is clear that the results are highly normally distributed, with a good Cronbach's alfa score of >0.80. Must be a decent assessment, then!


posted Nov 10, 2018, 4:57 AM by Marco Spruit

Yesterday I presented Ingy Sarhan's poster Uncovering Algorithmic Approaches in Open Information Extraction: A Literature Review at the 30th Benelux Conference on Artificial Intelligence (BNAIC) in 's-Hertogenbosch, The Netherlands. 

I also attended the highly interesting talk by prof. Eyke Hüllermeier on On-the-Fly Machine Learning (OTF-ML), an extension of the idea of automated machine learning (AutoML). That is, the on-the-fly selection, configuration, provision, and execution of machine learning and data analytics functionality as requested by an end-user. This is highly similar to my definition of Automated (Adaptive) Analytic Systems, except that in my own Model-Driven Analytic Systems approach I strive for semi-automation at the most, and certainly not automated knowledge discovery processes. I will defnitely take a look at the ML-Plan software to assess to what extent I can integrate it into my own research plans.

OUT: Applied Data Science in Patient-Centric Healthcare

posted May 23, 2018, 5:38 AM by Marco Spruit   [ updated Oct 30, 2018, 5:13 AM ]
Even though my research is frequently being published, I now have one paper out that I am particularly happy with and proud of, in a collaboration with my Greek friend Miltiadis: 
  • Spruit,M., & Lytras,M. (2018). Applied Data Science in Patient-centric Healthcare: Adaptive Analytic Systems for Empowering Physicians and Patients. Telematics and Informatics, 35(4), 643–653.[ISI impact factor: 3.398] [pdf] [online]
This strategic paper defines and positions my research theme as a research framework for Applied Data Science research on the knowledge discovery process in which analytic systems are designed and evaluated to improve the daily practices of domain experts. It introduces Adaptive Analytic Systems as a novel research perspective of the three intertwining aspects within the knowledge discovery process in healthcare: 
  1. domain and data understanding for physician- and patient-centric healthcare, 
  2. data preprocessing and modelling using natural language processing and big data analytic techniques, and 
  3. model evaluation and knowledge deployment through information infrastructures. 
We align these knowledge discovery aspects with the design science research steps of problem investigation, treatment design, and treatment validation, respectively, noting that the adaptive component in healthcare system prototypes may translate to data-driven personalisation aspects including personalised medicine. 

We then explore how applied data science for patient-centric healthcare can thus empower physicians and patients to more effectively and efficiently improve healthcare, through the included manuscripts in this special issue of the high-impact journal Telematics and Informatics.

Last but certainly not least, we propose Meta-Algorithmic Modelling as a solution-oriented design science research framework in alignment with the knowledge discovery process to address the three key dilemmas in the emerging “post-algorithmic era” of data science: depth versus breadth, selection versus configuration, and accuracy versus transparency.

NB: Elsevier provides free access to the paper until July 4, 2018!

1-10 of 176