News

Here's a blog which highlights some of the more memorable events during my daily routine.... Events include accepted or rejected papers (ACCEPT/REJECT), master thesis defenses by students I supervised (MBI), research presentations of papers I (co-)authored (TALK), grant awards and rejections (ACCEPT/REJECT), and important research interest statements, among others.

VACANCIES: Postdocs in Dutch NLP

posted Jul 31, 2019, 1:07 PM by Marco Spruit

The Applied Data Science lab at Utrecht University (UU), the Information Systems group at Technical University Eindhoven (TUe) and the Psychiatry department of the University Medical Center Utrecht (UMCU) seek to appoint two fulltime Postdoc researchers for the project “COVIDA: Computing Visits Data for Dutch Natural Language Processing in Mental Healthcare” led by Dr. Marco Spruit. The COVIDA project kickstarts the interuniversity and interdisciplinary COVIDA research group by furthering the state-of-the-art in Natural Language Processing technologies for Dutch to improve daily practices in Mental Healthcare.

COVIDA’s scientific objective comprises the development of a hybrid Dutch language model to better understand human language in general, and Dutch Mental Healthcare language use in particular. We operate within the Design Science Research paradigm to model our computational experiment findings from both Computational Linguistics (i.e. knowledge-based) and Machine Learning (i.e. data-driven) inspired representations. Our societal contribution consists of a publicly available self-service facility for Natural Language Processing (NLP) of already routinely collected Dutch medical texts. Thus, COVIDA aims to deliver a game-changing innovation of Dutch mental healthcare institutions’ daily practices by enabling healthcare professionals throughout the Dutch language area to reuse their daily clinical notes by nurses and doctors from patients’ EHRs to predict inpatient violence risk assessment, depression, and more.


TALKS: Invited @ DisCo 2019

posted Jun 6, 2019, 2:42 AM by Marco Spruit   [ updated Jun 23, 2019, 5:35 AM ]

Data Science & Society
I have given a keynote at the DisCo 2019 conference on E-learning – Unlocking the Gate of Education around the Globe. In this talk I introduced the focus area of Applied Data Science at Utrecht University and how the ADS Profile for master students can help empower them. As a running example I discussed the obligatory ADS Profile course Data Science & Society which focuses on Knowledge Discovery with Big Data using Cloud Computing technologies. Next to that, I chaired the opening DisCo session and participated in a discussion panel on the Future of Education.

It was a lot a fun doing it.

Shaheen Syed: From MSc to Dr

posted Mar 21, 2019, 5:43 AM by Marco Spruit   [ updated Mar 21, 2019, 5:45 AM ]

Yesterday was a great day: Shaheen Syed successfully defended his dissertation titled Topic Discovery from Textual Data: Machine Learning and Natural Language Processing for Knowledge Discovery in the Fisheries Domain in the Academiegebouw. In my opinion, a solid reference work for Utrecht University's Applied Data Science focus area research, especially related to Natural Language Processing applications and foundational research as envisioned by the Special Interest Group (SIG) Text Mining. Here are some key bits from this work.

The main research question in this thesis is: How can we improve the knowledge discovery process from textual data through latent topical perspectives? The first three chapters of this thesis seek to understand how different types of textual data, pre-processing steps, and hyper-parameter settings of probabilistic topic models affect the quality of the derived latent topics. The remaining three chapters are aimed at the interpretation of the latent topics and how such (raw) latent topics can be turned into useful (fisheries) domain knowledge. Throughout this thesis, and within each chapter, specific phases of the KDD process are covered. Combined, they provide guidelines on how to optimize the knowledge discovery process with the aim to understand the latent topical content of scientific publications better.

This work was funded by Horizon2020 Marie Skłodowska-Curie – ITN - ETN grant: SAF21.

Health Round Table @ICT.OPEN

posted Mar 21, 2019, 5:22 AM by Marco Spruit   [ updated Mar 21, 2019, 5:23 AM ]

On Tuesday 19 I attended ICT.OPEN2019, the Dutch conference for ICT research and participated in the Round Table Session on Health. The purpose of the discussion was to jointly reflect with scientific researchers and other stakeholders (e.g. SMEs) on the Dutch digitization strategy. The topic of discussion is IT research for social and economic challenges. All participants prepared a Canvas as input for a 2-minute pitch and input for the discussions. Here was my contribution:

EDU: Linguistics 101

posted Jan 8, 2019, 1:51 AM by Marco Spruit   [ updated Jan 8, 2019, 1:37 PM ]

Today I gave my favorite lecture again in the Data Analytics course: Linguistics 101! From Language in Context (Consonants, Vowels, Cognates, Indo-European, Language Changes, Diversity, Dialects) to illustrating why NLP is so hard, through Constituents, Collocations, Linguistic Levels, Ambiguity, and Other Difficulties. Obviously, I also cover Parts of Speech basics such as Nouns, Pronouns, Determiners and Adjectives, Verbs, and Other Parts of Speech in the first hour...

This provides the groundwork for the actual Natural Language Processing (NLP) tasks of calculating Text Similarity (Automatic Similarity Computation, Types Of Text Similarity), Stemming (Morphological Similarity, Stemming, Porter’s Stemming Method, Porter’s Algorithm, Examples of Measures) and Edit distance (Spelling Similarity, Edit Operations, Levenshtein Method, Example, Pronunciation & Lexis).

All in 2 hours ;-)





EDU: Data Science & Society 2018

posted Nov 13, 2018, 6:50 AM by Marco Spruit   [ updated Nov 13, 2018, 7:21 AM ]

The Applied Data Science Lab just finished teaching the Data Science & Society course for 120 students. We revised the course significantly, such that it captures the research fields as shown with their interdependent relationships in the conceptual Venn diagramme.

In a nutshell, illustrative of applied data science research, we regularly focused on relevant questions in a number of data science application domains including neonatology, epidemiology, geoscience, marketing, psychiatry, cell biology, ethics & privacy, through a series of guest lectures. Thus, students can better understand the role of data science and its societal impact (ILO1). Next, students apply the CRISP-DM Knowledge Discovery Process in both lectures and many workshop sessions, also with special attention to methodological issues in Big Data analyses like p-value interpretation, multiple testing, replicability, overfitting, and construct validity. This learns students to recognise the knowledge discovery processes in applied data science (ILO2). Throughout the course we maintained a Big Data focus, operationalised in a popular data science book review assignment, clarifying the particularities of big data in relation to datawarehousing, SQL vs NoSQL, and ethical and privacy implications. Hereby we help students identify trends and developments in big data technologies (ILO3). The Cloud Computing focus amply provides a thorough engineering component by utilising MS Azure as the Infrastructure-as-a-Service environment. Every student worked individually on their own personal Virtual Machine on weekly Hadoop and Spark assignments with real data and real research questions within an MS DevTest Labs context, mostly on Data Science Virtual Machine (DSVM) images. Thereby, students actually apply selected big data technologies to solve real-world problems (ILO4). All these tasks are performed to prepare students to help empower domain experts run their own analyses, possibly by using pretrained models and APIs to help realise our services computing-compatible vision of self-service data science.

We concluded the course with an online Remindo final exam which consisted of 85 multiple choice questions with the following resulting statistics as reported in Remindo:
We are quite content with the results, as the exam was intended to be more thorough than the Remindo midterm exam with 95 questions (which scored significantly higher grades). It is clear that the results are highly normally distributed, with a good Cronbach's alfa score of >0.80. Must be a decent assessment, then!

TALKS: BNAIC 2018

posted Nov 10, 2018, 4:57 AM by Marco Spruit

Yesterday I presented Ingy Sarhan's poster Uncovering Algorithmic Approaches in Open Information Extraction: A Literature Review at the 30th Benelux Conference on Artificial Intelligence (BNAIC) in 's-Hertogenbosch, The Netherlands. 

I also attended the highly interesting talk by prof. Eyke Hüllermeier on On-the-Fly Machine Learning (OTF-ML), an extension of the idea of automated machine learning (AutoML). That is, the on-the-fly selection, configuration, provision, and execution of machine learning and data analytics functionality as requested by an end-user. This is highly similar to my definition of Automated (Adaptive) Analytic Systems, except that in my own Model-Driven Analytic Systems approach I strive for semi-automation at the most, and certainly not automated knowledge discovery processes. I will defnitely take a look at the ML-Plan software to assess to what extent I can integrate it into my own research plans.

OUT: Applied Data Science in Patient-Centric Healthcare

posted May 23, 2018, 5:38 AM by Marco Spruit   [ updated Oct 30, 2018, 5:13 AM ]

https://authors.elsevier.com/c/1X2wl2dUkY816b
Even though my research is frequently being published, I now have one paper out that I am particularly happy with and proud of, in a collaboration with my Greek friend Miltiadis: 
  • Spruit,M., & Lytras,M. (2018). Applied Data Science in Patient-centric Healthcare: Adaptive Analytic Systems for Empowering Physicians and Patients. Telematics and Informatics, 35(4), 643–653.[ISI impact factor: 3.398] [pdf] [online]
This strategic paper defines and positions my research theme as a research framework for Applied Data Science research on the knowledge discovery process in which analytic systems are designed and evaluated to improve the daily practices of domain experts. It introduces Adaptive Analytic Systems as a novel research perspective of the three intertwining aspects within the knowledge discovery process in healthcare: 
  1. domain and data understanding for physician- and patient-centric healthcare, 
  2. data preprocessing and modelling using natural language processing and big data analytic techniques, and 
  3. model evaluation and knowledge deployment through information infrastructures. 
We align these knowledge discovery aspects with the design science research steps of problem investigation, treatment design, and treatment validation, respectively, noting that the adaptive component in healthcare system prototypes may translate to data-driven personalisation aspects including personalised medicine. 

We then explore how applied data science for patient-centric healthcare can thus empower physicians and patients to more effectively and efficiently improve healthcare, through the included manuscripts in this special issue of the high-impact journal Telematics and Informatics.

Last but certainly not least, we propose Meta-Algorithmic Modelling as a solution-oriented design science research framework in alignment with the knowledge discovery process to address the three key dilemmas in the emerging “post-algorithmic era” of data science: depth versus breadth, selection versus configuration, and accuracy versus transparency.

NB: Elsevier provides free access to the paper until July 4, 2018!

PRESS: Sleepwet outreach

posted Mar 22, 2018, 4:14 AM by Marco Spruit   [ updated Mar 22, 2018, 4:19 AM ]

My professional opinion on the Sleepwet/WiV was published on the Utrecht University homepage as well as the regional news headlines, after being prepared by our UU science editor/public information officer... Furthermore, together with other computer science colleagues in the Netherlands an extensive statement was issued on our concerns with the current Law for the intelligence and security services (Wet inlichten- en veiligheidsdiensten, Wiv). 

The UU piece was subsequently picked up by the popular regional online news paper DUIC one day before the national referendum on this topic. The official result will be made public in a week from now, but it will be a close call either way.

Finally, next week a group of secondary school students will interview me about the implications of the Sleepwet for their school project... 
Outreach is important stuff!

https://sites.google.com/a/spru.it/marco/press/sleepwet.jpghttps://sites.google.com/a/spru.it/marco/files/20180320%20-%20Utrechtse%20onderzoekers%20plaatsen%20vraagtekens%20bij%20effectiviteit%20sleepwet%20-%20De%20Utrechtse%20Internet%20Courant.pdf?attredirects=0&d=1

PS: The nice Wifi wordcloud was made using www.woordwolk.nl on the table of contents of the US Research Council's 2008 report Protecting Individual Privacy in the Struggle Against Terrorists.

TALKS: HealthINF 2018

posted Jan 23, 2018, 12:44 AM by Marco Spruit

Last week I presented the following two papers on the HEALTHINF 2018 conference:
  1. Speech Technology in Dutch Health Care: A Qualitative Study (19/01/2018). Poster at the 11th International Joint Conference on Biomedical Engineering Systems and Technologies. HEALTHINF 2018, Funchal, Portugal. 
  2. Devices Used for Non-invasive Tele homecare for Cardiovascular Patients: A Systematic Literature Review (19/01/2018). 11th International Joint Conference on Biomedical Engineering Systems and Technologies. HEALTHINF 2018, Funchal, Portugal. [15 min.]

1-10 of 174