Here's a blog which highlights some of the more memorable events during my daily routine.... Events include accepted or rejected papers (ACCEPT/REJECT), master thesis defenses by students I supervised (MBI), research presentations of papers I (co-)authored (TALK), grant awards and rejections (ACCEPT/REJECT), and important research interest statements, among others.

Dr. Omta: a hybrid PhD defense

posted Oct 16, 2020, 1:07 AM by Marco Spruit   [ updated Oct 16, 2020, 1:09 AM ]

On Wednesday 14 October, Wienand Omta successfully defended his dissertation Knowledge Discovery in High Content Screening in Corona-proof hybrid style in the Academiegebouw, for which I was the co-promotor. His work in Big Data Analytics within the domain of High Content Screening (HCS) as a technology that allows life scientists to analyze the effect of bioactive molecules on cellular phenotypes, is perhaps now more important than ever before, as HCS technology is widely used in drug discovery projects, academia and the pharmaceutical industry, for example, to search for a potential COVID19 vaccin. Not only does Wienand's dissertation include various impact journal publications, such as the one on Combining Supervised and Unsupervised Machine Learning Methods for Phenotypic Functional Genomics Screening, ever since 2012 he has also worked on the HC StratoMineR platform, for which his spin-off company Core Life Analytics recently secured a 1 M EUR Series A investment. The future is bright!

TALKS: Self-Service Data Science @HEALTHINF 2020

posted Feb 29, 2020, 6:50 AM by Marco Spruit
The 13th International Health Informatics (HealthInf 2020) conference took place in Valetta, and started – interestingly – with a 90 minutes long panel. The topic was on the undeniable gap between research and development, and, even worse yet, between development and operation: this is the "long mile" between research and medical practice that separates our best solutions from also becoming best practices and from achieving lasting impact at the point-of-care, and on the patients' illness trajectories and outcome. “Has the time come to move from the technical and embrace a more socio-technical, holistic approach?”

Of the keynote speakers in the panel, Helena Canhão introduced her Patient innovation project which focuses on patient entrepreneurship and has already collected 1000 innovations, however, many of them have not yet passed regulatory procedures to ensure patient safety. Roy Huddle specialises in visual analytics which helps explain how AI works (XAI) and can be considered a key tool to develop Trust in combination with using open data, open AI models, and external validation. Silvana Quaglini highlighted the role of the attitude of the medical professionals and the need for educating next generations of healthcare professionals to increase understanding and thus Trust in decision support systems and AI technologies. Finally, Federico Cabitza explained the gap between research and practice in more depth, citing some interesting works as well, with titles such as "The Last Mile: Where Artificial Intelligence Meets Reality", "Artificial Intelligence in Health Care: Will the Value Match the Hype?", and "The proof of the pudding: in praise of a culture of real-world validation for medical artificial intelligence". Unfortunately, at least within the regular programme, there were hardly any actual presentations on this key topic, once again illustrating the urgency of this viewpoint...

On a personal note, I presented our poster on Self-Service Data Science for Healthcare Professionals, which addresses this gap between research and practice by supporting the physicians in doing the data analysis themselves, as much as possible, capitalising upon the idea of "Trust Through Empowerment".

VACANCY: PhD position in Personalised Cybersecurity Risk Measurement

posted Feb 20, 2020, 4:32 AM by Marco Spruit   [ updated Mar 2, 2020, 2:18 PM ]

Fulltime PhD student or postdoc position in Personalised Cybersecurity Risk Measurement at Utrecht University

Job description

The Applied Data Science lab at Utrecht University (UU) seek to appoint a full-time and fully funded PhD student for the 4.8M EUR Horizon2020 EU project “GEIGER: The Geiger Cybersecurity Counter” on the topic “Digital Security and privacy for citizens and Small and Medium Enterprises and Micro Enterprises” (SU-DS03-2019-2020). This project builds in part upon the achievements of the SMESEC Horizon2020 EU project.
NB: We also invite qualified postdoc researchers in cybersecurity and data science to apply for a 2.5 years appointment.

The GEIGER project consists of 19 partners who will collaboratively develop an innovative solution with associated components and an Education Ecosystem addressing security, privacy and data protection risks of and for Small and Medium-sized Enterprises and Micro-enterprises (SMEs & MEs) in Europe. GEIGER will be developed in analogy of a GEIGER counter for detecting atomic radiation threatening human life. The GEIGER solution will be used for assessing, monitoring, and forecasting risks and reducing these risks by improving the SMEs’ & MEs’ security with well-curated tools, and an education program targeting practitioners-in-practice as “Certified Security Defenders” bringing security expertise sustainably to SMEs&MEs using existing vocational education frameworks.

At its core GEIGER consists of a GEIGER Indicator that dynamically summarizes the current level of risk by evaluating measures undertaken for security defences among the participating SMEs & MEs. The GEIGER Indicator can be personalised by registering the enterprise’s profile and supports GDPR-compliant sharing and exchanging data about incidents. The GEIGER Toolbox allows stepwise doit-yourself assessment and improvement of the SMEs’ & MEs’ security, privacy, and data protection with lightweight controls and advice for improved protection at varied levels of sophistication. The included tools offer endpoint, server, and network protection and guide the SME&ME in a personalised manner in data hygiene, including access and security control, data privacy management, and backup practices.

The GEIGER Education Ecosystem offers experimental-based training and cyber range-enabled challenges and will be integrated into curricula of diverse professions of non-ICT experts, offering direct impact on SMEs&MEs through target group-oriented education. The GEIGER solution will be demonstrated in three complementary use cases within three countries. GEIGER will achieve sustainable impact by raising awareness of more than one million SMEs&MEs within a period of 2.5 years after start.

The PhD student’s main tasks are to design and develop the personalised GEIGER Indicator, to co-develop a Cybersecurity knowledge graph relating all knowledge within security standards, and to lead the evaluation of the GEIGER solution in the three extensive use cases.


  1. The candidate should have (before 1 June 2020) an MSc in Data Science, Computer Science, Information Science, Information Security, Cybersecurity, Artifical Intelligence, Computational Linguistics, or other relevant area.
  2. You have excellent programming skills (at least in Python).
  3. You also have good English language skills in both academic writing and presentation.
  4. You are a team player, comfortable with working in a complex project involving multidisciplinary colleagues in different research groups.
  5. (For postdocs only:) You have a track record of publications in impact journals.
In a nutshell, you are ideally proficient in and enthousiastic about
  • Data Science and Natural Language Processing -- from personalised metric design to component implementation utilising, integrating and benchmarking various data and knowledge sources;
  • Cybersecurity -- assessing practices and standardisation processes; and
  • Pursuing European societal impact -- educating domain professionals in cybersecurity such as SME apprentices, accountants, and start-ups.

Additional information

For more information, please contact Dr. Marco Spruit, associate professor Applied Data Science (UU), m.r.spruit AT uu DOT nl.

Please note that the GEIGER project and, therefore, this position will commence on June 1, 2020. You need not apply if you are still unavailable after this fixed start date. However, if you think you qualify and are interested, or if you think you know someone who may be qualified and interested in pursuing a PhD, please contact us for more information.

Applications should address each of the criteria mentioned under qualifications, and include the following documents:
  • cover letter;
  • curriculum vitae;
  • copy of a recent publication;
  • copy of relevant (PhD/MSc/MA) diplomas and grades.
Please do not submit your application by email but use the offical UU application link instead.

The application deadline is Thursday 28 March 2020.

SMESEC: Cybersecurity Standardisation 2020

posted Feb 4, 2020, 7:43 AM by Marco Spruit   [ updated Feb 5, 2020, 5:28 AM ]

On February 3 2020 in Brussels, the annual Cybersecurity Standardisation 2020 Conference organised by ENISA, ETSI CEN CENELEC was being held with around 400 participants: “Cybersecurity Standardization and the EU Cybersecurity Act - What's Up?”. Since UU is leading the standardisation task in the SMESEC Horizon2020 project, we are interested in finding out exactly “what’s up”. The EU’s strategic goal is pretty clear: to arrive at some sort of Energy Label Certification scheme, where all software products are accompanied by a simple assessment score like A+. The question is whether before 2023 this will have materialised already, not whether this should happen at all, btw.

The three main panels during this day were about (1) The role of standardisation to support the certification framework, (2) Achievements in cybersecurity standardisation and the rolling plan of standardisation bodies, and (3) EU certification scheme – difficulties and success stories in relation to standards, and the road ahead. I was particularly interested in the second one, about achievements in cybersecurity standardisation, so here’s a more elaborate account on that. The panel included Alex Leadbeater (AL) from ETSI TC Cyber, Jean-Pierre Quemard (JPQ) from CEN/CENELEC JTC13, Marcus Pritsch (MP) as the consumer voice, Emilio Gonzalez (EG) representing the EU commission, and Roberto Cascella (RC) from ECSO.

As if the panel members had jointly prepared the session, there was considerable consensus throughout. AL mentioned 5G and IoT’s consumer security standard as the major achievements of last year, which was confirmed by MP as well. However, he noted that there is still a gap in (re)using existing standards instead of reinventing the wheel all the time. Also, certification needs to apply for both SEMs as big infrastructures. MP mentioned the need for a unified standard for IoT, a holistic one which integrates both the cloud connectivity and local device aspects. JPQ repeatedly confirmed the mantra of not wanting to reinvent the wheel in developing standards, and added the desire for a smaller scope of standards and the need the develop horizontal standards. Peer review could be better employed as a quality tool, and in order to take off, we need to train organisations to become better aware of the cybersecurity dimension. These objectives require a lightweight certification scheme: a security quickscan, perhaps?

#cyberactstdconf2020 #smesec #uu on cybersecurity standardisation achievements, today in Brussels

— Marco Spruit (@marcospruit) February 3, 2020
EG mentioned the Rolling Plan for Cybersecurity, pointing out that standardisation is a bottom-up process, and the importance of promoting collaboration. Standardisation is a strategic tool! RC focused on the importance of finding out what the market needs, and that the ECSO state-of-the-art syllabus is openly available to... avoid everyone reinventing the wheel. In addition, he interestingly mentioned that, after pointing out the importance of understanding the priorities wrt standards, we are shifting from meta-schemes to security assessments. The thing is, this is exactly the conceptual foundation of the SMESEC project efforts... Work is still in progress but milestones have already been reported in our Journal of Intellectual Capital paper titled “Modelling adaptive information security for SMEs in a cluster” and our 2019 conference paper on “A Questionnaire Model for Cybersecurity Maturity Assessment for Critical Infrastructures”. We believe that an open reference model for personalisable maturity assessments of SMEs should be made available to EU organisations to address this cybersecurity challenge, but more on that later.

RC added that in general we simply lack the skills in cybersecurity, including in management. Nevertheless, the goal of certification would increase trust in the market. We need to be better able to mitigate risks. At the policy level this is being stimulated through ENISA/SDO collaboration wrt certification schemes. The concept privacy-by-design is important, then. Furthermore, it was reassuring to hear that JPQ explicitly invited all to come, be welcome and contribute to certification efforts. As UU and as a part of SMESEC, we intend to do just that in the coming months. AL then had some final words, stating and demonstrating that “Consumers don’t buy security”, so how can we as EU realise our vision of a Cybersecurity Certification Scheme analogous to the highly successful Energy Efficiency Scheme? Learning from the past, he retold the car theft problem in the UK in the 1960s, which was basically turned around through Naming And Shaming to nudge the general public into changing their buying behavior. The reoccurring mention of the Energy Efficiency metric made me feel quite happy, as it seems to imply that there might be some real interest into our newly funded Horizon2020 Research & Innovation project, for which UU will develop a simple yet personalised metric for cybersecurity, much like the Energy Efficiency metric, among others! (More on this new project soon)

A final question from the audience asked about what to do in cases of new technologies, when there are by definition no specific standards available yet, e.g. with AI products. What to do then? Currently, the AI-specific standards are seemingly written by people outside this ineer circle, which results in using different terminology, which creates many interoperability issues. Luckily, the reply was quite unambiguous: Join working groups, contribute! And that is what UU will do as well, with our SMESEC contributions, which we believe may benefit many organisations throughout Europe.

EDU: Applied Data Science masterclass

posted Dec 14, 2019, 4:58 AM by Marco Spruit   [ updated Dec 14, 2019, 7:27 AM ]

On Friday Dec 13 I gave a one-day masterclass about Applied Data Science for Data Science trainees in the context of our Life Long Learning programme. With 4 rounds of 75 minutes, it felt not unlike a sporting activity, but there was a great and enthousiastic atmosphere and I enjoyed myself a lot as well.

My focus was on explaining the entire knowledge discovery process based on CRISP-DM, and by illustrating each phase with a real-life healthcare case study. The corresponding learning objectives were to recognise the knowledge discovery process in applied data science, to understand the role of data science and its societal impact, and to identify trends and developments in data science technologies.

Topics included requirements elicitation, hypothesis-free data exploration, natural language processing, automated machine learning, and dashboard design. In the closing talk about current trends and developments I touched upon all other cool technologies and developments that I hadn't mentioned yet, including deep learning, transfer learning, active learning, and deep reinforcement learning... Pretty mind-boggling and fun!

Dr. Menger: Vincent's successful defense

posted Oct 3, 2019, 2:06 AM by Marco Spruit

Yesterday, on Oct 2, Vincent Menger quite successfully defended his dissertation "Knowledge Discovery in Clinical Psychiatry: Learning from electronic health records" in the Academiegebouw. This is another solid reference work for Utrecht University's Applied Data Science focus area research, especially related to Natural Language Processing applications and foundational research as envisioned by the Special Interest Group (SIG) Text Mining. Here are some key bits from this work which investigates the following overarching research question: "How can data from Electronic Health Records provide relevant insights for psychiatric care?"

In the first three research chapters of this work, he identifies key technical, organizational and ethical challenges related to knowledge discovery in EHRs. He introduces the CRISP-IDM process, where the I stands for Interactive, as a process model for collaboration based on data visualization. He introduces the Capable Reuse of EHR Data (CARED) framework, aiming to support health care institutions to design such infrastructure. He develops and validates the De-identification Method for Dutch Medical Text (DEDUCE), which aims to automatically remove information that can identify a patient from free text.

In the second part of this research, Vincent focuses on applying knowledge discovery techniques to EHR data to obtain new insights with potential to improve care. First he looks at violence risk assessment, by using two clinical datasets to train models that can assess violence risk based on clinical text, and then perform a rigorous evaluation of their accuracy and generalizability. Finally, he turns to identifying psychiatric patient subgroups, and investigate how unsupervised learning can find robust and accurate stratifications of patients using cluster ensembles.

The two parts of this dissertation combined show that learning from EHRs, after addressing key challenges related to the nature of data, is a new and interesting approach with clear potential for improving psychiatric health care.

VACANCIES: Postdocs or PhD students in Dutch NLP

posted Jul 31, 2019, 1:07 PM by Marco Spruit   [ updated Sep 20, 2019, 1:06 PM ]

The Applied Data Science lab at Utrecht University (UU), the Information Systems group at Technical University Eindhoven (TUe) and the Psychiatry department of the University Medical Center Utrecht (UMCU) seek to appoint two fulltime Postdoc researchers for the project “COVIDA: Computing Visits Data for Dutch Natural Language Processing in Mental Healthcare” led by Dr. Marco Spruit. The COVIDA project kickstarts the interuniversity and interdisciplinary COVIDA research group by furthering the state-of-the-art in Natural Language Processing technologies for Dutch to improve daily practices in Mental Healthcare.

COVIDA’s scientific objective comprises the development of a hybrid Dutch language model to better understand human language in general, and Dutch Mental Healthcare language use in particular. We operate within the Design Science Research paradigm to model our computational experiment findings from both Computational Linguistics (i.e. knowledge-based) and Machine Learning (i.e. data-driven) inspired representations. Our societal contribution consists of a publicly available self-service facility for Natural Language Processing (NLP) of already routinely collected Dutch medical texts. Thus, COVIDA aims to deliver a game-changing innovation of Dutch mental healthcare institutions’ daily practices by enabling healthcare professionals throughout the Dutch language area to reuse their daily clinical notes by nurses and doctors from patients’ EHRs to predict inpatient violence risk assessment, depression, and more.

Your core research task is to design, implement and evaluate NLP pipelines for Dutch clinical texts which utilise linguistic and domain knowledge and structured data in a privacy-by-design architecture from both Deep/Transfer Learning and symbolic NLP perspectives

TALKS: Invited @ DisCo 2019

posted Jun 6, 2019, 2:42 AM by Marco Spruit   [ updated Jun 23, 2019, 5:35 AM ]

Data Science & Society
I have given a keynote at the DisCo 2019 conference on E-learning – Unlocking the Gate of Education around the Globe. In this talk I introduced the focus area of Applied Data Science at Utrecht University and how the ADS Profile for master students can help empower them. As a running example I discussed the obligatory ADS Profile course Data Science & Society which focuses on Knowledge Discovery with Big Data using Cloud Computing technologies. Next to that, I chaired the opening DisCo session and participated in a discussion panel on the Future of Education.

It was a lot a fun doing it.

Shaheen Syed: From MSc to Dr

posted Mar 21, 2019, 5:43 AM by Marco Spruit   [ updated Dec 6, 2019, 8:08 AM ]

Yesterday was a great day: Shaheen Syed successfully defended his dissertation titled Topic Discovery from Textual Data: Machine Learning and Natural Language Processing for Knowledge Discovery in the Fisheries Domain in the Academiegebouw. In my opinion, a solid reference work for Utrecht University's Applied Data Science focus area research, especially related to Natural Language Processing applications and foundational research as envisioned by the Special Interest Group (SIG) Text Mining. Here are some key bits from this work.

The main research question in this thesis is: How can we improve the knowledge discovery process from textual data through latent topical perspectives? The first three chapters of this thesis seek to understand how different types of textual data, pre-processing steps, and hyper-parameter settings of probabilistic topic models affect the quality of the derived latent topics. The remaining three chapters are aimed at the interpretation of the latent topics and how such (raw) latent topics can be turned into useful (fisheries) domain knowledge. Throughout this thesis, and within each chapter, specific phases of the KDD process are covered. Combined, they provide guidelines on how to optimize the knowledge discovery process with the aim to understand the latent topical content of scientific publications better.

This work was funded by Horizon2020 Marie Skłodowska-Curie – ITN - ETN grant: SAF21.

Health Round Table @ICT.OPEN

posted Mar 21, 2019, 5:22 AM by Marco Spruit   [ updated Mar 21, 2019, 5:23 AM ]

On Tuesday 19 I attended ICT.OPEN2019, the Dutch conference for ICT research and participated in the Round Table Session on Health. The purpose of the discussion was to jointly reflect with scientific researchers and other stakeholders (e.g. SMEs) on the Dutch digitization strategy. The topic of discussion is IT research for social and economic challenges. All participants prepared a Canvas as input for a 2-minute pitch and input for the discussions. Here was my contribution:

1-10 of 180