Here's a blog which highlights some of the more memorable events during my daily routine.... Events include accepted or rejected papers (ACCEPT/REJECT), master thesis defenses by students I supervised (MBI), research presentations of papers I (co-)authored (TALK), grant awards and rejections (ACCEPT/REJECT), and important research interest statements, among others.

VACANCY: Assistant Professor Data Science in Population Health (Tenure Track) in Leiden

posted Dec 27, 2020, 12:35 PM by Marco Spruit   [ updated Dec 30, 2020, 1:25 PM ]

What you do

This unique tenure track position offers the best of both worlds: 50% of your work will be performed from the Campus The Hague of the LUMC, and the other 50% from the Leiden Institute of Advanced Computer Science (LIACS) within the faculty of Science of Leiden University. This means that you will be a strategic linking pin in various collaborations at the junction of data science and natural language processing in the broad area of population health. This position is embedded within the recently launched Population Health Living Lab (PHLL) The Hague, which allows you to contribute to a sustainable and robust realization of the most extensive population dataset within the Netherlands, and to consequently perform novel multidisciplinary data analyses. As assistant professor, you are expected to contribute to at least one of our overarching research themes on our Translational Data Science research agenda. Regarding teaching, you are expected to contribute around 50% of your appointment to LUMC’s Population Health Management (PHM) master’s program and LIACS’s curricula, which includes co-developing, co-teaching, and coordinating the data science courses as well as the track itself, as well as thesis supervision.


  • You position yourself as an interorganizational linking pin in the Medical Delta ecosystem at the junction of Data Science initiatives in the broad area of Population Health
  • You contribute to the further development of the Population Health Living Lab (PHLL) ecosystem with respect to research related to data engineering and translational data science
  • You contribute to the Population Health Management master’s program by co-developing, co-teaching and coordinating data science courses as well as the track itself

What we ask

You’re an expert in either the research theme of Data Engineering/Information Science or (Big) Data Analytics/Machine Learning, and knowledgeable in the other one. Similarly, you are an expert in utilizing statistical methods and machine learning techniques on real data. You are conscientious and creative, and you have experience at the postdoctoral level with a strong publication record and a proven track record in teaching. Furthermore, you are experienced in raising research funds. You are passionate about investigating and utilizing data science technologies, focusing on state-of-the-art application-oriented research in Explainable AI, AutoML, Big sensors/wearables data, speech recognition, neuro-linguistic programming, affective computing, etc. You are skilled in Python development, like using SciKit-Learn, HuggingFace, PySyft, and Streamlit. Lastly, you are communicatively skilled and you work well collaboratively.

More information?

Hello Leiden!

posted Dec 1, 2020, 6:45 AM by Marco Spruit   [ updated Jan 14, 2021, 11:52 AM ]

Today is my first day as Professor of Advanced Data Science in Population Health at the Public Health & Primary Care (PHEG) department of the Leiden University Medical Centre (LUMC) and the Leiden Institute of Advanced Computer Science (LIACS) of the Faculty of Science (W&N)! Apart from being a great milestone in itself, here is my TOP-3 of Unique Selling Points why I am particularly excited:
  1. It is a formal DUAL APPOINTMENT, meaning that am appointed at both LUMC as well as LIACS. This makes me the official linking pin for the many upcoming collaborations at the junction of data science and natural language processing in healthcare.
  2. In Leiden, my new colleagues have developed over the last years the LARGEST POPULATION DATASET in the Netherlands, with access to anonimised health records of 500K+ patients, using the Central Bureau of Statistics (CBS) as its Trusted Third Party. Pure gold!
  3. My primary affiliation is within a MULTIDISCIPLINARY setting on the campus The Hague: the Population Health Living Lab (PHLL). This is a so-called QUADRUPLE HELIX fieldlab, where Academia, Industry, Citizens, and Government all collaborate.
I'd like to thank everyone at Utrecht University for the many inspiring informal encounters, personal development programmes and research collaborations that I have had with many of you throughout these... 13 years. I have learned a bunch and it was a lot of fun!

But from now on, it is... Hello Leiden!

PS: I find it truly amazing to discover that my announcement on LinkedIn has been read over 13,000 times already after just one week!

Dr. Tawfik: Text Mining for Precision Medicine

posted Nov 25, 2020, 6:37 AM by Marco Spruit   [ updated Nov 25, 2020, 6:38 AM ]

Yesterday Noha Tawfik defended her dissertation Text Mining for Precision Medicine: Natural Language Processing, Machine Learning and Information Extraction for Knowledge Discovery in the Health Domain. In extreme COVID19 style, we were with merely 8 people --including audience-- in the Senate Hall of the UU Academiegebouw. Nevertheless, Noha admirably competently and passionately defended her PhD research!

In Noha's first research phase, she mainly employed Information Extraction to automate the identification and analysis of Genome-Wide Association Studies, given a particular disease, to investigate the relation between different phenotypic traits and Single Nucleotide Polymorphisms, known to be associated with that disease. In the second research phase, Noha expands upon the previous work by employing Machine Learning algorithms to the problem of detecting contradictions between two statements, extracted from abstracts of published articles. interpreting contradictory findings as a likely Precision Medicine finding. In the third and final phase of her research, Noha her contradiction detection research in conformance with Natural Language Inference (NLI) best practices, and participated in the 2019 ACL "Medical Natural Language Inference" challenge where she battled successfully against entire teams of various top universities.

All in all a truly excellent achievement in 4 years time with no less than 7 peer-reviewed publications!

Dr. Omta: a hybrid PhD defense

posted Oct 16, 2020, 1:07 AM by Marco Spruit   [ updated Oct 16, 2020, 1:09 AM ]

On Wednesday 14 October, Wienand Omta successfully defended his dissertation Knowledge Discovery in High Content Screening in Corona-proof hybrid style in the Academiegebouw, for which I was the co-promotor. His work in Big Data Analytics within the domain of High Content Screening (HCS) as a technology that allows life scientists to analyze the effect of bioactive molecules on cellular phenotypes, is perhaps now more important than ever before, as HCS technology is widely used in drug discovery projects, academia and the pharmaceutical industry, for example, to search for a potential COVID19 vaccin. Not only does Wienand's dissertation include various impact journal publications, such as the one on Combining Supervised and Unsupervised Machine Learning Methods for Phenotypic Functional Genomics Screening, ever since 2012 he has also worked on the HC StratoMineR platform, for which his spin-off company Core Life Analytics recently secured a 1 M EUR Series A investment. The future is bright!

TALKS: Self-Service Data Science @HEALTHINF 2020

posted Feb 29, 2020, 6:50 AM by Marco Spruit
The 13th International Health Informatics (HealthInf 2020) conference took place in Valetta, and started – interestingly – with a 90 minutes long panel. The topic was on the undeniable gap between research and development, and, even worse yet, between development and operation: this is the "long mile" between research and medical practice that separates our best solutions from also becoming best practices and from achieving lasting impact at the point-of-care, and on the patients' illness trajectories and outcome. “Has the time come to move from the technical and embrace a more socio-technical, holistic approach?”

Of the keynote speakers in the panel, Helena Canhão introduced her Patient innovation project which focuses on patient entrepreneurship and has already collected 1000 innovations, however, many of them have not yet passed regulatory procedures to ensure patient safety. Roy Huddle specialises in visual analytics which helps explain how AI works (XAI) and can be considered a key tool to develop Trust in combination with using open data, open AI models, and external validation. Silvana Quaglini highlighted the role of the attitude of the medical professionals and the need for educating next generations of healthcare professionals to increase understanding and thus Trust in decision support systems and AI technologies. Finally, Federico Cabitza explained the gap between research and practice in more depth, citing some interesting works as well, with titles such as "The Last Mile: Where Artificial Intelligence Meets Reality", "Artificial Intelligence in Health Care: Will the Value Match the Hype?", and "The proof of the pudding: in praise of a culture of real-world validation for medical artificial intelligence". Unfortunately, at least within the regular programme, there were hardly any actual presentations on this key topic, once again illustrating the urgency of this viewpoint...

On a personal note, I presented our poster on Self-Service Data Science for Healthcare Professionals, which addresses this gap between research and practice by supporting the physicians in doing the data analysis themselves, as much as possible, capitalising upon the idea of "Trust Through Empowerment".

VACANCY: PhD position in Personalised Cybersecurity Risk Measurement

posted Feb 20, 2020, 4:32 AM by Marco Spruit   [ updated Mar 2, 2020, 2:18 PM ]

Fulltime PhD student or postdoc position in Personalised Cybersecurity Risk Measurement at Utrecht University

Job description

The Applied Data Science lab at Utrecht University (UU) seek to appoint a full-time and fully funded PhD student for the 4.8M EUR Horizon2020 EU project “GEIGER: The Geiger Cybersecurity Counter” on the topic “Digital Security and privacy for citizens and Small and Medium Enterprises and Micro Enterprises” (SU-DS03-2019-2020). This project builds in part upon the achievements of the SMESEC Horizon2020 EU project.
NB: We also invite qualified postdoc researchers in cybersecurity and data science to apply for a 2.5 years appointment.

The GEIGER project consists of 19 partners who will collaboratively develop an innovative solution with associated components and an Education Ecosystem addressing security, privacy and data protection risks of and for Small and Medium-sized Enterprises and Micro-enterprises (SMEs & MEs) in Europe. GEIGER will be developed in analogy of a GEIGER counter for detecting atomic radiation threatening human life. The GEIGER solution will be used for assessing, monitoring, and forecasting risks and reducing these risks by improving the SMEs’ & MEs’ security with well-curated tools, and an education program targeting practitioners-in-practice as “Certified Security Defenders” bringing security expertise sustainably to SMEs&MEs using existing vocational education frameworks.

At its core GEIGER consists of a GEIGER Indicator that dynamically summarizes the current level of risk by evaluating measures undertaken for security defences among the participating SMEs & MEs. The GEIGER Indicator can be personalised by registering the enterprise’s profile and supports GDPR-compliant sharing and exchanging data about incidents. The GEIGER Toolbox allows stepwise doit-yourself assessment and improvement of the SMEs’ & MEs’ security, privacy, and data protection with lightweight controls and advice for improved protection at varied levels of sophistication. The included tools offer endpoint, server, and network protection and guide the SME&ME in a personalised manner in data hygiene, including access and security control, data privacy management, and backup practices.

The GEIGER Education Ecosystem offers experimental-based training and cyber range-enabled challenges and will be integrated into curricula of diverse professions of non-ICT experts, offering direct impact on SMEs&MEs through target group-oriented education. The GEIGER solution will be demonstrated in three complementary use cases within three countries. GEIGER will achieve sustainable impact by raising awareness of more than one million SMEs&MEs within a period of 2.5 years after start.

The PhD student’s main tasks are to design and develop the personalised GEIGER Indicator, to co-develop a Cybersecurity knowledge graph relating all knowledge within security standards, and to lead the evaluation of the GEIGER solution in the three extensive use cases.


  1. The candidate should have (before 1 June 2020) an MSc in Data Science, Computer Science, Information Science, Information Security, Cybersecurity, Artifical Intelligence, Computational Linguistics, or other relevant area.
  2. You have excellent programming skills (at least in Python).
  3. You also have good English language skills in both academic writing and presentation.
  4. You are a team player, comfortable with working in a complex project involving multidisciplinary colleagues in different research groups.
  5. (For postdocs only:) You have a track record of publications in impact journals.
In a nutshell, you are ideally proficient in and enthousiastic about
  • Data Science and Natural Language Processing -- from personalised metric design to component implementation utilising, integrating and benchmarking various data and knowledge sources;
  • Cybersecurity -- assessing practices and standardisation processes; and
  • Pursuing European societal impact -- educating domain professionals in cybersecurity such as SME apprentices, accountants, and start-ups.

Additional information

For more information, please contact Dr. Marco Spruit, associate professor Applied Data Science (UU), m.r.spruit AT uu DOT nl.

Please note that the GEIGER project and, therefore, this position will commence on June 1, 2020. You need not apply if you are still unavailable after this fixed start date. However, if you think you qualify and are interested, or if you think you know someone who may be qualified and interested in pursuing a PhD, please contact us for more information.

Applications should address each of the criteria mentioned under qualifications, and include the following documents:
  • cover letter;
  • curriculum vitae;
  • copy of a recent publication;
  • copy of relevant (PhD/MSc/MA) diplomas and grades.
Please do not submit your application by email but use the offical UU application link instead.

The application deadline is Thursday 28 March 2020.

SMESEC: Cybersecurity Standardisation 2020

posted Feb 4, 2020, 7:43 AM by Marco Spruit   [ updated Feb 5, 2020, 5:28 AM ]

On February 3 2020 in Brussels, the annual Cybersecurity Standardisation 2020 Conference organised by ENISA, ETSI CEN CENELEC was being held with around 400 participants: “Cybersecurity Standardization and the EU Cybersecurity Act - What's Up?”. Since UU is leading the standardisation task in the SMESEC Horizon2020 project, we are interested in finding out exactly “what’s up”. The EU’s strategic goal is pretty clear: to arrive at some sort of Energy Label Certification scheme, where all software products are accompanied by a simple assessment score like A+. The question is whether before 2023 this will have materialised already, not whether this should happen at all, btw.

The three main panels during this day were about (1) The role of standardisation to support the certification framework, (2) Achievements in cybersecurity standardisation and the rolling plan of standardisation bodies, and (3) EU certification scheme – difficulties and success stories in relation to standards, and the road ahead. I was particularly interested in the second one, about achievements in cybersecurity standardisation, so here’s a more elaborate account on that. The panel included Alex Leadbeater (AL) from ETSI TC Cyber, Jean-Pierre Quemard (JPQ) from CEN/CENELEC JTC13, Marcus Pritsch (MP) as the consumer voice, Emilio Gonzalez (EG) representing the EU commission, and Roberto Cascella (RC) from ECSO.

As if the panel members had jointly prepared the session, there was considerable consensus throughout. AL mentioned 5G and IoT’s consumer security standard as the major achievements of last year, which was confirmed by MP as well. However, he noted that there is still a gap in (re)using existing standards instead of reinventing the wheel all the time. Also, certification needs to apply for both SEMs as big infrastructures. MP mentioned the need for a unified standard for IoT, a holistic one which integrates both the cloud connectivity and local device aspects. JPQ repeatedly confirmed the mantra of not wanting to reinvent the wheel in developing standards, and added the desire for a smaller scope of standards and the need the develop horizontal standards. Peer review could be better employed as a quality tool, and in order to take off, we need to train organisations to become better aware of the cybersecurity dimension. These objectives require a lightweight certification scheme: a security quickscan, perhaps?

#cyberactstdconf2020 #smesec #uu on cybersecurity standardisation achievements, today in Brussels

— Marco Spruit (@marcospruit) February 3, 2020
EG mentioned the Rolling Plan for Cybersecurity, pointing out that standardisation is a bottom-up process, and the importance of promoting collaboration. Standardisation is a strategic tool! RC focused on the importance of finding out what the market needs, and that the ECSO state-of-the-art syllabus is openly available to... avoid everyone reinventing the wheel. In addition, he interestingly mentioned that, after pointing out the importance of understanding the priorities wrt standards, we are shifting from meta-schemes to security assessments. The thing is, this is exactly the conceptual foundation of the SMESEC project efforts... Work is still in progress but milestones have already been reported in our Journal of Intellectual Capital paper titled “Modelling adaptive information security for SMEs in a cluster” and our 2019 conference paper on “A Questionnaire Model for Cybersecurity Maturity Assessment for Critical Infrastructures”. We believe that an open reference model for personalisable maturity assessments of SMEs should be made available to EU organisations to address this cybersecurity challenge, but more on that later.

RC added that in general we simply lack the skills in cybersecurity, including in management. Nevertheless, the goal of certification would increase trust in the market. We need to be better able to mitigate risks. At the policy level this is being stimulated through ENISA/SDO collaboration wrt certification schemes. The concept privacy-by-design is important, then. Furthermore, it was reassuring to hear that JPQ explicitly invited all to come, be welcome and contribute to certification efforts. As UU and as a part of SMESEC, we intend to do just that in the coming months. AL then had some final words, stating and demonstrating that “Consumers don’t buy security”, so how can we as EU realise our vision of a Cybersecurity Certification Scheme analogous to the highly successful Energy Efficiency Scheme? Learning from the past, he retold the car theft problem in the UK in the 1960s, which was basically turned around through Naming And Shaming to nudge the general public into changing their buying behavior. The reoccurring mention of the Energy Efficiency metric made me feel quite happy, as it seems to imply that there might be some real interest into our newly funded Horizon2020 Research & Innovation project, for which UU will develop a simple yet personalised metric for cybersecurity, much like the Energy Efficiency metric, among others! (More on this new project soon)

A final question from the audience asked about what to do in cases of new technologies, when there are by definition no specific standards available yet, e.g. with AI products. What to do then? Currently, the AI-specific standards are seemingly written by people outside this ineer circle, which results in using different terminology, which creates many interoperability issues. Luckily, the reply was quite unambiguous: Join working groups, contribute! And that is what UU will do as well, with our SMESEC contributions, which we believe may benefit many organisations throughout Europe.

EDU: Applied Data Science masterclass

posted Dec 14, 2019, 4:58 AM by Marco Spruit   [ updated Dec 14, 2019, 7:27 AM ]

On Friday Dec 13 I gave a one-day masterclass about Applied Data Science for Data Science trainees in the context of our Life Long Learning programme. With 4 rounds of 75 minutes, it felt not unlike a sporting activity, but there was a great and enthousiastic atmosphere and I enjoyed myself a lot as well.

My focus was on explaining the entire knowledge discovery process based on CRISP-DM, and by illustrating each phase with a real-life healthcare case study. The corresponding learning objectives were to recognise the knowledge discovery process in applied data science, to understand the role of data science and its societal impact, and to identify trends and developments in data science technologies.

Topics included requirements elicitation, hypothesis-free data exploration, natural language processing, automated machine learning, and dashboard design. In the closing talk about current trends and developments I touched upon all other cool technologies and developments that I hadn't mentioned yet, including deep learning, transfer learning, active learning, and deep reinforcement learning... Pretty mind-boggling and fun!

Dr. Menger: Vincent's successful defense

posted Oct 3, 2019, 2:06 AM by Marco Spruit

Yesterday, on Oct 2, Vincent Menger quite successfully defended his dissertation "Knowledge Discovery in Clinical Psychiatry: Learning from electronic health records" in the Academiegebouw. This is another solid reference work for Utrecht University's Applied Data Science focus area research, especially related to Natural Language Processing applications and foundational research as envisioned by the Special Interest Group (SIG) Text Mining. Here are some key bits from this work which investigates the following overarching research question: "How can data from Electronic Health Records provide relevant insights for psychiatric care?"

In the first three research chapters of this work, he identifies key technical, organizational and ethical challenges related to knowledge discovery in EHRs. He introduces the CRISP-IDM process, where the I stands for Interactive, as a process model for collaboration based on data visualization. He introduces the Capable Reuse of EHR Data (CARED) framework, aiming to support health care institutions to design such infrastructure. He develops and validates the De-identification Method for Dutch Medical Text (DEDUCE), which aims to automatically remove information that can identify a patient from free text.

In the second part of this research, Vincent focuses on applying knowledge discovery techniques to EHR data to obtain new insights with potential to improve care. First he looks at violence risk assessment, by using two clinical datasets to train models that can assess violence risk based on clinical text, and then perform a rigorous evaluation of their accuracy and generalizability. Finally, he turns to identifying psychiatric patient subgroups, and investigate how unsupervised learning can find robust and accurate stratifications of patients using cluster ensembles.

The two parts of this dissertation combined show that learning from EHRs, after addressing key challenges related to the nature of data, is a new and interesting approach with clear potential for improving psychiatric health care.

VACANCIES: Postdocs or PhD students in Dutch NLP

posted Jul 31, 2019, 1:07 PM by Marco Spruit   [ updated Sep 20, 2019, 1:06 PM ]

The Applied Data Science lab at Utrecht University (UU), the Information Systems group at Technical University Eindhoven (TUe) and the Psychiatry department of the University Medical Center Utrecht (UMCU) seek to appoint two fulltime Postdoc researchers for the project “COVIDA: Computing Visits Data for Dutch Natural Language Processing in Mental Healthcare” led by Dr. Marco Spruit. The COVIDA project kickstarts the interuniversity and interdisciplinary COVIDA research group by furthering the state-of-the-art in Natural Language Processing technologies for Dutch to improve daily practices in Mental Healthcare.

COVIDA’s scientific objective comprises the development of a hybrid Dutch language model to better understand human language in general, and Dutch Mental Healthcare language use in particular. We operate within the Design Science Research paradigm to model our computational experiment findings from both Computational Linguistics (i.e. knowledge-based) and Machine Learning (i.e. data-driven) inspired representations. Our societal contribution consists of a publicly available self-service facility for Natural Language Processing (NLP) of already routinely collected Dutch medical texts. Thus, COVIDA aims to deliver a game-changing innovation of Dutch mental healthcare institutions’ daily practices by enabling healthcare professionals throughout the Dutch language area to reuse their daily clinical notes by nurses and doctors from patients’ EHRs to predict inpatient violence risk assessment, depression, and more.

Your core research task is to design, implement and evaluate NLP pipelines for Dutch clinical texts which utilise linguistic and domain knowledge and structured data in a privacy-by-design architecture from both Deep/Transfer Learning and symbolic NLP perspectives

1-10 of 183