Welcome to a new series of blog posts, where researchers share their experiences of open science practices and the impact that sharing open data, code or protocols can have.
Geir Kjetil Sandve is Professor of Scientific Computing and Machine Learning at the University of Oslo in Oslo, Norway. He develops machine learning methods for life sciences and public health, with recent work focusing on climate-sensitive disease prediction. As a senior researcher, Professor Sandve has extensive publishing, teaching, and supervision experience, with articles in journals such as Nature Machine Intelligence, including Improving generalization of machine learning-identified biomarkers using causal modelling with examples from immune receptor diagnostics and The immuneML ecosystem for machine learning analysis of adaptive immune receptor repertoires.
We asked Professor Sandve to share his experiences of open science practices and the impact that sharing open code can have on public health and research in the age of deep learning.
I do research within machine learning, and often I do methodological research which is very tightly inspired by real-use cases. I try to always base my new methodological projects on some concrete need. I have worked in life sciences for 20 years in machine learning; I first worked in genomics, then pharmacoepidemiology, and, for some years, quite intensely with immune receptors, like adaptive immune cells and how they recognise foreign threats. Most recently, I'm spending most of my time on how climate change is affecting health, such as machine learning for malaria and dengue fever, and predicting outbreaks ahead of time.
I think it was during my PhD, where we published quite a bit in the BMC series, including BMC Bioinformatics and Genome Biology, which published open access articles. I liked that for two reasons. As a computer science researcher, it’s not just about whether you can find a paper in the end but how quickly you can browse and navigate papers. Even though we had access to most of the papers we needed in my PhD, the open access ones were more convenient. That was my entry point.
For open software, that was more on transparency. I could see that when anything was open science, it was much easier to double-check if things had been done correctly and feel reassured that I could trust the paper in detail. Also, it's much easier to build further on. In my PhD, I spent most of my time on preparation and boilerplate aspects. When I went in a more open software direction, where I tried to create open science projects that were built on open software, I could focus more on new ideas using the context of existing work.
I went into research a bit of an idealist. I think it's quite a tough sector to be in, so I feel we have to find motivation in some ideals of science. One is that if I do something, I want it to be as helpful to others as possible. I want others to be able to build on what I've done. The second is trust. In science, we shouldn't be forced to trust people. We should instead always be open to check each other’s work.
You’re never 100% proud of your research code. You always know that if you had unlimited time, you would make it even better. Allowing yourself to share it and say, ‘Okay, this is what I had time for,’ and being able to trust that you’ve made the right prioritisations is important. If you have done something wrong, there is a big chance that somebody will catch it.
Data sensitivity is one challenge, but another big challenge is the computational cost of ensuring reproducibility, especially when analyses are run frequently. We have practises to ensure transparency, reproducibility, and unbiasedness of results, but if you want to go into deep learning, you simply cannot do them realistically based on computational budgets. I think we are entering a phase where sometimes computation is so intense and demanding that we have to sacrifice a bit of this transparency and reproducibility in order save computational time.
Also, it’s one thing to share code, but for it to be valuable, people need to be able to run it. That is becoming harder, as it may depend on particular GPUs or system setups. It may also sometimes rely on closed models from large commercial companies or require access to computing centres that are not openly available. I think they are the main challenges we face in the machine learning community specifically.
“For researchers from smaller institutions, open science is crucial for being part of a broad international community beyond their own office.”
Geir Kjetil Sandve, Professor of Scientific Computing and Machine Learning, University of Oslo, Norway
We have one ongoing project that is not yet academically published, but it has already had an impact. I’m collaborating with the HISP Centre to develop DHIS2, which is the world’s largest open-source health information system. It is used as the national health information system in around 70 countries, where vaccination data, medicine stocks, tuberculosis outbreaks, and other health indicators are tracked.
I’ve worked with them for three years to develop machine learning methodologies to predict climate-sensitive disease outbreaks. While there are many open-source models for early warning systems, we are trying to build a platform that enables such models to be used operationally, as that can be quite challenging in practice. We are building an open community where everything is open source and where we share our ambitions and ideas transparently. The goal is to bring the community together so countries can retain control over the models and how they are used, make informed decisions, build local capacity, and communicate effectively with governments to ensure real-world impact.
It has been fascinating to me because it goes beyond publishing open access papers or open-source code. It requires an open development cycle, where we contribute core software platform technologies to a shared community and develop them collaboratively.
Another point is that open science is also a way of contributing to capacity building. For researchers from smaller institutions, open science (its processes, networks, software, and publications) is crucial for being part of a broad international community beyond their own office.
Yes, I think so. When I was hired for my permanent position, they wanted someone who could interact with existing research. I was able to say that I bring with me a lot of open and still actively developing code that others can contribute to. This is better for students as they can contribute to open-source code, which facilitates collaboration and interaction with the research environment. Open science practices have also increased the impact of my work. Several people have been inspired by the code I've made, and I have gotten collaborations both nationally and internationally based on it.
“Open science is a way of thinking about science and work that makes it feel meaningful and easier to stay motivated in the long run.”
Geir Kjetil Sandve, Professor of Scientific Computing and Machine Learning, University of Oslo, Norway
It is completely worth it. I think you gain so much by going in this direction. If you see it as a community and become conscious of your strengths and weaknesses, it will open so many collaboration and learning opportunities. Open science is a way of thinking about science and work that makes it feel meaningful and easier to stay motivated in the long run.
I feel the point of open science is that work should be possible to reproduce or reuse in practice. To me, open science is not about whether it’s theoretically possible with unlimited time to build on something but about ensuring it’s open in a way that actually invites reuse, transparency, and reproducibility. Sharing code without putting yourself in others’ shoes and considering how it might realistically be reused is not truly in the spirit of open science. It's not about checking a box; it's about actually contributing.
Geir Kjetil Sandve studied computer science at the Norwegian University of Science and Technology (NTNU). During his PhD, he surveyed, benchmarked and developed machine learning methodology for motif discovery in biosequences. For his postdoctoral studies at the University of Oslo, Norway, he broadened his understanding of statistics, collaborating with biologists and statisticians to pioneer statistical analysis of genomic co-localization. Currently, his main focus is on doing his part to help make our research environment fun but productive, brutally honest but supportive, and visionary while delivering on our promise.
Best practices for transparency and reuse:
Don't miss the latest news and blogs, sign up to The Researcher's Source Monthly Digest!