The Scientist Decides Who Gains Access to Their Data, Says EOSC CZ Chief Manager Matej Antol

According to Matej Antol, Czech science now has an immense opportunity to make a rapid leap forward by improving access to research data management. In this interview, he discusses how we can establish ourselves among the European, or even global, leaders in data storage, sharing, and accessibility. He also highlights the path science still has to travel, the role of the National Repository Platform, and what these changes will mean for scientists.

18 Dec 2024 Martina Čelišová

No description

What are the most significant changes science will face in the future?

I expect that science will look entirely different in a hundred years. Today, we cannot systematically evaluate science in any other way than by "ticking off boxes" for published articles. Scientists are often forced to publish an excessive number of papers to accumulate as many citations as possible and climb the traditional career ladder. Moreover, a scientist's reputation today also depends on their visibility—how often they are seen or heard—rather than purely on the value of their outputs.

One of the fundamental flaws of current science is that research findings are often inaccessible. I believe and hope that the way science is conducted will soon become more rational in these and other respects.

Research data will play a major role in how science evolves. We already know that data are at least as valuable as their interpretation. Data can often be interpreted in multiple ways, and publishing only a single research outcome as a paper means losing a vast amount of information—and potential for further research. Moreover, data provide a way to verify the quality of research.

The FAIR approach to scientific inquiry and the broader Open Science concept offer a direction for improving the quality of science. Open Science advocates for making scientific results more accessible to people. FAIR-compliant data are essentially well-managed data, and through managing FAIR data, the EOSC CZ initiative aims to enable better and simpler utilization of research data.


Should we prioritize quality over quantity?

Certainly, this is part of the issue. The main characteristics of standard scientific output, particularly article publishing, can be described as follows:

Firstly, there are incentives to publish what is referred to as MPU (Minimum Publishable Unit). This means that systemic pressures often lead researchers to publish four smaller articles rather than a single high-quality one, thereby accumulating more, let’s say, academic capital.

Secondly, and this is something that can be measured and quantified fairly well, there is a reproducibility problem. When someone publishes an article and another researcher attempts to replicate the results presented in that article, it is successful in only a small percentage of cases.

There are two main explanations for this: The most common reason, in my opinion, is that the methodology is not described with sufficient precision in the article, making replication of the research impossible. Alternatively, the author may have made an error or, in extreme cases, fabricated the data. These kinds of anomalies are numerous, and this broader issue is often referred to as the crisis in contemporary science. Open Science and research data attached to publications aim to at least partially address this problem.





“One of the fundamental flaws of current science is that research findings are often inaccessible.”

When data is accessible, does that mean a scientist won’t dare to take shortcuts in their research? Will there be more room for scrutiny?

I’d rather frame it differently. When a scientist conducts research, writes a paper based on it, and someone wants to challenge them by saying the findings aren’t well-supported, the scientist can respond: “No, no, here’s all my data. Feel free to reproduce my conclusions yourself.” Today, we often release publications with a message like, “Look what I discovered! But I’m not really going to tell you how I discovered it.” The more complete the datasets a scientist publishes, the easier it is to replicate their research and verify the results. That’s something we’d like to encourage, even considering the costs of data storage, although the initial investment might seem significant.

Publishing articles means making them public and ensuring they remain accessible for a long time—potentially a hundred years. That costs money, even if a PDF file itself takes up minimal storage space. But when we want to provide access to the data underlying our findings, long-term storage becomes much more expensive. The current approach is rational because we lack the capacity for long-term data storage on a larger scale. Of course, there are exceptions, but generally, we need to start preserving more resources.

Open Science advocates for this: science is funded by public money, yet the results of science are not always easily accessible. Let’s try to make science more open, whether for other scientists or for citizens who directly or indirectly fund it.


Could we clarify what we actually mean by “data”? In interviews with data stewards, we’ve found that people often have very different interpretations of the term.

We distinguish between raw and processed data. Imagine a device like a telescope, an electron microscope, or a weather station. Such instruments generate data in large quantities—these are raw data. They are typically quite voluminous but also represent the only form of data untouched by human interpretation. At this stage, no human—whether intentionally or unintentionally—has influenced or altered them, so they remain completely unmodified.

On the other hand, we usually need to make significant adjustments to this raw data to extract meaningful insights. During this process, the data volume typically decreases substantially. Within EOSC CZ, we are working to build a consensus among scientific communities about what constitutes valuable data within specific domains and in what format and volume it makes sense to store them.






“The more complete the datasets a scientist publishes, the easier it is to replicate their research and verify the results.”

So you’re working to agree on what data are worth preserving. Is the situation similar across all disciplines?

Not at all. Different fields are at very different stages of data management maturity. Let me give you an example from a domain I’m personally familiar with: In recent years, artificial intelligence has enabled the generation of vast amounts of protein data, which are used in everything from drug development to tackling plastic pollution. This revolution was made possible partly because, for the past fifty years, a clear format and repositories for systematically storing protein data have been in place.

However, there are fields where the situation is far more complex. Take the images of the Earth's surface, for instance. If I’m creating hiking maps, I’m interested in the most recent image of the terrain. But if I’m studying environmental change or shifts in urban development, I need images that show how the landscape evolves over time. The same data can serve many purposes, each requiring different approaches to description and storage. Examples like this underscore the importance of learning how to handle data properly.


Can we estimate how long it will take for data management and storage to stabilize across all fields? And will it stabilize at all, or is it expected that data storage will keep evolving?

I hope it will evolve continuously, much like scientific publishing has. Scientists have been publishing for hundreds of years, and today we have established formats, forums, and publishers. Yet, it wasn’t until recently that we realized we had an issue with persistent identifiers. Until recently, authors signed their work with just their name, but now we can encounter multiple researchers with the same name. If I want to know exactly who the author is, I need an identifier—something akin to a social security number, a researcher ID. That’s a feature we’ve only integrated into publications fairly recently.

So even something as seemingly straightforward as putting a scientific result on paper and sharing it with other researchers has continued to evolve over time. With data, it will be even more complex and dynamic. It’s certainly not something we’ll resolve in five or ten years and then freeze in place.






“So even something as seemingly straightforward as putting a scientific result on paper and sharing it with other researchers has continued to evolve over time. With data, it will be even more complex and dynamic.”

What does scientific data management look like now, what stage is it in, and what would the ideal scenario be?

No one knows exactly what ideal scientific data management should look like. However, scientific communities are starting to discuss this question, albeit cautiously for now. At the same time, the state of data management varies greatly between countries and across different fields.

On an individual level, there are already researchers who manage their data as well as they can—either because they need to work with it systematically themselves or because they see potential for its reuse and willingly share it with colleagues. On the other hand, some researchers are unwilling to share their data under any circumstances. The goal of national initiatives is to move this landscape forward on a broader scale.

At both the European and Czech levels, we are beginning to see progress in planning data workflows. There are now tools like Data Management Plans, which require researchers to outline what they plan to do with their data when submitting a research project for funding. This shows that even funding agencies recognize the value of research data.

Some publishers already require researchers to submit their datasets alongside their publications, so small changes are also happening in this area—something we didn’t see just a few years ago. Our initiative aims to accelerate this progress by creating storage capacity for data, developing user-friendly tools for working with it, and providing support for education in data management.


What does the digital infrastructure of Czech science look like? How fragmented or coordinated is it?

It’s all interconnected, particularly through what are called research infrastructures. For example, e-INFRA CZ is the national research e-infrastructure for IT, providing networks, storage, computational capacities, and other services to enable researchers to work with data. e-INFRA CZ consists of three partners: CESNET, IT4Innovations (a supercomputing center in Ostrava), and CERIT-SC at Masaryk University, where I also work.

Why is this relevant? e-INFRA CZ essentially forms the backbone of the EOSC initiative in the Czech Republic. Alongside other research infrastructures and institutions, we are now building a National Repository Platform (NRP) to help researchers work with structured data.

Of course, there are already some storage capacities for research data within these infrastructures, but they are typically designed for different purposes. For instance, the storage in e-INFRA CZ was primarily intended for temporary data during computational analyses. Such data may be in formats that are not accessible or understandable to others. We are working to change this.

Our goal is to create an environment that supports the storage of FAIR data—data that is organized and enriched with appropriate metadata (descriptions) and accessible to others. These datasets will be in formats that are comprehensible and reusable by me, my research team, partner institutions, or even others, depending on the conditions I set.






“On an individual level, there are already researchers who manage their data as well as they can—either because they need to work with it systematically themselves or because they see potential for its reuse and willingly share it with colleagues.”

How can this evolution in science be explained to a researcher who is hesitant and fears their data won’t be secure in the National Repository Platform? That it might be misused or that hackers could attack the database?

The most important thing is that the system must be intuitive and easy to use. A researcher wants to focus on their science and shouldn’t have to struggle with a complicated IT system—it needs to work almost “seamlessly.”

At the same time, if I want to store data in an infrastructure, I need to trust that infrastructure. And some researchers simply don’t. They prefer to keep their data on a computer under their desk. But in reality, that approach carries far greater risks.

As for data security, I want to emphasize again that we’re not talking about openly accessible data but about properly managed FAIR data. We have an authentication and authorization system in place that allows researchers or research groups to define exactly who can access their data. Of course, in some cases, keeping data locally can be justified—for example, with particularly sensitive data. For now, we naturally cannot provide a solution for every single case.

The second aspect is the mindset of researchers. I touched on this earlier in the context of publishing scientific articles and collecting metrics. It’s not entirely true that sharing data brings no benefits. Today, we know that when a researcher makes their data available, and that data is used by someone else, it increases the likelihood that their publication will be cited. In addition, datasets can be directly linked to publications through persistent identifiers, creating even stronger connections between a researcher’s work and its impact.


How does it look within the EU and even beyond? Will the world be connected through data?

We are part of a European initiative aimed at breaking down the barriers that prevent researchers from accessing data, software, and other digital resources. This is the essence of EOSC – the European Open Science Cloud. Each country has its own infrastructure, customs, and legislation on how to handle research outcomes. We are building on the Czech infrastructure, but we are not isolating it—instead, we are striving for connections with the rest of Europe and the world, because an isolated approach would be meaningless.

As part of this initiative, working groups have been established to focus on various fields, which also involve leading Czech scientists with years of experience. These scientists are naturally in contact with colleagues from all over Europe and the world, and they already have international solutions that are at different stages of development. As a result, most of the repositories that will be created here will be interconnected with European infrastructures and will serve across scientific communities.









“The most important thing is that the system must be intuitive and easy to use. A researcher wants to focus on their science and shouldn’t have to struggle with a complicated IT system—it needs to work almost seamlessly.”

Can you say at what level the Czech Republic stands compared to the rest of the world?

By European standards, we are doing quite well. The Germans and their NFDI (National Research Data Infrastructure) are ahead of us, with whom we are in close contact. However, Germany is larger and functions as a federation, so logically their infrastructure is more distributed. On the other hand, we have the advantage of being able to approach the building of a common foundation in a relatively centralized manner.

The Czech Republic has international collaborations in various fields, which gives us an awareness of what is happening in the world and enables us to ensure mutual compatibility. If we disagree with something emerging at the European level, we must have strong reasons for why we want to approach it differently. In that case, we engage in dialogue with Europe and try to convince others that (and why) our approach is better. And in some cases, we have been successful in doing so.


How this evolution in science will affect citizens? Will scientists be able to correct misinformation thanks to this?

It’s already happening to some extent. A good example, even though it has been widely discussed, is COVID, which had a direct impact on all citizens. Part of solving that problem also lay in how rationally we could work with data and how effectively we could share it. Politicians needed data for decision-making, and they had to rely on expert opinions, which in turn depended on what data and interpretations were available to them. The more advanced our infrastructure, the better decisions we would make.

The entire communication between the scientific community and the rest of the world also relies on trust. And we can deepen that trust through accessible data, while also giving people the space to counter misinformation. We can communicate the results of scientific research, but sometimes just stating the claims is not enough for society to broadly accept them. However, when we add data to those claims, people can verify for themselves that it is indeed true. We don’t want to simply say, “Scientists have found that the planet is warming.” It’s much better to show the data and say, “Look at the millions of data points collected from thermometers spread across the globe.” We don’t have to rely solely on the interpretation of one person. Data, therefore, lend far greater credibility to the claims.


RNDr. Matej Antol, Ph.D.


is the principal project manager of the EOSC-CZ project, the integration manager of the Czech e-infrastructure e-INFRA CZ and the executive director of one of its three partners, the CERIT-SC infrastructure established at the Institute of Computing at Masryk University. He has many years of experience in managing IT and research projects. His activities include building a platform for coordinating IT service management at MUNI, SensitiveCloud sensitive data management environment and others. His research activities focus on managing and analyzing complex and high-dimensional data (image data, structural biology data, etc.) using artificial intelligence techniques.


More articles

All articles

You are running an old browser version. We recommend updating your browser to its latest version.