The scientist decides for themselves to whom they will make their data available, says EOSC CZ general manager Matej Antol

According to Matej Antol, czech science now has a huge opportunity to move forward rapidly by improving access to research data management. In the interview, he also talks about how we can establish ourselves among the European, if not world, leaders in the field of data storage, sharing and, most importantly, increasing data accessibility. He also highlights the road still to be traveled by science, how the creation of the national repository platform will help, and what this will mean for scientists themselves.

18 Dec 2024 Martina Čelišová

No description

What are the most important changes facing science in the future?

I expect science will look very different in a hundred years. Today, we can't systematically evaluate science except by "collecting points" for articles. A scientist is sometimes forced to publish an absurdly high number of papers because that is how he can collect as many citations as possible and thus move up the standard career ladder. In addition, there is currently the aspect of popularity that contributes to a scientist's reputation, i.e., to what extent the reputation is built on the fact that the scientist is seen and heard somewhere, and to what extent his outputs are really valuable. One of the other major shortcomings of contemporary science is that we often simply cannot access the research results. I think and hope that how science is implemented will become more rational in these and other respects soon.

I think research data will play a big role in where science goes. We already know that data is at least as valuable as its interpretation. Often, data can be interpreted in many ways, and by publishing just one research result in the form of an article, we lose a huge amount of information, and thus the potential for further research. Moreover, it is on data that the quality of research can be tested, among other things.

FAIR's approach to scientific research and the whole concept of Open Science offers a direction in which the issue of setting the quality of science could move forward. Open Science says that the results of science should be more universally accessible to people. FAIR-compliant data is de-facto data that is properly managed, and the EOSC CZ initiative helps to make research data more easily usable through FAIR data management.


Does this mean going the way of quality rather than quantity?

That's definitely part of it. The main characteristics of standard scientific output - particularly the publication of articles - can be summarized as follows: Firstly, there are now incentives to publish what is called the Minimum Publishable Unit (MPU) - that is, there are pressures in the system setup that make a researcher write four smaller papers rather than one good paper and accumulate more, say, academic capital. Secondly, and this is also quite easy to measure and quantify, there is the problem of reproducibility. If somebody writes an article and somebody else wants to replicate the results in that article, then only a smaller percentage can do that. There are two explanations for this: the first, and in my opinion the most common, is that the method is not precisely described in the article and therefore the research cannot be replicated. And the second is that the author simply made a mistake or, in extreme cases, made things up. There are several such anomalies, broadly referred to as the crisis of contemporary science. Open Science and the research data accompanying publications try to address at least part of this problem.





“One of the other major shortcomings of contemporary science is that we often simply cannot access the research results.”

So when the data is available, the scientist no longer dares to underestimate the research? Will there be more control over it?

I'd rather turn the rhetoric around. If a scientist does some research, bases a paper on it, and someone wants to attack him for not having a proper foundation, he'll say, "No, no, here's all my data; feel free to reproduce my conclusions yourself." Today, we put out publications with words: "Look what I've come up with! But I'm not actually going to share the data behind it." The more complete the datasets a scientist publishes, the easier it is to replicate their research and validate the results. Although the investment may initially seem large, we would like to encourage this, even at the cost of storing the data.

Publishing articles means making them public and ensuring they stay there for a long time, maybe a hundred years. This costs money, even if the PDF file itself has a minimal amount of data. But if we also want to make the data on which the results are based available, it costs a lot more to store it for the long term. So, the current state of affairs is rational because we don't have the capacity for long-term data storage. Of course, there are exceptions, but in general, we need to start storing more resources. Open Science says: science is funded by public money, but the results of science are not always easily accessible. So, let us try to make science more open, either to other scientists or to the citizens who pay for it directly or indirectly.


Could we ask ourselves what data actually is? In talking to data stewards, we found that it is often the case that everyone has a different idea of what data means.

We distinguish between raw and processed data. If we think of an instrument - a telescope, an electron microscope, or a weather station, such an instrument spits out data. This is raw and usually relatively voluminous data, but it is also the only data not compromised by human eyes. Whether deliberately or unconsciously, no human has entered them yet, so they are not altered in any way.

But on the other hand, we usually need to make a lot of changes to them to get real value out of them, typically with a significant reduction in volume. Within EOSC CZ, we seek consensus within the scientific communities on what is actually valuable data in a particular domain, and in what format and volume it makes sense to store it.






“The more complete the datasets a scientist publishes, the easier it is to replicate their research and validate the results.”

So you're looking for consensus on what data makes sense to store. Is it at a similar level across all industries?

It's not. Different domains are at various stages of maturity regarding data management. One example from a domain close to my own: over the last few years, with the help of artificial intelligence, we have come to a large amount of protein data that finds applications in various fields, from drug production to dealing with plastic pollution. This revolution has been made possible because a clear format and repositories in which protein data are systematically stored have been established for fifty years.

However, there are domains where the situation is significantly more complex. An example is satellite imagery of the Earth. When making tourist maps, I am always interested in the latest current landscape photos. If I am dealing with environmental or urban change, I am interested in a picture of the landscape changing over time. But the same data can have many different uses, requiring different approaches to describing and storing it. Examples like this show that learning how to handle data properly is important.


Can we estimate how long it will take for data management and storage to become established in all fields? And will it settle down at all, or is the assumption that data storage will also continue to evolve?

I hope this will evolve continuously, just as publications are still evolving today. Scientists have been publishing for hundreds of years and today we have established formats, forums and publishers. Yet, we have only recently discovered that we have a problem with so-called persistent identifiers. Until recently, authors signed only their name, but today we can have several scientists with the same name. If I really want to know who the author is, I have to use some identifier, some analogy to a birth number, a researcher ID. And this is something that we have introduced to publications quite recently. So, even something as seemingly trivial as transcribing a scientific result on a piece of paper and putting it between scientists is still evolving today. With data, it's even more complex, even more lively. It's certainly not something we're going to solve in five or ten years and then it's going to get canned.






“So, even something as seemingly trivial as transcribing a scientific result on a piece of paper and putting it between scientists is still evolving today. With data, it's even more complex, even more lively.”

What does scientific data management look like now, what stage is it at and what should it ideally look like?

Nobody knows what the ideal management of scientific data should look like (laughs, editor's note). However, the scientific community is already discussing this question rather timidly. At the same time, the situation is indeed very different in different countries and in different domains. On an individual level, some researchers are already taking care of their data as best they can - either they need to work with it systematically themselves, or they see the potential for reuse and willingly share it with their colleagues. And then some will not give their data away at any cost. What the national initiative is designed to solve this by taking this further across the board.

At the European and Czech level, we are already starting to see a shift in the planning of work with data. There are so-called Data Management Plans, and if I submit a research project today and apply for research funding, grant agencies will ask me through them what I plan to do with the data. So, even the funders themselves see the value in research data. Some publishers already want me to include the data with the publication, so small changes are already happening in this area. That wasn't the case just a few years ago. And the initiative we have wants to add fuel to the fire in a way - to create the capacity to have a place to store data, to develop services that they are comfortable with and to support education in this area.


So, what does the digital background of Czech science look like? How fragmented or coordinated is it?

It's all connected, especially with the help of so-called research infrastructures. For example, e-INFRA CZ is a national IT research e-infrastructure that provides networks, storage, computing capacity and other services to enable researchers to work with data. e-INFRA CZ consists of three partners: the CESNET, IT4Innovations, which is a supercomputing center in Ostrava, and CERIT-SC at Masaryk University, where I work. And why am I talking about this? e-INFRA CZ is actually at the core of the EOSC initiative in the Czech Republic. So, with this background, together with other research infrastructures and institutions, we are now jointly creating a national repository platform that will serve researchers working with structured data.

Of course, we already have some storage capacity for research data in these infrastructures, but it typically serves a different purpose. For example, the ones at e-INFRA CZ were primarily designed to store data during computation, i.e., for the time when it is being analyzed. This data may be in a format that is not accessible or understandable to others. And we are trying to change this situation. We are creating an environment that will allow us to store FAIR data - data that is sorted and provided with appropriate metadata (descriptions), and that can be accessed by other researchers. It will be in an easy-to-understand format and will be reusable by me, my research team colleagues, colleagues at the partner institution, or anyone else as I determine.






“On an individual level, some researchers are already taking care of their data as best they can - either they need to work with it systematically themselves, or they see the potential for reuse and willingly share it with their colleagues.”

How do you explain this evolution in science to a scientist who is not entirely sympathetic and worried that the data will not be safe in NRP? Will someone misuse it, or will hackers attack the database?

The most important thing for me is that the system must be intuitive and easy to use. A scientist wants to do science and cannot struggle with an IT system. It has to be more or less "self-contained". At the same time, if I want to store data in an infrastructure, I just have to trust it. And some researchers just don't trust it. They'd rather keep the data on their computer under their desk. But in doing so, they are realistically risking much more. As far as data security is concerned, I would like to stress once again that we are not talking about open data but about data that is properly managed, so-called FAIR data. We have an authentication and authorization system that allows a scientist or a research group to determine to whom the data will be available. Sometimes, however, simply storing data locally has a reasonably logical basis, for example, in the case of specifically sensitive data. So, naturally, we cannot yet offer a solution to everyone.

The second thing is the mental setup of scientists, which I mentioned at the beginning in connection with publishing scientific papers and collecting commas. It doesn't quite apply, because we now know that if a scientist makes his data available, and then someone uses that data, he has a better chance of hitting that publication and increasing the number of citations. At the same time, you can directly link your datasets to your publications through persistent identifiers.


How is it within the EU, but perhaps also outside it? Will the world be connected through data?

We are part of a European initiative to break down the barriers that prevent researchers from accessing data, software and other digital resources. This is the essence of EOSC - European Open Science Cloud. Each country has its own background, practices, and legislation for handling research results. We are building on the Czech infrastructure, but we are not shutting it down - on the contrary, we are trying to connect it with the rest of Europe and the world because an isolated approach would make no sense.

The initiative has established working groups focused on various fields, which include leading Czech scientists with many years of experience. These scientists are naturally in contact with colleagues from all over Europe and the world and already have common international solutions in various development stages. As a result, most of the repositories built here will be linked to European infrastructures and will serve across scientific communities.









“The most important thing for me is that the system must be intuitive and easy to use. A scientist wants to do science and cannot struggle with an IT system. It has to be more or less "self-contained".”

Can we say where we are in the Czech Republic compared to the rest of the world?

By European standards, we are doing relatively well. The Germans and their NFDI (Nationale Forschungsdateninfrastruktur), with whom we are in close contact, seem to be ahead of us. But Germany is bigger and functions as a federation, so logically they have a more distributed infrastructure. We, on the other hand, have the advantage that we can take a relatively centralist approach to building a common base.

The Czech Republic has established international cooperation in various fields, thanks to which we are aware of what is happening worldwide and can ensure mutual compatibility. If we disagree with what is being created at the European level, we must have good reasons why we want it to be different here. In that case, we go into dialogue with Europe and try to convince others that (and why) our way is better. In some cases, we have been successful in doing so.


How can this evolution in science affect the ordinary citizen? Will scientists then be able to correct misinformation, for example?

It's already partly happening. A good example, albeit one that has been much talked about, is the covid, which had a direct impact on all citizens. Part of the solution to that problem lies in how rationally we can work with data and how well we can share it. Politicians needed data to make decisions; they needed to rely on experts' opinions, who, in turn, based their decisions on the data and interpretations they could get their hands on. And the more advanced our infrastructure, the better decisions we will make.

The whole communication between the scientific community and the rest of the world is also based on trust. And we can deepen that with the available data while giving people the space to refute misinformation. We can communicate the results of scientific research, but the claims alone are sometimes not accepted by society at large. However, when we add data to these claims, people can verify for themselves that this is indeed the case. We don't just want to say: "Scientists have found that the planet is warming." It is much better to show the data and say, "Look at the millions of data points from thermometers all over the planet." So we don't have to trust just one person's interpretation. The data lends a lot more credibility to the claims.


RNDr. Matej Antol, Ph.D.


is the principal project manager of the EOSC-CZ project, the integration manager of the Czech e-infrastructure e-INFRA CZ and the executive director of one of its three partners, the CERIT-SC infrastructure established at the Institute of Computing at Masryk University. He has many years of experience in managing IT and research projects. His activities include building a platform for coordinating IT service management at MUNI, SensitiveCloud sensitive data management environment and others. His research activities focus on managing and analyzing complex and high-dimensional data (image data, structural biology data, etc.) using artificial intelligence techniques.


More articles

All articles

You are running an old browser version. We recommend updating your browser to its latest version.