Open Science and automation: Marek Cebecauer changes the approach to research data management

Marek Cebecauer, is a biochemist and a pioneer in imaging techniques who is not only involved in immunology research but also in scientific data management. In this interview, he explains how chance encounters with leading experts led him to microscopy and the challenges of working with complex data. He shares how Open Science and automation can make data management easier and more transparent. From introducing mandatory data management plans (DMPs) at the J. Heyrovský Institute of Physical Chemistry of the CAS (HIPC) to developing tools for the National Repository Platform (NRP), he explains why Open Science is crucial for the future of scientific research.

22 Aug 2024 Vladimíra Coufalová

Can you describe your research, what you focus on, and why you pursue science?

I have the advantage that my father is also a scientist. He took me to the lab when I was a kid and I liked it, so I stuck with it. It's been forty years. I have a degree in biochemistry, which I apply to immunological questions in my research. Around the year 2000, I moved into imaging and microscopy, because up until then I was mostly using biochemical tools. I was lucky to come across people in London who were involved in developing the latest microscopy technologies, and I got the opportunity to use these technologies long before they were commercially available. This led me to the J. Heyrovský Institute of Physical Chemistry of the CAS, where Martin Hof's group (now Director of the HIPC) has been working on fluorescence for a long time. This is a phenomenon that is used in biological microscopy, and I found it useful to be near someone who understands this phenomenon and knows how to use complex imaging techniques correctly. This is a huge advantage I have at the Heyrovský Institute, although it is of course more difficult to pursue biology on a chemistry campus. But at the moment we are working closely with Motol Hospital in clinical microbiology and it turns out that it is not that difficult to do biology-oriented science in chemical or physical institutes.

“But at the moment we are working closely with Motol Hospital in clinical microbiology and it turns out that it is not that difficult to do biology-oriented science in chemical or physical institutes.”

How did you get into research data management issues?

What led me to work with data was that I started using imaging techniques that produce not only large volumes of data, which most people know, but especially very complex data.

Can you explain the concept of complex data?

One image you see in the publication was created by putting together, for example, ten thousand images. These images were taken at different times and different dimensions, and the space in biological imaging is really multidimensional, certainly larger than 4D.

So data management was a necessity?

In our group, the data was such a mess that we had to start managing it, because no one would know where to find what. I found out we weren't the only ones, but there was no one to help me with it, so I had to help myself. Gradually I found that I was becoming an expert in the context of Open Science.

How long have you been an Open Science expert?

I started doing it seriously three years ago. Until then, I had only been involved in data management issues with respect to my own research. We scientists are used to learning quickly. However, I still see relatively few scientists in EOSC and Open Science, which is understandable, because they prefer to spend time on their experiments. I still have my lab and my immunology research, I just have added a few more worries.

“However, I still see relatively few scientists in EOSC and Open Science, which is understandable, because they prefer to spend time on their experiments.”

What are the other worries? Keeping my data organized sounds like something I can put in a drawer and know where I have it, but I guess it's more complicated with data, right?

The drawer analogy actually fits quite well, we just need to add binders to the drawer. The problem is that the drawers have to be described and the right things have to go in them. A scientist is not a librarian or an archivist who knows exactly where things are, because their main job is to be creative. That's my case too. Putting things in order is not my strong point, but I am looking for new solutions. I'm mostly into automation.

“A scientist is not a librarian or an archivist who knows exactly where things are, because their main job is to be creative.”

Can you describe your automation solution in more detail?

I mean automation of data collection and the collection of information about the data itself. I don't want to say that everything works perfectly, that's a long way off, but most people have experienced having to fill in their name, affiliation, email address and so on over and over again, even in EOSC Association documents. We fill in the same thing over and over again, and the system should know who we are by now. What I'm working on is to improve this so that, for example, the system recognises that it's me using my phone and fills in my affiliation straight away, instead of offering me a whole list of institutions in the Czech Republic. Using lab journals makes this even easier, because I can have ready-made templates and work structures with all the information about a given experiment. Often an experiment differs from the previous one in only two or three parameters. Because I've already written a protocol, it's not a problem to create a copy in an electronic journal where I just change the little things. Automation systems recognize different experiments and each has its own identifier. This brings us back to the archivists, but the trick is that the scientist doesn't need to know that these processes are running in the background, and doesn't need to assign identifiers to his experiments himself.

How do these data management solutions reach other scientists and researchers?

Firstly, I lead one EOSC CZ working group. I'm a biochemist, but I lead a working group on materials science and technology. This role came about because at the Heyrovský Institute the focus on materials and technology is very strong, I am involved in data management plans and overall Open Science and scientific data management. Face-to-face meetings, that's one way. This information then spreads mainly virally.

Can you elaborate?

Two years ago, at the Heyrovský Institute, we introduced mandatory data management plans for every project that we develop. This is proving to help us raise awareness of the need for good data management. We have created our model for data management plans in contrast to the way the EU uses them by focusing exclusively on data. The most interesting question we have in these plans is about data reuse, where the answer is wrong 99 percent of the time. People do not realise that even data from their decade-old research can still be relevant, that no research is greenfield. But even that data should be available if we want to do reproducible science. This question in the data management plans highlights and educates on this.

“People do not realise that even data from their decade-old research can still be relevant, that no research is greenfield.”

Do you have Data Stewards on your team?

At the Heyrovský Institute, I founded the "Heyrovský Open Science Team", which started to form two years ago with Eva Pluharova and Stefan Swift. The team is gradually growing and you can say that it is a Data Stewardship team, because we intensively discuss all issues related to data management. We also have colleagues who are technically oriented, such as Michal Tarana and Jakub Chalupský, who are excellent developers and experts in the IT environment. Importantly, we have a comprehensive overview of the whole system, including policy, scientific and developer perspectives.

Can you describe the activity of the National Repository Platform project that you are leading?

In this project we are focusing on automation of systems. The activity I am in charge of involves developing tools that will help to transfer data directly from the instruments to the platform and ensure that the data is correctly described, i.e. has FAIR metadata. We are developing our own systems, but we are also using tools already developed abroad, which we are adapting for the Czech NRP environment. The key for us is to cooperate with pilot repositories and to ensure FAIRification of data and workflow (data processing in computing infrastructures). If we have data from an instrument that needs to be processed, this data should be stored immediately in the vicinity of the computing capacity. This is so that the scientist can fully concentrate on his experiment and not have to think about how to transfer the data to the place where a colleague from Ostrava is processing it. The system has to ask him whether the data should go to the repository or whether it will be further processed and where. The interesting thing about the whole process is that the system identifies its users, works automatically and the data does not go where it should not be. In addition, the data will still be protected.

“This is so that the scientist can fully concentrate on his experiment and not have to think about how to transfer the data to the place where a colleague from Ostrava is processing it.”

To keep the data from being stolen?

No, that doesn't have to be malicious at all. But it shouldn't be the case that someone who thinks they're supposed to be working with the data gets access to it when they're not. Nor should it happen that someone inadvertently deletes data because they mistake it for something else. When the system is set up well, it can eliminate a number of naturally occurring errors where malicious intent may not be present.

Where do you get new ideas?

My group works with data itself, so a lot of ideas come directly during our work. For example, the plenary sessions of the Research Data Alliance, which is a global initiative on Open Science and data management, are very inspiring. There we come across very interesting examples and approaches.

Can you cite a specific example?

For example, there are environmental analyses in the Amazon rainforest where sensors are designed not to disturb the ecosystem. These are tower-like structures, but the jungle is the jungle, and the sensors are often disturbed by something, and they are in areas where access is prohibited. It's interesting how remote control and repair of the sensors is handled so that everything communicates properly. These examples are fascinating, but the Czech-BioImaging infrastructure, which generates data across the Czech Republic, is not much different in principle. It is not the jungle that causes problems, but perhaps a new scientist who does not yet know how to use the instrument.

At the end, could you summarize what you think is really important in Open Science?

Open Science is crucial for transparency and reproducibility of scientific research. Scientists often collect data, combine it and produce publications, but these are subjective interpretations of the data. What is really important is that the data, if possible, is made public before publication. This will allow other scientists access to the original data and encourage open discussion of the results. An important part of Open Science is that scientists have the tools and skills needed to manage and share data. This process is not straightforward, as it involves many small challenges and limited funding.

“An important part of Open Science is that scientists have the tools and skills needed to manage and share data. This process is not straightforward, as it involves many small challenges and limited funding.”

Mgr. Marek Cebecauer, Ph.D.

is a biochemist and a pioneer in imaging techniques who is not only involved in immunology research but also in scientific data management. In this interview, he explains how chance encounters with leading experts led him to microscopy and the challenges of working with complex data. He shares how Open Science and automation can make data management easier and more transparent. From introducing mandatory data management plans (DMPs) at the J. Heyrovský Institute of Physical Chemistry of the CAS (HIPC) to developing tools for the National Repository Platform (NRP), he explains why Open Science is crucial for the future of scientific research.

All articles