DEBS '25: Proceedings of the 19th ACM International Conference on Distributed and Event-based Systems

Full Citation in the ACM Digital Library

SESSION: KEYNOTES

Navigating the Challenges of AI-Driven Data Processing

Georgia Koutrika

AI-driven data processing is revolutionizing the way researchers analyze vast datasets, enabling intuitive interactions and insightful answers. However, the integration of AI in data processing also introduces several challenges that must be addressed to ensure its reliability and trustworthiness. Key issues include hallucinations, where AI systems generate incorrect information; bias, which can lead to unfair outcomes; ethical considerations, such as privacy and misuse; explainability, to ensure transparency; and answer provenance, to verify the accuracy of AI outputs. These challenges can undermine the effectiveness of AI systems and potentially lead to incorrect or even harmful decisions. This talk explores these issues and discuss mitigation strategies.

A Bird's-Eye View of Complex Streaming Data Analytics

Minos Garofalakis

Massive continuous data streams arise naturally in several dynamic big data analytics applications, such as enabling observability for complex distributed systems, network-operations monitoring in large ISPs, or incremental federated learning over dynamic distributed data. In such settings, usage information from numerous devices needs to be continuously collected and analyzed for interesting trends and real-time reaction to different conditions (e.g., anomalies/hotspots, DDoS attacks, or concept drifts). Streaming data raises important memory-, time-, and communication-efficiency issues, making it critical to carefully optimize the use of available computation and communication resources. We give a (biased) overview of some key algorithmic tools in the space of streaming data analytics, along with relevant applications and challenges.

Supporting AI pipelines in the computing continuum

Vana Kalogeraki

The technology ecosystem has experienced a profound and disruptive transformation in recent years. Groundbreaking technological innovations such as cloud computing, the Internet of Things (IoT) and Artificial Intelligence (AI) have paved the way for applications that were once deemed inconceivable, significantly enhancing quality of life and empowering businesses to make more informed decisions. The long-term potential of these technologies presents immense opportunities. The question we face now is how to automate the machine learning workflows and build systems that can meet the demands of scalability and efficiency, bridging the gap between AI innovations and end users to make AI widely accessible to diverse audiences. In this talk, we explore recent developments and highlight exciting prospects for supporting AI pipelines across the evolving computing continuum. In particular, we will discuss a series of research challenges and introduce novel techniques focused on resource management, energy efficiency, reliability, and privacy. Our discussion is motivated by critical application areas within the SmartCity domain, specifically: (a) transportation systems and (b) urban disaster and emergency response.

SESSION: WORKSHOP SUMMARIES

Workshop on Relaxed Semantics in Data Analytics Pipelines (RELAX 2025)

Deepak P
Dimitrios Tsoumakos
Thai Son Mai

Data analytics systems have become highly complex over time. This includes aspects such as the variety of data they process, the kind of computations they do, and the myriad types of scenarios they get deployed. This has created new challenges for distributed data systems. The RELAX 2025 workshop at DEBS 2025 conferences explores how relaxation of semantics could be a substantive pathway in navigating these challenges.

SESSION: RESEARCH TRACK

Run-Time Adaptation of Complex Event Forecasting

Manolis Pitsikalis
Elias Alevizos
Nikos Giatrakos
Alexander Artikis

Complex Event Forecasting (CEF) is a process whereby complex events of interest are forecast over a stream of simple events. CEF facilitates proactive measures by anticipating the occurrence of complex events. This proactive property, makes CEF a crucial task in many domains; for instance, in maritime situational awareness, forecasting the arrival of vessels at ports allows for better resource management, and higher operational efficiency. However, our world’s dynamic and evolving conditions necessitate the use of adaptive methods. For example, for safety reasons, maritime vessels may adapt their routes to avoid powerful swell waves; in fraud analytics, fraudsters evolve their tactics to avoid detection etc. CEF systems typically rely on probabilistic models, trained on historical data. This renders such CEF systems inherently susceptible to data evolutions that can invalidate their underlying models. To address this problem, we propose RTCEF, a novel framework for Run-Time Adaptation of CEF, based on a distributed, service-oriented architecture. We evaluate RTCEF on two use-cases and our reproducible results show that our proposed approach has significant benefits in terms of forecasting performance without sacrificing efficiency.

DIPSUM: Distributed Pattern Summaries for Efficient CEP Aggregates

Steven Purtzel
Samira Akili
Marc Sebastian Kühne
Matthias Weidlich

Complex Event Processing (CEP) continuously evaluates queries over event streams to detect patterns that yield actionable insights. In monitoring applications, however, not all patterns are relevant at any point in time. Especially for queries with permissive evaluation semantics yielding a large number of matches, it is often sufficient to compute only aggregated, summary results upon request, e.g., upon refreshing a dashboard. The optimization potential of on-demand aggregations over detected patterns has, so far, been mostly considered for centralized evaluation scenarios. For the setting of distributed event processors, it is not yet clear how to efficiently compute and distribute local aggregates.

To fill this gap, we propose DIPSUM , a framework for the efficient on-demand evaluation of CEP aggregate queries in distributed environments. DIPSUM combines the efficient aggregation over computationally expensive operators (Kleene closure, negation), with fine-grained routing of partial aggregates. It relies on a summary data structure that compactly captures match information and can be decomposed for distributed query evaluation. Experiments with real-world and synthetic data show that DIPSUM improves transmission costs, detection latencies, and throughput by several orders of magnitude compared to baseline strategies.

Automated Discovery of CEP Applications with Evolutionary Computing

Giulio Appetito
Eric Medvet
Vincenzo Gulisano

Complex event processing (CEP) is key for detecting patterns in digital systems (e.g., smart grids and vehicular networks) through platforms like Apache Flink CEP that decouple application logic from distributed execution in cloud-to-edge infrastructures. Yet, a barrier remains: system experts can identify relevant patterns but often lack programming skills to implement CEP applications, limiting effective use.

We present a preliminary study on using evolutionary computation to automate CEP application discovery from data. Experts provide examples of relevant event sequences for an evolutionary algorithm to evolve applications to detect similar patterns. Initial results are promising and highlight CEP-related challenges that open new research directions.

FEDAMON: A Forecast-Based, Error-Bounded and Data-Aware Approach to Continuous Distributed Monitoring

Yixing Zhang
Romaric Duvignau

Efficiently monitoring distributed systems is critical for applications such as data center load balancing, fleet management, and smart grid energy optimization. Traditional continuous monitoring solutions often require significant communication overhead, straining network resources. This paper addresses the continuous distributed monitoring problem, where a central coordinator needs to track statistics from numerous distributed nodes in real-time. We propose a novel forecast-based, error-bounded, and data-aware approach that significantly reduces communication costs while maintaining accurate monitoring. Instead of transmitting all observed values to the central coordinator, our event-based monitoring leverages lightweight forecasting models at edge nodes. Both the coordinator and distributed nodes predict the evolution of local values, communicating only when deviations exceed a predefined error threshold. To adapt to dynamically changing trends in data streams, we introduce a data-aware model selection strategy that optimizes the balance between communication frequency and monitoring accuracy. Our solution is evaluated on diverse datasets and results demonstrate a substantial reduction in communication overhead with minimal impacts on accuracy, outperforming baseline monitoring regarding communication complexity, e.g., sending, on average, only 10% of baseline update events while maintaining less than 2% average error across all monitored streams. Furthermore, we show that our standard parameter solution even surpasses the best calibrated single models, achieving up to a 17% improvement in communication overhead with identical guarantees on maximum error. Optimizing the control factor in data-aware approach leads to a 13% improvement in performance, reducing error by 1%, without incurring additional communication costs. We believe our approach offers a scalable and efficient solution, enabling fully automatic, real-time monitoring with optimized performance.

PPOIJ: Shared-Nothing Parallel Patterns for Efficient Online Interval Joins over Data Streams

Gabriele Mencagli
Yuriy Rymarchuk
Dalvan Griebler

Joining data streams is a fundamental stateful operator in stream processing. It involves evaluating join pairs of tuples from two streams that meet specific user-defined criteria. This operator is typically time-consuming and often represents the major bottleneck in several real-world continuous queries. This paper focuses on a specific class of join operator, named online interval join, where we seek join pairs of tuples that occur within a certain time frame of each other. Our contribution is to propose different parallel patterns for implementing this join operator efficiently in the presence of watermarked data streams and skewed key distributions. The proposed patterns comply with the shared-nothing parallelization paradigm, a popular paradigm adopted by most of the existing Stream Processing Engines. Among the proposed patterns, we introduce one based on hybrid parallelism, which is particularly effective in handling various scenarios in terms of key distribution, number of keys, batching, and parallelism as demonstrated in our experimental analysis.

EPL: The Event Processing Language for Streaming Data

Samuele Langhi
Riccardo Tommasini
Angela Bonifati
Thomas Bernhardt

Stream Processing (SP) engines play a crucial role in real-time analysis within the Big Data landscape, handling infinite data streams to analyze massive, noisy, and heterogeneous information flows. While initially inheriting programming interfaces from Hadoop MapReduce, a recent trend involves adopting declarative languages for expressing analyses. The Event Processing Language (EPL) and its implementation Esper, a mature query language in streaming and event processing, have gained prominence. EPL, with SQL-like syntax, uniquely combines Complex Event Processing (CEP) and streaming analytics. However, it lacks formal semantics. This work addresses this gap by formalizing a core fragment of EPL, focusing on the aspects of Data Definition Language (DDL) and Data Manipulation Language (DML). The formalization resolves semantic ambiguities, identifies potentially harmful constructs, and specifies EPL’s data and processing model. This effort addresses a major gap in the formalization of stream processing languages, aligning with recent initiatives from similar domains like graph query languages.

Histrio: a Serverless Actor System

Luca De Martini
Giorgio Natale Buttiglieri
Alessandro Margara

The serverless paradigm has gained significant traction for cloud applications, offering scalability while offloading infrastructure management and resource provisioning to providers. However, its adoption introduces a shift in programming models, adding complexity to software development. In Function-as-a-Service (FaaS), functions are stateless, requiring developers to manage external storage, concurrency, and state consistency, diverting focus from business logic. This paper presents Histrio, a programming model and execution environment that simplifies developing stateful FaaS applications. Built on the actor programming model, Histrio abstracts state management, database interaction, and concurrency, enhancing the actor model to optimize storage interactions. It ensures exactly-once processing, masking failures and guaranteeing consistent behavior. Compared to traditional FaaS implementations, Histrio reduces development complexity by minimizing operational code, while remaining scalable and configurable to balance performance and costs.

Vertical Scaling Can Save Time: Optimizing Container Scheduling to Handle Sudden Bursts

Aristotelis Peri
Michail Tsenos
Vana Kalogeraki

Serverless computing is a cloud computing model that has seen substantial growth in recent years. This model allows developers to focus only on their application development by offering a user-friendly programming interface, while all the complexity of the underlying infrastructure is taken care of by the platform, creating the illusion of always available resources. However, due to its dynamic nature, serverless computing faces challenges when handling sudden bursts of requests that can lead to increased latency as the platform dynamically allocates resources to meet the unexpected workload. The provisioning delay, known as "cold start" can negatively impact performance. Many approaches have been proposed to tackle this problem, including "pre-warming," where instances with preloaded runtime environments are set up in advance. However, "pre-warming" requires prediction models, which may struggle with unpredictable traffic or lead to resource wastage which is problematic, especially in edge environments with limited resources. Additionally, applications with complex dependency loading can still experience delays. To address this challenge, we propose a novel scheduling approach for serverless applications that utilizes vertical scaling. With vertical scaling, instead of creating new instances of the application functions we update existing instances by increasing their size (CPU and memory) to handle the burst of requests. This approach bypasses the limitations of "pre-warming" as it utilizes already loaded instances. To enhance the effectiveness of this method, we have also developed a resource prediction model that incorporates queuing theory and Holt’s forecasting technique. We have implemented this approach in our serverless orchestration framework, and our detailed experimental results highlight the efficiency and performance advantages of our approach compared to state of the art methods.

Reducio: Data Aggregation and Stability Detection for Industrial Processes Using In-Network Computing

Liam Tirpitz
Ike Kunze
Philipp Niemietz
Anna Kathrin Gerhardus
Thomas Bergs
Klaus Wehrle
Sandra Geisler

Modern manufacturing environments handle increasing numbers of raw data streams that carry large volumes of data. Searching this raw data for anomalous events, such as faults, failures, or degrading product quality, across multiple data sources can help engineers optimize the underlying manufacturing processes. Yet, facilitating corresponding analyses on the massive data streams is challenging: it requires powerful processing platforms, but strict latency requirements or limited bandwidths often make sending the full raw data to suitable, centralized locations (in the cloud or locally) impossible. As a middle ground, the processing logic for finding relevant events can also be placed on the edge, close to the data source, but this still requires high-speed compute capabilities.

In this paper, we show that in-network computing provides an effective solution to this dilemma. In particular, we design Reducio, which detects relevant events directly on the data path and dynamically adapts where and in which resolution data is forwarded for further analysis. Underneath, Reducio leverages the process semantics of clocked manufacturing processes to first aggregate raw data streams across multiple sensors and independent machines on a shop floor. It then uses the aggregates to detect anomalous events and assess the stability of the underlying processes to switch between data resolutions and identify machine or sensor malfunctions. We demonstrate the practicality of Reducio by applying a Tofino prototype to the clocked process of fineblanking in several experiments, which reveal that Reducio can detect instabilities in a timely manner while reducing the data volumes by up to 90% without losing important process information.

Detecting Continuous Group Movement Patterns in Densely Populated Areas

Andreas Morgen
Bernhard Seeger

Today, smart mobile devices are omnipresent and constantly generating massive spatio-temporal data streams. Among the challenging tasks in stream processing is the continuous derivation of higher-level knowledge, such as detecting movement patterns of objects and groups of objects with low latencies. In spatial stream processing, existing approaches to detecting groups are limited in their applicability because they are either designed for processing historical data only or address a simplified problem setting that is not sufficient for detecting continuously connected groups within densely populated areas. To address these deficiencies, we introduce a novel low-latency pattern-matching approach to group detection by combining recent results from complex event processing, spatial query processing, and community discovery. First, we formally introduce an expressive Complex Event Processing (CEP) operator offering continuous connectedness and life cycle management for group detection. Then, we propose efficient algorithms and various optimization strategies to improve throughput and overall latency. The results of our experiments confirm the efficiency of our approach and the positive effects of our optimization strategies. For cases where comparisons are possible, we report the results compared to existing approaches.

Impact of Network Topologies on Blockchain Performance

Vincenzo P. Di Perna
Marco Bernardo
Francesco Fabris
Sebastiao Amaro
Miguel Matos
Valerio Schiavoni

Since blockchains are increasingly adopted in real-world applications, it is of paramount importance to evaluate their performance across diverse scenarios. Although the network infrastructure plays a fundamental role, its impact on performance remains largely unexplored. Some studies evaluate blockchain in cloud environments, but this approach is costly and difficult to reproduce. We propose a cost-effective and reproducible environment that supports both cluster-based setups and emulation capabilities and allows the underlying network topology to be easily modified. We evaluate five industry-grade blockchains – Algorand, Diem, Ethereum, Quorum, and Solana – across five network topologies – fat-tree, full mesh, hypercube, scale-free, and torus – and different realistic workloads – smart contract requests and transfer transactions. Our benchmark framework, Lilith, shows that full mesh, hypercube, and torus topologies improve blockchain performance under heavy workloads. Algorand and Diem perform consistently across the considered topologies, while Ethereum remains robust but slower.

BlockFed: A Novel Federated Learning Framework Based On Hierarchical Weighted Aggregation

Reza Nourmohammadi
Kaiwen Zhang
Chamseddine Talhi

Federated learning (FL) has emerged as a promising paradigm for decentralized machine learning, enabling clients to collaboratively train models while keeping their data private. However, a key challenge in FL is the centralized aggregation of model updates, which can lead to inefficiencies and vulnerabilities, especially when data privacy is critical. This study presents a pioneering federated learning framework, BlockFed, which leverages a novel hierarchical aggregation approach to empower clients in collaboratively generating a global model through multiple levels of aggregation. A unique role definition mechanism is integrated to delineate clients’ roles and tasks in each learning round. Additionally, BlockFed incorporates the Particle Swarm Optimization (PSO) algorithm to solve an optimization problem for determining optimal weights in the weighted averaging aggregation, enabling faster convergence. To ensure secure and decentralized storage, IPFS and blockchain technologies are used to store local models and their corresponding hash pointers. The efficacy of BlockFed is evaluated using a genomic breast cancer dataset sourced from the GDC portal, achieving a remarkable 98% accuracy for the global model and demonstrating enhanced accuracy and convergence speed over the original framework.

Fenics: A Modular Framework for Security Evaluation in Decentralized Federated Learning

Shubham Saha
Sifat Nawrin Nova
Romaric Duvignau
Carla Fabiana Chiasserini

This paper presents Fenics, a modular framework for evaluating the resilience of Decentralized Federated Learning (DFL) networks under adversarial conditions. As a nascent field, DFL raises security challenges in decentralized network settings under adversarial behaviors. To our knowledge, Fenics is the first fully open-source framework of its kind, enabling user-defined topologies, multiple communication protocols, and customizable attack models to study how malicious node placement affects network performance. It integrates core components of DFL, including data distribution, dynamic node participation, and aggregation to establish the DFL architecture. We demonstrate the framework’s capabilities through different use cases under poisoning and delay attacks using the FashionMNIST dataset. The results validate its capability to reveal how node placement affects performance and expose network vulnerabilities. For example, poisoning attacks exhibit topology-dependent impacts, with accuracy dropping by over 55% in certain scenarios, leading to derailed convergence. Additionally, the extensive logging features of the framework enable post-simulation analysis and insightful interpretation. Its modular architecture, simple deployment, and customizable options make it a lightweight yet useful tool for in-depth research on DFL network security.

Practical Secure Aggregation by Combining Cryptography and Trusted Execution Environments

Romain de Laage
Peterson Yuhala
François-Xavier Wicht
Pascal Felber
Christian Cachin
Valerio Schiavoni

Secure aggregation enables a group of mutually distrustful parties, each holding private inputs, to collaboratively compute an aggregate value while preserving the privacy of their individual inputs. However, a major challenge in adopting secure aggregation approaches for practical applications is the significant computational overhead of the underlying cryptographic protocols, e.g. fully homomorphic encryption. This overhead makes secure aggregation protocols impractical, especially for large datasets. In contrast, hardware-based security techniques such as trusted execution environments (TEEs) enable computation at near-native speeds, making them a promising alternative for reducing the computational burden typically associated with purely cryptographic techniques. Yet, in many scenarios, parties may opt for either cryptographic or hardware-based security mechanisms, highlighting the need for hybrid approaches. In this work, we introduce several secure aggregation architectures that integrate both cryptographic and TEE-based techniques, analyzing the trade-offs between security and performance.

SESSION: INDUSTRY TRACK

IgNITE: Scheduling Pipeline-Parallel DNN Training Jobs on Heterogeneous Infrastructures

Phivos Dadamis
Dimitrios Tomaras
Vana Kalogeraki
Dimitrios Gunopulos

Deep Neural Networks (DNNs) have received significant attention in diverse domains such as finance, autonomous systems, text analytics and healthcare. However, DNN training is still dominated by memory-bound operations, making efficient execution on heterogeneous GPU clusters under timing constraints fundamentally challenging. In this work, we present IgNITE, a framework for scheduling pipeline-parallel DNN training jobs on heterogeneous edge clusters, leveraging batch partitioning and memory availability to optimize system throughput and efficiency. IgNITE features a batch partitioning scheme that dynamically splits the batches to fit device memory constraints and a distributed scheduling algorithm that jointly optimizes scheduling to satisfy both memory and end-to-end training time objectives. Our evaluation demonstrates that IgNITE outperforms state-of-the-art approaches by reducing the total memory requirements and improving time efficiency on heterogeneous clusters.

Efficient Compression for Data Analysis in IoT-Aware Process Orchestrations based on Time Series Point Cloud Reduction

Nataliia Klievtsova
Juergen Mangler
Stefanie Rinderle-Ma

IoT-aware process orchestrations can interlink thousands of interacting services and a plethora of parameters reflecting physical properties such as wear of tools or temperature, that influence the quality of the process outcome. The parameters are either available as input/output data flow of process tasks or as sensor data flow which is independently collected and associated with the control flow of a process. Depending on the task and the involved sensors, the velocity, volume, and variety of collected data can be high, which has a substantial impact on how efficiently data can be logged and analyzed. For runtime analysis, especially in edge service orchestrations (i.e., close to machines or patients for privacy and latency reasons) it is imperative not being overwhelmed by data. This is especially true for predictive analytics, that for example in manufacturing often deal with several 10.000 data-points per second. This paper introduces four novel approaches to compress sensor data associated with and contained in process logs and process event streams, specifically focusing on multi-gigabyte point clouds, i.e., sets of data points in n-dimensional space, with the goals of (1) providing runtime detection, prevention and correction of anomalies while (2) reducing the amount of data by several magnitudes. The proposed compression approaches are evaluated based on two different datasets from real-world production scenarios to see how they perform at a typical manufacturing analysis task, both in comparison with each other and non-compressed data.

A Tale of Many Streams: Characterizing a Hybrid Batch-Stream Production Workload in Digazu, a Data Lake supported by Apache Kafka and Flink

Donatien Schmitz
Luc Berrewaerts
Guillaume Rosinosky
Sabri Skhiri
Etienne Rivière

Many industries rely on analyzing large volumes of combined historical and live data. A data lake facilitates these operations by supporting an integrated data ingestion, storage, replay, and analysis workflow.

A modern data lake is distributed and combines a processing engine, able to seamlessly process large volumes of existing data as well as continuous flows of new data, such as Apache Flink, with a storage infrastructure able to ingest and replay this data, such as Apache Kafka.

This use of Flink in this setting departs from the commonly agreed model of stream processing queries operating over windows of events, maintaining a bounded and relatively small state per operator. Instead, hybrid batch-stream queries typically process an existing data set in its entirety before updating results with incoming stream data, leading to a large accumulated state. Given the industry’s importance of such usages, understanding their characteristics and how they differ from common assumptions in designing and evaluating stream processing systems is of utmost importance.

We present in this paper the analysis of a large-scale hybrid batch-stream workload collected from a production deployment of Digazu, a modern data lake building upon Kafka and Flink. We characterize 142 different sources of data and 129 hybrid batch-stream queries. Our analysis offers valuable insights into the nature of data and queries in typical data lake deployment, which will assist designers of such systems and associated benchmarks.

Bridging the Gap between CARISMA and SOME/IP-based Service Communication

Kevin Klein
Pascal Hirmer
Alexander Walz
Steffen Becker

Nowadays, the amount of software in modern cars as well as its complexity increases rapidly due to applications such as autonomous driving. Consequently, more powerful computing platforms are employed and software applications are developed in a distributed manner, e.g., by applying the widely established microservice pattern. Microservices introduce additional complexity and challenges regarding the communication between the individual services. In previous work, we introduced the CARISMA approach, which addresses these challenges and is specifically designed for application in the automotive domain. The main idea of the CARISMA approach is to employ a service mesh architecture that is adapted to the design of future electric/electronic (E/E) architectures. Initially, we focused on supporting gRPC-based inter-service communication. In this paper, we enhance CARISMA’s applicability to the automotive domain by enabling communication with SOME/IP-based services. SOME/IP is a protocol for inter-service communication widely adopted within the automotive domain. This allows our approach to be integrated with services running on established automotive platforms. We evaluate our concept by conducting a case study and benchmark based on a prototypical implementation.

In-situ Porosity Detection in Additive Manufacturing

Erik Sievers
Marina Papatriantafilou
Vincenzo Gulisano
Eduard Hryha
Lars Nyborg
Zhuoer Chen

Additive Manufacturing (AM) is a rapidly growing technology with applications in aerospace, automotive, and medical industries. Scalable AM requires in-situ quality monitoring to detect defects promptly. However, in-situ monitoring introduces scalability challenges due to high data volumes, rapid acquisition rates, and strict latency requirements. We introduce Hephaestus, a continuous in-situ monitoring system for data streams from optical monitoring sensors, able to detect porosity risks promptly and to balance accuracy and timeliness by adjusting the window of data used for porosity detection. Using data from two builds, we study this trade-off and the method’s cost-benefit towards early cancellation decisions.

SESSION: GRAND CHALLENGE TRACK

The DEBS 2025 Grand Challenge: Real-Time Monitoring of Defects in Laser Powder Bed Fusion (L-PBF) Manufacturing

Luca De Martini
Jawad Tahir
Alessandro Margara
Christoph Doblander
Sebastian Frischbier
Ruben Mayer
Hans-Arno Jacobsen

The DEBS 2025 Grand Challenge (GC) focuses on real-time defect detection in Laser Powder Bed Fusion (L-PBF) additive manufacturing. Participants must analyze streaming optical tomography images to identify temperature anomalies and cluster potential defects. The challenge defines a structured pipeline involving thresholding, temporal windowing, outlier analysis, and clustering. Solutions are evaluated using Challenger 2.1, a Kubernetes-based platform featuring REST APIs, containerized deployment, and fault-tolerance evaluations via Chaos Mesh. We also introduce Local-Challenger for local development and provide a Python reference implementation. Lastly, we report on the statistics of this year’s GC. This edition of the GC advances the state-of-the-art in defect detection in manufacturing processes by fostering the development of high-performant, real-time, and fault-tolerant cloud-native solutions.

Real-Time Thermal Defect Detection in Additive Manufacturing with Apache Flink

Zishuo Liu
Jingzhi Yan
Yixin Lao

The DEBS 2025 Grand Challenge addresses real-time defect detection in Laser Powder Bed Fusion (L-PBF) additive manufacturing. Participants must analyze a continuous stream of optical tomography images (per-layer temperature maps) to identify potential defects during the build process. In this paper, we present two stream-processing optimization based on Apache Flink system design that detect defects on-the-fly via local outlier analysis and clustering. The first optimization, termed the offset optimization, precomputes static neighbor offsets to accelerate the calculation of local neighborhood mean temperatures for each pixel. The second optimization, termed the Beta optimization, further improves performance by caching the mean temperatures of “close” neighboring points and updating only those pixels affected when the 3-layer sliding window advances. Implemented in Java on Flink, our approach significantly outperforms the baseline Python implementation provided by the challenge organizers, achieving up to 11× lower latency and 25.3× higher throughput on the challenge dataset. These optimizations demonstrate that real-time in situ defect detection for L-PBF is feasible with substantial performance gains over naive approaches.

Flink Solution for ACM DEBS Grand Challenge 2025

Owen Mariani
Prathmesh Sonawane
Tanish Bhowmick

The DEBS Grand Challenge 2025 focuses on creating a real-time monitoring system for defects during Laser Powder Bed Fusion (L-PBF) additive manufacturing. We present a distributed, event-driven solution built on Apache Flink that ingests streams of 16-bit optical tomography images, computes local temperature deviations, and clusters them to highlight possible defect regions. Deployed on a Flink cluster under Kubernetes, our system processes about 7.07 batches of images per second with an end-to-end latency around 6.12 seconds, enabling intervention in the printing process.

SESSION: POSTERS AND DEMOS

ATOS, An Open-Source Platform for Testing AI-Based Automated Vehicle Systems in Integrated Simulated and Physical Scenarios

Victor Jarlow
Lukas Wikander
Samuel Thorén
Robert Brenick
Xianbiao Hu
Timo Kero

The increasing complexity of automated and AI-based vehicle systems necessitates novel testing methodologies to ensure safety, reliability, and performance in diverse and dynamic scenarios. This paper introduces AV Test Operating System (ATOS), an open-source platform that seamlessly integrates simulated and physical testing environments for Advanced Driver-Assistance Systems and Automated Driving systems. AV Test Operating System (ATOS) supports comprehensive scenario evaluations under varied conditions such as weather, traffic, and connectivity environments by automating and orchestrating tests involving multiple simultaneous virtual and physical objects. Key contributions of this work include: 1) The introduction and design of AV Test Operating System (ATOS), an open-source platform for automating and orchestrating complex vehicle testing scenarios. 2) A method for evaluating the repeatability of automated test orchestration systems. 3) Data analysis of two scenarios where AV Test Operating System (ATOS) was used in real-life AV testing. Results demonstrate AV Test Operating System (ATOS)’s effectiveness in executing repeatable and reliable tests across diverse configurations, highlighting its utility for research institutions, vehicle manufacturers, and testing facilities. Using AV Test Operating System (ATOS) in repeated runs, the time delay when triggering dynamic events was within 1 millisecond with virtual objects and 7 milliseconds when using physical objects. The positional variation between runs using virtual and real-life objects amounted to 6 and 20 cm, respectively.

Towards a DSL for hybrid secure computation

Romain de Laage

Fully homomorphic encryption (FHE) and trusted execution environments (TEE) are two approaches to provide confidentiality during data processing. Each approach has its own strengths and weaknesses. In certain scenarios, computations can be carried out in a hybrid environment, using both FHE and TEE. However, processing data in such hybrid settings presents challenges, as it requires to adapt and rewrite the algorithms for the chosen technique. We propose a domain-specific language (DSL) for secure computation that allows to express the computations to perform and execute them using a backend that leverages either FHE or TEE, depending on what is available.

CNN and RVV Co-design for Efficient Model Serving

Sonia Rani Gupta
Nikela Papadopoulou
Jing Chen
Miquel Pericàs

Convolutional algorithm performance depends on layer dimensions, with SIMD demands and cache sharing influencing runtime selection. To identify the best settings, we perform a co-design exploration of convolutional layer parameters and three algorithms: Direct, im2col+GEMM, and Winograd, jointly with hardware parameters for RISC-V vector architectures. Our results show that incorporating hardware parameters with layer dimensions boosts execution time and efficiency, emphasizing the need for co-design.

VEnOM: A Vector Embedding Operator Modelling Framework

Andreas Loizou
Nikolaos Andriotis
Dimitrios Tsoumakos

The massive increase in available data sources among organisations has generated new challenges for scalable and efficient data analytics. Selecting the highest-quality datasets for a specific analytics task can be cumbersome, especially when the number of available inputs is very large. In this demonstration, we present VEnOM, a modular modelling system that addresses this challenge: Through VEnOM, users have high-precision predictions of the result of an analytics operator for a random input dataset at hand, without actually executing it. VEnOM leverages dataset similarity and adaptive modelling in order to accurately infer operator outputs for heterogeneous such operators and dataset types through its modular design. In this demonstration, we showcase the modelling of multiple operators from the domain of machine learning and graph analytics that receive tabular and graph datasets as input.

V-Mon: Scalable and Fault-Tolerant Stream Processing Pipeline for Monitoring Vehicular Data Validity

Carl-Magnus Wall
Måns Josefsson
Martin Hilgendorf
Marina Papatriantafilou
Binay Mishra

With constant increases in edge devices in industry settings, increases in data rates naturally follow. However, with high, unbounded data rates, traditional (store-then-process) database procedures and batch-based processing are struggling to remain performant. To this end, processing streams of data continuously is an increasingly appealing approach, targeting low latency, high scalability and real-time data processing. This work examines design considerations as well as performance trade-offs for a stream processing pipeline targeting stateful analysis. The pipeline implementation employs Apache Kafka, Apache Flink and Apache Druid, and is studied through an example use case at Volvo Trucks, focusing on signal data set validity analysis. Performance evaluation of the pipeline reveals that the throughput requirements of the use case are satisfied, while also achieving sub-second latencies and offering a degree of fault tolerance. The pipeline also shows promise of adapting well to different levels of scale, providing enough headroom for a tenfold increase in data volumes over current demands. Further, the extensible nature of the pipeline enables the support of various feature extraction methods, e.g., data synopsis and sketching, and alternative data representations, e.g., knowledge graphs.

StreamPlane: Flexible Reconfiguration of Streaming Application Runtime Topology

Takdir
Salman Ahmed Shaikh
Akiyoshi Matono
Hiroyuki Kitagawa
Toshiyuki Amagasa

Modern distributed stream processing engines (DSPEs) execute tasks in parallel and utilize their built-in peer-to-peer data transfer to deliver intermediate results across distributed nodes. When systems need reconfiguration to adapt to the underlying infrastructure, such as migrating operators from one node to another, backup-and-restore procedures must be conducted, which may cause existing systems to halt. The rapid increase in continuously generated streaming data requires resilient and flexible systems that cannot afford such delays. Additionally, the topology may involve heterogeneous, multi-cluster computing environments, such as edge and cloud servers. Addressing these issues is impractical in current streaming systems. We present StreamPlane to enable flexible and seamless control and reconfiguration of streaming topology on the fly. The key idea is to leverage a distributed in-memory cluster as the control plane on top of the existing DSPEs. StreamPlane maintains runtime information to control the execution of distributed streaming tasks. StreamPlane delivers and temporarily stores in-transit data between nodes to facilitate efficient reconfiguration. Furthermore, it keeps the computational states accessible during reconfiguration. In this demonstration, we present our prototype to attendees to showcase our achievements with the proposed scheme.

Towards Stream-Based Monitoring for EVM Networks

Emanuel Onica
Claudiu-Nicu Bărbieru
Andrei Arusoaie
Oana-Otilia Captarencu
Ciprian Amariei

We believe that leveraging real-time blockchain operational data is of particular interest in the context of the current rapid expansion of rollup networks in the Ethereum ecosystem. Given the compatible but also competing ground that rollups offer for applications, stream-based monitoring can be of use both to developers and to EVM networks governance. In this paper, we discuss this perspective and propose a basic monitoring pipeline.

Important Dates

Events	Dates (AoE)
Research Papers
Abstract Submission	~~January 31st, 2025~~ February 10th, 2025
Paper Submission	~~February 7th, 2025~~ February 17th, 2025
Rebuttal (start)	March 21st, 2025
Rebuttal (end)	March 28th, 2025
Notification	April 4th, 2025
Camera Ready	May 2nd, 2025
Submission Dates
Industry and Application Papers	~~March 23rd, 2025~~ April 6th, 2025
Tutorials and Workshops	February 28th, 2025
Posters and Demos	~~April 11th, 2025~~ April 18th, 2025
Grand Challenge Short Paper	May 2nd, 2025
Doctoral Symposium	~~May 16th, 2025~~ May 23rd, 2025
Notification Dates
Tutorials and Workshops	March 14th, 2025
Industry and Application Papers	~~April 21st, 2025~~ April 27th, 2025
Posters and Demos	April 25th, 2025
Doctoral Symposium	~~May 23rd, 2025~~ May 30th, 2025
Camera Ready
Industry and Application Papers	May 2nd, 2025
Posters and Demos	May 2nd, 2025
Tutorials and Workshops	May 2nd, 2025
Grand Challenge	May 9th, 2025
Grand Challenge Platform
Registration	December, 2024
Platform Opens	February 15th, 2025
Platform Closes	May 9th, 2025
Conference
Conference	June 10th–13th 2025

PROCEEDINGS