Chaos engineering is the discipline of injecting computing and network faults, such as increased network latency and unavailability of computing nodes, into an IT system to help developers in identifying problems that could arise in a production environment and tackle them. Several tools have emerged to ease the application of chaos engineering to complex IT systems, leveraging microservice and container-based applications deployed on Kubernetes. However, applying of such tools requires several phases to be put into practice, from defining a steady state to establishing an effective response plan if something goes wrong. To ease the application of chaos engineering in improving the resilience of Kubernetes applications, this work presents a smart scheduler for Kubernetes called TELKA: a Twin-Enhanced Learning for Kubernetes Applications, which combines chaos engineering, Digital Twin (DT), and Reinforcement Learning (RL) methodologies to mitigate the effects of computing and network faults. Instead of interacting directly with the physical Kubernetes application, TELKA learns by interacting with a digital twin, thus reducing the learning time and the operation costs related to the application of chaos engineering. Experiment results compare TELKA with other approaches to show its effectiveness in mitigating the adverse effects of injected faults.

TELKA: Twin-Enhanced Learning for Kubernetes Applications

Zaccarini, Mattia;Poltronieri, Filippo;Stefanelli, Cesare;Tortonesi, Mauro
2024

Abstract

Chaos engineering is the discipline of injecting computing and network faults, such as increased network latency and unavailability of computing nodes, into an IT system to help developers in identifying problems that could arise in a production environment and tackle them. Several tools have emerged to ease the application of chaos engineering to complex IT systems, leveraging microservice and container-based applications deployed on Kubernetes. However, applying of such tools requires several phases to be put into practice, from defining a steady state to establishing an effective response plan if something goes wrong. To ease the application of chaos engineering in improving the resilience of Kubernetes applications, this work presents a smart scheduler for Kubernetes called TELKA: a Twin-Enhanced Learning for Kubernetes Applications, which combines chaos engineering, Digital Twin (DT), and Reinforcement Learning (RL) methodologies to mitigate the effects of computing and network faults. Instead of interacting directly with the physical Kubernetes application, TELKA learns by interacting with a digital twin, thus reducing the learning time and the operation costs related to the application of chaos engineering. Experiment results compare TELKA with other approaches to show its effectiveness in mitigating the adverse effects of injected faults.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in SFERA sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11392/2570493
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact