Let me brief about my project, I’m building containerized security training environments (Future - Randomized Security Environments), aimed to help local students, organizations on their Information security training needs.
Current working - I have an instance group which auto scales according to load running a script to add and remove nodes from swarm, I use pub/sub topics to cater deployment needs which are deployed through (docker stack deploy command). It was tested by 4-5 people and was thought to be working perfectly until, we started trails on my own college students.
It got issues such as port numbers not being assigned to new deployments after 20-25 people deployed onto swarm, I am not understanding why, I mean resource usage is optimal, but swarm isn’t assigning ports, after restarting the whole instance, swarm was updated with ports.
I knew there was a swarm option for task-history-limit which default set to 5, maybe that was the issue and it that’s why it wasn’t able to concurrently deploy. Later same thing happened (After 40 deployments) even after setting it to a higher number (Upgraded infra, low utilization in logs). Even now I’m getting nightmares on not knowing the correct reason of why this is happening.
Sample Deployment stack - Sample Scenario of how my deployment looks · GitHub, Aim is to deploy this environment on-demand basis, each environment being isolated to respective user.