Providing initial data to app instances running in Docker Swarm

atrz · March 15, 2018, 8:36am

Hi all,

I have an app, lets call it MyApp, that I would like to run in cluster on lets’ say 3 nodes (but it can be 2, 5, 10, any number of nodes). Each instance of MyApp is independent from the other instance - each gets its set of initial data, that needs to be stored in DB so it can be accessed by MySecondApp.

So, to summarize,

There is some initial data in DB, it is accessed by MySecondApp
MySecondApp can operate on initial data, add/update/delete
There are instances of MyApp - each instance gets part of initial data - they are independent. I was thinking that MySecondApp reads all initial data and serves as coordinator. But it needs to know which instances exists. Does Docker that provide? MySecondApp gets all instances of MyApp (service registry) and initializes one by one. Should I implement this as custom solution or can use some framework?
If initial data is updated via MySecondApp, changes need to be propagated to the MyApp instance that is operating on that data
(Optionally) If additional data is added, I can create new instance of MyApp that will handle that added data

I am now looking into Docker to see what features it provides to run my scenario. So far, I understand that by using Docker Swarm I can start 3 instances of MyApp on 3 nodes. However, I cannot find how can I provide each with initial data. It seems to me that I cannot use the “Docker Config” for this because, with MySecondApp I should be able to update initial data and changes should be propagated to correct instances of MyApp (those operating on that modified data).

What from the list above can be handled by Docker (Swarm)?

I am very new to all of this so any help would be great!

BR,
atrz

dmaze · March 15, 2018, 11:28am

If you store the data in a relational database – a postgres or mysql container, say, or an external database not managed by Docker – then you don’t have this problem of moving data around. At startup time you’d give each of the application containers the host name of the database and some sort of partition identifier (a SQL schema name or a data set ID value) and every one would talk to the same database.

No. If you merely need to know about the data sets that exist, and they’re all in one central database, you could do an SQL query and get them. If you need to know about the actual worker processes, they could connect to the coordinator process to get directions. You could also use some sort of service discovery system or configuration store (Hashicorp’s Consul; CoreOS’s etcd; the Kubernetes Service subsystem; Apache ZooKeeper) to directly record what other processes exist.

(This is a complicated problem with a lot of established practices, and I’d read up on distributed systems and/or pick a prebuilt library for what you’re doing before writing your own.)

Note that you’re not limited to one instance per node; if you have large nodes or your workers aren’t that busy it can be useful to run more instances than nodes.

atrz · March 19, 2018, 12:03pm

David, thank you for the reply!

Unfortunatelly, solution with DB won’t work because that way MyApp because if data gets updated and it needs to be propagated to relevant MyApp instance. I saw that Docker Swarm provides service discovery and I’m trying to understand how would I integrate that into my solution. Would that be sufficient for my needs or I would I still need some system/conf store that you propsed?

dmaze · March 19, 2018, 10:28pm

None of the core Docker or Docker-related technologies will really help you with this. Possibly you’re looking for a messaging system like RabbitMQ or Apache Kafka that would let you push out notifications of updated data; I gather in Kafka it’s common to replicate entire data sets.

That is, instead of an architecture where you have a single master that knows who all of the workers are and pushes work to them, I feel like it would be easier to build and manage a system where you have a pool of workers that can ask for work. That “asking” vs. “telling” difference helps solve problems like figuring out what to do if you have an update for a particular datum but the worker who would normally handle it is unavailable.

trajano · March 20, 2018, 1:33am

Ideally you would want to avoid storing “changing” data in a container because the container file system is slower https://stackoverflow.com/questions/49335944/should-i-put-nginx-proxy-caches-in-the-container-or-a-volume/49342684#49342684

There are two approaches I can think of, you can store the data as part of the image which may be big, but at least it is already there. Or you just have a volume mount. Likely you would want it scalable. For initial data that is file based you can use something like glusterfs, for something with databases you can probably set up a cluster like https://hub.docker.com/r/mysql/mysql-cluster/

goffinf · March 25, 2018, 10:06am

I would have thought that you want the data that relates to your app instances to be generated by those instances. Can’t you generate (or load) the data-set for ‘MyApp’ as part of its start-up configuration/bootstrap. It would be relatively easy to identify different initial data-sets for each instance if you need that, and you can create a stack which includes a dependency to MySecondApp so by the time MySecondApp starts all the data is ready ?