Data Volume Recommendation

I have a large postgres database that i’m building into a docker container.

Right now, I’m downloading a compressed databaes dump and restoring it during the docker build process. So it takes a 1.7 gb compressed file and generates a docker image of about 40GB.

The down side of this is of course deployment of the docker image or pulling it from dtr is super slow.

I was wondering if anyone had any suggestions on how to better manage the volume data.

This is my current docker file.

FROM internal_db:base

COPY *.Fc /tmp/sql/
COPY /tmp/sql/

RUN /usr/pgsql-10/bin/pg_ctl -D /postgresql/pg_data/10/ start && \
  /tmp/sql/ /tmp/sql && \
  /usr/pgsql-10/bin/pg_ctl -D /postgresql/pg_data/10/ stop

USER root

RUN /usr/bin/rm -Rf /tmp/sql

USER postgres

the script it calls is this does a few sanity checks then ends up running this line.

pg_restore -p 5432 --dbname=postgres -Fc --create --verbose --jobs=4 --no-tablespaces /tmp/sql/akdb-extract-for-docker.Fc which does most of the work.

On the one hand I like the fact the data is in a container because i can take advantage of the image reset to restore the DB data to an original state, but at the same time having a container that large seems ilke an anti-pattern.

Anyone have a better suggestion on how to do this? (This also cause serious issues in the past with docker VMs running out fo space / memory and so on while building the image)

Is there any particular reason why you aren’t storing the postgres data into a persistent volume/mount? In this situation, I would usually have the image form a basic database server, and I’d import the data by calling the postgres import commands (via docker run).

Then, to add persistence, I would mount a persistent volume/host folder to the postgres data directory (SHOW data_directory;).

Or have I misunderstood what you’re trying to do?

1 Like

so, a few things. Please correct me on any of this if i’m just being naive or misunderstanding your suggestion.

  1. loading the data can take anywhere from 20-30 which is why it’s part of the image currently.

  2. Persistent volume I assume is just using the VOLUME keyword to declare it in the image and then mounting it via the -v parameter or the appropriate docker-compose tag.

I know docker has added more refinement to the volume but I believe it’s still essentially mapping a local directory to a folder on the container.

My issue with volumes is that if I want to ‘reset’ the image to the original state, then I would have to re-build the image once more to restore. Is there the support for a immutable volume? or basically

volume state 0 + delta, upon reset wipes delta and reverts of state 0?

Thanks for any help and I hope this clears things up a bit?

You can always mount the volume with the :ro flag, to make it read only, but that would make it impossible to update the database in that location. (You could write new transactions elsewhere, though, to make it possible to always start at point-zero, then quickly rebuild previous changes.)

In any event, you definitely do NOT want to include 40GB of data into a docker image! That’s what the whole volume system was designed to avoid. No one wants to push/pull/load/deal with an image of that size. (Including you, from the sound of your original post! :wink: )

Yeah, the reason I’m building the image locally is because it’s faster and smaller at least to download an extract and build the image locally then pushing/pulling the data.

If we can write the new transaction to a different location and wipe that out that’d be great though I have no clue how that would be accomplished. I’m not sure of a docker or postgres way of doing that.