EDIT: I figured out what was going on, and unless I’ve missed something, it’s pretty fatal to running “docker-for-azure” in production, as the logs will consume all the containers disks, and can’t be rotated. See post below
Expected behavior
After deleting a log file, the file contents to only contain new log entries.
Actual behavior
After deleting a log, it is recreated with all the contents from before
Additional Information
Steps to reproduce the behavior
Create a container that logs continuously to stdout and run it in the background
Mount the Azure log storage onto a VM (I used a non-Docker CentOS one)
Rotate/move/delete the log file, and check with “ls” that it is gone.
Observe that the file is recreated with all the previous contents.
e.g.
[master01:/m/l/manager000000] ls -l
total 512
-rwxrwxrwx. 1 root root 300 Mar 21 15:43 logtest-b83295ecc6e3
[master01:/m/l/manager000000] head logtest-b83295ecc6e3
2017-03-21T15:43:49+0000
For internal use only.
2017-03-21T15:43:53+0000
Few things are harder to put up with than the annoyance of a good example.
-- "Mark Twain, Pudd'nhead Wilson's Calendar"
2017-03-21T15:43:58+0000
I can live without
[master01:/m/l/manager000000] rm -f logtest-b83295ecc6e3
[master01:/m/l/manager000000] ls -l
total 0
... wait for new log msgs ....
[master01:/m/l/manager000000] ls -l
total 2
-rwxrwxrwx. 1 root root 1258 Mar 21 15:44 logtest-b83295ecc6e3
[master01:/m/l/manager000000] head logtest-b83295ecc6e3
2017-03-21T15:43:49+0000
For internal use only.
2017-03-21T15:43:53+0000
Few things are harder to put up with than the annoyance of a good example.
-- "Mark Twain, Pudd'nhead Wilson's Calendar"
2017-03-21T15:43:58+0000
I can live without
Not the original file contents still exist.
I don’t know if this is something to do with Azure storage, Docker buffering the content and recreating the file or what. It’s very strange. Additionally, the file line count seems to randomly vary:
So far I haven’t seen any evidence of log rotation, the logs all seem to hang around for ever on the storage container. And worse - from above - when attempting to manually rotate them by mounting the storage to an external VM, they seem to be magically reappearing!
So I’ve done a bit of investigation, and the logging as it is currently is quite frankly bizarre. Unless I’ve missed something it also wildly contradicts the statements on the webpages for this project about how this is “self cleaning” and auto managed.
Process
The containers output is routed via syslog to a logging container (docker4x/logger-azure).
This runs a python script that provides a syslog endpoint.
This script appends the log message to a file local to the container (in a volume mount), and then uploads the entire file with every message to Azure storage!
This means that:
All logs are stored indefinitely in the local volume mount, so it will rapidly run out of disk space
As log files grow beyond a few Mb, it will be uploading vast amounts of data every time a single line is written.
You can’t delete files from Azure - if you do they just get recreated.
You can’t monitor files for changes (e.g. with Logstash or other tools), as each file is continuously being deleted and recreated.
This is completely broken. What was supposed to be an easy task of clearing out our stale logs from Azure has made us need to re-evaluate the entire project, as there is no way we can go into production with something like this.
You can view the python script that does this with: docker run -ti docker4x/logger-azure:azure-v1.13.1-ga-2 cat server.py
Ironically it also has the statement # @TODO implement cleanup for container…
@ddebroy - you seem quite informed about this project, would you be able to respond to this? thanks!
Hi @ddebroy, thanks for the respons. Do you have any ETAs for fixing this? We’re quite close to moving our onsite webapps into our Azure swarm, and this is a major blocker. Alternatively, is there a way to replace the logging container with our own one? I can obviously just stop the container and run something else on the port, but that won’t persist across reboots or updates.
@markvr We will most likely have a patch release out next week for 17.03 and we will address this issue there. Stopping the editions_logger container on all nodes for Azure swarms is one temporary workaround. You can achieve this by running the following from one of the masters through swarm-exec which will ensure the command gets executed as a global swarm service i.e. executed on all existing and fresh nodes:
@ddebroy I’ve looked into some alternatives in case we need to replace this manually. I’m not a python developer, but I think the main options would be to either use an append blob, which is more designed for this use case, or a combination of resize_fileand update_range on the file, without the file ever hitting the local disk. Of course you have the problem of what to do if the storage requests fail, but it’s probably better to drop logs than have everything crash if the disks fill up.
You can’t mount blobs as file shares, which makes them harder to access and manage using standard tools, but they are more designed for the use case of rapidly appending data such as logs,
Incidentally, it would be easier if all the files were in the same share in a flat list. Currently the files from each host are in a separate share, but really the host a container ran on is largely irrelevant, and having these as separate shares means we need to call the Azure API find all the shares, and iterate through them to find the relevant log file. If all the files were simply in the same directory, with the filename as currently (i.e. named after the service name & container/task ID) it would make finding all the logs for a particular service much simpler!.
@markvr You are correct in your observation that blobs are hard to work with so we are planning to stick to files for now.
The fix we are implementing is going to use resize_file and update_range to update the file. We will cache the latest logs into a “chunk” and for each container upload the chunk when it’s size grows beyond a certain threshold or beyond a certain timeout. We do not want to call Azure APIs for each log message as that may cause certain chatty containers to force Azure into throttling other API calls (for health checks for example) beyond a certain frequency. So we will use some disk space but keep it bounded.
As far as the layout of the files is concerned, we do need to distinguish containers started with docker run (i.e. not swarm services) with the same name on different nodes. We can keep the structure flat by appending the hostname of the node as a suffix to the name of the file.
That sounds like it will be a vast improvement! Can the timeout/buffer be made configurable? Possibly this could even be configurable per service using labels or something? To find out why a service is failing we obviously need the logs, and so if there is any significant delay between the service failing and the logs arriving in Azure it will make it slower to investigate and get the service back up again.
Having the hostname as part of the filename would also be fine, even using the hostname as the directory would be ok. It’s having them as seperate shares that makes it hard to enumerate all the files for a particular service.
@ddebroy If it’s of any interest, for a different reason (haproxy traffic logs) I used syslog-ng and Cloudstor to mount an Azure share and write syslog data to it. It seems to work OK, and is a lot simpler than using the Azure REST API. It obviously does then have a dependence on the cloudstor volume plugin.
@markvr … Very nice workaround. Are you first killing the editions_logger container and then deploying a syslog-ng container?
Your observation about not using the Azure APIs being simpler is absolutely correct. In fact in our patch, we also changed the mechanism (we discussed above) to mount the share over SMB (thus very similar to cloudstor) and just append to the syslog entries to the files over CIFS. Should be out anytime now. Another advantage of this over APIs is a chatty container won’t lead to overall Azure API invocations getting throttled since Azure has very strict bounds over the rate of API calls.
@ddebroy no, this wasn’t to replace the Docker logging. It was a separate logger to capture the load balancer traffic logs, which are sent over syslog but just to an endpoint in the swarm. I don’t see why the same thing couldn’t work though.
Centralized logging is a critical component of many modern infrastructure stacks. To have these logs indexed and searchable proves invaluable for debugging application and system issues as they come up. Out of the box, Docker for Azure forwards logs from containers to a native cloud provider abstraction (a storage account in the created resource group).
How is Docker for Azure configuring the container logging by default so the logs are indexed and searchable? How can we see the container logs in this manner (searchable)? Do you mean it is enabling a log analytics OMS stuff?
Docker for Azure will stream all the logs from all containers on all hosts to a storage account in the Docker4Azure Resource Group that ends in *logs. When you go there (through the Azure portal), you will find a specific Azure File Storage Share named docker4azurelogs . You can now either use the Azure Portal to explore all the log files from each container in there or mount the SMB share (//sa_id_for_logs.file.core.windows.net/docker4azurelogs) and grep through the files using any file system utility.
How would Docker 4 Azure users benefit from storing the logs in either:
A table in a Storage Account? (instead of just files)
Into OMS Log Analytics?
Right now, having a Docker 4 Azure Swarm Cluster with 15+ services (each having more than 1 container), searching and grepping the logs as you mentioned is… rather difficult: I need to know exactly which worked node was hosting which service’s container and when.
I’m assuming that if we route the logs instead to either one of these, at least I can do a query for a particular log fragment in just one source.
What do you think? Maybe I can work something out for this.
Using a table in a Storage Account: If each log entry generated by the Docker daemon for each host in the swarm is sent to a fresh entity/row in a Azure Table, I think it will lead to too many Azure API calls and you will run into Azure API invocation throttling issues. If Azure API calls from the swarm starts to get throttled, it will affect the overall stability of the swarm and not recommended (especially if you end up with chatty containers that log a lot).
OMS Logs: This sounds like a good idea. You will probably need to have a OMS agent for Syslog deployed in a container on each host that listens for syslog events and filters out/processes the ones from dockerd and sends them over.
How can I prune logs in the Azure storage container? I have 4 nodes and 2 services. I’ve done quite a few “docker service created” followed by “docker service rm”. Each time I do this, it builds new IDs on the log files. In one day of messing around, I now have 500 log files for just my two services and 4 nodes.
I’ve tried deleting them with Azure Storage Explorer, but docker still has file locks on many of them, even after the service is gone.
I don’t suppose there is a docker logs prune or some such command coming?