Communication timeout when uploading to SD, m105 not being sent


#1

What is the problem?
During the upload of a gcode file to the printer's SD card, the upload will hang with timeout messages. Looking at serial.log it seems that the command to get a fresh response from the print (M105) is not being sent. I have also (once) seen a hang during printing and believe that this was a similar problem, but the upload to SD card is easier to reproduce.

What did you already try to solve it?
I've repeated the test several times, the time taken to hit the problem varies but it always seems to eventually lock up. I've tried both "standard" ok mode and ADVANCED_OK (see the output below). I have also run the test using Octoprint safe mode. I have serial.log enabled and the end of the file shows the following output....

2018-12-23 20:38:39,832 - Send: N89742 G1 X116.020 Y113.817 E0.01363*108
2018-12-23 20:38:39,840 - Recv: ok N89742 P15 B3
2018-12-23 20:38:39,861 - Send: N89743 G1 X116.538 Y113.505 E0.01171*110
2018-12-23 20:38:39,869 - Recv: ok N89743 P15 B3
2018-12-23 20:38:39,885 - Recv:  T:19.29 /0.00 B:19.30 /0.00 @:0 B@:0
2018-12-23 20:38:39,894 - Send: N89744 G1 X117.118 Y113.227 E0.01245*109
2018-12-23 20:38:39,905 - Recv: ok N89744 P15 B3
2018-12-23 20:38:39,911 - Send: N89745 G1 X117.739 Y113.001 E0.01279*96
2018-12-23 20:38:39,915 - Recv: ok N89745 P15 B3
2018-12-23 20:38:39,922 - Send: N89746 G1 X118.385 Y112.835 E0.01290*102
2018-12-23 20:38:39,927 - Recv: ok N89746 P15 B3
2018-12-23 20:38:39,933 - Send: N89747 G1 X119.032 Y112.735 E0.01268*97
2018-12-23 20:38:39,937 - Recv: ok N89747 P15 B3
2018-12-23 20:38:41,883 - Recv:  T:19.29 /0.00 B:18.75 /0.00 @:0 B@:0
2018-12-23 20:38:43,896 - Recv:  T:19.29 /0.00 B:18.75 /0.00 @:0 B@:0
2018-12-23 20:38:45,883 - Recv:  T:19.29 /0.00 B:18.75 /0.00 @:0 B@:0
2018-12-23 20:38:47,882 - Recv:  T:19.29 /0.00 B:19.22 /0.00 @:0 B@:0
2018-12-23 20:38:49,882 - Recv:  T:19.29 /0.00 B:18.75 /0.00 @:0 B@:0
2018-12-23 20:38:51,882 - Recv:  T:19.29 /0.00 B:18.75 /0.00 @:0 B@:0
2018-12-23 20:38:53,882 - Recv:  T:19.29 /0.00 B:19.14 /0.00 @:0 B@:0
2018-12-23 20:38:55,882 - Recv:  T:19.29 /0.00 B:18.83 /0.00 @:0 B@:0
2018-12-23 20:38:57,883 - Recv:  T:19.29 /0.00 B:18.75 /0.00 @:0 B@:0
2018-12-23 20:38:59,881 - Recv:  T:19.29 /0.00 B:19.34 /0.00 @:0 B@:0
2018-12-23 20:39:01,882 - Recv:  T:19.29 /0.00 B:18.75 /0.00 @:0 B@:0
2018-12-23 20:39:03,882 - Recv:  T:19.29 /0.00 B:18.79 /0.00 @:0 B@:0
2018-12-23 20:39:05,880 - Recv:  T:19.29 /0.00 B:19.18 /0.00 @:0 B@:0
2018-12-23 20:39:07,880 - Recv:  T:19.29 /0.00 B:18.75 /0.00 @:0 B@:0
2018-12-23 20:39:09,880 - Recv:  T:19.29 /0.00 B:18.75 /0.00 @:0 B@:0
2018-12-23 20:39:11,880 - Recv:  T:19.29 /0.00 B:18.75 /0.00 @:0 B@:0
2018-12-23 20:39:11,887 - Communication timeout while printing, trying to trigger response from printer. Configure long running commands or increase communication timeout if that happens regularly on specific commands or long moves.
2018-12-23 20:39:13,881 - Recv:  T:19.29 /0.00 B:19.34 /0.00 @:0 B@:0
2018-12-23 20:39:15,880 - Recv:  T:19.29 /0.00 B:18.75 /0.00 @:0 B@:0
2018-12-23 20:39:17,880 - Recv:  T:19.29 /0.00 B:18.79 /0.00 @:0 B@:0
2018-12-23 20:39:19,880 - Recv:  T:19.29 /0.00 B:19.02 /0.00 @:0 B@:0
2018-12-23 20:39:21,880 - Recv:  T:19.29 /0.00 B:18.91 /0.00 @:0 B@:0
2018-12-23 20:39:23,880 - Recv:  T:19.29 /0.00 B:18.75 /0.00 @:0 B@:0
2018-12-23 20:39:25,883 - Recv:  T:19.29 /0.00 B:18.95 /0.00 @:0 B@:0
2018-12-23 20:39:27,880 - Recv:  T:19.29 /0.00 B:19.02 /0.00 @:0 B@:0
2018-12-23 20:39:29,878 - Recv:  T:19.29 /0.00 B:19.14 /0.00 @:0 B@:0
2018-12-23 20:39:31,878 - Recv:  T:19.29 /0.00 B:18.75 /0.00 @:0 B@:0
2018-12-23 20:39:33,879 - Recv:  T:19.29 /0.00 B:18.75 /0.00 @:0 B@:0
2018-12-23 20:39:35,878 - Recv:  T:19.29 /0.00 B:19.18 /0.00 @:0 B@:0
2018-12-23 20:39:37,878 - Recv:  T:19.29 /0.00 B:18.75 /0.00 @:0 B@:0
2018-12-23 20:39:39,878 - Recv:  T:19.29 /0.00 B:19.14 /0.00 @:0 B@:0
2018-12-23 20:39:41,878 - Recv:  T:19.29 /0.00 B:18.98 /0.00 @:0 B@:0
2018-12-23 20:39:43,878 - Recv:  T:19.29 /0.00 B:19.37 /0.00 @:0 B@:0
2018-12-23 20:39:43,891 - Communication timeout while printing, trying to trigger response from printer. Configure long running commands or increase communication timeout if that happens regularly on specific commands or long moves.
2018-12-23 20:39:45,878 - Recv:  T:19.29 /0.00 B:19.37 /0.00 @:0 B@:0
2018-12-23 20:39:47,878 - Recv:  T:19.29 /0.00 B:18.75 /0.00 @:0 B@:0
2018-12-23 20:39:49,878 - Recv:  T:19.29 /0.00 B:19.18 /0.00 @:0 B@:0
2018-12-23 20:39:51,878 - Recv:  T:19.29 /0.00 B:19.22 /0.00 @:0 B@:0
2018-12-23 20:39:53,878 - Recv:  T:19.29 /0.00 B:19.37 /0.00 @:0 B@:0
2018-12-23 20:39:55,878 - Recv:  T:19.29 /0.00 B:19.37 /0.00 @:0 B@:0
2018-12-23 20:39:57,881 - Recv:  T:19.29 /0.00 B:19.37 /0.00 @:0 B@:0

Additional information about your setup (OctoPrint version, OctoPi version, printer, firmware, octoprint.log, serial.log or output on terminal tab, ...)
Octoprint is running on Octopi and is version 1.3.10 I am using Marlin Bugfix 2.0.x firmware.

I am a contributor to Marlin and have worked on the LPC176X USB stack (including on the USB CDC communications). I have been looking into a report of issues with Octoprint communications getting timeouts when using Marlin on a LPC176X based board and in the process of testing ran into the problem described here. From the log files it seems that Marlin has correctly responded to Octoprint and that for some reason Octoprint is no longer sending gcode (in particular it does not seem to send the M105 command that is used to provoke Marlin to respond). I've had a quick look at the Octoprint code but I'm not familiar with it and would appreciate any thoughts on what may be going on here. Thanks.


#2

I'm guessing that @foosel will want to read this, noting that she'll be away for the Christmas break and back soon enough.


#3

Quick update. I realised last night that I had not mentioned something....

In my tests to now, I always had two browsers connected to octoprint (I started the SD card upload on my PC and then "monitored" things from my Chromebook). Last night I ran a test using only the PC and with no browser windows open on any other system, this time for the first time I was able to complete the transfer of the 20Mbyte test file I have been using. Obviously not conclusive as just a single test, but I thought I'd update the information here. I will be running more tests using just a single browser connection today as and when I get chance.

I'm not familiar with the octoprint architecture, is it likely that having multiple browsers connected to a single instance could have an impact? Also When running this test I typically have the "control" tab active which will have a video stream of the printer visible in it (streamed from a Pi cam connected to the rpi).

Oh and I should also add that in my very first tests I had the "lightning bolt" low voltage symbol show up from time to time. To ensure that power a glitch was not the cause of the problem I ordered a genuine Pi 2.5A power supply and have been using this in all of my recent tests. On these tests no power problems have been indicated.

Thanks again for any help.


#4

Further update. I've now run 4 tests (each with the same 20MByte file) with only a single browser connected and three of the four tests succeeded. The failed test failed in an identical fashion to the problem described in the first post. I'm not sure what to make of this. Previously I ran 5 tests and all of them failed. Now this is still a small sample size but it does seem that having two browsers connected may make things worse?


#5

@gloomyandy - Have you tried enabling HOST_KEEPALIVE_FEATURE in marlin? These timeouts look exactly like the ones I was having, and enabling that fixed it. (HOST_KEEPALIVE_FEATURE causes Marlin to output "busy: processing" messages when there is a long period without an OK response; Octoprint sees that and is satisfied that the printer is still working on the unacknowledged command.)


#6

Yep just checked HOST_KEEPALIVE_FEATURE is already enabled. But I don't think the busy stuff really applies when uploading gcode to an SD card file (as none of the commands are actually executed). That's one of the reasons I've been using it for testing.


#7

Are you sure there isn't an unacknowledged command, earlier in the log? (You only attached a partial log, starting with "Send: N89742...", so it's possible there was an earlier command that wasn't acknowledged -- and that could be what caused the eventual timeout).

The Octoprint comm timeout would generally mean that Octoprint saw neither a matching command acknowledgement (ok) nor a "busy: processing" within the timeout period. All the sent commands in your partial log are acknowledged, so perhaps there was an earlier command for which an OK wasn't received?


#8

Well I can't be 100% sure (as I've not checked all of the file) but I've just been back over 3 minutes worth of log and can see no sign of a missing "ok". I thought octoprint used a simple send and wait protocol (send a line, wait for the ok then send the next line), if it doesn't how does it detect a missing ok without having to go to the end of the upload before discovering the missing item? Also I don't understand why there is no sign of octoprint sending "m105" my understanding is that it should be doing that to try and prompt Marlin to respond, but I can see no sign that these are ever sent. I would have expected them to show up in the serial log. I would assume that ocotprint will send this command even if it has not seen an "ok" as this is part of the recovery process?


#9

Ok so I had a little more time today (as you can probably guess it is a quiet Xmas for me!) and so I have been investigating the failed transfers a little more. I selected the portion of serial.log that contains all of the send/recv operations from the start to the failure point. I then ran a grep on them looking for "Send: N" and used wc on this and then repeated the same for "Recv: ok" both give the same number of occurrences so I'm reasonably sure that no ok responses have gone missing.


#10

I recommend opening a full bug report -- with all hardware/software configuration details and full copies of octoprint.log and serial.log files. You're only providing the information here that you think is relevant -- and you may be correct, or you may be missing something. Gina will be back in a few days and may have a better idea (having looked at full logs and configuration) of how to isolate the problem.


#11

Just want to second this. Please open a full bug report so this can be looked into.


#12

As requested please see issue: https://github.com/foosel/OctoPrint/issues/2996

So I've spent a fair bit of time digging into this and I'm pretty sure I now understand what is going on....

First off you need to understand how octoprint is performing I/O to the USB serial port in particular when writing data. It is basically using the pyserial library. For a write that has a timeout specified this library will perform a non-blocking write followed by a call to select to check if the fd is writable. This call to select will be made using whatever write timeout value has been specified. It will make this second call even if the write was a success and all of the requested bytes have been written (which is a little suspect). Normally what happens is that select determines that the fd is writable and so returns intermediately. However very occasionally this does not happen, instead the select will block and will wait until the timeout period expires (even though the fd either was writable or becomes writable almost straight away). I suspect that this behaviour may be a result of a bug/race condition in the USB-CDC driver but can't be sure. Unfortunately Octoprint has specified a relatively long write timeout of 10000 seconds (see https://github.com/foosel/OctoPrint/blob/112a8b9c8bf7f5895f6a29dbe3d98ea4d9e0f400/src/octoprint/util/comm.py#L2461). This in effect means that the write will block for ever. This in turn blocks the Octoprint output thread and so no recovery is possible.

I've spent a fair bit of time instrumenting the pyserial code and trying various things and it is clear that the actual I/O should not be blocking, the previous write has succeeded and all of the data is happily transmitted to Marlin. Basically the select should not block. If you specify a shorter timeout then the select will complete after this period and will report that the fd is writable.

A workaround for this issue is to set the write timeout value to 0 this will result in the write using non-blocking code (and so will not call select after a write). Octoprint already contains code in the write loop to handle the situation of not all of the data being written. This code will retry the write a (user specified) number of times and will fail if this does not succeed. It seems to me this is actually a better way to handle writes that do not succeed as it will handle the case of a partial write more cleanly (only retrying the part of the write that failed). One possible downside is that because the retry happens immediately a small temporary pause may cause this code to detect a failure, it may be worth adding a small delay into this code if retrying for more than say the first time (thus allowing the system time to actual write the data).

I'd be happy to post a pull request for the changes I've made if you are interested.


#13

Definitely! Against the maintenance branch please (bug fix/improvement, not new feature). Just be aware that I might take a bit until I get around to review, I'm just returned to work today after a needed break over the holidays and will need a couple days to get through all of the backlog :joy:


#14

Pull request here https://github.com/foosel/OctoPrint/pull/3003

Screwed up the original PR try this one!


#15

Whoa. This sounds like it explains my own experiences here (freezup) and here (file deletion).