Puma Try to Finish Read Error: #
Heed to this article
For quite some time nosotros've received reports from our larger customers about a mysterious H13 - Connection closed error showing upwards for Ruby-red applications. Curiously it only ever happened effectually the time they were deploying or scaling their dynos. Even more peculiar, it just happened to relatively high calibration applications. We couldn't reproduce the beliefs on an case app. This is a story about distributed coordination, the TCP API, and how nosotros debugged and fixed a bug in Puma that only shows up at scale.
Connectedness closed
First of all, what even is an H13 fault? From our mistake folio documentation:
This error is thrown when a process in your web dyno accepts a connection, but then closes the socket without writing annihilation to it. One instance where this might happen is when a Unicorn web server is configured with a timeout shorter than 30s and a request has not been candy past a worker before the timeout happens. In this example, Unicorn closes the connection before whatsoever information is written, resulting in an H13.
Fun fact: Our error codes start with the letter of the component where they came from. Our Routing code is all written in Erlang and is named "Hermes" so whatsoever fault codes from Heroku that start with an "H" indicate an error from the router.
The documentation gives an example of an H13 error code with the unicorn webserver, only it can happen whatsoever time a connexion is closed via a server, but there has been no response written. Hither'south an example showing how to reproduce a H13 explicitly with a node app.
What does it hateful for an awarding to become an H13? Essentially every one of these errors correlates to a client who got an error page. Serving a handful of errors every time the app restarts or deploys or auto-scales is an atrocious user experience, then it's worth it to find and fix.
Debugging
I have maintained the Ruby buildpack for several years, and office of that job is to handle support escalations for Cerise tickets. In addition to the normal deployment bug, I've been developing an interest in performance, scalability, and web servers (I recently started helping to maintain the Puma webserver). Because of these interests, when a tricky outcome comes in from i of our larger customers, especially if it only happens at scale, I have particular interest.
To understand the problem, you need to know a lilliputian about the nature of sending distributed letters. Webservers are inherently distributed systems, and to make things more than complicated, nosotros frequently use distributed systems to manage our distributed systems.
In the example of this error, information technology didn't seem to come from a customer'due south application code i.e. they didn't seem to have anything misconfigured. It also only seemed to happen when a dyno was existence shut down.
To shut downward a dyno two things have to happen, we demand to send a SIGTERM
to the processes on the dyno which tells the webserver to safely shutdown. We also need to tell our router to terminate sending requests to that dyno since it will exist shut down shortly.
These 2 operations happen on two different systems. The dyno runs on i server, the router which serves our requests is a separate system. It's itself a distributed system. It turns out that while both systems get the message at near the aforementioned fourth dimension, the router might still let a few requests trickle into the dyno that is being close down subsequently it receives the SIGTERM
.
That explains the problem then, right? The reason this only happens on apps with a large amount of traffic is they go so many requests there is more take a chance that there will be a race condition between when the router stops sending requests and the dyno receives the SIGTERM
.
That sounds similar a bug with the router then right? Before we get too deep into the difficulties of distributed coordination, I noticed that other apps with but as much load weren't getting H13 errors. What did that tell me? Information technology told me that the distributed behavior of our arrangement wasn't to blame. If other webservers can handle this just fine, then nosotros need to update our webserver, Puma in this case.
Reproduction
When you're dealing with a distributed system issues that'south reliant on a race condition, reproducing the result can be a tricky affair. While pairing on the upshot with some other Heroku engineer, Chap Ambrose, we hit an idea. Get-go, nosotros would reproduce the H13 beliefs in whatsoever app to figure out what curl leave code we would get, and then nosotros could try to reproduce the exact failure atmospheric condition with a more complicated instance.
A simple reproduction rack app looks similar this:
app = Proc.new practice |env| current_pid = Process.pid signal = "SIGKILL" Process.kill(indicate, current_pid) ['200', {'Content-Type' => 'text/html'}, ['A barebones rack app.']] end run app
When you run this config.ru
with Puma and hitting it with a request, you'll get a connection that is closed without a request getting written. That was pretty like shooting fish in a barrel.
The curl code when a connexion is closed like this is 52
and then now we can detect when it happens.
$ curlicue localhost:9292 % Full % Received % Xferd Boilerplate Speed Time Time Fourth dimension Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 coil: (52) Empty reply from server
A more than complicated reproduction happens when SIGTERM is chosen only requests continue coming in. To facilitate that we ended up with a reproduction that looks like this:
app = Proc.new practice |env| puma_pid = File.read('puma.pid').to_i Procedure.impale("SIGTERM", puma_pid) Process.kill("SIGTERM", Process.pid) ['200', {'Content-Blazon' => 'text/html'}, ['A barebones rack app.']] stop run app
This config.ru
rack app sends a SIGTERM
to itself and it's parent process on the commencement request. So other future requests will be coming in when the server is shutting down.
Then we tin can write a script that boots this server and hits information technology with a bunch of requests in parallel:
threads = [] threads << Thread.new do puts `puma > puma.log` unless ENV["NO_PUMA_BOOT"] end sleep(three) require 'fileutils' FileUtils.mkdir_p("tmp/requests") twenty.times practice |i| threads << Thread.new do request = `curl localhost:9292/?request_thread=#{i} &> tmp/requests/requests#{i}.log` puts $? end end threads.map {|t| t.join }
When we run this reproduction, we run into that it gives us the exact beliefs we're looking to reproduce. Even amend, when this code is deployed on Heroku we can run into an H13 error is triggered:
2019-05-10T18:41:06.859330+00:00 heroku[router]: at=fault code=H13 desc="Connection closed without response" method=GET path="/?request_thread=6" host=ruby-h13.herokuapp.com request_id=05696319-a6ff-4fad-b219-6dd043536314 fwd="<ip>" dyno=web.ane connect=0ms service=5ms status=503 bytes=0 protocol=https
You tin get all this lawmaking and some more than details on the reproduciton script repo. And here's the Puma Upshot I was using to track the behavior
Closing the connectedness closed issues
With a reproduction script in hand, it was possible for us to add debugging statements to Puma internals to see how information technology behaved while experiencing this issue.
With a little investigation, it turned out that Puma never explicitly closed the socket of the connexion. Instead, information technology relied on the process stopping to close information technology.
What exactly does that mean? Every time you type a URL into a browser, the asking gets routed to a server. On Heroku, the request goes to our router. The router then attempts to connect to a dyno (server) and pass it the asking. The underlying mechanism that allows this is the webserver (Puma) on the dyno opening up a TCP socket on a $PORT. The request is accepted onto the socket, and it will sit down there until the webserver (Puma) is set to read it in and respond to it.
What behavior do nosotros want to happen to avoid this H13 error? In the error instance, the router tries to connect to the dyno, it'southward successful, and since the request is accustomed by the dyno, the router expects the dyno to handle writing the asking. If instead, the socket is closed when it tries to pass on the request it will know that Puma cannot respond. It will then retry passing the connection to another dyno. There are times when a webserver might reject a connection, for example, if the socket is full (default is merely to let 1024 connections on the socket excess), or if the entire server has crashed.
In our case, closing the socket is what nosotros desire. It correctly communicates to the router to do the correct thing (try passing the connection to some other dyno or hold onto it in the case all dynos are restarting).
And so then, the solution to the problem was to shut the socket before attempting to shut downward explicitly. Hither's the PR. The primary magic is just one line:
@launcher.close_binder_listeners
If y'all're a worrier (I know I am) you might be afraid that closing the socket prevents any in-flight requests from existence completed successfully. Lucky for us closing a socket prevents incoming requests but yet allows united states to answer to existing requests. If you don't believe me, recall about how you could exam information technology with one of my above example repos.
Testing distributed behavior
I don't know if this behavior in Puma bankrupt, or maybe it never worked. To try to make sure that information technology continues to work in the future, I wanted to write a test for it. I reached out to dannyfallon who has helped out on some other Puma issues, and we remote paired on the tests using Tuple.
The tests ended up being non terribly different than our case reproduction higher up, but it was pretty catchy to become information technology to have consistent behavior.
With an issue that doesn't regularly show up unless it's on an app at calibration, information technology's essential to test, equally Charity Majors would say "in production". We had several Heroku customers who were seeing this mistake try out my patch. They reported some other issues, which we were able to resolve, later on fixing those issues, it looked similar the errors were fixed.
Rolling out the fix
Puma 4, which came with this set, was recently released. We reached out to a customer who was using Puma and seeing a large number of H13s, and this release stopped them in their tracks.
Larn more than near Puma iv beneath.
— Richard Schneeman 🤠(@schneems) June 25, 2019
Source: https://blog.heroku.com/puma-4-hammering-out-h13s-a-debugging-story
0 Response to "Puma Try to Finish Read Error: #"
Post a Comment