Jul 7, 2021
6 mins read
This aims to explain and perform an example of how the Client-Puzzle-Protocol (CPP) may be implemented (almost) entirely in HTTP.
The Client Puzzle Protocol at a high level is a way to slow down automated bots crawling a site so that they approach the speed that humans would normally browse your site. The bots are slowed down by requiring them to solve a “puzzle” or show “proof of work” in order to obtain a secret that allows them to continue to access the site.
Rate limiting bots is always tricky. Well coded bots like Googlebot or Bingbot understand HTTP status codes such as “429: Too Many Requests” and react appropriately by slowing down.
And so they do exactly that. The HTTP response telling them to slow down causes them to go even faster.
My solution as implemented below is specific to address the above problem, but is meant to be easily adaptable for different types of traffic flows. Let’s dig in.
I use HAProxy for load balancing. The following is a simple commented example config:
frontend website_fe #Listen on Port 80 bind *:80 #Create in memory stick table that stores the rolling average request rate per 10 seconds. #Expire entries from table after 30s of inactivity stick-table type ip size 100k expire 30s store http_req_rate(10s) http-request track-sc0 src #Rules #Rule that is true if request ends with "sup3r-s3cr3t" our token value acl has_token query -i -m reg .*sup3r-s3cr3t #Rule that is true if the requesting IP has made greater than 5 requests in 10 seconds acl too_fast sc_http_req_rate(0) gt 5 #Rule that is true if the requesting IP has made greater than 10 requests in 10 seconds acl way_too_fast sc_http_req_rate(0) gt 10 #Rule that is true if the requesting IP has made greater than 15 requests in 10 seconds acl way_way_too_fast sc_http_req_rate(0) gt 15 #Actions (Processed in order) #HTTP 403 Deny if > 15 requests http-request deny if way_way_too_fast #HTTP 429 Deny if > 10 requests http-request deny deny_status 429 if way_too_fast #Route to the website backend if request includes secret token use_backend website_be if has_token #Route to the antibot backend if > 5 requests and does not have secret token use_backend antibot_be if too_fast !has_token #Never actually hit, but HAP requires a default backend default_backend website_be backend website_be mode http #Route request to local service listening on port 8080 server website 127.0.0.1:8080 backend antibot_be mode http #Route request to local service listening on port 8095 server antibot 127.0.0.1:8095
As we can can see above the following actions will occur:
Pretty simple right? This is all functionality that’s built into HAProxy by default, which is why I absolutely love it. Now that we’ve established the general structure of routing, let’s look at how to implement CPP into the mix.
On port 8095 we have an HTTP service listening that serves up puzzles that must be solved in order to retrieve the secret token.
For the following script I utilize:
#!/usr/bin/env python3 import socketserver import zlib def build_big_gzip(): zero_padding = b'\00' * 999999999 html = """ <html><meta http-equiv="refresh" content="0;url=http://antibot/?secretToken=sup3r-s3cr3t" /></html> """.encode('utf8') out = zero_padding + html print("Decompressed Size: ", len(out)//1024) z = zlib.compressobj(-1,zlib.DEFLATED,31) return z.compress(out) + z.flush() biggzip = build_big_gzip() print("Compressed Size: ", len(biggzip)//1024) class MyTCPHandler(socketserver.BaseRequestHandler): def handle(self): data = self.request.recv(1024) print(data.decode()) self.request.send(b'HTTP/1.1 200 OK\r\n') self.request.send(b'Content-Encoding: gzip\r\n') self.request.send(b'Content-Type: text/html; charset=UTF-8\r\n') self.request.send(b'Connection: close\r\n') self.request.send(b'\r\n') self.request.send(biggzip) class MyTCPSever(socketserver.TCPServer): allow_reuse_address = True with MyTCPSever(('0.0.0.0', 8095), MyTCPHandler) as server: server.serve_forever()
At the top of the script, I build a byte() object with 999999999 NULL bytes, then append a minimal HTML file that uses META Refresh to redirect to a url including the secret token.
The NULL bytes are important as this helps prevent any HTML parser from thrashing trying to parse out the page, but still allows us to arbitrarily pad the file.
This byte() object is then gzip compressed and stored in a variable.
The compressed size (sent over the wire to client) of the HTML is 949kb. The client must decompress the full size of the stream (~976mb) and parse the tailing HTML to following the redirect and pick up the token.
That’s it, that’s the puzzle. A massive gzip stream that must be decompressed to retrieve the token.
CPU cheap for the server to validate, CPU heavy for the client.