Jul 7, 2021
6 mins read
This aims to explain and perform an example of how the Client-Puzzle-Protocol (CPP) may be implemented (almost) entirely in HTTP.
The Client Puzzle Protocol at a high level is a way to slow down automated bots crawling a site so that they approach the speed that humans would normally browse your site. The bots are slowed down by requiring them to solve a “puzzle” or show “proof of work” in order to obtain a secret that allows them to continue to access the site.
In the link above, a proof of concept is implemented by Scott Contini using javascript. This will absolutely work for normal browsers, but most bots are not browsers and do not support javascript.
Rate limiting bots is always tricky. Well coded bots like Googlebot or Bingbot understand HTTP status codes such as “429: Too Many Requests” and react appropriately by slowing down.
And so they do exactly that. The HTTP response telling them to slow down causes them to go even faster.
My solution as implemented below is specific to address the above problem, but is meant to be easily adaptable for different types of traffic flows. Let’s dig in.
I use HAProxy for load balancing. The following is a simple commented example config:
frontend website_fe
#Listen on Port 80
bind *:80
#Create in memory stick table that stores the rolling average request rate per 10 seconds.
#Expire entries from table after 30s of inactivity
stick-table type ip size 100k expire 30s store http_req_rate(10s)
http-request track-sc0 src
#Rules
#Rule that is true if request ends with "sup3r-s3cr3t" our token value
acl has_token query -i -m reg .*sup3r-s3cr3t
#Rule that is true if the requesting IP has made greater than 5 requests in 10 seconds
acl too_fast sc_http_req_rate(0) gt 5
#Rule that is true if the requesting IP has made greater than 10 requests in 10 seconds
acl way_too_fast sc_http_req_rate(0) gt 10
#Rule that is true if the requesting IP has made greater than 15 requests in 10 seconds
acl way_way_too_fast sc_http_req_rate(0) gt 15
#Actions (Processed in order)
#HTTP 403 Deny if > 15 requests
http-request deny if way_way_too_fast
#HTTP 429 Deny if > 10 requests
http-request deny deny_status 429 if way_too_fast
#Route to the website backend if request includes secret token
use_backend website_be if has_token
#Route to the antibot backend if > 5 requests and does not have secret token
use_backend antibot_be if too_fast !has_token
#Never actually hit, but HAP requires a default backend
default_backend website_be
backend website_be
mode http
#Route request to local service listening on port 8080
server website 127.0.0.1:8080
backend antibot_be
mode http
#Route request to local service listening on port 8095
server antibot 127.0.0.1:8095
As we can can see above the following actions will occur:
Pretty simple right? This is all functionality that’s built into HAProxy by default, which is why I absolutely love it. Now that we’ve established the general structure of routing, let’s look at how to implement CPP into the mix.
On port 8095 we have an HTTP service listening that serves up puzzles that must be solved in order to retrieve the secret token.
For the following script I utilize:
#!/usr/bin/env python3
import socketserver
import zlib
def build_big_gzip():
zero_padding = b'\00' * 999999999
html = """
<html><meta http-equiv="refresh" content="0;url=http://antibot/?secretToken=sup3r-s3cr3t" /></html>
""".encode('utf8')
out = zero_padding + html
print("Decompressed Size: ", len(out)//1024)
z = zlib.compressobj(-1,zlib.DEFLATED,31)
return z.compress(out) + z.flush()
biggzip = build_big_gzip()
print("Compressed Size: ", len(biggzip)//1024)
class MyTCPHandler(socketserver.BaseRequestHandler):
def handle(self):
data = self.request.recv(1024)
print(data.decode())
self.request.send(b'HTTP/1.1 200 OK\r\n')
self.request.send(b'Content-Encoding: gzip\r\n')
self.request.send(b'Content-Type: text/html; charset=UTF-8\r\n')
self.request.send(b'Connection: close\r\n')
self.request.send(b'\r\n')
self.request.send(biggzip)
class MyTCPSever(socketserver.TCPServer):
allow_reuse_address = True
with MyTCPSever(('0.0.0.0', 8095), MyTCPHandler) as server:
server.serve_forever()
At the top of the script, I build a byte() object with 999999999 NULL bytes, then append a minimal HTML file that uses META Refresh to redirect to a url including the secret token.
The NULL bytes are important as this helps prevent any HTML parser from thrashing trying to parse out the page, but still allows us to arbitrarily pad the file.
This byte() object is then gzip compressed and stored in a variable.
The compressed size (sent over the wire to client) of the HTML is 949kb. The client must decompress the full size of the stream (~976mb) and parse the tailing HTML to following the redirect and pick up the token.
That’s it, that’s the puzzle. A massive gzip stream that must be decompressed to retrieve the token.
CPU cheap for the server to validate, CPU heavy for the client.
Sharing is caring!