Anti-Bot Client Puzzle Protocol in HTTP

Jul 7, 2021

6 mins read

This aims to explain and perform an example of how the Client-Puzzle-Protocol (CPP) may be implemented (almost) entirely in HTTP.

  • Side Note: I’m _mattata on Twitter, you should give me a follow. I do stuff like this often.

The Client Puzzle Protocol at a high level is a way to slow down automated bots crawling a site so that they approach the speed that humans would normally browse your site. The bots are slowed down by requiring them to solve a “puzzle” or show “proof of work” in order to obtain a secret that allows them to continue to access the site.

In the link above, a proof of concept is implemented by Scott Contini using javascript. This will absolutely work for normal browsers, but most bots are not browsers and do not support javascript.

Rate limiting bots is always tricky. Well coded bots like Googlebot or Bingbot understand HTTP status codes such as “429: Too Many Requests” and react appropriately by slowing down.

HTTP 429

  • Other bots? Well, all they see is that the server is now responding super fast, which means they can crawl even faster.

And so they do exactly that. The HTTP response telling them to slow down causes them to go even faster.

My solution as implemented below is specific to address the above problem, but is meant to be easily adaptable for different types of traffic flows. Let’s dig in.

Implementation Specifics

I use HAProxy for load balancing. The following is a simple commented example config:

frontend website_fe
    #Listen on Port 80
    bind *:80
    #Create in memory stick table that stores the rolling average request rate per 10 seconds.
    #Expire entries from table after 30s of inactivity
    stick-table type ip size 100k expire 30s store http_req_rate(10s)
    http-request track-sc0 src

    #Rules

    #Rule that is true if request ends with "sup3r-s3cr3t" our token value
    acl has_token query -i -m reg .*sup3r-s3cr3t
    #Rule that is true if the requesting IP has made greater than 5 requests in 10 seconds
    acl too_fast sc_http_req_rate(0) gt 5
    #Rule that is true if the requesting IP has made greater than 10 requests in 10 seconds
    acl way_too_fast sc_http_req_rate(0) gt 10
    #Rule that is true if the requesting IP has made greater than 15 requests in 10 seconds
    acl way_way_too_fast sc_http_req_rate(0) gt 15

    #Actions (Processed in order)

    #HTTP 403 Deny if > 15 requests
    http-request deny if way_way_too_fast
    #HTTP 429 Deny if > 10 requests
    http-request deny deny_status 429 if way_too_fast
    #Route to the website backend if request includes secret token
    use_backend website_be if has_token
    #Route to the antibot backend if > 5 requests and does not have secret token
    use_backend antibot_be if too_fast !has_token
    #Never actually hit, but HAP requires a default backend
    default_backend website_be

backend website_be
    mode http
    #Route request to local service listening on port 8080
    server website 127.0.0.1:8080

backend antibot_be
    mode http
    #Route request to local service listening on port 8095
    server antibot 127.0.0.1:8095

As we can can see above the following actions will occur:

  • Under 5 requests: Normal Operation
  • Greater than 5 requests:
    • If no secret token (sup3r-s3cr3t) is provided in requests, route to the antibot backend to retrieve puzzle and secret token
    • If a secret token (sup3r-s3cr3t) is provided in requests, route to the website as normal
  • Greater than 10 requests: HTTP 429 Status Code
  • Greater than 15 requests: HTTP 403 Status Code

Pretty simple right? This is all functionality that’s built into HAProxy by default, which is why I absolutely love it. Now that we’ve established the general structure of routing, let’s look at how to implement CPP into the mix.

Client Puzzle Protcol

On port 8095 we have an HTTP service listening that serves up puzzles that must be solved in order to retrieve the secret token.

For the following script I utilize:

#!/usr/bin/env python3

import socketserver
import zlib

def build_big_gzip():
    zero_padding = b'\00' * 999999999
    html = """
    <html><meta http-equiv="refresh" content="0;url=http://antibot/?secretToken=sup3r-s3cr3t" /></html>
    """.encode('utf8')
    out = zero_padding + html
    print("Decompressed Size: ", len(out)//1024)
    z = zlib.compressobj(-1,zlib.DEFLATED,31)
    return z.compress(out) + z.flush()

biggzip = build_big_gzip()
print("Compressed Size: ", len(biggzip)//1024)


class MyTCPHandler(socketserver.BaseRequestHandler):
    def handle(self):
        data = self.request.recv(1024)
        print(data.decode())
        self.request.send(b'HTTP/1.1 200 OK\r\n')
        self.request.send(b'Content-Encoding: gzip\r\n')
        self.request.send(b'Content-Type: text/html; charset=UTF-8\r\n')
        self.request.send(b'Connection: close\r\n')
        self.request.send(b'\r\n')
        self.request.send(biggzip)

class MyTCPSever(socketserver.TCPServer):
    allow_reuse_address = True

with MyTCPSever(('0.0.0.0', 8095), MyTCPHandler) as server:
    server.serve_forever()

At the top of the script, I build a byte() object with 999999999 NULL bytes, then append a minimal HTML file that uses META Refresh to redirect to a url including the secret token.

The NULL bytes are important as this helps prevent any HTML parser from thrashing trying to parse out the page, but still allows us to arbitrarily pad the file.

This byte() object is then gzip compressed and stored in a variable.

The compressed size (sent over the wire to client) of the HTML is 949kb. The client must decompress the full size of the stream (~976mb) and parse the tailing HTML to following the redirect and pick up the token.

That’s it, that’s the puzzle. A massive gzip stream that must be decompressed to retrieve the token.

CPU cheap for the server to validate, CPU heavy for the client.

Video Proof of Concept

Additional Notes

  • The HTML must be prepended with whitespace or browser behavior varies wildly in the time it takes to solve assumably due to eagerly parsing HTML.
  • Using META Refesh in HTML works because it embeds the token in the puzzle. Use of a location header for a redirect cannot be used as that skips the puzzle entirely
  • HTTP Trailer’s are headers that can be added at the end of a chunked request and can theorhetically be used.
    • I couldn’t get it to work with any trailer headers in Chrome/Firefox.
    • I’m fairly sure the compatibiliy table is incorrect.
    • The web browsers I could get to pick up the trailer would puke on the chunked encoding and enter into really strange behavior (downloading a randomly named temp file, then fail and say the stream didn’t exist in the first place).

Sharing is caring!