blog

git clone https://git.ce9e.org/blog.git

commit
ceba0e53974eacf40c829dcc995e6926e946ff6c
parent
7a6c12d77196964392fd5e8129081ce7ad3f3c49
Author
Tobias Bengfort <tobias.bengfort@posteo.de>
Date
2025-05-24 18:10
add post on anubis

Diffstat

A _content/posts/2025-05-24-anubis/index.md 202 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

1 files changed, 202 insertions, 0 deletions


diff --git a/_content/posts/2025-05-24-anubis/index.md b/_content/posts/2025-05-24-anubis/index.md

@@ -0,0 +1,202 @@
   -1     1 ---
   -1     2 title: Should Proof-of-Work be standardized for HTTP?
   -1     3 date: 2025-05-24
   -1     4 tags: [code]
   -1     5 description: "Anubis seems to emerge as the go-to tool to combat scraping. But is it ultimately a good idea?"
   -1     6 ---
   -1     7 
   -1     8 A couple of days ago I tried to access the excellent Arch Wiki and was greeted
   -1     9 with this text instead (emphasis mine):
   -1    10 
   -1    11 > You are seeing this because the administrator of this website has set up
   -1    12 > **Anubis** to protect the server against the scourge of **AI companies
   -1    13 > aggressively scraping websites**. This can and does **cause downtime** for
   -1    14 > the websites, which makes their resources inaccessible for everyone.
   -1    15 >
   -1    16 > Anubis is a compromise. Anubis uses a **Proof-of-Work** scheme in the
   -1    17 > vein of Hashcash, a proposed proof-of-work scheme for reducing email spam.
   -1    18 > The idea is that at individual scales the additional load is ignorable, but
   -1    19 > at mass scraper levels it adds up and makes scraping much more expensive.
   -1    20 >
   -1    21 > …
   -1    22 >
   -1    23 > Please note that Anubis requires the use of **modern JavaScript** features
   -1    24 > that plugins like JShelter will disable. Please disable JShelter or other
   -1    25 > such plugins for this domain.
   -1    26 
   -1    27 The Arch Wiki is not the only page that is affected by scraping. [Drew
   -1    28 DeVault](https://drewdevault.com/2025/03/17/2025-03-17-Stop-externalizing-your-costs-on-me.html)
   -1    29 also wrote a great article about it, explaining some of the issues sysadmin are
   -1    30 currently facing.
   -1    31 
   -1    32 Anubis seems to emerge as the go-to tool to combat scraping. But it requires
   -1    33 JavaScript, so it also blocks any legitimate attempt to request the page with
   -1    34 anything but a modern web browser. You can also [configure it to let in
   -1    35 specific User Agents](https://anubis.techaro.lol/docs/admin/policies/), but
   -1    36 then attackers could just bypass the protection by using those User Agent
   -1    37 strings.
   -1    38 
   -1    39 I believe the JavaScript issue could be fixed by implementing something like
   -1    40 Anubis' Proof-of-Work scheme on the protocol level. But standardizing it would
   -1    41 also be an endorsement of the concept as a whole. So let's first look into it:
   -1    42 How does it work, how does it compare to other mitigations, and is it
   -1    43 ultimately a good idea?
   -1    44 
   -1    45 ## What is bad about AI companies scraping websites
   -1    46 
   -1    47 Scraping on its own just means that programs use HTML that was generated for
   -1    48 humans instead of dedicated APIs to get information. I often end up using that
   -1    49 technique when an API is not available.
   -1    50 
   -1    51 There are actually a lot of bots that regularly request HTML that is intended
   -1    52 for humans. For example, search engine crawlers like the Google bot regularily
   -1    53 scan the whole web to update their index. However, in that case search engines
   -1    54 and website owners have a mutual interest: allowing users to find the content.
   -1    55 So they play nice with each other: The crawlers use a unique `User-Agent`
   -1    56 header and voluntarily respect any restrictions that are defined in
   -1    57 `robots.txt`.
   -1    58 
   -1    59 What AI companies are doing, on the other hand, seems to be much more similar
   -1    60 to DDoS attacks: Servers get flooded with requests with `User-Agent` headers
   -1    61 that look like regular browsers. They also come from many different IP
   -1    62 addresses so it is hard to distinguish them from organic traffic.
   -1    63 
   -1    64 One issue that is sometimes mentioned is that AI companies only take, but
   -1    65 give nothing back. Search engines cause a little bit of load, but they also
   -1    66 send users to the page. AI companies, on the other hand, just use the content
   -1    67 as training data and do not retain a link to the source.
   -1    68 
   -1    69 I am not sure what I think about that. On the one hand, I think this issue is
   -1    70 mostly caused by ad-based monetization, which is a scourge on its own.
   -1    71 Spreading information, in whichever way people want to, is a good thing! On the
   -1    72 other hand, I also don't like when rich companies steel from open source
   -1    73 communities. In the case of the Arch Wiki, the content is published under [GNU
   -1    74 FDL](https://www.gnu.org/licenses/fdl-1.3.html), so scraping it for training AI
   -1    75 models is actually illegal.
   -1    76 
   -1    77 For me, the main issue with these attacks (lets call them what they are) is
   -1    78 that they exhaust all resources to the point where servers cannot handle
   -1    79 requests that come from real human users.
   -1    80 
   -1    81 ## Mitigations
   -1    82 
   -1    83 The first line of defense is **performance optimization**. Servers can handle
   -1    84 much more requests if they don't require a lot of resources. However, at some
   -1    85 point this will no longer be sufficient and we need to start blocking requests.
   -1    86 The fundamental issue then is how to distinguish good requests from bad
   -1    87 requests. How can that even be defined?
   -1    88 
   -1    89 **CAPTCHAs** define good requests as those that were initiated by humans, so
   -1    90 they require that clients pass a Turing test. While this definition is useful
   -1    91 in some cases, it is not useful for many other cases where we explicitly want
   -1    92 to allow scraping.
   -1    93 
   -1    94 **Rate limiting** defines good request by their frequency. I find this to be a
   -1    95 much better definition for most situations because it roughly translates to
   -1    96 resource usage. However, rate limiting requires that we can identify which
   -1    97 requests come from the same source. If attackers use different IP addresses and
   -1    98 User Agents, it is hard to even realize that all those requests belong
   -1    99 together.
   -1   100 
   -1   101 In that case, we can do **active monitoring** and constantly update our
   -1   102 blocking rules based on request patterns. But there is no guarantee that we
   -1   103 will actually find any patterns. It is also a huge amount of work.
   -1   104 
   -1   105 The new idea that **Proof-of-Work** brings to the table is that good requests
   -1   106 need to contribute some of their own resources. However, we do not actually
   -1   107 share the work between client and server. Instead, the client just wastes some
   -1   108 CPU time with some complex calculations to signal that it is willing to do its
   -1   109 part. In a way, this is the cryptographic version of [Bullshit
   -1   110 Jobs](https://en.wikipedia.org/wiki/Bullshit_Jobs). Proof-of-Work does not
   -1   111 prevent scrapers from exhausting server resources, but it provides incentives.
   -1   112 
   -1   113 ## Proof-of-Work in Anubis
   -1   114 
   -1   115 Anubis is deployed as a proxy in front of the actual application. When a client
   -1   116 first makes a request, Anubis instead loads a page with some JavaScript that
   -1   117 tries to find a string so that `sha256(string + challenge)` starts with
   -1   118 `difficulty` zeroes. Once that string is found, it is sent back to the server.
   -1   119 On success, Anubis stores the challenge and response in a cookie and then
   -1   120 finally lets the user pass to the application.
   -1   121 
   -1   122 The challenge is not random. It contains the IP address, current week, and a
   -1   123 secret. This way, a new proof must be calculated for every device, week, and
   -1   124 service.
   -1   125 
   -1   126 For further details, see the [Anubis
   -1   127 documentation](https://anubis.techaro.lol/docs/design/how-anubis-works).
   -1   128 
   -1   129 ## Proof-of-Work in HTTP
   -1   130 
   -1   131 This exact mechanism could be integrated into HTTP by adding a new
   -1   132 Authentication scheme:
   -1   133 
   -1   134 ```http
   -1   135 HTTP/1.1 401 Unauthorized
   -1   136 WWW-Authenticate: Proof-Of-Work algorithm=SHA-256 difficulty=5 challenge=ABC
   -1   137 ```
   -1   138 
   -1   139 ```http
   -1   140 Authorization: Proof-Of-Work algorithm=SHA-256 difficulty=5 challenge=ABC response=XYZ
   -1   141 ```
   -1   142 
   -1   143 A JavaScript/cookie fallback that works a lot like Anubis could be added for
   -1   144 browsers that do not yet support the new scheme. Also, IP-based exceptions
   -1   145 could be added for important clients like the Google bot until they add
   -1   146 support.
   -1   147 
   -1   148 Supporting this scheme on the protocol level would allow to implement support
   -1   149 in clients that do not execute JavaScript, e.g. curl. It would also open new
   -1   150 usecases that do not necessarily involve web browsers, e.g. protecting
   -1   151 resource-intensive API endpoints.
   -1   152 
   -1   153 ## Distribution of Work
   -1   154 
   -1   155 Proof-of-Work only works as intended if:
   -1   156 
   -1   157 -   it causes negligible load on the server
   -1   158 -   it causes negligible load for casual users
   -1   159 -   it causes significant load for scrapers
   -1   160 
   -1   161 But is that the case?
   -1   162 
   -1   163 In the case of Anubis, I would say clearly no. The proof takes less than 2
   -1   164 seconds to compute and then stays valid for a whole week. I do not see how that
   -1   165 could ever be considered *significant load*.
   -1   166 
   -1   167 Why do people who deploy Anubis still see positive results? I guess this is
   -1   168 mostly because they do something unconventional that scrapers have not yet
   -1   169 adapted to. This is a completely valid mitigation in itself. But it seizes to
   -1   170 work as soon as it becomes too prevalent, so standardizing it would be
   -1   171 counter-productive. And it also doesn't really require wasting CPU time. Just
   -1   172 setting a cookie would work just as well.
   -1   173 
   -1   174 Let's look at a more meaningful approach: The server has to verify the proof on
   -1   175 every request, so the client should have to calculate a proof on (nearly) every
   -1   176 request, too. This could be achieved by including the exact URL in the
   -1   177 challenge and reducing the validity to something like 5 minutes.
   -1   178 
   -1   179 For casual users, I would consider an increase in load time of ~20% as
   -1   180 acceptable. Lets says that is something like 200ms on average. The Arch Wiki
   -1   181 has close to 30.000 pages, so downloading all of them would require clients to
   -1   182 waste ~100 minutes of CPU time. While this is not nothing, I am also not
   -1   183 convinced that this is enough of an obstacle to discourage scrapers.
   -1   184 
   -1   185 Also, this whole idea assumes that attackers even care about their resource
   -1   186 usage. DDoS attacks are commonly executed via bot nets where attackers have
   -1   187 taken over regular people's devices. In that case, attackers don't really care
   -1   188 about resource use because they don't pay the bill.
   -1   189 
   -1   190 ## Conclusion
   -1   191 
   -1   192 So should the Proof-of-Work scheme be standardized? Performance optimizations
   -1   193 and *doing something unconventional* will only get us so far. We need something
   -1   194 better. And in order to make Proof-of-Work useful it needs to be standardized.
   -1   195 
   -1   196 But does it actually work? I was genuinely excited about Anubis. I liked its
   -1   197 premise:
   -1   198 
   -1   199 > The idea is that at individual scales the additional load is ignorable, but
   -1   200 > at mass scraper levels it adds up and makes scraping much more expensive.
   -1   201 
   -1   202 But on closer inspection I am not really sure if that balance can be achieved.