- commit
- ceba0e53974eacf40c829dcc995e6926e946ff6c
- parent
- 7a6c12d77196964392fd5e8129081ce7ad3f3c49
- Author
- Tobias Bengfort <tobias.bengfort@posteo.de>
- Date
- 2025-05-24 18:10
add post on anubis
Diffstat
A | _content/posts/2025-05-24-anubis/index.md | 202 | ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ |
1 files changed, 202 insertions, 0 deletions
diff --git a/_content/posts/2025-05-24-anubis/index.md b/_content/posts/2025-05-24-anubis/index.md
@@ -0,0 +1,202 @@ -1 1 --- -1 2 title: Should Proof-of-Work be standardized for HTTP? -1 3 date: 2025-05-24 -1 4 tags: [code] -1 5 description: "Anubis seems to emerge as the go-to tool to combat scraping. But is it ultimately a good idea?" -1 6 --- -1 7 -1 8 A couple of days ago I tried to access the excellent Arch Wiki and was greeted -1 9 with this text instead (emphasis mine): -1 10 -1 11 > You are seeing this because the administrator of this website has set up -1 12 > **Anubis** to protect the server against the scourge of **AI companies -1 13 > aggressively scraping websites**. This can and does **cause downtime** for -1 14 > the websites, which makes their resources inaccessible for everyone. -1 15 > -1 16 > Anubis is a compromise. Anubis uses a **Proof-of-Work** scheme in the -1 17 > vein of Hashcash, a proposed proof-of-work scheme for reducing email spam. -1 18 > The idea is that at individual scales the additional load is ignorable, but -1 19 > at mass scraper levels it adds up and makes scraping much more expensive. -1 20 > -1 21 > … -1 22 > -1 23 > Please note that Anubis requires the use of **modern JavaScript** features -1 24 > that plugins like JShelter will disable. Please disable JShelter or other -1 25 > such plugins for this domain. -1 26 -1 27 The Arch Wiki is not the only page that is affected by scraping. [Drew -1 28 DeVault](https://drewdevault.com/2025/03/17/2025-03-17-Stop-externalizing-your-costs-on-me.html) -1 29 also wrote a great article about it, explaining some of the issues sysadmin are -1 30 currently facing. -1 31 -1 32 Anubis seems to emerge as the go-to tool to combat scraping. But it requires -1 33 JavaScript, so it also blocks any legitimate attempt to request the page with -1 34 anything but a modern web browser. You can also [configure it to let in -1 35 specific User Agents](https://anubis.techaro.lol/docs/admin/policies/), but -1 36 then attackers could just bypass the protection by using those User Agent -1 37 strings. -1 38 -1 39 I believe the JavaScript issue could be fixed by implementing something like -1 40 Anubis' Proof-of-Work scheme on the protocol level. But standardizing it would -1 41 also be an endorsement of the concept as a whole. So let's first look into it: -1 42 How does it work, how does it compare to other mitigations, and is it -1 43 ultimately a good idea? -1 44 -1 45 ## What is bad about AI companies scraping websites -1 46 -1 47 Scraping on its own just means that programs use HTML that was generated for -1 48 humans instead of dedicated APIs to get information. I often end up using that -1 49 technique when an API is not available. -1 50 -1 51 There are actually a lot of bots that regularly request HTML that is intended -1 52 for humans. For example, search engine crawlers like the Google bot regularily -1 53 scan the whole web to update their index. However, in that case search engines -1 54 and website owners have a mutual interest: allowing users to find the content. -1 55 So they play nice with each other: The crawlers use a unique `User-Agent` -1 56 header and voluntarily respect any restrictions that are defined in -1 57 `robots.txt`. -1 58 -1 59 What AI companies are doing, on the other hand, seems to be much more similar -1 60 to DDoS attacks: Servers get flooded with requests with `User-Agent` headers -1 61 that look like regular browsers. They also come from many different IP -1 62 addresses so it is hard to distinguish them from organic traffic. -1 63 -1 64 One issue that is sometimes mentioned is that AI companies only take, but -1 65 give nothing back. Search engines cause a little bit of load, but they also -1 66 send users to the page. AI companies, on the other hand, just use the content -1 67 as training data and do not retain a link to the source. -1 68 -1 69 I am not sure what I think about that. On the one hand, I think this issue is -1 70 mostly caused by ad-based monetization, which is a scourge on its own. -1 71 Spreading information, in whichever way people want to, is a good thing! On the -1 72 other hand, I also don't like when rich companies steel from open source -1 73 communities. In the case of the Arch Wiki, the content is published under [GNU -1 74 FDL](https://www.gnu.org/licenses/fdl-1.3.html), so scraping it for training AI -1 75 models is actually illegal. -1 76 -1 77 For me, the main issue with these attacks (lets call them what they are) is -1 78 that they exhaust all resources to the point where servers cannot handle -1 79 requests that come from real human users. -1 80 -1 81 ## Mitigations -1 82 -1 83 The first line of defense is **performance optimization**. Servers can handle -1 84 much more requests if they don't require a lot of resources. However, at some -1 85 point this will no longer be sufficient and we need to start blocking requests. -1 86 The fundamental issue then is how to distinguish good requests from bad -1 87 requests. How can that even be defined? -1 88 -1 89 **CAPTCHAs** define good requests as those that were initiated by humans, so -1 90 they require that clients pass a Turing test. While this definition is useful -1 91 in some cases, it is not useful for many other cases where we explicitly want -1 92 to allow scraping. -1 93 -1 94 **Rate limiting** defines good request by their frequency. I find this to be a -1 95 much better definition for most situations because it roughly translates to -1 96 resource usage. However, rate limiting requires that we can identify which -1 97 requests come from the same source. If attackers use different IP addresses and -1 98 User Agents, it is hard to even realize that all those requests belong -1 99 together. -1 100 -1 101 In that case, we can do **active monitoring** and constantly update our -1 102 blocking rules based on request patterns. But there is no guarantee that we -1 103 will actually find any patterns. It is also a huge amount of work. -1 104 -1 105 The new idea that **Proof-of-Work** brings to the table is that good requests -1 106 need to contribute some of their own resources. However, we do not actually -1 107 share the work between client and server. Instead, the client just wastes some -1 108 CPU time with some complex calculations to signal that it is willing to do its -1 109 part. In a way, this is the cryptographic version of [Bullshit -1 110 Jobs](https://en.wikipedia.org/wiki/Bullshit_Jobs). Proof-of-Work does not -1 111 prevent scrapers from exhausting server resources, but it provides incentives. -1 112 -1 113 ## Proof-of-Work in Anubis -1 114 -1 115 Anubis is deployed as a proxy in front of the actual application. When a client -1 116 first makes a request, Anubis instead loads a page with some JavaScript that -1 117 tries to find a string so that `sha256(string + challenge)` starts with -1 118 `difficulty` zeroes. Once that string is found, it is sent back to the server. -1 119 On success, Anubis stores the challenge and response in a cookie and then -1 120 finally lets the user pass to the application. -1 121 -1 122 The challenge is not random. It contains the IP address, current week, and a -1 123 secret. This way, a new proof must be calculated for every device, week, and -1 124 service. -1 125 -1 126 For further details, see the [Anubis -1 127 documentation](https://anubis.techaro.lol/docs/design/how-anubis-works). -1 128 -1 129 ## Proof-of-Work in HTTP -1 130 -1 131 This exact mechanism could be integrated into HTTP by adding a new -1 132 Authentication scheme: -1 133 -1 134 ```http -1 135 HTTP/1.1 401 Unauthorized -1 136 WWW-Authenticate: Proof-Of-Work algorithm=SHA-256 difficulty=5 challenge=ABC -1 137 ``` -1 138 -1 139 ```http -1 140 Authorization: Proof-Of-Work algorithm=SHA-256 difficulty=5 challenge=ABC response=XYZ -1 141 ``` -1 142 -1 143 A JavaScript/cookie fallback that works a lot like Anubis could be added for -1 144 browsers that do not yet support the new scheme. Also, IP-based exceptions -1 145 could be added for important clients like the Google bot until they add -1 146 support. -1 147 -1 148 Supporting this scheme on the protocol level would allow to implement support -1 149 in clients that do not execute JavaScript, e.g. curl. It would also open new -1 150 usecases that do not necessarily involve web browsers, e.g. protecting -1 151 resource-intensive API endpoints. -1 152 -1 153 ## Distribution of Work -1 154 -1 155 Proof-of-Work only works as intended if: -1 156 -1 157 - it causes negligible load on the server -1 158 - it causes negligible load for casual users -1 159 - it causes significant load for scrapers -1 160 -1 161 But is that the case? -1 162 -1 163 In the case of Anubis, I would say clearly no. The proof takes less than 2 -1 164 seconds to compute and then stays valid for a whole week. I do not see how that -1 165 could ever be considered *significant load*. -1 166 -1 167 Why do people who deploy Anubis still see positive results? I guess this is -1 168 mostly because they do something unconventional that scrapers have not yet -1 169 adapted to. This is a completely valid mitigation in itself. But it seizes to -1 170 work as soon as it becomes too prevalent, so standardizing it would be -1 171 counter-productive. And it also doesn't really require wasting CPU time. Just -1 172 setting a cookie would work just as well. -1 173 -1 174 Let's look at a more meaningful approach: The server has to verify the proof on -1 175 every request, so the client should have to calculate a proof on (nearly) every -1 176 request, too. This could be achieved by including the exact URL in the -1 177 challenge and reducing the validity to something like 5 minutes. -1 178 -1 179 For casual users, I would consider an increase in load time of ~20% as -1 180 acceptable. Lets says that is something like 200ms on average. The Arch Wiki -1 181 has close to 30.000 pages, so downloading all of them would require clients to -1 182 waste ~100 minutes of CPU time. While this is not nothing, I am also not -1 183 convinced that this is enough of an obstacle to discourage scrapers. -1 184 -1 185 Also, this whole idea assumes that attackers even care about their resource -1 186 usage. DDoS attacks are commonly executed via bot nets where attackers have -1 187 taken over regular people's devices. In that case, attackers don't really care -1 188 about resource use because they don't pay the bill. -1 189 -1 190 ## Conclusion -1 191 -1 192 So should the Proof-of-Work scheme be standardized? Performance optimizations -1 193 and *doing something unconventional* will only get us so far. We need something -1 194 better. And in order to make Proof-of-Work useful it needs to be standardized. -1 195 -1 196 But does it actually work? I was genuinely excited about Anubis. I liked its -1 197 premise: -1 198 -1 199 > The idea is that at individual scales the additional load is ignorable, but -1 200 > at mass scraper levels it adds up and makes scraping much more expensive. -1 201 -1 202 But on closer inspection I am not really sure if that balance can be achieved.