Jumping the Paywall

Attn: Cooking blog-ish preamble incoming. If you're gagging for the technical stuff, skip to the next heading.

What's happened to the news business is terrifying and everyone is worse off for it. The integrity of journalism is greatly diminished -- public trust in journalism far more so. When I used to scramble words for money, paywalls were considered a cyanide pill. A surefire way to dome your illustrious ad-supported publication. Publishers just did not believe anyone would pay to support the things they thought were important.

This seemed obviously wrong to me then and those publishers seemed obviously short-sighted and stupid. The content industry's core problem was that it was incentivised to publish the most base, humiliating slop they could because it's what the advertising model begged and the only way out of that was to reorient the incentives: stop appealing to the needs of advertising, start appealing to the needs of the readers. 10 years later, am I vindicated? Sure, in one or two ways. Paywalls are de rigueur and thousands of publications from back when are now electron dust. A few problems remain:

The reporting hasn't gotten any better for already established publications. It might've just stopped it getting much worse, much faster though. That said, the normalisation of paywalls by these bigger pubs has allowed fantastic, predominantly reader-supported startups like 404 Media to flourish.
Publications often use paywalls to double dip their readers. Subscriptions grant access to content, but don't remove advertising or third-party tracking.
The paywalls aren't very robust.

I spent my entire journalism career arguing about the first two; lately I've been curious about the third. To be clear, this is the least important. If most people pay to support journalism while some can get it for free, it's probably not worth an international summit. But as an after-school research project, it tickles me for a couple of reasons:

Authentication is mostly a solved problem on the web, at least at a very basic level. Auth issues still abound, but if you visit a page that requires a login to view, you generally don't expect to... still be able to see the things anyway? But that's how some paywalls work. Arguably, some of the examples below constitute something very similar to Execution After Redirect (EAR) vulnerabilities. Kooky!
I'm pretty interested in how technology can help protect and support journalism. Making good paywalls is part of that.

Anyway, let's read the news.

(Quick disclaimer: the below was undertaken with the desire not to do any more than observing requests made in the course of regular usage of the site. Obviously as responsible security researchers we're not trying to exploit servers and applications that aren't ours, right? What this is: a relatively passive investigation of how content is served by the publications in question. What this isn't: a comprehensive effort to exploit or test their access controls. I mean I don't do that for free, you know?)

Sydney Morning Herald

The Sydney Morning Herald is incredibly generous. Evidently supporters of the open web, they've exposed a public API for retrieving published assets at https://api.smh.com.au/api/content/v0/assets/<id>. Finding these IDs is easy enough, they're in most URLs. For example, author links throughout the site look like:

https://www.smh.com.au/by/jessica-mcsweeney-p5376l

And article links look like:

https://www.smh.com.au/politics/federal/mps-behaving-badly-is-the-ugly-public-face-of-democracy-20240104-p5ev37.html

That six-character string at the end, "p5ev37" etc., is an ID. So you could return an article like:

https://api.smh.com.au/api/content/v0/assets/p5ev37

This returns a pretty big JSON object with all kinds of handy metadata like dates edited/published, author name, and handily, a blob of Unicode-escaped article text.

Someone reading this blob would also notice elements like:

<x-placeholder id="11b16dd91118418e4f3de93d020381efc5b2bd98">

These are replaced during rendering with elements like anchor and image tags accessed by the corresponding SHA-1 hash ID. In the JSON response they look like:

When the page loads, they're handily assembled into an article. Sweet.

Let's say you're not a REST fan. Everything's GraphQL now right? Alright cool guy, you can do that too. You can query a GraphQL endpoint at /graphql with something like:

query { assetJSON(shortId: "p5eurl") { asset { body extensions } error { message } } }

And that will return a list of objects like:

Alright, maybe you're not an API person at all. In the browser, the site takes a second to decide whether you're allowed to read it or not. Once it decides you're not a subscriber, it removes the extra content and throws up the paywall. This content is loaded into a JSON.parse() call in two variables:


window.GLOBAL_VARIABLES
window.INITIAL_STATE

In window.GLOBAL_VARIABLES you'll find this object:

In order for the site to determine whether it should throw up the paywall, that "ENABLED" attribute must be "true". If an APT or perhaps nation-state threat actor were to intercept and alter the HTTP response before the rest of the site loaded, they could return "ENABLED":false and frame you for stealing the news.

Finally, if you wanted it the really really easy way, because the entire DOM (well, the important bit) is rendered before the paywall is triggered, you could just download it like we're in the 90s.

wget https://www.smh.com.au/politics/federal/article-title-here.html -O page.html; firefox page.html

(Nobody else saved HTML pages to their disk to spare their ailing dialup connection? Just me? Okay.)

Hell, why didn't you say that first? Well, it wouldn't be as interesting, duh. But besides that, it also brings all the embedded scripts with it. When you load an article on the Sydney Morning Herald, it sends requests to a little over 120 domains. A couple of those are first-party, but the vast majority are for tracking and analytics. Kind of barf.

New York Times

The New York Times will also allow you to curl/wget a page, although it only loads assets and scripts from 60+ domains. I was kind of surprised by this for a couple of reasons. First of all, man, it's the Times. Second, they actually do implement some filtering depending on the client accessing the content. If a User-Agent header contains the string "python", it'll fail. Presumably this is to prevent popular automated scraping methods. For example, Python's requests library will make HTTP requests with a User-Agent like the following:


User-Agent: python-requests/2.28.2

It's not a very good protection mechanism though. If you're doing some requests work and want to change this, it's as simple as passing headers into your requests call:


headers = {"User-Agent": "my-cool-user-agent-NOT-a-hacker"}

requests.get("https://example.org/", headers=headers)

And now your HTTP request will look like:


GET / HTTP/2
Host: example.org
User-Agent: my-cool-user-agent-NOT-a-hacker

We digress. The point is, even though it fails on a Python User-Agent string, it apparently permits curl's.

Also similar to the SMH, the content is loaded into a variable called window.preloadedData. It looks like a GraphQL response and the Times does expose a GraphQL endpoint at https://samizdat-graphql.nytimes.com/graphql. Unfortunately any requests it makes for this content are invisible and we don't have access to the schema so we can only daydream about what that query might look like. Just kidding, we can just search responses for " query ". Because some requests are initiated client-side, we can stumble across a suggestive looking block like:

query ArticleQuery($id: String!, $bodyCount: Int!, $isTheWeekly: Boolean!) { article: workOrLocation(id: $id) { __typename ... on Article { id compatibility { isOak hasInlineEmbeddedInteractives }archiveProperties {\n timesMachineUrl\n lede @stripHtml\n thumbnails {\n height\n url\n width\n }\n }\n collections @filterEmpty {\n id\n url\n slug\n }\n tone\n section {\n id\n name\n displayName\n url\n uri\n }\n subsection {\n id\n name\n displayName\n url\n uri\n }\n sprinkledBody {\n content @filterEmpty {\n __typename\n }\n }\n storyFormat\n ...storyAdRequirements_article\n ...isCompatible_article\n ...isRTL_article\n ...StoryTrackingMapper_article\n ...AssembledArticlePage_article\n }\n ... on RelocatedWork {\n targetUrl\n }\n ... on ExpiredWork {\n id\n uri\n }\n }\n }\n "

And with a bit more work we can find out that the id variable takes in an article's relative path. An id like:

"id": "/2024/01/05/us/trump-supreme-court-colorado-ballot.html"

...will return a very basic structure for an article's contents. In reality, it still requires a lot more work to return any useful content as the above query only returns __typename for the given parameters. To actually get article text, we need to navigate the structure using inline fragments in the sprinkledBody object. Obviously we don't wanna fuzz any parameters, but by carefully examining the API responses in the DOM we can take a stab at it. With a little bit of work you can pass a sprinkledBody object in the query like:

And get back something like:

...which will return the entire article's text (truncated here, obvi.) Neat!

Wired

If you remember the days when visiting a site's home page and clicking a page or two added up to a handful of requests, remember to do your stretches today. Thoroughly in the modern era, Wired makes like a thousand requests on load from more than 350 domains. Man I just wanted to read the news not alert the whole goddamn internet.

Wired is pretty similar to the Times. Any article can be grabbed using curl/wget or any other suitable method. The article's content and metadata are loaded in a variable called window.__PRELOADED_STATE__. There's not much more to say here; you'd expect Conde to have it wrapped up tight so besides a good ol' fashioned scraping, the content stays on the page.

So there's my afternoon of research. The above paywalls weren't picked for any special reason besides I've been a subscriber of two of them (not any more and they were a nightmare to cancel; dark patterns blog post when) and a frequent reader of the third. Taken charitably, the permeability of all of them is a gift to inventive readers. But could we say they seem like half-hearted implementations indicating a half-hearted reorienting of incentives? Could we? We could say that? Let's not put too fine a point on it.

What else did we learn? Most of those paywall-jumping sites probably just run curl from the server and render the results. No wizardry necessary.

If you know someone who knows someone who wants to get yelled at, give them my email. Until then, send me more paywalls..

Jumping the Paywall

A cursory glance at some websites from January 2024

Sydney Morning Herald

New York Times

Wired