Utility for collecting the list of pervasive resources from the HTTP Archive dataset.
The rules for what URLs qualify as “pervasive” for the purpose of using a shared cache are laid out in the Cache sharing for extremely-pervasive resources document.
The script automates the filtering of candidate URLs from the HTTP Archive dataset, collecting data from the last six months of crawls, filters the URLs based on the restrictions and automates creating patterns for the resulting resources.
This is the query that is run for each of the last six months of data to produce an initial dataset to filter from:
#standardSQL SELECT url, ANY_VALUE(dest) as dest, ANY_VALUE(size) as size, ANY_VALUE(request_headers) as request_headers, ANY_VALUE(response_headers) as response_headers, body_hash, COUNT(*) as num FROM ( SELECT url, PARSE_NUMERIC(JSON_VALUE(payload, "$._objectSize")) as size, JSON_VALUE(payload, "$._body_hash") as body_hash, req_h.value as dest, request_headers, response_headers FROM `httparchive.crawl.requests`, UNNEST (request_headers) as req_h, UNNEST (response_headers) as resp_h WHERE date = "{year}-{month}-01" AND JSON_VALUE(payload, "$._body_hash") IS NOT NULL AND lower(resp_h.name) = "cache-control" AND lower(resp_h.value) LIKE "%public%" AND lower(req_h.name) = "sec-fetch-dest" AND (lower(req_h.value) = "script" OR lower(req_h.value) = "style" OR lower(req_h.value) = "empty") AND PARSE_NUMERIC(JSON_VALUE(payload, "$._responseCode")) = 200 AND PARSE_NUMERIC(JSON_VALUE(payload, "$._objectSize")) > 1000 ) Hashes GROUP BY url, body_hash HAVING COUNT(*) > 20000 ORDER BY num DESC
empty destination that do not include a use-as-dictionary response header.public in the cache-control response header.set-cookie response header.