[url_pattern_index] improve README
Adds core concepts important to the component, including an overview
of building the n-gram index and how URL pattern matching works.
Change-Id: Ia5f00be3835614b644680b1fc08539154cfe2ef6
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/3475219
Auto-Submit: Charlie Harrison <csharrison@chromium.org>
Reviewed-by: Dave Vandyke <kzar@chromium.org>
Reviewed-by: Kelvin Jiang <kelvinjiang@chromium.org>
Commit-Queue: Kelvin Jiang <kelvinjiang@chromium.org>
Cr-Commit-Position: refs/heads/main@{#987660}
diff --git a/components/url_pattern_index/README b/components/url_pattern_index/README
deleted file mode 100644
index 755e9df3..0000000
--- a/components/url_pattern_index/README
+++ /dev/null
@@ -1,12 +0,0 @@
-UrlPatternIndex component can be used to build an index over a set of URL rules,
-and speed up matching network requests against these rules.
-
-A URL rule (see flat::UrlRule structure) describes a subset of network requests
-that it targets. The essential element of the rule is its URL pattern, which is
-a simplified regular expression (a string with wildcards). UrlPatternIndex is
-mainly based on text fragments extracted from the patterns.
-
-The component uses FlatBuffers serialization library to represent the rules and
-the index. The key advantage of the format is that it does not require
-deserialization. Once built, the data structure can be stored on disk or
-transferred, then copied/loaded/memory-mapped and used directly.
diff --git a/components/url_pattern_index/README.md b/components/url_pattern_index/README.md
new file mode 100644
index 0000000..99c5878
--- /dev/null
+++ b/components/url_pattern_index/README.md
@@ -0,0 +1,129 @@
+# `UrlPatternIndex` overview
+
+The UrlPatternIndex component can be used to build an index over a set of URL
+rules, and speed up matching network requests against these rules.
+
+A URL rule (see `flat::UrlRule` structure) describes a subset of network
+requests that it targets. The essential element of the rule is its URL pattern,
+which is a simplified regular expression (a string with wildcards).
+`UrlPatternIndex` is mainly based on text fragments extracted from the patterns.
+
+The component uses the [FlatBuffers serialization
+library](https://google.github.io/flatbuffers/) to represent the rules and the
+index. The key advantage of the format is that it does not require
+deserialization. Once built, the data structure can be stored on disk or
+transferred, then copied/loaded/memory-mapped and used directly.
+
+# Detailed design
+
+## `UrlPattern`s
+
+The component is built around an underlying concept of a URL pattern, defined in
+the class `UrlPattern`. These patterns are largely inspired by patterns in
+[EasyList / Adblock Plus filters](https://adblockplus.org/filter-cheatsheet) and
+are documented in more detail in the [declarativeNetRequest
+documentation](https://developer.chrome.com/docs/extensions/reference/declarativeNetRequest/#type-RuleCondition).
+
+## Building the index
+
+The underlying goal of the index format is to efficiently check to see if URLs
+match any URL patterns contained in the index. The data structure used here is
+an N-gram filter. An N-gram is a string consisting of N (up to 8) bytes. Currently,
+the component has chosen to use [`kNGramSize = 5`](https://source.chromium.org/chromium/chromium/src/+/main:components/url_pattern_index/url_pattern_index.h;drc=e89a43f45befc5c8e549d765018524d2f81c8765;l=54).
+
+The strategy used in this component is to build a data structure which maps
+`NGram -> vector<UrlRule>`, by finding all N-grams associated with a given URL
+pattern, and picking one of them (the most distinctive one, see
+`UrlPatternIndexBuilder::GetMostDistinctiveNGram`). The URL pattern is then
+inserted into the map associated with that N-gram.
+
+Note: URL patterns have special characters like `*` and `^` which implement
+special wildcard matching. N-grams are built only _between_ these special
+characters.
+
+For example, the URL pattern `foo.com/*abc*` will generate the following 5-grams:
+```
+foo.c
+oo.co
+o.com
+.com/
+```
+
+See
+[url_pattern_index.fbs](https://source.chromium.org/chromium/chromium/src/+/main:components/url_pattern_index/flat/url_pattern_index.fbs)
+for the raw underlying Flatbuffers format which builds the N-gram filter using a
+[custom hash table](https://source.chromium.org/chromium/chromium/src/+/main:components/url_pattern_index/closed_hash_map.h)
+implementation.
+
+## Querying the index
+
+Querying a built index is very similar to building the index in the first place.
+Given a URL, it is broken into all of it's component N-grams, just like the URL
+pattern was above. For example, the URL `https://foo.com/?q=abcdef` would
+generate the following 5-grams:
+```
+https
+ttps:
+tps:/
+ps://
+s://f
+://fo
+//foo
+/foo.
+foo.c
+oo.co
+o.com
+.com/
+com/?
+om/?q
+m/?q=
+/?q=a
+?q=ab
+q=abc
+=abcd
+abcde
+bcdef
+```
+With these N-grams extracted, we can just consider all of the `UrlPattern`s
+which are associated with those N-grams. See `FindMatchInFlatUrlPatternIndex`
+and `FindMatchAmongCandidates` for this logic.
+
+Many of these N-grams match ones that are also present in the `foo.com/*abc*`
+example above , so we can be sure that that URL pattern will be considered
+during pattern evaluation.
+
+## Fallback rules
+You might be thinking "what about URLs whose length is less than N, or
+patterns that generate no N-grams?" We will make sure to put all rules like that
+into a special list called the `fallback_rules` which are applied to every URL
+unconditionally.
+
+## Checking an individual `UrlPattern`
+
+This logic is encapsulated in `UrlPattern::MatchesUrl`. This essentially
+consists of splitting a URL pattern by the `*` wildcard, and considering each
+subpattern in between the `*`s.
+
+There is some complexity here to deal with:
+- `^` separator matching, which matches any ASCII symbol except letters, digits,
+ and the following: `'_', '-', '.', '%'`. See
+ [fuzzy_pattern_matching](https://source.chromium.org/chromium/chromium/src/+/main:components/url_pattern_index/fuzzy_pattern_matching.h).
+- `|` Left/right anchors, which specifies the beginning or end of a URL.
+- `||` Domain anchors, which specifies the start of a (sub-)domain of a URL.
+
+After all this complexity is dealt with, the bulk of the subpattern logic is
+simply `StringPiece::find / std::search`! This component used to use something
+much more complicated ([Knuth-Morris-Pratt
+algorithm](https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm)),
+but benchmarking on real URLs proved the simple solution was more optimal (and
+removed the need for a preprocessing step at indexing time), so it was
+[removed](https://codereview.chromium.org/2793993002/).
+
+For example, in checking if `https://foo.com/?q=abcdef` matches `foo.com/*abc*`,
+the component will:
+
+- Split the URL pattern into two pieces: `foo.com/` and `abc`.
+- Try to find `foo.com/` in `https://foo.com/?q=abcdef`, which is a match!
+- Remove the matching prefix
+- Try to find `abc` in `?q=abcdef`, which is a match! This is the last pattern,
+ so return true
\ No newline at end of file