tree: b15601b39a47ca3569495ec2942934c99586564a [path history] [tgz]
  1. flat/
  2. proto/
  3. BUILD.gn
  4. closed_hash_map.h
  5. closed_hash_map_unittest.cc
  6. DEPS
  7. DIR_METADATA
  8. fuzzy_pattern_matching.cc
  9. fuzzy_pattern_matching.h
  10. fuzzy_pattern_matching_unittest.cc
  11. ngram_extractor.h
  12. ngram_extractor_unittest.cc
  13. OWNERS
  14. PRESUBMIT.py
  15. README.md
  16. string_splitter.h
  17. string_splitter_unittest.cc
  18. uint64_hasher.h
  19. url_pattern.cc
  20. url_pattern.h
  21. url_pattern_index.cc
  22. url_pattern_index.h
  23. url_pattern_index_unittest.cc
  24. url_pattern_unittest.cc
  25. url_rule_test_support.cc
  26. url_rule_test_support.h
  27. url_rule_util.cc
  28. url_rule_util.h
  29. url_rule_util_unittest.cc
components/url_pattern_index/README.md

UrlPatternIndex overview

The UrlPatternIndex component can be used to build an index over a set of URL rules, and speed up matching network requests against these rules.

A URL rule (see flat::UrlRule structure) describes a subset of network requests that it targets. The essential element of the rule is its URL pattern, which is a simplified regular expression (a string with wildcards). UrlPatternIndex is mainly based on text fragments extracted from the patterns.

The component uses the FlatBuffers serialization library to represent the rules and the index. The key advantage of the format is that it does not require deserialization. Once built, the data structure can be stored on disk or transferred, then copied/loaded/memory-mapped and used directly.

Detailed design

UrlPatterns

The component is built around an underlying concept of a URL pattern, defined in the class UrlPattern. These patterns are largely inspired by patterns in EasyList / Adblock Plus filters and are documented in more detail in the declarativeNetRequest documentation.

Building the index

The underlying goal of the index format is to efficiently check to see if URLs match any URL patterns contained in the index. The data structure used here is an N-gram filter. An N-gram is a string consisting of N (up to 8) bytes. Currently, the component has chosen to use kNGramSize = 5.

The strategy used in this component is to build a data structure which maps NGram -> vector<UrlRule>, by finding all N-grams associated with a given URL pattern, and picking one of them (the most distinctive one, see UrlPatternIndexBuilder::GetMostDistinctiveNGram). The URL pattern is then inserted into the map associated with that N-gram.

Note: URL patterns have special characters like * and ^ which implement special wildcard matching. N-grams are built only between these special characters.

For example, the URL pattern foo.com/*abc* will generate the following 5-grams:

foo.c
oo.co
o.com
.com/

See url_pattern_index.fbs for the raw underlying Flatbuffers format which builds the N-gram filter using a custom hash table implementation.

Querying the index

Querying a built index is very similar to building the index in the first place. Given a URL, it is broken into all of it's component N-grams, just like the URL pattern was above. For example, the URL https://foo.com/?q=abcdef would generate the following 5-grams:

https
ttps:
tps:/
ps://
s://f
://fo
//foo
/foo.
foo.c
oo.co
o.com
.com/
com/?
om/?q
m/?q=
/?q=a
?q=ab
q=abc
=abcd
abcde
bcdef

With these N-grams extracted, we can just consider all of the UrlPatterns which are associated with those N-grams. See FindMatchInFlatUrlPatternIndex and FindMatchAmongCandidates for this logic.

Many of these N-grams match ones that are also present in the foo.com/*abc* example above , so we can be sure that that URL pattern will be considered during pattern evaluation.

Fallback rules

You might be thinking “what about URLs whose length is less than N, or patterns that generate no N-grams?” We will make sure to put all rules like that into a special list called the fallback_rules which are applied to every URL unconditionally.

Checking an individual UrlPattern

This logic is encapsulated in UrlPattern::MatchesUrl. This essentially consists of splitting a URL pattern by the * wildcard, and considering each subpattern in between the *s.

There is some complexity here to deal with:

  • ^ separator matching, which matches any ASCII symbol except letters, digits, and the following: '_', '-', '.', '%'. See fuzzy_pattern_matching.
  • | Left/right anchors, which specifies the beginning or end of a URL.
  • || Domain anchors, which specifies the start of a (sub-)domain of a URL.

After all this complexity is dealt with, the bulk of the subpattern logic is simply StringPiece::find / std::search! This component used to use something much more complicated (Knuth-Morris-Pratt algorithm), but benchmarking on real URLs proved the simple solution was more optimal (and removed the need for a preprocessing step at indexing time), so it was removed.

For example, in checking if https://foo.com/?q=abcdef matches foo.com/*abc*, the component will:

  • Split the URL pattern into two pieces: foo.com/ and abc.
  • Try to find foo.com/ in https://foo.com/?q=abcdef, which is a match!
  • Remove the matching prefix
  • Try to find abc in ?q=abcdef, which is a match! This is the last pattern, so return true