Add a tracing framework (really just logging).

This isn't a performance tracing framework (unlike the old ruy tracing).
This is about understanding what happens inside a ruy::Mul with a view
toward documenting how ruy works.

Added a 'parametrized_example' to help play with this tracing on any
flavor of ruy::Mul call. This also serves as a more elaborate example
of how to call ruy::Mul, and as a single binary instantiating several
different instantiations of the ruy::Mul template, which is useful
for measuring binary size and showing a breakdown of ruy symbols in
a document.

A few code changes beyond tracing slipped in:
 - Improved logic in determining the traversal order in MakeBlockMap:
   In rectangular cases, since we first do the top-level rectangularness
   subdivision with linear traversal anyway, the traversal order only
   applies within each subdivision past that, so it should be based
   on sizes already divided by rectangularness. In practice this nudges
   1000x400x2000 from kFractalHilbert to kFractalU on Pixel4, without
   making an observable perf difference in that case.
 - Removed the old RUY_BLOCK_MAP_DEBUG logging code: superseded.
   Kept only a minimal hook to force a block_size_log2 choice.
 - Wrote new comments on BlockMap internals.
 - Fixed Ctx::set_runtime_enabled_paths to behave as documented:
   passing Path::kNone reverts to the default behavior (auto detect).
 - Exposed Context::set_runtime_enabled_paths.
 - Renamed UseSimpleLoop -> GetUseSimpleLoop (easier to read trace).

PiperOrigin-RevId: 352695092
19 files changed
tree: 192a231991a43336b9049b38dd651b35d4157935
  1. doc/
  2. example/
  3. ruy/
  4. third_party/
  5. .gitmodules
  6. BUILD
  7. CONTRIBUTING.md
  8. LICENSE
  9. README.md
  10. WORKSPACE
README.md

The ruy matrix multiplication library

This is not an officially supported Google product.

ruy is a matrix multiplication library. Its focus is to cover the matrix multiplication needs of neural network inference engines. Its initial user has been TensorFlow Lite, where it is used by default on the ARM CPU architecture.

ruy supports both floating-point and 8bit-integer-quantized matrices.

Efficiency

ruy is designed to achieve high performance not just on very large sizes, as is the focus of many established libraries, but on whatever are the actual sizes and shapes of matrices most critical in current TensorFlow Lite applications. This often means quite small sizes, e.g. 100x100 or even 50x50, and all sorts of rectangular shapes. It's not as fast as completely specialized code for each shape, but it aims to offer a good compromise of speed across all shapes and a small binary size.

Documentation

Some documentation will eventually be available in the doc/ directory, see doc/README.md.