Native Symbols

This doc describes how SuperSize breaks down native binaries into symbols.

Overview

Native symbols are those with a section of:

  • .text (executable code)
  • .rodata (read-only data)
  • .data (writable data)
  • .data.rel.ro (data that is read-only after ELF relocations are applied)
  • .bss (symbols that are zero-initialized. These consume no space in the binary, and so are generally ignored despite still being collected.

There are 3 modes that SuperSize can use to break an ELF down into symbols:

  • linker_map - Uses linker map + build directory to create symbols.
  • dwarf - Uses debug information to create symbols.
  • sections - Creates one symbol for each ELF section.

Mode: linker_map

This is the mode that produces the largest number of symbols, and thus is the preferred mode. Information provided only by this mode:

  • Path information for symbols outside of .text
    • DWARF information is complete for .text symbols (maybe because stack symbolization is a primary use-case?), but incomplete or missing for symbols in other sections.
  • String literals (.rodata symbols that look like "some string dat...").
    • Linker map files contain ** merge strings entries, which tell us where to string tables exist within .rodata.
  • object_path, which is useful for attributing STL usages to individual source files.
  • Path aliases - when an inline symbol is used by multiple source files, we attribute the symbol's cost equally among the files.
  • Linker-generated symbols. E.g. Switch tables.

Data Sources

  • build.ninja is parsed to get:
    • List of .o and .a files that were inputs to the linker.
    • Mapping of .cc -> .o files.
  • All .o (and .a) files are parsed:
    • with nm to get symbol list.
    • Non-ThinLTO: with nm to get list of string literals
    • ThinLTO: with llvm-bcanalyzer to get list of string literals
  • ELF file is parse with nm to get list of symbol names that were identical-code-folded to the same address.
  • Linker map (created via -Wl,-Map=output.map) parsed to get:
    • Full list of symbols that comprise the binary,
    • Location of string tables (** merge strings entries).
    • Non-ThinLTO: object_path (.o file) associated with each symbol
    • Note:
      • With ThinLTO, object_path points to a hashed filename within the thinlto cache (not useful).
      • When multiple symbols are folded together due to Identical Code Folding, the linker map file lists only one of them.
  • ELF file string tables are parsed by looking for \0 bytes and creating string literal symbols for each string therein.

Algorithm

  1. Create initial symbol list from linker map.
  2. Assign object paths by seeing which .o files define each symbol (match up the names).
    • When multiple files define the same symbol, create symbol aliases.
  3. Create string literal symbols from string tables, and assign them paths based on which .o files define the same string literal.
  4. Assign source_path using the .o -> .cc mapping from build.ninja.
    • This means that .h files are never listed as sources. No information about inlined symbols is gathered (by design).
  5. Create symbol aliases when nm reports multiple symbols mapping to the same address.
  6. Normalize source_path by removing generated path prefix (and adding FLAG_GENERATED) when applicable.
  7. Normalize symbol names.

Mode: dwarf

Creates symbols using only an ELF with debug information enabled. Requires compiler flag -gmlt to enable full source paths (rather than just basename).

Algorithm

  1. Create initial symbol list with nm --print-size.
  2. Add name aliases using output from nm (this could have been done at the same time as the previous step, but is done as a separate step in order to share logic with linker_map mode.
  3. Uses dwarfdump to find all DW_AT_compile_unit and DW_AT_ranges entries and create a map of address range -> source path.
  4. Assign source paths based to .text symbols based on symbol address.

Why not use Bloaty?

Bloaty is an excellent tool, and produces size information with similar fidelity to “dwarf” mode, as it uses the same data source. We did not use bloaty since “dwarfdump” was already readily available and gave similar results. It would be nice to also have a “bloaty” mode so that we could more direclty compare outputs.

Mode: sections

This mode uses readelf -s to create one symbol for each ELF section. It is used for native files where no debug information or linker map file is available, and for native files whose ABI do not match the --abi-filter.

Data Normalization

Some manipulation happens in order to make names and paths more human-readable.

  • (anonymous::) is removed from names (and stored as a symbol flag).
  • [clone] suffix removed (and stored as a symbol flag).
  • vtable for FOO -> Foo [vtable]
  • Mangling done by linkers is undone (e.g. prefixing with “unlikely.”)
  • Names are processed into:
    • name: Name without template and argument parameters.
    • template_name: Name without argument parameters.
    • full_name: Name with all parameters.
  • LLVM function outlining creates many OUTLINED_FUNCTION_* symbols. These are renamed to ** outlined functions or ** outlined functions * (count), and are de-duped so an address can have at most one such symbol.
    • Update: Outlining was ARM64-only, and has been disabled in our build due to performance regressions.