blob: 9a81dd433a1c004830d237c1367c3740e4ac2f72 [file] [view]
# Strings in V8
Strings are a fundamental data type in JavaScript, and V8 uses a complex hierarchy of string representations to optimize various operations like concatenation, slicing, and internalization.
## String Representation Hierarchy
All strings in V8 inherit from the `String` class (defined in `src/objects/string.h`). V8 uses different concrete classes depending on how the string was created and how it is used.
### 1. Sequential Strings (`SeqString`)
Captures sequential string values where the characters are stored directly in the object.
* **`SeqOneByteString`**: Characters are stored as 8-bit Latin-1 code units. Used for ASCII-like strings.
* **`SeqTwoByteString`**: Characters are stored as 16-bit UTF-16 code units. Used for strings containing non-Latin-1 characters.
### 2. Cons Strings (`ConsString`)
Describes string values built by using the addition operator (`+`) on strings.
* Instead of copying characters immediately, a `ConsString` is a pair of pointers to the two constituent strings.
* This creates a binary tree of strings, where the leaves are non-Cons strings.
* **Benefit**: Fast concatenation without copying.
* **Flattening**: When a `ConsString` is read or becomes too deep, V8 may "flatten" it by allocating a sequential string and copying the characters into it.
* **Minimum Size**: Cons strings have a minimum size. Very short concatenations may result in a sequential string instead of a cons string to avoid the overhead of small trees.
### 3. Sliced Strings (`SlicedString`)
Describes strings that are substrings of another sequential string.
* Instead of copying characters for `substr()` or `slice()`, a `SlicedString` contains a pointer to the parent string, an offset, and a length.
* **Benefit**: Fast slicing without copying.
* **Limitation**: Keeps the parent string alive in memory, even if only a small slice is needed.
### 4. Thin Strings (`ThinString`)
Describes string objects that are just references to another string object.
* They are used for **in-place internalization** when the original string cannot actually be internalized in-place.
* In these cases, the original string is converted to a `ThinString` pointing at its internalized version (which is allocated as a new object).
* In terms of memory layout, they can be thought of as "one-part cons strings".
* **Benefit**: Avoids updating all handles pointing to the original string when it is internalized.
* **GC Behavior**: The GC *may* (but might not) patch pointers to thin strings to instead point directly to the internalized string, eventually allowing the thin string to be reclaimed.
### 5. External Strings (`ExternalString`)
Describes string values that are backed by a string resource that lies outside the V8 heap (e.g., in the embedder like Chrome or Node.js).
* V8 must ensure that the resource is not deallocated while the `ExternalString` is live.
* They come in one-byte and two-byte variants, similar to sequential strings. V8 accesses the characters directly from the external resource, avoiding copying the data into the V8 heap.
## String Transitions and Internalization
### Internalization
When a string is used as a property key (e.g., `obj["prop"]`), V8 **internalizes** it. This means it ensures there is only one unique instance of that string value in the **String Table** (a hash table).
* If the string is already internalized, it returns the existing instance.
* If not, it adds it to the table.
* If a `SeqString` is internalized, it might be changed to an `InternalizedString` in place if possible.
* If it cannot be changed in place (e.g., if it's a `ConsString`), V8 creates a new `InternalizedString` and converts the original string into a `ThinString` pointing to the new one.
### Flattening
As mentioned above, `ConsString` instances are tree structures. To read characters efficiently or pass them to APIs that expect flat buffers, V8 will flatten the tree into a single `SeqString`.
## String Instance Types and Bitfield
V8 uses the `InstanceType` field in the object Map to identify the specific representation and encoding of a string. For strings, the high-order bits (bits 7-15) are cleared, and the lower bits form a bitfield:
* **Bits 0-2 (Representation)**:
* `000`: Sequential String
* `001`: Cons String
* `010`: External String
* `011`: Sliced String
* `101`: Thin String
* **Bit 3 (Encoding)**:
* `0`: Two-Byte (UTF-16)
* `1`: One-Byte (Latin-1)
* **Bit 4 (Uncached External)**: Set if the data pointer of an external string is not cached.
* **Bit 5 (Internalization)**:
* `0`: Internalized String
* `1`: Not Internalized String
* **Bit 6 (Shared)**: Set if the string is accessible by more than one thread.
This bitfield layout allows V8 to perform extremely fast checks (e.g., checking if a string is one-byte or internalized) using simple bitwise operations.
## The String Table
The **String Table** is a hash table that stores all internalized strings.
### How it Works
* **Uniqueness**: Every string value in the table is unique.
* **Lookup**: When V8 needs to internalize a string, it first computes its hash and looks it up in the String Table.
* **Sharing**: If found, the existing string instance is returned. If not found, the new string is added to the table.
* **Use Case**: Property names, symbol descriptions, and common identifiers are internalized to allow fast comparison by pointer equality instead of character-by-character comparison.
### Thread Safety
* **Shared String Table**: V8 can be configured to use a single shared string table across all isolates in a process (enabled by default when the V8 Sandbox or shared isolates are used).
* **Locking**: Access to the shared string table is protected by locks to ensure thread safety when multiple isolates are internalizing strings concurrently.
## File Structure
* `src/objects/string.h`: Main header file defining the string hierarchy.
* `src/objects/string.tq`: Torque definitions for strings.
* `src/snapshot/code-serializer.cc`: Handles serialization of strings for code caching.