High performance Infrastructure to run an executable with all its dependencies mapped in.
One line description: How to push 10000 files worth 2Gb in total to tens of bots within a few seconds, even on Windows or Android.
Enables running an executable on an arbitrary machine by describing the executable's runtime dependency efficiently. Archival and retrieval of the needed files scale sub-linearly. Stale files are evicted from the cache automatically.
To be able to run an executable along with all its dependencies fast, the required files must be transferred. Many techniques can be used, like:
Using the Chromium infrastructure as a real-world example, the Chromium Continuous Integration (CI) infrastructure uses a set of “builders” and testers". The “builders” compiles and archive the generated unit tests. The “testers” run one or multiple tests, or a fraction of a test, i.e. a shard.
Testers have historically required a full checkout to run the tests since they have no information about which data files are read by which test, even less if they are accessed at all. In addition, the executables were transferred as a large zip. That has became problematic as the zip grew larger than 2gb.
Syncing multiple tens of GB of sources is a constant cost that doesn‘t scale well when the tests are sharded across multiple testers. As we want to spread the test execution on more bots, the need to reduce constant costs grows, since it’s becoming the major cost of running a test on a bot.
Authors looked at using a faulty file system instead, e.g. a copy-on-write compile plus mounting the partition on the testers to run the test from the checkout. The problem is that it puts a significant burden on the hardware providing these partitions and the round-trip latency is worsened, since the underlying infrastructure has no idea what data will be needed upfront. This has the net effect of multiplying the latency on the end-to-end execution duration.
For the Chromium team to scale properly, a more deterministic approach needs to be applied to the way tests are run. Guessing what is needed to run a test, manually keeping a list of executables to zip in a python script to send them over to testers is not scalable. So a totally different approach is to make the tests to run in a “pure” environment. Make sure running the tests is idempotent from the surrounding environment. From a directory-tree point of view, the best way to do it is the map the test executable into a temporary directory before running it.
We highly recommend reading the Google engtools blog posts about building in the cloud. In particular, read Testing at the speed and scale of Google and Build in the Cloud: Distributing Build Steps for background information about why you want to do that.
The process works as follow:
.isolatedfile is generated. See IsolateDesign for tools that generate these files. It contains the
.isolatedfile are archived on the isolate server with
run_isolated.pyis provided the SHA-1 of the
.isolatedfile, retrieves the dependencies and maps them all in a temporary directory. It runs the executable and delete the temporary directory afterward.
isolateserver.py interacts with a Content Addressed Cache file datastore hosted on AppEngine named isolate-server.
isolateserver.py can optionally archive on an NFS/Samba share or locally albeit at the loss of cache eviction functionality.
isolate-server is the main infrastructure that acts as a very efficient file cache.
Goal: specifies “how to run a command with all its runtime dependencies”. It's more than just specifying the dependencies, it also describes how to run a command, which includes the relative working directory to start the executable from and the actual command line to execute.
.isolated file is
.isolate file processed by
isolate.py. It lists the exact content of each file that needs to be mapped in the relative path, the content being addressed by it's SHA-1. It also lists the command that should be used to run the executable and the relative directory to start the command in.
.isolated files are JSON file, unlike
.isolate which are python files.
The root is a dictionary with the following keys:
algo: Hashing algorithm used to hash the content. Normally
command: exact command to run as a list.
files: list of dictionary, each key being the relative file path, and the entry being a dict determining the properties of the file. Exactly one of
lmust be present.
mmust be present only on POSIX systems.
h: file content's SHA-1
l: link destination iff a symlink
m: POSIX file mode (required on POSIX, ignored on non-POSIX).
s: file size iff not a symlink
t: type of the file iff not the default of
includes: references another
.isolatedfile for additional files or to provide the command. In practice, this is used to reduce
.isolatedfile size by moving rarely changed test data files in a separate
read_only: boolean to specify if all the files should be read-only. This will be eventually enforced.
relative_cwd: relative directory inside the temporary directory tree when the command should be executed from.
version: version of the file format. Increment the minor version for non breaking changes and the major version if code written for a previous version shouldn't be able to parse it.
There are the following file types;
basic: All normal files, the default type.
ar: An ar archive containing a large number of small files.
tar: An tar archive containing a large number of small files.
.isolated format supports the
includes key to split and merge back list of files in separate
.isolated files. It is in stark contrast with more traditional tree of trees structure like the git tree object rooted to a git commit object.
The reason is to leave a lot of room to the tool generating the
.isolated files to be able to package low-churn files versus high-churn files into a small number of
.isolated files. As a practical example, we can state that test data files are low churn, as an example the rate of change of every
.js file is low. The files in <(PRODUCT_DIR) are high-churn files since they are usually different at each build. It's up to the tool generating the
.isolated files to generate the most optimal setup.
That's the primary design decision to use a flat list and not a tree-based storage for the file mapping like the git tree objects. The reason is that splitting the high-churn files from the low-churn files is not necessarily directly describable in term of directories.
It's important to clarify here that
.isolate are not related in any way to the
.isolated files. Reread the sentence if necessary. The first one is to reuse a common list of runtime dependencies for multiple targets, the second is to optimize the overall size of the
.isolated files to archive and load for bots.
Goal: performance tiered Content Addressed Cache with native compression support. That is, it caches data, with each entry‘s key being the hash of the entry’s content. The performance tier of each entry is specified at each entry upload time, so higher performance can be stored in higher performance backend for faster retrieval.
As such, it is ill named, it doesn't know about the
.isolated file formats. It is a cache, not a permanent data store, so while the semantics can be similar to a Content Addressed Storage, it is much more restricted in its usage and optimized for very fast lookups.
Each item's timestamp is refreshed on storage or request for presence. Fetching, like running
run_isolated.py, never updates the timestamp. Only storing does, like running
The cache uses a global 7 days eviction policy by default so objects are deleted automatically if not tested for presence.
To help future-proof the server, all the objects are stored in a specified namespace. The namespace is used as a signal to specify which hashing algorithm is used (defaults to SHA-1) and if the objects are stored in their plain format or transformed (compressed). The namespace logic also has a special case for temporary objects, any “temporary” namespace is evicted after 1 day instead of the default 7 days.
It is interesting to compare the choice of embedding the hashing algorithm in the namespace instead of each key, like how camlistore does. It slightly reduces the strings overhead and simplifies sending the hashes as binary bytes. A single request handling several items doesn't have to switch of hashing algorithm per item. It is a requirement and is implicitly enforced that a single
.isolated has all its items referenced in the same namespace.
As such, there is a close relationship between the
.isolated and the namespace, since both must use the exact same hash algorithm.
TODO: It could be valuable to use 2 layers of namespaces instead of one, so that the hash algorithm and compression algorithm can be specified independently.
Some files are more important that others. In particular,
.isolated files must have much lower fetch latency than the other ones since they are the bottleneck to fetch more data, i.e. all the dependencies. These high-priority files are stored in memcache in addition to the datastore, so the retrieve operation can complete with a lower latency.
To optimize small object retrieval, small objects (with a current cut off at 20kb, heuristics needs to be done to select a better value) are stored directly inline in the datastore instead of the AppEngine BlobStore or Cloud Storage to reduce inefficient I/O for small objects.
Like most SCM like git and hg but unlike most CAS,
isolate-server supports on-the-wire and in-storage compression while using the uncompressed data to calculate the hash key. Unlike git,
isolate-server doesn't recompress on the fly and do not do inter-file compression.
The reason for the on-the-wire compressed transfer is to greatly reduce the network I/O. It is based on the assumption that most objects are build outputs, usually executables, so they are usually both large and highly compressible. It is important for that the
.isolated files do not need to be modified to switch from the non-compressed namespace to a compressed one so the key is the same for the compressed and uncompressed version but they are stored in different namespaces.
The server is optimized for warm cache usage; the most frequent use case is that a large number of files are already in the cache on store operation. The way to do this is to batch requests for presence at 100 items per HTTP request, greatly reducing the network overhead and latency. Then for each cache miss, the item is uploaded as separate HTTP POST.
The actual algorithm used to do this is a bit more involved, files with recent timestamps are more likely to be not present on the server so they are looked up first. Same wise, larger files are looked up first, since they will incur the largest latency to be uploaded. The batching of requests is gradual, the first request specify a low number of highly probably cache misses, and as the cache misses lower, the batches are larger.
The number of supported requests is designed to be limited for its specific intended use case. See the current API by visiting a live instance with path
/_ah/api/explorer to see the generated documentation.
It's interesting to look at the trade offs with a few Content Addressed Storage systems. Note that the other CAS compared here are not caches but real datastores but the comparison is still useful from an optimization stand-point. Using git (a source control system), bup (a backup software based on git), camlistore (a one-size-fits-all datastore) as comparison.
|Feature vs Tool||git||bup||camlistore||isolateserver|
|Transparent compression||yes||yes||no||yes (but explicit)|
|Inter-file compression||yes||yes||no (but reuse chunks with same rolling hash)||no|
|Independent files request||no (supported through gitweb but inefficient I/O wise due to inter-file compression. Optimized to fetch a whole commit and its history)||no (supported through bup web but not optimal I/O wise due to rolling hash split chunks)||yes||yes|
|Efficient binary files support||no||yes (rolling hash)||yes||yes|
|Access control||external (usually ssh)||external (like git)||yes (AppEngine or native)||limited (IP or AppEngine)|
|Per file ACL||no, everything is visible||no||yes, a subset can be shared||no|
|Permanent reference to mutated objects||yes (tag)||no (each backup is independent?)||yes (permanode)||no|
|Tree of objects||explicit (root tree of a commit)||explicit (like git)||explicit||implicit (isolated files)|
|Automatic eviction policy||yes (GC unreferenced objects)||no||yes (GC unreferenced objects)||yes (explicit LRU, independent of reference tree)|
|Easy to delete older versions of an object||no (shallow clones aren't efficient)||no (near impossible to delete anything efficiently)||Yes (? not sure)||yes|
|Designed to work in distributed setup||yes||yes||yes||no|
|Hash algorithm can be changed||no||no||yes||yes|
|Priority support / fresh object memcaching||no||no||no||yes (explicit)|
|Fast remote object lookup||yes (by git commit only)||yes (like git)||no||yes (arbitrary)|
Note that the server double-checks the SHA-1 of the content uploaded, and will discard the data if there is a mismatch.
Every step is optimized for the “warm” use case.
isolateserver.pylookups hundreds files at a time on Isolate Server to look for presence, and multiple HTTP requests for lookup are done simultaneously. The files are sorted via an heuristics to query the most likely cache misses first.
run_isolated.pykeeps a LRU-based local cache to reduce network I/O, so only new files need to be fetched.
run_isolated.pyuses hardlinks on all OSes to reduce file I/O when creating a temporary tree. A multiple thousands tree can be mapped in mere seconds.
run_isolated.pyfetches multiple files simultaneously to reduce the overall effect of HTTP fetching latency.
.isolatedfiles in memcache for higher performance. Since they are a bottleneck to fetch the remaining dependencies, these files needs to be fetched first before fetching any other file.
To achieve better scalability, this project enables being able to confine each test to a limited view of the available files. In practice, the bottleneck of a Continuous Integration infrastructure will become:
Isolated Server is designed to run on AppEngine so it can be considered a “single distributed server”.
There is no redundancy in the Isolate Server, as it is running on App Engine.
Isolate Server require a valid GAIA account to access the content. An IP whitelist table is also available.