docs/workflow/debugging-with-swarming.md - chromium/src - Git at Google

 # Debugging with Swarming

 This document outlines how to debug a test failure on a specific builder
 configuration without needing to repeatedly upload new CL revisions or do CQ dry
 runs.

 [TOC]

 ## Overview & Terms

 *Swarming* is a system operated by the infra team that schedules and runs tasks
 under a specific set of constraints, like "this must run on a macOS 10.13 host"
 or "this must run on a host with an intel GPU". It is somewhat similar to part
 of [Borg], or to [Kubernetes].

 An *isolate* is an archive containing all the files needed to do a specific task
 on the swarming infrastructure. It contains binaries as well as any libraries
 they link against or support data. An isolate can be thought of like a tarball,
 but held by the "isolate server" and identified by a hash of its contents. The
 isolate also includes the command(s) to run, which is why the command is
 specified when building the isolate, not when executing it.

 Normally, when you do a CQ dry run, something like this happens:

 ```
   for type in builders_to_run:
     targets = compute_targets_for(type)
     isolates = use_swarming_to_build(type, targets) # uploads isolates for targets
     wait_for_swarming_to_be_done()

     for isolate in isolates:
       use_swarming_to_run(type, isolate) # downloads isolates onto the bots used
     wait_for_swarming_to_be_done()
 ```

 When you do a CQ retry on a specific set of bots, that simply constrains
 `builders_to_run` in the pseudocode above. However, if you're trying to rerun a
 specific target on a specific bot, because you're trying to reproduce a failure
 or debug, doing a CQ retry will still waste a lot of time - the retry will still
 build and run *all* targets, even if it's only for one bot.

 Fortunately, you can manually invoke some steps of this process. What you really
 want to do is:

 ```
   isolate = use_swarming_to_build(type, target) # can't do this yet, see below
   use_swarming_to_run(type, isolate)
 ```

 or perhaps:

 ```
   isolate = upload_to_isolate_server(target_you_built_locally)
   use_swarming_to_run(type, isolate)
 ```

 ## Building an isolate

 At the moment, you can only build an isolate locally, like so (commands you type
 begin with `$`):

 ```
 $ tools/mb/mb.py isolate //$outdir $target
 ```

 This will produce some files in $outdir. The most pertinent two are
 `$outdir/$target.isolate` and `$outdir/target.isolated`. If you've already built
 $target, you can save some CPU time and run `tools/mb/mb.py` with `--no-build`:

 ```
 $ tools/mb/mb.py isolate --no-build //$outdir $target
 ```

 Support for building an isolate using swarming, which would allow you to build
 for a platform you can't build for locally, does not yet exist.

 ## Uploading an isolate

 You can then upload the resulting isolate to the isolate server:

 ```
 $ tools/swarming_client/isolate.py archive \
       -I https://isolateserver.appspot.com \
       -i $outdir/$target.isolate \
       -s $outdir/$target.isolated
 ```

 You may need to log in to `https://isolateserver.appspot.com` to do this:

 ```
 $ python tools/swarming_client/auth.py login \
       --service=https://isolateserver.appspot.com
 ```

 The `isolate.py` tool will emit something like this:

 ```
 e625130b712096e3908266252c8cd779d7f442f1  unit_tests
 ```

 Do not ctrl-c it after it does this, even if it seems to be hanging for a
 minute - just let it finish.

 ## Running an isolate

 Now that the isolate is on the isolate server with hash `$hash` from the
 previous step, you can run on bots of your choice:

 ```
 $ tools/swarming_client/swarming.py trigger \
     -S https://chromium-swarm.appspot.com \
     -I https://isolateserver.appspot.com \
     -d pool $pool \
     $criteria \
     -s $hash
 ```

 There are two more things you need to fill in here. The first is the pool name;
 you should pick "Chrome" unless you know otherwise. The pool is the collection
 of hosts from which swarming will try to pick bots to run your tasks.

 The second is the criteria, which is how you specify which bot(s) you want your
 task scheduled on. These are specified via "dimensions", which are specified
 with `-d key val` or `--dimension=key val`. In fact, the `-d pool $pool` in the
 command above is selecting based on the "pool" dimension. There are a lot of
 possible dimensions; one useful one is "os", like `-d os Linux`. Examples of
 other dimensions include:

 * `-d os Mac10.13.6` to select a specific OS version
 * `-d device_type "Pixel 3"` to select a specific Android device type
 * `-d gpu 8086:1912` to select a specific GPU

 The [swarming bot list] allows you to see all the dimensions and the values they
 can take on.

 If you need to pass additional arguments to the test, simply add
 `-- $extra_args` to the end of the `swarming.py trigger` command line - anything
 after the `--` will be passed directly to the test.

 When you invoke `swarming.py trigger`, it will emit two pieces of information: a
 URL for the task it created, and a command you can run to collect the results of
 that task. For example:

 ```
 Triggered task: ellyjones@chromium.org/os=Linux_pool=Chrome/e625130b712096e3908266252c8cd779d7f442f1
 To collect results, use:
   tools/swarming_client/swarming.py collect -S https://chromium-swarm.appspot.com 46fc393777163310
 Or visit:
   https://chromium-swarm.appspot.com/user/task/46fc393777163310
 ```

 The 'collect' command given there will block until the task is complete, then
 produce the task's results, or you can load that URL and watch the task's
 progress.

 ## run-swarmed.py

 A lot of this logic is wrapped up in `tools/run-swarmed.py`, which you can run
 like this:

 ```
 $ tools/run-swarmed.py $outdir $target
 ```

 See the `--help` option of `run-swarmed.py` for more details about that script.

 ## mb.py run

 Similar to `tools/run_swarmed.py`, `mb.py run` bundles much of the logic into a
 single command line. Unlike `tools/run_swarmed.py`, `mb.py run` allows the user
 to specify extra arguments to pass to the test, but has a messier command line.

 To use it, run:
 ```
 $ tools/mb/mb.py run \
     -s --no-default-dimensions \
     -d pool $pool \
     $criteria \
     $outdir $target \
     -- $extra_args
 ```

 ## Other notes

 If you are looking at a Swarming task page, be sure to check the bottom of the
 page, which gives you commands to:

 * Download the contents of the isolate the task used
 * Reproduce the task's configuration locally
 * Download all output results from the task locally

 [borg]: https://ai.google/research/pubs/pub43438
 [kubernetes]: https://kubernetes.io/
 [swarming bot list]: https://chromium-swarm.appspot.com/botlist
	# Debugging with Swarming

	This document outlines how to debug a test failure on a specific builder
	configuration without needing to repeatedly upload new CL revisions or do CQ dry
	runs.

	[TOC]

	## Overview & Terms

	Swarming is a system operated by the infra team that schedules and runs tasks
	under a specific set of constraints, like "this must run on a macOS 10.13 host"
	or "this must run on a host with an intel GPU". It is somewhat similar to part
	of [Borg], or to [Kubernetes].

	An isolate is an archive containing all the files needed to do a specific task
	on the swarming infrastructure. It contains binaries as well as any libraries
	they link against or support data. An isolate can be thought of like a tarball,
	but held by the "isolate server" and identified by a hash of its contents. The
	isolate also includes the command(s) to run, which is why the command is
	specified when building the isolate, not when executing it.

	Normally, when you do a CQ dry run, something like this happens:

	```
	for type in builders_to_run:
	targets = compute_targets_for(type)
	isolates = use_swarming_to_build(type, targets) # uploads isolates for targets
	wait_for_swarming_to_be_done()

	for isolate in isolates:
	use_swarming_to_run(type, isolate) # downloads isolates onto the bots used
	wait_for_swarming_to_be_done()
	```

	When you do a CQ retry on a specific set of bots, that simply constrains
	`builders_to_run` in the pseudocode above. However, if you're trying to rerun a
	specific target on a specific bot, because you're trying to reproduce a failure
	or debug, doing a CQ retry will still waste a lot of time - the retry will still
	build and run all targets, even if it's only for one bot.

	Fortunately, you can manually invoke some steps of this process. What you really
	want to do is:

	```
	isolate = use_swarming_to_build(type, target) # can't do this yet, see below
	use_swarming_to_run(type, isolate)
	```

	or perhaps:

	```
	isolate = upload_to_isolate_server(target_you_built_locally)
	use_swarming_to_run(type, isolate)
	```

	## Building an isolate

	At the moment, you can only build an isolate locally, like so (commands you type
	begin with `$`):

	```
	$ tools/mb/mb.py isolate //$outdir $target
	```

	This will produce some files in $outdir. The most pertinent two are
	`$outdir/$target.isolate` and `$outdir/target.isolated`. If you've already built
	$target, you can save some CPU time and run `tools/mb/mb.py` with `--no-build`:

	```
	$ tools/mb/mb.py isolate --no-build //$outdir $target
	```

	Support for building an isolate using swarming, which would allow you to build
	for a platform you can't build for locally, does not yet exist.

	## Uploading an isolate

	You can then upload the resulting isolate to the isolate server:

	```
	$ tools/swarming_client/isolate.py archive \
	-I https://isolateserver.appspot.com \
	-i $outdir/$target.isolate \
	-s $outdir/$target.isolated
	```

	You may need to log in to `https://isolateserver.appspot.com` to do this:

	```
	$ python tools/swarming_client/auth.py login \
	--service=https://isolateserver.appspot.com
	```

	The `isolate.py` tool will emit something like this:

	```
	e625130b712096e3908266252c8cd779d7f442f1 unit_tests
	```

	Do not ctrl-c it after it does this, even if it seems to be hanging for a
	minute - just let it finish.

	## Running an isolate

	Now that the isolate is on the isolate server with hash `$hash` from the
	previous step, you can run on bots of your choice:

	```
	$ tools/swarming_client/swarming.py trigger \
	-S https://chromium-swarm.appspot.com \
	-I https://isolateserver.appspot.com \
	-d pool $pool \
	$criteria \
	-s $hash
	```

	There are two more things you need to fill in here. The first is the pool name;
	you should pick "Chrome" unless you know otherwise. The pool is the collection
	of hosts from which swarming will try to pick bots to run your tasks.

	The second is the criteria, which is how you specify which bot(s) you want your
	task scheduled on. These are specified via "dimensions", which are specified
	with `-d key val` or `--dimension=key val`. In fact, the `-d pool $pool` in the
	command above is selecting based on the "pool" dimension. There are a lot of
	possible dimensions; one useful one is "os", like `-d os Linux`. Examples of
	other dimensions include:

	* `-d os Mac10.13.6` to select a specific OS version
	* `-d device_type "Pixel 3"` to select a specific Android device type
	* `-d gpu 8086:1912` to select a specific GPU

	The [swarming bot list] allows you to see all the dimensions and the values they
	can take on.

	If you need to pass additional arguments to the test, simply add
	`-- $extra_args` to the end of the `swarming.py trigger` command line - anything
	after the `--` will be passed directly to the test.

	When you invoke `swarming.py trigger`, it will emit two pieces of information: a
	URL for the task it created, and a command you can run to collect the results of
	that task. For example:

	```
	Triggered task: ellyjones@chromium.org/os=Linux_pool=Chrome/e625130b712096e3908266252c8cd779d7f442f1
	To collect results, use:
	tools/swarming_client/swarming.py collect -S https://chromium-swarm.appspot.com 46fc393777163310
	Or visit:
	https://chromium-swarm.appspot.com/user/task/46fc393777163310
	```

	The 'collect' command given there will block until the task is complete, then
	produce the task's results, or you can load that URL and watch the task's
	progress.

	## run-swarmed.py

	A lot of this logic is wrapped up in `tools/run-swarmed.py`, which you can run
	like this:

	```
	$ tools/run-swarmed.py $outdir $target
	```

	See the `--help` option of `run-swarmed.py` for more details about that script.

	## mb.py run

	Similar to `tools/run_swarmed.py`, `mb.py run` bundles much of the logic into a
	single command line. Unlike `tools/run_swarmed.py`, `mb.py run` allows the user
	to specify extra arguments to pass to the test, but has a messier command line.

	To use it, run:
	```
	$ tools/mb/mb.py run \
	-s --no-default-dimensions \
	-d pool $pool \
	$criteria \
	$outdir $target \
	-- $extra_args
	```

	## Other notes

	If you are looking at a Swarming task page, be sure to check the bottom of the
	page, which gives you commands to:

	* Download the contents of the isolate the task used
	* Reproduce the task's configuration locally
	* Download all output results from the task locally

	[borg]: https://ai.google/research/pubs/pub43438
	[kubernetes]: https://kubernetes.io/
	[swarming bot list]: https://chromium-swarm.appspot.com/botlist