net/docs/crash-course-in-net-internals.md - chromium/src.git - Git at Google

 # A Crash Course in Debugging with about:net-internals

 This document is intended to help get people started debugging network errors
 with about:net-internals, with some commonly useful tips and tricks.  This
 document is aimed more at how to get started using some of its features to
 investigate bug reports, rather than as a feature overview.

 It would probably be useful to read
 [life-of-a-url-request.md](life-of-a-url-request.md) before this document.

 # What Data Net-Internals Contains

 about:net-internals provides a view of browser activity from net/'s perspective.
 For this reason, it lacks knowledge of tabs, navigation, frames, resource types,
 etc.

 The top level network stack object is the URLRequestContext.  The Events View
 has information for all Chrome URLRequestContexts that are hooked up to the
 single, global, ChromeNetLog object.  This includes both incognito and non-
 incognito profiles, among other things.  The Events view only shows events for
 the period that net-internals was open and running, and is incrementally updated
 as events occur.  The code attempts to add a top level event for URLRequests
 that were active when the tab was opened, to help debug hung requests, but
 that's best-effort only, and only includes requests for the current profile and
 the system URLRequestContext.

 The other views are all snapshots of the current state of the main
 URLRequestContext's components, and are updated on a 5 second timer.  These will
 show objects that were created before about:net-internals was opened.  Most
 debugging is done with the Events view (which will be all this document
 covers), but it's good to be aware of this distinction.

 # Events vs Sources

 The Event View shows events logged by the NetLog.  The NetLog model is that
 long-lived network stack objects, called sources, emit events over their
 lifetime.  When looking at the code, a "BoundNetLog" object contains a source
 ID, and a pointer to the NetLog the source emits events to.  Some events have a
 beginning and end point (during which other subevents may occur), and some only
 occur at a single point in time.  Generally only one event can be occuring for a
 source at a time.  If there can be multiple events doing completely independent
 thing, the code often uses new sources to represent the parallelism.

 "Sources" correspond to certain net objects, however, multiple layers of net/
 will often log to a single source.  Here are the main source types and what they
 include (Excluding HTTP2 [SPDY]/QUIC):

 * URL_REQUEST:  This corresponds to the URLRequest object.  It includes events
 from all the URLRequestJobs, HttpCache::Transactions, NetworkTransactions,
 HttpStreamFactoryImpl::Requests, HttpStream implementations, and
 HttpStreamParsers used to service a response.  If the URL_REQUEST follows HTTP
 redirects, it will include each redirect.  This is a lot of stuff, but generally
 only object is doing work at a time.  This event source includes the full URL
 and generally includes the request / response headers (Except when the cache
 handles the response).

 * HTTP_STREAM_JOB:  This corresponds to HttpStreamFactoryImpl::Job (Note that
 one Request can have multiple Jobs).  It also includes its proxy and DNS
 lookups.  HTTP_STREAM_JOB log events are separate from URL_REQUEST because
 two stream jobs may be created and races against each other, in some cases -
 one for one for QUIC, and one for HTTP.  One of the final events of this source
 indicates how an HttpStream was created (Reusing an existing SOCKET /
 HTTP2_SESSION / QUIC_SESSION, or creating a new one).

 * CONNECT_JOB:  This corresponds to the ConnectJob subclasses that each socket
 pool uses.  A successful CONNECT_JOB return a SOCKET.  The events here vary a
 lot by job type.  Their main event is generally either to create a socket, or
 request a socket from another socket pool (Which creates another CONNECT_JOB)
 and then do some extra work on top of that - like establish an SSL connection on
 top of a TCP connection.

 * SOCKET:  These correspond to TCPSockets, but may also have other classes
 layered on top of them (Like an SSLClientSocket).  This is a bit different from
 the other classes, where the name corresponds to the topmost class, instead of
 the bottommost one.  This is largely an artifact of the fact the socket is
 created first, and then SSL (Or a proxy connection) is layered on top of it.
 SOCKETs may be reused between multiple requests, and a request may end up
 getting a socket created for another request.

 * HOST_RESOLVER_IMPL_JOB:  These correspond to HostResolverImpl::Job.  The
 include information about how long the lookup was queued, each DNS request that
 was attempted (With the platform or built-in resolver) and all the other sources
 that are waiting on the job.

 When one source depends on another, the code generally logs an event with
 "source_dependency" value to both sources, which lets you jump between the two
 related events.

 # Debugging

 When you receive a report from the user, the first thing you'll generally want
 to do find the URL_REQUEST[s] that are misbehaving.  If the user gives an ERR_*
 code or the exact URL of the resource that won't load, you can just search for
 it.  If it's an upload, you can search for "post", or if it's a redirect issue,
 you can search for "redirect".  However, you often won't have much information
 about the actual problem.  There are two filters in net-internals that can help
 in a lot of cases:

 * "type:URL_REQUEST is:error" will restrict the list to URL_REQUEST object with
 an error of some sort (red background).  Cache errors are often non-fatal, so
 you should generally ignore those, and look for a more interesting one.

 * "type:URL_REQUEST sort:duration" will show the longest-lived requests first.
 This is often useful in finding hung or slow requests.

 For a list of other filter commands, you can mouse over the question mark on
 about:net-internals.

 Once you locate the problematic request, the next is to figure out where the
 problem is - it's often one of the last events, though it could also be related
 to response or request headers.  You can use "source_dependency" links to drill
 down into other related sources, or up from layers below URL_REQUEST.

 You can use the name of an event to search for the code responsible for that
 event, and try to deduce what went wrong before/after a particular event.  Note
 that the event names used in net-internals are not the entire string names, so
 you should not do an entire string match.

 Some things to look for while debugging:

 * CANCELLED events almost always come from outside the network stack.

 * Changing networks and entering / exiting suspend mode can have all sorts of
 fun and exciting effects on underway network activity.  Network changes log a
 top level NETWORK_CHANGED event with no source - the event itself is treated as
 its own source.  Suspend events are currently not logged.

 * URL_REQUEST_DELEGATE / DELEGATE_INFO events mean a URL_REQUEST is blocked on a
 URLRequest::Delegate or the NetworkDelegate, which are implemented outside the
 network stack.  A request will sometimes be CANCELED here for reasons known only
 to the delegate.  Or the delegate may cause a hang.  In general, to debug issues
 related to delegates, one needs to figure out which method of which object is
 causing the problem.  The object may be the a NetworkDelegate, a
 ResourceThrottle, a ResourceHandler, the ResourceLoader itself, or the
 ResourceDispatcherHost.

 * Sockets are often reused between requests.  If a request is on a stale
 (reused) socket, what was the previous request that used the socket, how long
 ago was it made?

 * SSL negotation is a process fraught with peril, particularly with broken
 proxies.  These will generally stall or fail in the SSL_CONNECT phase at the
 SOCKET layer.

 * Range requests have magic to handle them at the cache layer, and are often
 issued by the media and PDF code.

 * Late binding:  HTTP_STREAM_JOBs are not associated with any CONNECT_JOB until
 a CONNECT_JOB actually connects.  This is so the highest priority pending job
 gets the first available socket (Which may be a new socket, or an old one that's
 freed up).  For this reason, it can be a little tricky to relate hung
 HTTP_STREAM_JOBs to CONNECT_JOBs.

 * Each CONNECT_JOB belongs to a "group", which has a limit of 6 connections.  If
 all CONNECT_JOBs beling to a group (The CONNECT_JOB's description field) are
 stalled waiting on an available socket, the group probably has 6 sockets that
 that are hung - either hung trying to connect, or used by stalled requests and
 thus outside the socket pool's control.

 * There's a limit on number of DNS resolutions that can be started at once.  If
 everything is stalled while resolving DNS addresses, you've probably hit this
 limit, and the DNS lookups are also misbehaving in some fashion.

 # Miscellany

 These are just miscellaneous things you may notice when looking through the
 logs.

 * URLRequests that look to start twice for no obvious reason.  These are
 typically main frame requests, and the first request is AppCache.  Can just
 ignore it and move on with your life.

 * Some HTTP requests are not handled by URLRequestHttpJobs.  These include
 things like HSTS redirects (URLRequestRedirectJob), AppCache, ServiceWorker,
 etc.  These generally don't log as much information, so it can be tricky to
 figure out what's going on with these.

 * Non-HTTP requests also appear in the log, and also generally don't log much
 (blob URLs, chrome URLs, etc).

 * Preconnects create a "HTTP_STREAM_JOB" event that may create multiple
 CONNECT_JOBs (or none) and is then destroyed.  These can be identified by the
 "SOCKET_POOL_CONNECTING_N_SOCKETS" events.
	# A Crash Course in Debugging with about:net-internals

	This document is intended to help get people started debugging network errors
	with about:net-internals, with some commonly useful tips and tricks. This
	document is aimed more at how to get started using some of its features to
	investigate bug reports, rather than as a feature overview.

	It would probably be useful to read
	[life-of-a-url-request.md](life-of-a-url-request.md) before this document.

	# What Data Net-Internals Contains

	about:net-internals provides a view of browser activity from net/'s perspective.
	For this reason, it lacks knowledge of tabs, navigation, frames, resource types,
	etc.

	The top level network stack object is the URLRequestContext. The Events View
	has information for all Chrome URLRequestContexts that are hooked up to the
	single, global, ChromeNetLog object. This includes both incognito and non-
	incognito profiles, among other things. The Events view only shows events for
	the period that net-internals was open and running, and is incrementally updated
	as events occur. The code attempts to add a top level event for URLRequests
	that were active when the tab was opened, to help debug hung requests, but
	that's best-effort only, and only includes requests for the current profile and
	the system URLRequestContext.

	The other views are all snapshots of the current state of the main
	URLRequestContext's components, and are updated on a 5 second timer. These will
	show objects that were created before about:net-internals was opened. Most
	debugging is done with the Events view (which will be all this document
	covers), but it's good to be aware of this distinction.

	# Events vs Sources

	The Event View shows events logged by the NetLog. The NetLog model is that
	long-lived network stack objects, called sources, emit events over their
	lifetime. When looking at the code, a "BoundNetLog" object contains a source
	ID, and a pointer to the NetLog the source emits events to. Some events have a
	beginning and end point (during which other subevents may occur), and some only
	occur at a single point in time. Generally only one event can be occuring for a
	source at a time. If there can be multiple events doing completely independent
	thing, the code often uses new sources to represent the parallelism.

	"Sources" correspond to certain net objects, however, multiple layers of net/
	will often log to a single source. Here are the main source types and what they
	include (Excluding HTTP2 [SPDY]/QUIC):

	* URL_REQUEST: This corresponds to the URLRequest object. It includes events
	from all the URLRequestJobs, HttpCache::Transactions, NetworkTransactions,
	HttpStreamFactoryImpl::Requests, HttpStream implementations, and
	HttpStreamParsers used to service a response. If the URL_REQUEST follows HTTP
	redirects, it will include each redirect. This is a lot of stuff, but generally
	only object is doing work at a time. This event source includes the full URL
	and generally includes the request / response headers (Except when the cache
	handles the response).

	* HTTP_STREAM_JOB: This corresponds to HttpStreamFactoryImpl::Job (Note that
	one Request can have multiple Jobs). It also includes its proxy and DNS
	lookups. HTTP_STREAM_JOB log events are separate from URL_REQUEST because
	two stream jobs may be created and races against each other, in some cases -
	one for one for QUIC, and one for HTTP. One of the final events of this source
	indicates how an HttpStream was created (Reusing an existing SOCKET /
	HTTP2_SESSION / QUIC_SESSION, or creating a new one).

	* CONNECT_JOB: This corresponds to the ConnectJob subclasses that each socket
	pool uses. A successful CONNECT_JOB return a SOCKET. The events here vary a
	lot by job type. Their main event is generally either to create a socket, or
	request a socket from another socket pool (Which creates another CONNECT_JOB)
	and then do some extra work on top of that - like establish an SSL connection on
	top of a TCP connection.

	* SOCKET: These correspond to TCPSockets, but may also have other classes
	layered on top of them (Like an SSLClientSocket). This is a bit different from
	the other classes, where the name corresponds to the topmost class, instead of
	the bottommost one. This is largely an artifact of the fact the socket is
	created first, and then SSL (Or a proxy connection) is layered on top of it.
	SOCKETs may be reused between multiple requests, and a request may end up
	getting a socket created for another request.

	* HOST_RESOLVER_IMPL_JOB: These correspond to HostResolverImpl::Job. The
	include information about how long the lookup was queued, each DNS request that
	was attempted (With the platform or built-in resolver) and all the other sources
	that are waiting on the job.

	When one source depends on another, the code generally logs an event with
	"source_dependency" value to both sources, which lets you jump between the two
	related events.

	# Debugging

	When you receive a report from the user, the first thing you'll generally want
	to do find the URL_REQUEST[s] that are misbehaving. If the user gives an ERR_*
	code or the exact URL of the resource that won't load, you can just search for
	it. If it's an upload, you can search for "post", or if it's a redirect issue,
	you can search for "redirect". However, you often won't have much information
	about the actual problem. There are two filters in net-internals that can help
	in a lot of cases:

	* "type:URL_REQUEST is:error" will restrict the list to URL_REQUEST object with
	an error of some sort (red background). Cache errors are often non-fatal, so
	you should generally ignore those, and look for a more interesting one.

	* "type:URL_REQUEST sort:duration" will show the longest-lived requests first.
	This is often useful in finding hung or slow requests.

	For a list of other filter commands, you can mouse over the question mark on
	about:net-internals.

	Once you locate the problematic request, the next is to figure out where the
	problem is - it's often one of the last events, though it could also be related
	to response or request headers. You can use "source_dependency" links to drill
	down into other related sources, or up from layers below URL_REQUEST.

	You can use the name of an event to search for the code responsible for that
	event, and try to deduce what went wrong before/after a particular event. Note
	that the event names used in net-internals are not the entire string names, so
	you should not do an entire string match.

	Some things to look for while debugging:

	* CANCELLED events almost always come from outside the network stack.

	* Changing networks and entering / exiting suspend mode can have all sorts of
	fun and exciting effects on underway network activity. Network changes log a
	top level NETWORK_CHANGED event with no source - the event itself is treated as
	its own source. Suspend events are currently not logged.

	* URL_REQUEST_DELEGATE / DELEGATE_INFO events mean a URL_REQUEST is blocked on a
	URLRequest::Delegate or the NetworkDelegate, which are implemented outside the
	network stack. A request will sometimes be CANCELED here for reasons known only
	to the delegate. Or the delegate may cause a hang. In general, to debug issues
	related to delegates, one needs to figure out which method of which object is
	causing the problem. The object may be the a NetworkDelegate, a
	ResourceThrottle, a ResourceHandler, the ResourceLoader itself, or the
	ResourceDispatcherHost.

	* Sockets are often reused between requests. If a request is on a stale
	(reused) socket, what was the previous request that used the socket, how long
	ago was it made?

	* SSL negotation is a process fraught with peril, particularly with broken
	proxies. These will generally stall or fail in the SSL_CONNECT phase at the
	SOCKET layer.

	* Range requests have magic to handle them at the cache layer, and are often
	issued by the media and PDF code.

	* Late binding: HTTP_STREAM_JOBs are not associated with any CONNECT_JOB until
	a CONNECT_JOB actually connects. This is so the highest priority pending job
	gets the first available socket (Which may be a new socket, or an old one that's
	freed up). For this reason, it can be a little tricky to relate hung
	HTTP_STREAM_JOBs to CONNECT_JOBs.

	* Each CONNECT_JOB belongs to a "group", which has a limit of 6 connections. If
	all CONNECT_JOBs beling to a group (The CONNECT_JOB's description field) are
	stalled waiting on an available socket, the group probably has 6 sockets that
	that are hung - either hung trying to connect, or used by stalled requests and
	thus outside the socket pool's control.

	* There's a limit on number of DNS resolutions that can be started at once. If
	everything is stalled while resolving DNS addresses, you've probably hit this
	limit, and the DNS lookups are also misbehaving in some fashion.

	# Miscellany

	These are just miscellaneous things you may notice when looking through the
	logs.

	* URLRequests that look to start twice for no obvious reason. These are
	typically main frame requests, and the first request is AppCache. Can just
	ignore it and move on with your life.

	* Some HTTP requests are not handled by URLRequestHttpJobs. These include
	things like HSTS redirects (URLRequestRedirectJob), AppCache, ServiceWorker,
	etc. These generally don't log as much information, so it can be tricky to
	figure out what's going on with these.

	* Non-HTTP requests also appear in the log, and also generally don't log much
	(blob URLs, chrome URLs, etc).

	* Preconnects create a "HTTP_STREAM_JOB" event that may create multiple
	CONNECT_JOBs (or none) and is then destroyed. These can be identified by the
	"SOCKET_POOL_CONNECTING_N_SOCKETS" events.