| <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN"> |
| <HTML> |
| |
| <HEAD> |
| <link rel="stylesheet" href="designstyle.css"> |
| <title>Gperftools CPU Profiler</title> |
| </HEAD> |
| |
| <BODY> |
| |
| <p align=right> |
| <i>Last modified |
| <script type=text/javascript> |
| var lm = new Date(document.lastModified); |
| document.write(lm.toDateString()); |
| </script></i> |
| </p> |
| |
| <p>This is the CPU profiler we use at Google. There are three parts |
| to using it: linking the library into an application, running the |
| code, and analyzing the output.</p> |
| |
| <p>On the off-chance that you should need to understand it, the CPU |
| profiler data file format is documented separately, |
| <a href="cpuprofile-fileformat.html">here</a>. |
| |
| |
| <H1>Linking in the Library</H1> |
| |
| <p>To install the CPU profiler into your executable, add |
| <code>-lprofiler</code> to the link-time step for your executable. |
| (It's also probably possible to add in the profiler at run-time using |
| <code>LD_PRELOAD</code>, e.g. |
| <code>% env LD_PRELOAD="/usr/lib/libprofiler.so" <binary></code>, |
| but this isn't necessarily recommended.)</p> |
| |
| <p>This does <i>not</i> turn on CPU profiling; it just inserts the |
| code. For that reason, it's practical to just always link |
| <code>-lprofiler</code> into a binary while developing; that's what we |
| do at Google. (However, since any user can turn on the profiler by |
| setting an environment variable, it's not necessarily recommended to |
| install profiler-linked binaries into a production, running |
| system.)</p> |
| |
| |
| <H1>Running the Code</H1> |
| |
| <p>There are several alternatives to actually turn on CPU profiling |
| for a given run of an executable:</p> |
| |
| <ol> |
| <li> <p>Define the environment variable CPUPROFILE to the filename |
| to dump the profile to. For instance, if you had a version of |
| <code>/bin/ls</code> that had been linked against libprofiler, |
| you could run:</p> |
| <pre>% env CPUPROFILE=ls.prof /bin/ls</pre> |
| </li> |
| <li> <p>In addition to defining the environment variable CPUPROFILE |
| you can also define CPUPROFILESIGNAL. This allows profiling to be |
| controlled via the signal number that you specify. The signal number |
| must be unused by the program under normal operation. Internally it |
| acts as a switch, triggered by the signal, which is off by default. |
| For instance, if you had a copy of <code>/bin/chrome</code> that had been |
| been linked against libprofiler, you could run:</p> |
| <pre>% env CPUPROFILE=chrome.prof CPUPROFILESIGNAL=12 /bin/chrome &</pre> |
| <p>You can then trigger profiling to start:</p> |
| <pre>% killall -12 chrome</pre> |
| <p>Then after a period of time you can tell it to stop which will |
| generate the profile:</p> |
| <pre>% killall -12 chrome</pre> |
| </li> |
| <li> <p>In your code, bracket the code you want profiled in calls to |
| <code>ProfilerStart()</code> and <code>ProfilerStop()</code>. |
| (These functions are declared in <code><gperftools/profiler.h></code>.) |
| <code>ProfilerStart()</code> will take |
| the profile-filename as an argument.</p> |
| </li> |
| </ol> |
| |
| <p>In Linux 2.6 and above, profiling works correctly with threads, |
| automatically profiling all threads. In Linux 2.4, profiling only |
| profiles the main thread (due to a kernel bug involving itimers and |
| threads). Profiling works correctly with sub-processes: each child |
| process gets its own profile with its own name (generated by combining |
| CPUPROFILE with the child's process id).</p> |
| |
| <p>For security reasons, CPU profiling will not write to a file -- and |
| is thus not usable -- for setuid programs.</p> |
| |
| <p>See the include-file <code>gperftools/profiler.h</code> for |
| advanced-use functions, including <code>ProfilerFlush()</code> and |
| <code>ProfilerStartWithOptions()</code>.</p> |
| |
| |
| <H2>Modifying Runtime Behavior</H2> |
| |
| <p>You can more finely control the behavior of the CPU profiler via |
| environment variables.</p> |
| |
| <table frame=box rules=sides cellpadding=5 width=100%> |
| |
| <tr valign=top> |
| <td><code>CPUPROFILE_FREQUENCY=<i>x</i></code></td> |
| <td>default: 100</td> |
| <td> |
| How many interrupts/second the cpu-profiler samples. |
| </td> |
| </tr> |
| |
| <tr valign=top> |
| <td><code>CPUPROFILE_REALTIME=1</code></td> |
| <td>default: [not set]</td> |
| <td> |
| If set to any value (including 0 or the empty string), use |
| ITIMER_REAL instead of ITIMER_PROF to gather profiles. In |
| general, ITIMER_REAL is not as accurate as ITIMER_PROF, and also |
| interacts badly with use of alarm(), so prefer ITIMER_PROF unless |
| you have a reason prefer ITIMER_REAL. |
| </td> |
| </tr> |
| |
| </table> |
| |
| |
| <h1><a name="pprof">Analyzing the Output</a></h1> |
| |
| <p><code>pprof</code> is the script used to analyze a profile. It has |
| many output modes, both textual and graphical. Some give just raw |
| numbers, much like the <code>-pg</code> output of <code>gcc</code>, |
| and others show the data in the form of a dependency graph.</p> |
| |
| <p>pprof <b>requires</b> <code>perl5</code> to be installed to run. |
| It also requires <code>dot</code> to be installed for any of the |
| graphical output routines, and <code>gv</code> to be installed for |
| <code>--gv</code> mode (described below). |
| </p> |
| |
| <p>Here are some ways to call pprof. These are described in more |
| detail below.</p> |
| |
| <pre> |
| % pprof /bin/ls ls.prof |
| Enters "interactive" mode |
| % pprof --text /bin/ls ls.prof |
| Outputs one line per procedure |
| % pprof --gv /bin/ls ls.prof |
| Displays annotated call-graph via 'gv' |
| % pprof --gv --focus=Mutex /bin/ls ls.prof |
| Restricts to code paths including a .*Mutex.* entry |
| % pprof --gv --focus=Mutex --ignore=string /bin/ls ls.prof |
| Code paths including Mutex but not string |
| % pprof --list=getdir /bin/ls ls.prof |
| (Per-line) annotated source listing for getdir() |
| % pprof --disasm=getdir /bin/ls ls.prof |
| (Per-PC) annotated disassembly for getdir() |
| % pprof --text localhost:1234 |
| Outputs one line per procedure for localhost:1234 |
| % pprof --callgrind /bin/ls ls.prof |
| Outputs the call information in callgrind format |
| </pre> |
| |
| |
| <h3>Analyzing Text Output</h3> |
| |
| <p>Text mode has lines of output that look like this:</p> |
| <pre> |
| 14 2.1% 17.2% 58 8.7% std::_Rb_tree::find |
| </pre> |
| |
| <p>Here is how to interpret the columns:</p> |
| <ol> |
| <li> Number of profiling samples in this function |
| <li> Percentage of profiling samples in this function |
| <li> Percentage of profiling samples in the functions printed so far |
| <li> Number of profiling samples in this function and its callees |
| <li> Percentage of profiling samples in this function and its callees |
| <li> Function name |
| </ol> |
| |
| <h3>Analyzing Callgrind Output</h3> |
| |
| <p>Use <a href="http://kcachegrind.sourceforge.net">kcachegrind</a> to |
| analyze your callgrind output:</p> |
| <pre> |
| % pprof --callgrind /bin/ls ls.prof > ls.callgrind |
| % kcachegrind ls.callgrind |
| </pre> |
| |
| <p>The cost is specified in 'hits', i.e. how many times a function |
| appears in the recorded call stack information. The 'calls' from |
| function a to b record how many times function b was found in the |
| stack traces directly below function a.</p> |
| |
| <p>Tip: if you use a debug build the output will include file and line |
| number information and kcachegrind will show an annotated source |
| code view.</p> |
| |
| <h3>Node Information</h3> |
| |
| <p>In the various graphical modes of pprof, the output is a call graph |
| annotated with timing information, like so:</p> |
| |
| <A HREF="pprof-test-big.gif"> |
| <center><table><tr><td> |
| <img src="pprof-test.gif"> |
| </td></tr></table></center> |
| </A> |
| |
| <p>Each node represents a procedure. The directed edges indicate |
| caller to callee relations. Each node is formatted as follows:</p> |
| |
| <center><pre> |
| Class Name |
| Method Name |
| local (percentage) |
| <b>of</b> cumulative (percentage) |
| </pre></center> |
| |
| <p>The last one or two lines contains the timing information. (The |
| profiling is done via a sampling method, where by default we take 100 |
| samples a second. Therefor one unit of time in the output corresponds |
| to about 10 milliseconds of execution time.) The "local" time is the |
| time spent executing the instructions directly contained in the |
| procedure (and in any other procedures that were inlined into the |
| procedure). The "cumulative" time is the sum of the "local" time and |
| the time spent in any callees. If the cumulative time is the same as |
| the local time, it is not printed.</p> |
| |
| <p>For instance, the timing information for test_main_thread() |
| indicates that 155 units (about 1.55 seconds) were spent executing the |
| code in <code>test_main_thread()</code> and 200 units were spent while |
| executing <code>test_main_thread()</code> and its callees such as |
| <code>snprintf()</code>.</p> |
| |
| <p>The size of the node is proportional to the local count. The |
| percentage displayed in the node corresponds to the count divided by |
| the total run time of the program (that is, the cumulative count for |
| <code>main()</code>).</p> |
| |
| <h3>Edge Information</h3> |
| |
| <p>An edge from one node to another indicates a caller to callee |
| relationship. Each edge is labelled with the time spent by the callee |
| on behalf of the caller. E.g, the edge from |
| <code>test_main_thread()</code> to <code>snprintf()</code> indicates |
| that of the 200 samples in <code>test_main_thread()</code>, 37 are |
| because of calls to <code>snprintf()</code>.</p> |
| |
| <p>Note that <code>test_main_thread()</code> has an edge to |
| <code>vsnprintf()</code>, even though <code>test_main_thread()</code> |
| doesn't call that function directly. This is because the code was |
| compiled with <code>-O2</code>; the profile reflects the optimized |
| control flow.</p> |
| |
| <h3>Meta Information</h3> |
| |
| <p>The top of the display should contain some meta information |
| like:</p> |
| <pre> |
| /tmp/profiler2_unittest |
| Total samples: 202 |
| Focusing on: 202 |
| Dropped nodes with <= 1 abs(samples) |
| Dropped edges with <= 0 samples |
| </pre> |
| |
| <p>This section contains the name of the program, and the total |
| samples collected during the profiling run. If the |
| <code>--focus</code> option is on (see the <a href="#focus">Focus</a> |
| section below), the legend also contains the number of samples being |
| shown in the focused display. Furthermore, some unimportant nodes and |
| edges are dropped to reduce clutter. The characteristics of the |
| dropped nodes and edges are also displayed in the legend.</p> |
| |
| <h3><a name=focus>Focus and Ignore</a></h3> |
| |
| <p>You can ask pprof to generate a display focused on a particular |
| piece of the program. You specify a regular expression. Any portion |
| of the call-graph that is on a path which contains at least one node |
| matching the regular expression is preserved. The rest of the |
| call-graph is dropped on the floor. For example, you can focus on the |
| <code>vsnprintf()</code> libc call in <code>profiler2_unittest</code> |
| as follows:</p> |
| |
| <pre> |
| % pprof --gv --focus=vsnprintf /tmp/profiler2_unittest test.prof |
| </pre> |
| <A HREF="pprof-vsnprintf-big.gif"> |
| <center><table><tr><td> |
| <img src="pprof-vsnprintf.gif"> |
| </td></tr></table></center> |
| </A> |
| |
| <p>Similarly, you can supply the <code>--ignore</code> option to |
| ignore samples that match a specified regular expression. E.g., if |
| you are interested in everything except calls to |
| <code>snprintf()</code>, you can say:</p> |
| <pre> |
| % pprof --gv --ignore=snprintf /tmp/profiler2_unittest test.prof |
| </pre> |
| |
| |
| <h3>Interactive mode</a></h3> |
| |
| <p>By default -- if you don't specify any flags to the contrary -- |
| pprof runs in interactive mode. At the <code>(pprof)</code> prompt, |
| you can run many of the commands described above. You can type |
| <code>help</code> for a list of what commands are available in |
| interactive mode.</p> |
| |
| <h3><a name=options>pprof Options</a></h3> |
| |
| For a complete list of pprof options, you can run <code>pprof |
| --help</code>. |
| |
| <h4>Output Type</h4> |
| |
| <p> |
| <center> |
| <table frame=box rules=sides cellpadding=5 width=100%> |
| <tr valign=top> |
| <td><code>--text</code></td> |
| <td> |
| Produces a textual listing. (Note: If you have an X display, and |
| <code>dot</code> and <code>gv</code> installed, you will probably |
| be happier with the <code>--gv</code> output.) |
| </td> |
| </tr> |
| <tr valign=top> |
| <td><code>--gv</code></td> |
| <td> |
| Generates annotated call-graph, converts to postscript, and |
| displays via gv (requres <code>dot</code> and <code>gv</code> be |
| installed). |
| </td> |
| </tr> |
| <tr valign=top> |
| <td><code>--dot</code></td> |
| <td> |
| Generates the annotated call-graph in dot format and |
| emits to stdout (requres <code>dot</code> be installed). |
| </td> |
| </tr> |
| <tr valign=top> |
| <td><code>--ps</code></td> |
| <td> |
| Generates the annotated call-graph in Postscript format and |
| emits to stdout (requres <code>dot</code> be installed). |
| </td> |
| </tr> |
| <tr valign=top> |
| <td><code>--pdf</code></td> |
| <td> |
| Generates the annotated call-graph in PDF format and emits to |
| stdout (requires <code>dot</code> and <code>ps2pdf</code> be |
| installed). |
| </td> |
| </tr> |
| <tr valign=top> |
| <td><code>--gif</code></td> |
| <td> |
| Generates the annotated call-graph in GIF format and |
| emits to stdout (requres <code>dot</code> be installed). |
| </td> |
| </tr> |
| <tr valign=top> |
| <td><code>--list=<<i>regexp</i>></code></td> |
| <td> |
| <p>Outputs source-code listing of routines whose |
| name matches <regexp>. Each line |
| in the listing is annotated with flat and cumulative |
| sample counts.</p> |
| |
| <p>In the presence of inlined calls, the samples |
| associated with inlined code tend to get assigned |
| to a line that follows the location of the |
| inlined call. A more precise accounting can be |
| obtained by disassembling the routine using the |
| --disasm flag.</p> |
| </td> |
| </tr> |
| <tr valign=top> |
| <td><code>--disasm=<<i>regexp</i>></code></td> |
| <td> |
| Generates disassembly of routines that match |
| <regexp>, annotated with flat and |
| cumulative sample counts and emits to stdout. |
| </td> |
| </tr> |
| </table> |
| </center> |
| |
| <h4>Reporting Granularity</h4> |
| |
| <p>By default, pprof produces one entry per procedure. However you can |
| use one of the following options to change the granularity of the |
| output. The <code>--files</code> option seems to be particularly |
| useless, and may be removed eventually.</p> |
| |
| <center> |
| <table frame=box rules=sides cellpadding=5 width=100%> |
| <tr valign=top> |
| <td><code>--addresses</code></td> |
| <td> |
| Produce one node per program address. |
| </td> |
| </tr> |
| <td><code>--lines</code></td> |
| <td> |
| Produce one node per source line. |
| </td> |
| </tr> |
| <td><code>--functions</code></td> |
| <td> |
| Produce one node per function (this is the default). |
| </td> |
| </tr> |
| <td><code>--files</code></td> |
| <td> |
| Produce one node per source file. |
| </td> |
| </tr> |
| </table> |
| </center> |
| |
| <h4>Controlling the Call Graph Display</h4> |
| |
| <p>Some nodes and edges are dropped to reduce clutter in the output |
| display. The following options control this effect:</p> |
| |
| <center> |
| <table frame=box rules=sides cellpadding=5 width=100%> |
| <tr valign=top> |
| <td><code>--nodecount=<n></code></td> |
| <td> |
| This option controls the number of displayed nodes. The nodes |
| are first sorted by decreasing cumulative count, and then only |
| the top N nodes are kept. The default value is 80. |
| </td> |
| </tr> |
| <tr valign=top> |
| <td><code>--nodefraction=<f></code></td> |
| <td> |
| This option provides another mechanism for discarding nodes |
| from the display. If the cumulative count for a node is |
| less than this option's value multiplied by the total count |
| for the profile, the node is dropped. The default value |
| is 0.005; i.e. nodes that account for less than |
| half a percent of the total time are dropped. A node |
| is dropped if either this condition is satisfied, or the |
| --nodecount condition is satisfied. |
| </td> |
| </tr> |
| <tr valign=top> |
| <td><code>--edgefraction=<f></code></td> |
| <td> |
| This option controls the number of displayed edges. First of all, |
| an edge is dropped if either its source or destination node is |
| dropped. Otherwise, the edge is dropped if the sample |
| count along the edge is less than this option's value multiplied |
| by the total count for the profile. The default value is |
| 0.001; i.e., edges that account for less than |
| 0.1% of the total time are dropped. |
| </td> |
| </tr> |
| <tr valign=top> |
| <td><code>--focus=<re></code></td> |
| <td> |
| This option controls what region of the graph is displayed |
| based on the regular expression supplied with the option. |
| For any path in the callgraph, we check all nodes in the path |
| against the supplied regular expression. If none of the nodes |
| match, the path is dropped from the output. |
| </td> |
| </tr> |
| <tr valign=top> |
| <td><code>--ignore=<re></code></td> |
| <td> |
| This option controls what region of the graph is displayed |
| based on the regular expression supplied with the option. |
| For any path in the callgraph, we check all nodes in the path |
| against the supplied regular expression. If any of the nodes |
| match, the path is dropped from the output. |
| </td> |
| </tr> |
| </table> |
| </center> |
| |
| <p>The dropped edges and nodes account for some count mismatches in |
| the display. For example, the cumulative count for |
| <code>snprintf()</code> in the first diagram above was 41. However |
| the local count (1) and the count along the outgoing edges (12+1+20+6) |
| add up to only 40.</p> |
| |
| |
| <h1>Caveats</h1> |
| |
| <ul> |
| <li> If the program exits because of a signal, the generated profile |
| will be <font color=red>incomplete, and may perhaps be |
| completely empty</font>. |
| <li> The displayed graph may have disconnected regions because |
| of the edge-dropping heuristics described above. |
| <li> If the program linked in a library that was not compiled |
| with enough symbolic information, all samples associated |
| with the library may be charged to the last symbol found |
| in the program before the library. This will artificially |
| inflate the count for that symbol. |
| <li> If you run the program on one machine, and profile it on |
| another, and the shared libraries are different on the two |
| machines, the profiling output may be confusing: samples that |
| fall within shared libaries may be assigned to arbitrary |
| procedures. |
| <li> If your program forks, the children will also be profiled |
| (since they inherit the same CPUPROFILE setting). Each process |
| is profiled separately; to distinguish the child profiles from |
| the parent profile and from each other, all children will have |
| their process-id appended to the CPUPROFILE name. |
| <li> Due to a hack we make to work around a possible gcc bug, your |
| profiles may end up named strangely if the first character of |
| your CPUPROFILE variable has ascii value greater than 127. |
| This should be exceedingly rare, but if you need to use such a |
| name, just set prepend <code>./</code> to your filename: |
| <code>CPUPROFILE=./Ägypten</code>. |
| </ul> |
| |
| |
| <hr> |
| <address>Sanjay Ghemawat<br> |
| <!-- Created: Tue Dec 19 10:43:14 PST 2000 --> |
| <!-- hhmts start --> |
| Last modified: Fri May 9 14:41:29 PDT 2008 |
| <!-- hhmts end --> |
| </address> |
| </BODY> |
| </HTML> |