File Formats

Static Information Files

Static information files are generated by sampler-cc during program compilation. They describe instrumented code but contain no information about any individual run. Think of them as an extension of the symbolic debugging information generated during normal compilation.

Embedding and Extraction

File is a bit of a misnomer. These pieces of information are not initially placed in their own files. Rather, static information is embedded directly within each instrumented object file, shared library, or executable. Custom ELF sections hold each piece of static data.

Embedding ensures that the static information remains tightly associated with the code it describes even across renaming, linking, archive extraction, etc. However, it is often useful to extract this information into a standalone form for additional analysis. The /usr/local/lib/sampler/tools/extract-section tool may be used for this purpose. It is run as follows:

extract-section {section-name} { executable | shared-library | object ...}

section-name names the ELF section containing the desired data, including a leading .. Remaining arguments are ELF executables, shared libraries, or object files. The named section is read from each of these and written to standard output in sequence.

[Note]Note

Although extract-section will happily copy data from any ELF section you name, it should only be used for extracting static sampler information. Normalization applied during extraction means that you are not seeing a byte-for-byte copy of the named section. Use objcopy for more general manipulation tasks.

When shipping precompiled binaries to large numbers of users, you should extract, save, and then remove the sampler's static data sections from your binaries. This data can be large, and the typical end user does not need to download, store, or use it. Removal is made easier by the fact that all of these extra ELFsections are marked as debugging sections: the standard strip will remove them along with all other debug information. Our RPM building tools do exactly this, and save the extracted static data files in auxiliary *-samplerinfo packages analogous to Red Hat's *-debuginfo packages. The *-samplerinfo packages are available for all to see, but typically only a developer would download and install them.

Site Information

The site information file lists the instrumentation sites added to each compilation unit. This is the main key used to decode dynamic feedback reports. When embedded within an instrumented binary, it is always found in the .debug_site_info ELF section. When extracted into a standalone file, it is conventionally stored with the extension .sites.

The format of this file is a hybrid of XML and tab-delimited columnar data. This is intended as a compromise between structure, efficiency of storage, and ease of processing.

Compilation Units and Schemes

At the top level, a static site information file consists of a sequence of sections marked by XML-style sites tags:

<sites unit="unit signature" scheme="scheme name">
…
</sites>
<sites unit="unit signature" scheme="scheme name">
…
</sites>
⋮
<sites unit="unit signature" scheme="scheme name">
…
</sites>

[Note]Note

Unlike true XML, there is no prologue and no single root tag. The first line is the <sites> start tag for the first compilation unit and the last line is the </sites> end tag for the last compilation unit.

A single <sites></sites> section describes the instrumentation sites for one compilation unit with one instrumentation scheme. Each sites start tag carries two attributes:

unit

a 128-bit identifying signature for this compilation unit expressed as 32 lower case hexadecimal digits

scheme

the name of an instrumentation scheme as given to sampler-cc's -fsampler-scheme flag

When data is collected during a run, it is also marked with these same two attributes. Thus this (unit signature, scheme name) pair serves to connect dynamic data with the static sites that collected it.

A complex application that links together several object files will contain several sites sections. If multiple instrumentation schemes were used within a single compilation unit, then multiple sites sections will appear with the same compilation unit signature but differing scheme names. It is even possible, though rare, for a single object file to be linked into an executable multiple times, in which case all of its sites sections will be duplicated as well. In all cases where multiple sites sections are present, their order is arbitrary. In particular, do not assume that the dynamic data in feedback reports appears in the same order as these static sites sections.

Site Details

Within a single sites section, each line describes one instrumentation site for the given compilation unit and scheme. The order here is fixed and matches the order of counters appearing in the corresponding section of the dynamic feedback report from each run.

Details for each site are given as a sequence of tab-delimited fields. The initial fields are common to all instrumentation schemes:

  1. source file name

  2. source line number

  3. name of function containing site

  4. control flow graph number of site (unique within function)

Additional fields are specific to the instrumentation scheme that induced this site:

  • atoms

    1. the lvalue which may access shared, mutable memory

    2. one of:

      read

      access reads from the given location

      write

      access writes to the given location

  • bounds

    1. the left hand side of the instrumented assignment

    2. one of:

      local

      assignment is to a named local variable

      global

      assignment is to a named global variable

      mem

      assignment is to an indirectly addressed memory location

    3. one of:

      direct

      assignment is to a direct base location with no offset

      field

      assignment is to a named field within a structure

      index

      assignment is to an indexed element within an array

  • branches

    1. the predicate of the instrumented branch

  • float-kinds

    1. the left hand side of the instrumented assignment

    2. one of:

      local

      assignment is to a named local variable

      global

      assignment is to a named global variable

      mem

      assignment is to an indirectly addressed memory location

    3. one of:

      direct

      assignment is to a direct base location with no offset

      field

      assignment is to a named field within a structure

      index

      assignment is to an indexed element within an array

  • function-entries: no additional fields

  • g-object-unref

    1. the object argument in the instrumented call to g_object_unref

  • returns

    1. the callee in the instrumented function call

  • scalar-pairs

    1. the left hand side of the instrumented assignment

    2. one of:

      local

      assignment is to a named local variable

      global

      assignment is to a named global variable

      mem

      assignment is to an indirectly addressed memory location

    3. one of:

      direct

      assignment is to a direct base location with no offset

      field

      assignment is to a named field within a structure

      index

      assignment is to an indexed element within an array

    4. the right hand side of the instrumented assignment

    5. one of:

      local

      site is comparing the assigned value with a named local variable

      global

      site is comparing the assigned value with a named global variable

      const

      site is comparing the assigned value with a compile-time constant

      local-init

      site is comparing the assigned value with a named local variable whose value is definitely initialized at that site

      local-uninit

      site is comparing the assigned value with a named local variable whose value is possibly uninitialized at that site

Control Flow Graph

Not yet written. Will eventually document our representation of program control flow graphs as stored in the .debug_sampler_cfg ELF section or in .cfg standalone files.

Dynamic Feedback Reports

A dynamic feedback report consists of instrumentation data and possibly other debugging information collected from a single run of an instrumented program. When using high-level program launchers in conjunction with a report collection server, a dynamic report arrives at the server each time an instrumented application exits. When using low-level environment variables, a dynamic report is written into the selected file descriptor or file name as the program exits.

At the top level, a dynamic feedback report consists of a sequence of sections marked by XML-style report tags:

<report id="subreport name">
…
</report>
<report id="subreport name">
…
</report>
⋮
<report id="subreport name">
…
</report>

[Note]Note

Unlike true XML, there is no prologue and no single root tag. The first line is the <report> start tag for the first subreport and the last line is the </report> end tag for the last subreport.

The following subsections describe the subreports currently in use.

samples subreport

The first subreport always has id="samples". The samples subreport contains the final recorded values for all instrumentation sites. It is designed to be small and therefore easy to send to a central collection server. For this reason, it cannot be understood by itself. A samples subreport must be decoded using the static site information files generated when the application was built. Taken together, the samples subreport and the static site information files connect observed dynamic behaviors with static source features such as functions, files, and line numbers.

A samples subreport consists of a sequence of sections marked by XML-style samples tags:

<samples unit="unit signature" scheme="scheme name">
…
</samples>
<samples unit="unit signature" scheme="scheme name">
…
</samples>
⋮
<samples unit="unit signature" scheme="scheme name">
…
</samples>

Each samples section gives the final instrumentation data for one instrumentation scheme in one compilation unit. As noted earlier, sites sections in the static site information file and samples sections in a samples subreport are not guaranteed to appear in the same order. However, the mandatory unit and scheme attributes have the same meaning in both cases. For any given (unit signature, scheme name) pair, the corresponding section of the samples subreport gives the measured values for a run and the corresponding section of the static site information file relates that information to the application source code.

Within one samples section, each instrumentation site reports its measurements on one line, with multiple values delimited by tabs. See Instrumentation schemes for a description of each scheme's recorded data values. The order of lines within a samples section is fixed and corresponds, line by line, with the corresponding sites section of some static site information file. Usually the static site information is drawn from the main instrumented application, but it may also come from shared libraries or dynamically loaded plugins, each of which has its own static lists of instrumentation sites.

Dynamically loaded plugins are a special case, in that they may appear multiple times in a single samples subreport. If a plugin is loaded and unloaded multiple times while the application is running, each unload reports on all of that plugin's sites just before unloading. Each reload of the plugin resets all of the plugin's instrumentation site data to its initial values (e.g. 0 for counters), with no memory of the earlier load. When examining feedback reports from applications with instrumented plugins, it is up to you to merge these repeated samples sections appropriately. For counter-based schemes, the right thing to do is simply sum corresponding counters from multiple sections. For the bounds scheme, which is not counter-based, take the minimum of all corresponding minima and the maximum of all corresponding maxima.

Aligning and merging multiple samples and sites sections can be tedious. The /usr/local/lib/sampler/tools/resolveSamples tool provides simple merging to support basic data analysis. It is run as follows:

resolveSamples {section-name} { executable | shared-library | object | standalone site information file ...}

Standard input to resolveSamples should be a samples subreport, starting with the first <samples> start tag and ending after the last </samples> end tag. Arguments on the command line may be any mixture of extracted site information files or instrumented binary files with static site information still embedded within them. Output is a sequence of lines containing only tab-delimited columnar data, with no XML-like tags. Each instrumentation site appears on a single line with the following initial fields:

  1. file name from resolveSamples in which this site was found

  2. signature of the compilation unit in which this site was found

  3. name of the instrumentation scheme that induced this site

These initial fields are followed by:

  • all static information fields for this site

  • all dynamic values reported for this site

The flat, uniform structure of a fully resolved samples report can be convenient for basic data analysis on small numbers of runs. However, the size and redundancy of the static information fields make this approach undesirable when processing hundreds or thousands of feedback reports.

Add documentation about timestamps sections, which will also appear in the samples subreport when site time stamping is enabled at instrumentation time. Also document the /usr/local/lib/sampler/tools/resolveTimestamps tool.

main-backtrace subreport

In the event of a crash, the dynamic feedback report contains an additional subreport describing the execution stack of the running thread at the point of failure. Output is as generated by the backtrace_symbols function from the GNU C library. For example:

ccrypt[0x804ff1c]
/lib/tls/libc.so.6[0xaa78c8]
ccrypt[0x804c3df]
ccrypt[0x804c80e]
ccrypt[0x804a34b]
/lib/tls/libc.so.6(__libc_start_main+0xd3)[0xa94e23]
ccrypt[0x8049171]

Using debug information recorded when the application was built, these raw addresses can be further resolved to function names and line numbers, as one would expect to see in a debugger:

returnmodulefunctionfileline
0x804ff1cccrypthandleSignalreport.c87
0xaa78c8/lib/tls/libc.so.6????0
0x804c3dfccrypttraverse_filetraverse.c451
0x804c80eccrypttraverse_filestraverse.c485
0x804a34bccryptmainmain.c516
0xa94e23/lib/tls/libc.so.6????0
0x8049171ccrypt_start??0

Note that this only reveals code locations. Values of local variables or other program data are not reported.

Subreport Extraction

It is often useful to extract a single subreport from a dynamic feedback report. For example, the resolveSamples tool expects to see just a samples subreport, not an entire feedback report. The /usr/local/lib/sampler/tools/extract-section tool may be used for this purpose. It is run as follows:

extract-report {report id}

Standard input to extract-report should be a raw dynamic feedback report. It prints the report with the requested ID on standard output and discards the rest. For example, one might use this in conjunction with extract-section as follows:

extract-report samples <raw-report.log | resolveSamples myapp myplugin.so libmylib.so