Linux Kernel Tracing

Ftrace is a tracing mechanism built into the kernel which gives the ability to developers to trace specific events by inserting appropriate tracepoints into the appropriate code sites.

Except from the trace event feature, the kernel can be configured to provide tracing support to dynamically trace function calls and various latencies helping the developers to profile and analyze the performance of their code.

This post will focus on how to implement trace events and tracepoints and manage the trace event output.

The trace data generated by trace events are stored in a ring buffer. When a trace event is implemented, a function is defined and associated to this trace event. This function is called probe and its main role is to register the event data into the ring buffer. Then, the user can output the trace data registered in the ring buffer via the debug virtual filesystem.

When a tracepoint of a given trace event is encountered in the code and its trace event is enabled, the probe function is called at the tracepoint’s callsite. If the trace event is disabled, its function is simply not called. The above result is achieved because the probe function is not called directly but it is called via a wrapper function, with the same prototype, which internally tests whether the trace event is enabled before calling the probe. So, in effect, the wrapper function is the function that will be called at the tracepoint callsite.

A simplification of the wrapper function is shown below.

void trace_##name(proto)
	if (trace_event_##name_is_enabled)

It can be observed that the name of the tracepoint consists of the name of the trace event prepended with the prefix ‘trace_’. Hence, the name of the tracepoint function of an event called, for example, ‘cmd_status’ will be ‘trace_cmd_status’.
‘proto’ corresponds to the function parameters and ‘args’ are the arguments with which will be called the probe.

So, there are two basic restrictions when defining the probe. First, the probe function and the tracepoint function must share the same prototype, void(*)(proto) and second the variables used as probe’s args must be defined in proto’s parameters.

There are special macros used to define a trace event. Follows an example of the cmd_status trace event.

	    TP_PROTO(u32 type, u32 status),
	    TP_ARGS(type, status),
		    __field(u32, type)
		    __field(u32, status)
		    __entry->type = type;
		    __entry->status = status;
	    TP_printk("cmd = %d, status = %d",

A more simplified version can be:

TRACE_EVENT(name, proto, args, struct entry, assign, print)
name:         the name of the trace event
proto:        the prototype of the tracepoint and probe funtions'
args:         the arguments that will be passed when probe is called
struct entry: the definition of the fields of the struct that will
              be recorded into the ring buffer. This struct will
              contain the data we want to trace
assign:       the functions that will be called to assign values to
              the fields
print:        the format and arguments that will be used to print
              the data stored in the ring buffer entry

The above trace event example will produce something similar to the following, at least as far as i have understood so far:

void trace_cmd_status(u32 type, u32 status)
    if (trace_event_cmd_status_is_enabled)
         ((void(*)(u32 type, u32 status))(probe)(type, status);

struct trace_entry_cmd_status
    u32	type;
    u32	status;

void probe(u32 type, u32 status)
    struct trace_entry_cmd_status *__entry;

    __entry->type = type;
    __entry->status = status;

    register __entry into the ring buffer for event cmd_status;

void ftrace_printk_cmd_status()
    struct trace_entry_cmd_status *__entry;

    __entry = read entry from the ring buffer for event cmd_status;

    printk("cmd = %d, status = %d", __entry->type, __entry->status);

So, to trace an event in the code, it is required to define the probe function and output format, and to specify the desired callsites by inserting tracepoints. To add a tracepoint for the cmd_status event, simply call trace_cmd_status() with the appropriate arguments.

The trace events must be defined in a header file which is called trace header. At the beginning of the trace header, which for example is named xhci-trace.h add the following lines:

#define TRACE_SYSTEM xhci-hcd

#if !defined(__XHCI_TRACE_H) || defined(TRACE_HEADER_MULTI_READ)
#define __XHCI_TRACE_H

#include  <linux/tracepoint.h>

The first pair of lines define the name for the trace system which is essentially a means to group all the trace events defined for the xhci-hcd.

The second pair of lines define a check where the second condition, defined(TRACE_HEADER_MULTI_READ), permits the trace header to be included multiple times. That is essential because the macros used to define the trace events must be processed multiple times.

At the end of xhci-trace.h, add:

#endif /* __XHCI_TRACE_H */

/* this part must be outside the header guard */


#define TRACE_INCLUDE_FILE xhci-trace

#include <trace/define_trace.h>

The first pair of lines, outside protection, prompt the trace/define_trace.h to search in the current directory for the xhci-trace.h header and not in the default, include/trace/events.

The second pair of lines tell the trace/define_trace.h the name of the trace header, which is xhci-trace.h of course. The .h will be appended to the TRACE_INCLUDE_FILE automatically.

Except from the trace header, there is need of another file to keep the tracing code self contained. The second file called, for example, xhci-trace.c contains the following lines:

#include "xhci-trace.h"

The first line will prompt the trace/define_trace.h to create the necessary ftrace structures.
The definition of CREATE_TRACE_POINTS must be included in only one file (that is the reason of this file’s existence) and must be followed by the trace header.

Finally, it is necessary to add in the search path to the path to the trace header. That can be done by adding to the Makefile the following line:

CFLAGS_xhci-trace.o := -I$(src)

The code that implements the trace events, as it is already mentioned, is placed in the trace header.

  • The control of ftrace is performed via the debugfs and the tracing directory. The tracing management can be performed either directly via read/write operations on the tracing files, or indirectly with the help of trace-cmd tool. Also, the trace-cmd tool can be installed with graphical interface, called KernelShark.

    To mount the debugfs, do:

    $ mount -t debugfs debugfs /sys/kernel/debug/

    Then, enter the tracing directory:

    $ cd  /sys/kernel/debug/tracing/

    The contents of the tracing directory may vary depending on the tracing config options set when the kernel was compiled. Here, follows a list of the contents that appear in my tracing directory with a brief description of their use.

    	Contains all the tracers supported by the current kernel
    	For example, the built-in tracers for my kernel are
            blk and nop.
    	The nop tracer simply displays the output of ftrace_printk()
    	The block tracer is a special-purpose tracer that generates
            traces related to the I/O traffic of a given block device
    	Other tracers can be also configured, for tracing:
    	- all function calls: function, function_graph
    	- the time passed with interrupts or preemption disabled:
    	  irqsoff, preemptoff, preemtirqsoff
    	- the process wakeup latencies: wakeup, wakeup_rt
    	- the stack: stack tracer
    	- the context switches: sched_switch
    	Don't forget CONFIG_DYNAMIC_FTRACE !!
    	Contains all the trace events supported by the current
            kernel configuration.
            Each entry in this file has the following format:
    	<trace system name>:<trace event name>
    	Contains the size in KB of the CPU trace ring buffer.
            If the size of trace output exceeds the buffer size,
            the contents of the 'trace' file will be overwritten.
    	Contains the accumulated size in KB of all CPU ring
    	Shows the tracer that is currently enabled. To select
            another tracer, choose the desirable tracer from available
            tracers and write its name into this file, for example:
    	$ echo blk > current_tracer
    	To enable just the event tracing feature, choose the
            nop tracer.
    	Classifies events per trace system. Each trace system
            subdirectory contains a directory per trace event and
            the files 'enable' and 'filter'.
    	The file 'enable' can be used to enable or disable all
            the events of the corresponding trace system, by writing
            1 or 0 respectively.
    	For example, to enable all trace events of trace system, do:
    	$ echo 1 > events/<trace system name>/enable
    	To filter the trace events with respect to the contents of
            their fields, add the appropriate filter expression in the
            file 'filter'.
            Filter expressions have the following format:
    	  <field> <relational op> <value>
    	More complex expressions can be created adding logical
            operators. To disable the filter, echo 0.
    	Each trace event subdirectory contains four files:
    	enable, filter, format and id.
    	To enable a specific event of a trace system, do:
    	$ echo 1 > events/<trace system name>/<trace event name>/enable
    	The id and the fields of a trace event appear in files
            'id' and 'format'.
    	Writing 1 into this file will free the trace ring buffer.
            If the option disable_on_free is set, then writing 1 into
            this file will also disable tracing.
    	Contains the available configuration options for the trace
            output. To enable or disable a specific tracing option,
            write 1 or 0 respectively to the corresponding option file.
    	Groups the trace output files (trace, trace_pipe and
            trace_pipe_rawper) per CPU and provides per CPU stats.
    	Contains the process names and ids running producing the
            registered events.
    	Contains the currently enabled trace events and can be
            used to enable or disable specific tracing events from
            the available events.
    	To enable for example the timer:timer_init event, do:
    	$ echo timer:timer_init >> set_event
    	The contents of this file can be managed easily using
            regular expressions.
            For example, to disable the timer:timer_init event, do:
    	$ echo '!timer:timer_init' >> set_event
    	To enable all the events of timer trace system, do:
    	$ echo 'timer:*' >> set_event
    	And to disable them, do:
    	$ echo '!timer:*' >> set_event
    	Contains the trace output produced by the currently enabled
            tracer. To display its content, do:
    	$ cat trace
    	The principal fields of each trace entry are:
    	TASK-PID:  it is composed by the process name and id.
    	CPU:       it is the CPU on which the process was running.
    	TIMESTAMP: it represents the time passed from the
                       initializatioN of tracing and its units
                       depend on the trace_clock mode set (i think).
    	FUNCTION:  it is the trace event name.
    	The trace is the last element and shows up at the end of
            each entry.
    	Shows the clock used to take the timestamp values of traces.
    	Is used to introduce markers in the trace output.
    	Lists all the available tracing options for configuring
            the trace output. The disabled options have the prefix 'no'.
    	So, to enable, for example the 'raw' tracing option, do:
    	$ echo raw > tracing_options
    	And, to disable it again, do:
    	$ echo noraw > tracing_options
    	Contains the trace output and is used to pipe the current
            trace output into a command. Once its content is consumed,
            it will block until new tracing output is generated.
    	Shows a mask representing on which CPUs tracing is currently
            enabled. To enable tracing on specific CPUs, write the
            appropriate mask into this file.
    	Shows whether the tracing mechanism is currently enabled and
            can be used to enable or disable tracing by writing 1 or 0,
    	Is used in function duration tracing to set and display the
            duration filter value. Function duration tracing traces the
            function entries and exits to evaluate function's duration.
            The tracing threshold is used for filtering the function
            duration trace by duration.
    	The threshold value is in microseconds and a zero value
            means that no duration threshold is applied in trace output.

    In order to enable early boot tracing for built-in modules,
    use boot option:


    where, event-list is a comma separated list of events. The events shall have the format:

    Posted in Uncategorized | Leave a comment

    Diving into the linux kernel

    In 7 days from now, my internship in the Linux USB3.0 controller driver will
    start and I still don’t know what to say. That just leaves me speechless.

    Not fingerless.

    Circumstances vortex, that used to send me to keep company to the deep-sea
    vampire squids, this summer casts me out in antarctica to meet the “penguins”.

    Well … a hot summer spent among penguins, iceweasels and icedoves will be
    after all an exciting “cool” summer.

    the brainless

    Posted in Uncategorized | Leave a comment