How to send patches with git-send-email

The prerequisites for this tutorial is that you have already made some changes to your local kernel tree and that these changes have been committed.
In this tutorial, are described the steps to follow in order to create and send a patch series using git-send-email.

Initially, you need to determine which of your commits want to be sent, so do:

$ git log --pretty=oneline --abbrev-commit

The output, in my case, looks like:

db868ad xhci: remove conversion from generic to pci device in xhci_mem.c
c010f0c xhci: remove unnecessary check in xhci_free_stream_info()
a166493 xhci: fix SCT_FOR_CTX(p) macro
56e4cd3 xhci: replace USB_MAXINTERFACES with config->desc.bNumInterface

Lets assume that I want to send the last 3 commits i.e db868ad, c010f0c and a166493. The first thing I need to do is to create patches for these commits and store them in a local directory e.g. ~/patches/

Patches that can be sent using git-send-email should have been generated with git-format-patch. Patches with other formats may fail to be passed to git-send-email. So to create the patches I do:

$ git format-patch -o ~/patches/ -3 HEAD

HEAD may be omitted since it is implied by default when you do not state the starting commit-id. If you want, for example, to create patches for the last 3 commits starting from commit c010f0c i.e. c010f0c, a166493 and 56e4cd3, you need to alter the above command in the following way:

$ git format-patch -o ~/patches/ -3 c010f0c

If you observe the patches created in ~/patches/, you will notice that the patch subjects got prefixed with [PATCH n/m]. If you intend to send a patch as RFC, you can alter the subject prefixes into [RFC n/m] by doing:

$ git format-patch --subject-prefix="RFC" -o ~/patches/ -3

Now, you have created your patches and you are ready to send them. Note, that in case you do not want to keep a copy of your patches in a local directory, the above steps on patch creation can be omitted and you can use git-send-email to directly create and send patches for your commits. How to do that will be described below.

The next step is to indicate to git-send-email which SMTP server it will use to send your patches and to specify its parameters e.g. encryption protocol, port etc. If you have a gmail account, you can use the following commands. Otherwise, you need to alter them accordingly.

$ git config --global sendemail.smtpuser <your mail>
$ git config --global sendemail.smtpserver
$ git config --global sendemail.smtpencryption tls
$ git config --global sendemail.smtpserverport 587

You can also configure your password doing:

$ git config --global sendemail.smtppass <your passwd>

However, that is not recommended since your password will be written unencrypted in ~/.gitconfig. This command has been included just for completeness. If smtppass has not been set, you will be prompted for your password every time you send a patchset.

Every time a patch is send, your mail address is CC’ed by default. To prevent git-send-email from sending you back copies of your emails, do:

$ git config --global sendemail.suppresscc self

You can check ~/.gitconfig to see if you have setup correctly git-send-email. Its contents should look similar to the following:

	email =
	name = Xenia Ragiadakou
	smtpuser =
	smtpserver =
	smtpencryption = tls
	smtpserverport = 587
	suppresscc = self

Now, before proceeding with git-send-email options, try to send the patch series to your email account by doing:

$ git send-email --to  <your mail> ~/patches/*.patch

If the above command fails, maybe there is a typo in your git-send-email configuration. The error message will help you can track down possibly misconfigured settings in your ~/.gitconfig

As I already stated above, if you are not interested in keeping a copy of your patches in a local directory, you can run git-send-email directly on your commits. Try it:

$ git send-email -3 --to=<your mail>

The above command will create the patchset in a subdirectory in /tmp/ and send it to your email. You can specify a different subject prefix as well, using –subject-prefix option. For example:

$ git send-email -3 --subject-prefix="RFC" --to=<your mail>

Now, lets have a look at some useful git-send-email options. A complete list can be found in:

--to=<destination address>
	If you forget to set it, you will be prompted for it.

--cc=<cc'ed address>
	If you want to cc more than one address, you need to repeat --cc for
	each CC'ed address.

	With --no-chain-reply-to, all patches following the first patch will
	be sent as replies to the first (shallow threading). This is the
	recommended way to send patch series to mailing lists. This is also
	the default so you don't need to set it explicitly.
  	[PATCH 1/4] ...
  		[PATCH 2/4] ...
  		[PATCH 3/4] ...
  		[PATCH 4/4] ...
	With --chain-reply-to, each patch will be sent in reply to the previous
	one (deep threading).
  	[PATCH 1/4] ...
  		[PATCH 2/4] ...
  			[PATCH 3/4] ...
  				[PATCH 4/4] ...

	With --thread, to the sent emails would be added In-Reply-To and
	References headers. This is enabled by default.

	It is used in order to send the patchset as a reply to an email with
	the specified Message ID. This is particularly useful when you want
	to send a revised version for a patchset because it won't break the
	existing thread and will help reviewers to follow up your changes on
	the patchset. When --thread and --no-chain-reply-to are specified,
	threading will look like:
 	[PATCH 1/4] ...
  		[PATCH 2/4] ...
  		[PATCH 3/4] ...
  		[PATCH 4/4] ...
		[PATCH v2 1/4] ...
  			[PATCH v2 2/4] ...
  			[PATCH v2 3/4] ...
  			[PATCH v2 4/4] ...
	With this option you can edit and send an introductory message to your
	patch series. You can specify the subject either when you directly edit
	the mail or using the option --subject. That can be helpful in case you
	want to add a cover-letter to describe, for example, the changes
	introduced after the last revision of the patchset.
	Take in mind that you would need to set explicitly the subject prefix
	if you use that method (other method can be using --cover-letter option
	with git-format-patch) for creating your cover letter e.g:
	$ git send-email -3 --subject="[RFC 0/3] ..." --compose --to=<your mail>

The last step is to identify to who the patchset should be sent. That can be done using script and the following command:

$ perl scripts/ < <your patch>

For instance, in the case of 0002-xhci-fix-SCT_FOR_CTX-p-macro.patch, I did:

$ perl scripts/ < 0002-xhci-fix-SCT_FOR_CTX-p-macro.patch

And the output was:

Sarah Sharp  (supporter:USB XHCI DRIVER)
Greg Kroah-Hartman  (supporter:USB SUBSYSTEM) (open list:USB XHCI DRIVER) (open list)

To send the patchset, I do:

$ git send-email --to
  --cc --cc
  --cc ~/patches/*.patch

Now, you are ready to send your patchsets to the kernel mailing lists using git-send-email 🙂

Before attempting that though please read the documentation on submitting patches to increase the chances of your patches being accepted.

Posted in Uncategorized | 3 Comments

How to enable and tune Dynamic Debugging for xHCI

Dynamic debugging is a kernel debug mechanism that aims to allow linux users and developers to dynamically enable or suppress kernel debugging statements.

The debugging statements which can be managed via the dynamic debug interface are those that have been written using pr_debug() or dev_debug(). Dynamic debug does not operate over debugging messages written using other than those two functions.

To take advantage of the dynamic debug feature, the kernel has to be compiled with CONFIG_DYNAMIC_DEBUG on. So, before proceeding make sure that your configuration file includes the following line:


Another thing, that you should pay attention to, for the case of xhci, is that CONFIG_USB_DEBUG should not be set. If USB_DEBUG configuration option is turned on, the debugging statements of the usb host controller drivers would all be enabled by default blowing up your logs. That happens because when CONFIG_USB_DEBUG is set, the usb host controller drivers are compiled with the DEBUG flag.
Observing the code in linux/dynamic_debug.h (the relevant parts have been attached below) shows that when DEBUG is defined, then _DPRINTK_FLAGS_DEFAULT gets defined as _DPRINTK_FLAGS_PRINT. Therefore, the condition for the if statement in dynamic_pr_debug() succeeds and the message gets printed by default.


#if defined DEBUG


#define DEFINE_DYNAMIC_DEBUG_METADATA(name, fmt)                    \
            static struct _ddebug  __aligned(8)                     \
            __attribute__((section("__verbose"))) name = {          \
                   .flags =  _DPRINTK_FLAGS_DEFAULT,                \

#define dynamic_pr_debug(fmt, ...)                                  \
    do {                                                            \
            DEFINE_DYNAMIC_DEBUG_METADATA(descriptor, fmt);         \
            if (unlikely(descriptor.flags & _DPRINTK_FLAGS_PRINT))  \
                    __dynamic_pr_debug(&descriptor, pr_fmt(fmt),    \
                                       ##__VA_ARGS__);              \
    } while (0)

The debugging statements can be dynamically enabled and disabled by writing to a debugfs file. So, make sure that debugfs is mounted (it is usually mounted at /sys/kernel/debug). Under debugfs, there is a subdirectory called ‘dynamic_debug’ that contains the file ‘control’.
In this file, you have to write queries with a specific format, in order to enable/disable the debugging statements contained in a module, a file, a function or even a line. In that way, you can enable only the messages that are really interested to you. That saves you from searching in the logs and from the need to recompile or even reboot your system (unless you are interested in early boot debugging statements).

How dynamic debugug interface can be manipulated is explained in great detail in Documentation/dynamic-debug-howto.txt, so let’s do the overall discussion more xhci specific.

======= How to enable dynamic debug for xhci at early boot =======

The reason I decided to write this post in first place is that in Debian I was using the boot option ddebug_query to enable dynamic debugging for xhci. However, when I moved to Arch linux, I realised that this option is deprecated and I had to use dyndbg option instead.
Therefore, if xhci is compiled as a built-in module, you need to add the following boot option:

dyndbg="<query>" (for instance, dyndbg='module xhci_hcd +p')

However, if xhci driver has been compiled as a loadable module, the boot option should be:

xhci_hcd.dyndbg="<query>" (no need to specify again the module
                                 e.g xhci_hcd.dyndbg=+p)

The boot option can be set at boot by pressing ‘e’ or if you want it to be set at every boot, you can add it to your linux ddefault command line by writing the /etc/default/grub file as follows:


Don’t forget to update the grub by doing ‘update-grub’ or ‘grub-mkconfig -o /boot/grub/grub.cfg’.

The query can target a specific file (i.e file <file name> +p) or a specific function (i.e func <function name> +p) or a specific line (i.e line <line num> +p).
Also, more than one queries can be issued using ‘;’ as delimiter to separate them.

======= How to enable dynamic debug for xhci at runtime =======

As it was already said above, dynamic debugging is tuned by echoing queries into the <debugfs>/dynamic_debug/control file.
For example, to enable xhci debug statements do:

echo 'module xhci_hcd +p' > <debugfs>/dynamic_debug/control

To disable them do:

echo 'module xhci_hcd -p' > <debugfs>/dynamic_debug/control

If you are interested in debug messages generated by a group of functions, for example, you can write all the associated queries in a batch file and then use that file to enable them all together.
I would like to stay a bit here, since I have implemented some xhci event traces that do exactly the above (isolate debug messages). For example, the functionality of the xhci_dbg_cancel_urb trace that aims to display the debug messages related to urb cancellation, can be implemented just by importing the following batch file to <debugfs>/dynamic_debug/control, lets call it urb_cancel.batch:

module xhci_hcd func xhci_urb_dequeue +p
module xhci_hcd func xhci_find_new_dequeue_state +p
module xhci_hcd func td_to_noop +p
module xhci_hcd func xhci_handle_cmd_stop_ep +p
module xhci_hcd func xhci_stop_endpoint_command_watchdog +p
module xhci_hcd func xhci_handle_cmd_set_deq +p

And, then, do:

cat urb_cancel.batch > <debugfs>/dynamic_debug/control

Note that using this method in the place of separate debug message traces does not have any overhead and does not require additional code, decreasing xhci size.
So, I will double the number of patches I have sent by revering my own patches :/

Apart from ‘p’, that indicates whether or not the statement will get printed, there are also some other flags like ‘f’, ‘l’, ‘m’ and ‘t’ that can be enabled/disabled if you want to include/exclude in the printed message the name of the function, the line number, the module name or the associated thread, respectively.

If you want to disable all the flags, use flag ‘_’, instead of disabling them individually.

Ok, I ‘m done! For more information, refer to Documentation/dynamic-debug-howto.txt that was currently updated and which was the main source for this post.

Posted in Uncategorized | Leave a comment

Use libusb to issue a clear halt to an endpoint

Currently, I work on fixing the xhci_endpoint_reset() function so that an endpoint will get reset properly when the usb device driver calls usb_reset_endpoint() through a call to usb_clear_halt().

In order to trigger the bug, I have to issue a clear halt to an endpoint. For that job, I wrote a simple userspace program using libusb. I called the source file ‘xhci_resetep.c’ and its contents are presented below:

/* xhci_resetep.c */

#include <unistd.h>
#include <stdio.h>
#include <string.h>
#include <libusb-1.0/libusb.h>

#define VENDOR  0x13fe
#define PRODUCT 0x1a00

#define EP_DIR_OUT 0x0
#define EP_DIR_IN 0x80

int is_myusb(libusb_device *dev)
	struct libusb_device_descriptor desc;

	libusb_get_device_descriptor(dev, &desc);

	if (desc.idVendor == VENDOR && desc.idProduct == PRODUCT)
		return 0;

	return -1;

int main(int argc, char *argv[])
	char c;
	char *ep_dir;
	unsigned int ep = 0, iface = 0;
	libusb_device **udev_list;
	libusb_context *ctx = NULL;
	libusb_device *my_udev = NULL;
	size_t udev_cnt;
	libusb_device_handle *handle;
	int i, ret;

	while ((c = getopt (argc, argv, "e:d:i:")) != -1) {
		switch (c) {
		case 'i':
			iface = strtoul(optarg, NULL, 0);
		case 'e':
			ep = strtoul(optarg, NULL, 0);

		case 'd':
			ep_dir = strdup(optarg);
			fprintf(stderr, "ERR: Invalid options\n");
			return -1;

	if (!ep) {
		fprintf(stderr, "ERR: -e: Specify endpoint number\n");
		return -1;

	if (!strcmp(ep_dir, "in")) {
		ep |= EP_DIR_IN;
	} else if (!strcmp(ep_dir, "out")) {
		ep |= EP_DIR_OUT;
	} else {
		fprintf(stderr, "ERR: -d [in/out]: Invalid direction\n");
		return -1;

	udev_cnt = libusb_get_device_list(ctx, &udev_list);
	printf("Number of attached usb devices = %zu\n", udev_cnt);

	for (i = 0; i < udev_cnt; i++) {
		if (!is_myusb(udev_list[i])) {
			my_udev = udev_list[i];

	if (!my_udev) {
		fprintf(stderr, "ERR: udev not found\n");
		ret = -1;
		goto cleanup;

	ret = libusb_open(my_udev, &handle);
	if (ret) {
		fprintf(stderr, "ERR: failed to open udev ");
		switch(ret) {
			fprintf(stderr, "(memory allocation failed)\n");
			fprintf(stderr, "(no permission)\n");
			fprintf(stderr, "(udev not found)\n");
		ret = -1;
		goto cleanup;

	/* detach udev's driver, if present */
	if (libusb_kernel_driver_active(handle, iface))
		libusb_detach_kernel_driver(handle, iface);

	ret = libusb_claim_interface(handle, iface);
	if (ret) {
		switch(ret) {
			fprintf(stderr, "ERR: iface not found\n");
			fprintf(stderr, "ERR: iface claimed by another "
			fprintf(stderr, "ERR: udev not found\n");
		ret = -1;
		goto release;

	ret = libusb_clear_halt(handle, ep);
	if (ret) {
		switch(ret) {
			fprintf(stderr, "ERR: ep %d not found\n", ep);
			fprintf(stderr, "ERR: udev not found\n");
		ret = -1;

	libusb_attach_kernel_driver(handle, iface);

	libusb_free_device_list(udev_list, 1);

	return ret;

To build xhci_resetep.c, do:

$ gcc xhci_resetep.c -lusb-1.0 -o resetep

The executable ‘resetep’ takes the following command line parameters:

-e    : endpoint number
-d    : direction of the endpoint, in or out
-i    : interface number

Also, you need to set appropriately the values for the VENDOR and PRODUCT macros in the source file to correspond to the vendor and product ids of the usb device to which you want to issue the clear halt.
In order to set properly the above parameters retrieve the necessary information using the following commands:

$ usb-devices
$ lsusb -v

For example:

$ sudo ./resetep -e 1 -d in -i 0
$ dmesg | tail -5
[   75.253235] xhci_hcd 0000:00:10.0: Endpoint 0x81 not halted, refusing to reset

And voila! the bug was reproduced …

Posted in Uncategorized | Leave a comment

xHCI Interrupts

This post starts with a brief reference to how iterrupts are implemented on PCI platforms.

There are three methods to implement interrupts on PCI platforms:

1) legacy interrupts

The devices attached to PCI bus are equipped with an external interrupt pin which is connected to a dedicated interrupt line on the bus. PCI bus has a limited number of such out-of-band lines. More specifically, there are four dedicated interrupt lines, named INTA, INTB, INTC and INTD.
To distribute the bus interrupt lines evenly accross devices and reduce sharing, device interrupt pins and bus interrupt lines are multiplexed in the following way:

1st Device		2nd Device		3rd Device

Since CPU interrupt pins are expensive and limited, PCI interrupt lines are connected to the input pins of an interrupt controller, PIC or APIC. Whenever an interrupt triggers, the interrupt controller asserts CPU’s interrupt pin. OS can figure out which input pin of the interrupt controller triggered the interrupt and acknpwledge it via reading/writing the IO/Memory-mapped registers of the interrupt controller.
The input pins of the interrupt controller are numbered and are referred to as IRQ lines. The device’s interrupt pin and IRQ line are reported in its PCI Configuration registers and can be viewed with the command:

$ lspci -b -vv

The above hold only for PCI buses. PCI-Express bus, which is implemented as a point-to-point interconnect, has not dedicated out-of-band interrupt lines. Instead, devices attached to PCIe have to implement MSI/MSI-X in-band interrupt mechanism to trigger interrupts (this is described below). However, for backward compatibily with drivers that do not support MSI/MSI-X interrupts, PCIe capable devices can emulate legacy interrupts.

2) MSI interrupts

MSI interrupts are in-band interrupts i.e. no dedicated interrupt pins exist but instead interrupts are reported via the same lines used for data transfers. An MSI capable device triggers an interrupt by sending a special packet called MSI Message. Whether a device is MSI capable, it is indicated in the MSI Capability structure in its PCI Configuration registers. A device can support up to 32 MSI Messages (interrupts). The actual supported number is reported in [3:1] bits of Message Control field of the MSI Capability structure, while the requested number of MSI Messages is written in [6:4] bits of the same register (here we need to say, that the number is reported as a power of 2, so you need to do 1 << num to take the corresponding decimal value).
The implementation of MSI, as any interrupt mechanism, is dependent on the underlying hardware. On x86 platformsm, MSI Messages are filtered by the interrupt controller, more specifically the IO APIC, and are translated to the correct virtual IRQ line. When an interrupt triggers, device writes in the MSI message the contents of Message Address and Message Data fields of MSI Capability structure, so that IO APIC get the necessary information regarding where to store the message in its address space.
To see whether MSI messages are enabled for a pci device, as well as the contents of MSI Capability structure, search in the lspci -vv output for a line started with:

Capabilities: MSI:

3) MSI-X interrupts

MSI-X is an enhancement to MSI. It can provide up to 2048 interrupts and supports different Address and Data for each vector. Whether a device is MSI-X capable, as well as the number of supported MSI-X vectors (i.e. the size of MSI-X Table), are reported in the MSI-X Capability structure in its PCI Configuration registers. MSI-X Capability structure provides, also, pointers to the MSI-X Table and Bit-per-vector Pending Bit Array (PBA) structures, which reside in memory-mapped address space. MSI-X Table holds the Address and Data for each supported MSI-X message as well as a Control field which is used for masking the interrupt corresponding to this message. PBA, as its name signals, is an array of bits with size equal to number of supported MSI-X vectors, where each bit, if set, indicates a pending interrupt. When the device wants to deliver an interrupt, it sets the PBA bit of corresponding MSI-X vector and, if the associated entry in the MSI-X Table does not have its Control field set to masked, the device writes Data to the memory location indicated in Adress field.
To see whether a pci device is MSI-X capable as well as the contents of MSI-X Capability structure, search in the lspci -vv output for a line started with:

Capabilities: MSI-X:

What’s the main MSI/MSI-X benefit? Interrupt affinity

Aside from gains in size by eliminating the need of external interrupt pins and aside from eliminating interrupt sharing, MSI/MSI-X mechanism enables the binding of interrupts to a specific CPU. Modern SMP systems that support process affinity see significant performance benefits by increased cache hits when the interrupt is delivered to the CPU running the process associated to this interrupt. For SMP systems with more than 32 cores, MSI-X eliminates the need of re-vectoring logic, necessary in case of MSI to implement interrupt affinity.

After this “brief” description, lets move to kernel stuff …

The information related to each interrupt is stored in an ‘irq_desc’ structure. Each interrupt is referenced by a numeric value corresponding to an entry to an interrupt descriptor table kept by kernel. Each irq descriptor is associated with a list of handlers (in case of shared interrupts) via its ‘irqactions’ field. physical IRQ lines (i.e. the input pins of interrupt controllers) is a limited resource. The number of available IRQ lines depend on the interrupt controller. On systems having master and slave PICs 16 IRQ lines are used while on systems with APIC 32 IRQ lines are available, 8 of those can be used by PCI devices.

The xhci driver in its pci probe function calls usb_hcd_pci_probe() which will call in turn pci_enable_device(). When pci_enable_device() is called, pci kernel code will first look if the pci device is msi/msi-x capable. If it isn’t, the Interrupt Pin will be read from its PCI Configuration registers and kernel will search if IO APIC has an available IRQ line routed to this pin. If found one, it will update the contents of Interrupt Line register of device’s PCI Configuration Space so that the driver will get informed for which IRQ line to request interrupts. If the device is msi/msi-x capable, xhci driver needs to call pci_enable_msi_block() or pci_enable_msix(), respectively. Kernel uses the structure ‘msi_desc’ to store the information related to MSI interrupts (such as its virtual irq number, the last message etc).
pci_enable_msi_block() will care to allocate and initialize the msi descriptors requested, it will create as many entries in the irq table as the number of MSI vectors, it will assign the irq numbers and it will write the appropriate values to the Address and Data fields on the MSI Capability structure. ‘msi_list’ field of the pci_dev structure can be used to find the assigned irq numbers when requestin an MSI interrupt.

pci_enable_msix() —–> TODO

After that, xhci can request to associate an interrupt to a handler using request_irq(). This request must be performed before the device is instructed to generate interrupts.

request_irq(irq_num, handler, irq_flags, name, driver)

The number of interrupts supported by xHC host controller is reported via the MAX_INTRS field of the HCCPARAMS1 register. Normally, the same number shall be reported in its MSI/MSI-X Capability structure. These interrupts are used to signal to the xhci driver that a new Transfer Event has been posted in one of its Event Rings. However, the current implementantion of the driver uses only one Event Ring with the intention to extend the number of Event Rings in the future, hence only one interrupt is used at the moment.

When the xhci driver wants a transfer to trigger an interrupt on completion or when a short packet is detected, i.e. the xHC to register a Transfer Event on the Event Ring and send an MSI/MSI-X message, it sets the IOC (Interrupt-On-Completion) or the ISP (Interrupt-on-Short-Packet) bit, respectively, to 1 and specifies the interrupter number (not the irq number) in the Interrupter Target field of the Transfer TRB.

Mind your step when you are reading my posts!!!
Please don’t trip on the often blurry line between fiction and reality

Posted in Uncategorized | Leave a comment

Issueing commands to xHC

To issue commands to the USB device or the xHC, the xhci driver uses two basic structures, the Command TRBs and the Command Ring. In effect, the Command Ring is a circular buffer in host memory where the Command TRBs to be passed to the xHC are stored. The xHC reads the Command TRBs placed on the Command Ring by the xhci driver using DMA read transfers.

In order for the above scheme to work, two elementary things are needed:
1) to indicate to the xHC that there is an unhandled Command TRB on the Command Ring waiting for execution
2) and to communicate to the xHC the DMA address of the Command Ring

These are performed with the use of two memory-mapped registers:
1) the first by writing to the Host Controller Doorbell Register.
2) and the second via the Command Ring Control Register.
More specifically, xHC has an internal register, called Command Ring Dequeue Pointer, the value of which is initialized with the value written to the memory-mapped Command Ring Control Register and is updated by the xHC when it finishes the processing of each stored command. When the command is completed, xHC writes the value of this internal register to the Command Completion Event TRB that posts in the Event Ring.

The xhci driver is informed about the completion status of the commands passed to xHC via different circular buffers called Event Rings. The Event Rings, that also rest in host memory, are populated by the xHC with the so-called Event TRBs using DMA write transfers.

So, when the xhci driver wants to issue a command to the xHC needs to build the appropriate Command TRB, to enqueue it in the Command Ring and to ring the xHC doorbell via writing to the Host Controller Doorbell register.

The Command Ring is allocated in xhci_mem_init():

xhci->cmd_ring = xhci_ring_alloc(xhci, 1, 1, TYPE_COMMAND, flags);

And its DMA address is written in the Command Ring Control Register:

val_64 = xhci_read_64(xhci, &xhci->op_regs->cmd_ring);
val_64 = (val_64 & (u64) CMD_RING_RSVD_BITS) |
       		(xhci->cmd_ring->first_seg->dma & (u64) ~CMD_RING_RSVD_BITS) |
xhci_write_64(xhci, val_64, &xhci->op_regs->cmd_ring);

The function for ringing the xHC doorbell is the xhci_ring_cmd_db(), in which the doorbell register is written:

xhci_writel(xhci, DB_VALUE_HOST, &xhci->dba->doorbell[0]);

The xhci driver stores all information related to a command into the structure xhci_command, that is defined as follows:

struct xhci_command {
	struct xhci_container_ctx	*in_ctx;
	u32				status;
	struct completion		*completion;
	union xhci_trb			*command_trb;
	struct list_head		cmd_list;

The ‘completion’ field is used when xhci driver waits for the completion of the command, otherwise ‘completion’ is set to NULL. The completion structure is initialized using init_completion(cmd->completion).
The xhci driver calls wait_for_completion_interruptible_timeout(cmd->completion, time_interval) to wait for the command’s completion for a predefined time interval. The completion is triggered using complete(cmd->completion). In order for a command to be allocated with its ‘completion’ field allocated and initialized, you need to call xhci_alloc_command() with the third argument ‘allocate_completion’ set to true. If wait_for_completion_interruptible_timeout() returns 0 the time interval has passed and complete() has not be called, if it returns a negative code a signal interrupted the thread waiting on this completion and if it returns 1 complete() was called before the time interval expires.

The ‘status’ field is the command completion status reported by the Command Completion Event TRB and it is written into this field before the command completes.

The ‘cmd_list’ field is used to form a node on a device ‘s list command list. Each device is associated with a list of commands issued to it and that have a completion thread associated with them. New commands to the device are allocated and added at the end of the device ‘s command list.
When a command completes, actually when xhci_complete_cmd_in_cmd_wait_list() is called, the command is deleted from the device ‘s command list.

The fields ‘command_trb’ and ‘in_ctx’ are pointers to the data structures that hold the command and its parameters. Briefly, the Command TRB is a 64 bytes structure reporting the type of the command, the slot and the endpoint IDs to which the command is issued and the DMA address of the Input Context, in case the command wants to configure the slot or endpoint contexts.

After the allocation and initialization of the xhci_command structure, the command is entered in the Command Ring using xhci_queue_<command_type>() functions, that set appropriately, depending on the command type, the fields of the Command TRB and update the Command Ring with the new Command TRB. To inform xHC that a new command has been enqueued, we ring the doorbell using xhci_ring_cmd_db(). Furthermore, the xhci_command is inserted to the device command list.

After the command execution, xHC generates a Command Completion Event TRB, posts it to the xhci driver via the Event Ring and generates a interrupt. The xhci has more than one Event Rings, each one associated with an interrupt line (more about the xhci interrupters’ interface will be said in a follow-on post). The Command Completion Event TRB informs xhci driver about the command’s completion status. The xHC interrupts are handled by the xhci_irq() routine, which basically acknowledges the interrupt and calls xhci_handle_event() to handle the event accordingly to its type (apart from the Command Completion Events, there are also other types of events passed to the Event Ring such as Transfer Events). Hence, the Command Completion Events are actually handled by handle_cmd_completion(). Different type of commands are handled differently upon completion. xHC will post a Command Completion Event on every command issued to the Command Ring.

Under some situations, xhci driver needs to wait on the completion of a command before comtinue its tasks. For example, the driver have to wait for the completion of a Configure Endpoint command before issueing further commands to the host controller. That ‘s why the driver calls wait_for_completion_interruptible_timeout() after queueing a Configure Endpoint command and the command completion handler has to call complete() for that command. The same stands for the Address Device, Evaluate Context and Enable Slot commands. If you are interested in waiting on the completion of other commands, you need to ensure that complete() will be called in the code path, otherwise the timer will expires although the Command Completion Event was generated by xHC.

Also, another thing you should care about, when issuing a command, is to not forget to free the xhci_command struct and delete it from the device command list.

Detailed documentation on the available commands and their functionality can be found in the xHCI Specification.

To be updated (probably)…

Posted in Uncategorized | Leave a comment

DMA mask setup for xHCI

The xHCI interface defines data structures that are used by the xhci-hcd driver and the xHC host controller to manage the usb devices. The buffers referenced by these data structures are allocated in host memory and the transfer operations between these host memory buffers and the xHC host controller are performed using the DMA mechanism. DMA is critical for achieving USB 3.0 speeds.

DMA allows the xHC host controller to access the host memory without the CPU’s intervention. In order to do so, the xHC host controller must be capable to obtain the control of the bus and initialize a bus transaction. On pci platforms, the xHC host controller obtains pci mastering capabilities during the pci device enumeration via a call to pci_set_master().

As a peripheral, the xHC host controller must use bus addresses to address the DMA buffers allocated in host memory. So the bus and the xHC host controller must implement an initial means that will permit the xHC host controller to obtain the bus address of the DMA buffers. On pci platforms, during the pci enumeration, the set of xHC host controller’s registers is mapped in host memory. Some of the registers are used by the xhci-hcd to report the bus address (or DMA address) of some DMA buffers which will, in turn, also contain in their structure the DMA address of another buffer and so on. As a result, the xHC will use these addresses to perform DMA transfers to and from the corresponding buffers.

For the data structures used by the xHC host controller, the xhci-hcd driver shall allocate memory space and then map it into DMA addresses. This space must be physically contiguous and reside in a DMA-able region of the memory, meaning that it can be reached using a bus address. If the DMA buffers have been allocated in the non DMA-able region, then a bounce buffer must be set by the DMA mapping function and the transfers between the xHC host controller and the original buffer are now performed via the bounce buffer.

The linux kernel provides to the developers of peripheral drivers a generic DMA API that is architecture and bus independent and can facilitate the allocation and mapping of DMA buffers by abstracting the architecture and bus specific DMA setup layers. The xhci-hcd driver uses this DMA API to allocate its data structures and map their virtual addresses into DMA addresses.

The DMA API defines the dma_addr_t opaque type to hold a bus address and distinguishes between two types of DMA mappings, streaming and coherent. Coherent mappings guarantee that changes in the content of DMA buffers will be visible immediately to both the driver and the peripheral. On the other hand, streaming mappings do not guarantee cache coherency and coherent DMA operations rely on the correct usage of streaming mapping functions by the developer. Coherent mappings have additional overhead to setup and use in comparison to streaming mappings, so for single transfers streaming mappings are preferred.

In xhci-hcd, all the buffers used for communicating data between the xHC host controller and the driver are allocated using coherent mappings, with the exception of URB buffers that hold the usb packets passed between the host controller and the device driver which are allocated using streaming mappings.

The size of DMA addresses is constrained by the size of peripheral’s DMA engine internal address register and the number of bus address lines. DMA transactions to buffers with bus addresses greater than the minimum of the two above cannot be performed due to the limited number of available bits. The number of bits that can be used to hold a DMA address is hardware specific and is set by setting the dma mask. Setting the dma mask to the highest supported value will enable xHC host controller to address a bigger memory region and will improve system’s performance by eliminating the use of bounce buffers.

In order to set up the dma mask for a device, the fields dma_mask and coherent_dma_mask of the generic ‘struct device’ must be set. dma_mask is a pointer to dma mask used in streaming DMA transfers while coherent_dma_mask corresponds to the dma mask used in coherent DMA transfers.

The xHC host controller reports its addressing capabilities via the HCCPARAMS register. The host system’s addressing capability is architecture and bus dependent. Hence, in order to check whether the dma addressing mode supported by the xHC host controller is also supported by the host, the device dma_mask and coherent_dma_mask must be set using dma_set_mask() and dma_set_coherent_mask().

The definition of dma_set_mask() is found in the arch/ subdirectory since its implementation is architecture specific. The code, before setting the dma_mask to point to the appropriate DMA bitmask, checks whether the pointer has been initialized and whether the required bitmask is supported by the current architecture. For instance, for the x86, the implementation of dma_set_mask() is:

int dma_set_mask(struct device *dev, u64 mask)
         if (!dev->dma_mask || !dma_supported(dev, mask))
                 return -EIO;
         *dev->dma_mask = mask;
         return 0;

Hence, on success the dma_set_mask() returns 0, otherwise returns a negative error code.
For pci platforms, the dma_mask is inititialized during the pci device enumeration via the function pci_device_add() to point to the pci_dev dma_mask field which is 32bit. So, by default, it is assumed that the devices attached to PCI are 32-bit DMA capable (or SAC). For other platforms, the caller of dma_set_mask() shall verify that the dma_mask pointer is not NULL, otherwise the function will fail even if the bitmask is supported by the platform.

For the x86, the implementation of dma_set_coherent_mask() is:

int dma_set_coherent_mask(struct device *dev, u64 mask)
        if (!dma_supported(dev, mask))
                 return -EIO;
        dev->coherent_dma_mask = mask;
        return 0;

As it was mentionned above, the addressing capabilities of the USB 3.0 host controller are reported via the HCCPARAMS register and can be 32- or 64-bit. If the xHC is 64-bit DAC capable, the xhci-hcd driver has to set the dma_mask to 64-bits to avoid bounce buffers and IOMMU utilization.

In general, it is recommended to unmap DMA regions as soon as they are not used anymore for DMA transfers and to use dma_set_mask() and dma_set_coherent_mask() instead of assigning directly values to the dma and coherent dma masks.

Posted in Uncategorized | Leave a comment


Trace-cmd is a tool implemented by Steven Rostedt. Its purpose is to facilitate the users’ control over the kernel tracing mechanism and to add support for more elaborate parsing and display of the trace output through plugins.

The trace-cmd tool can be found in the following repository:


To download and install it, do:

$ git clone git://
$ cd trace-cmd
$ make && sudo make install

The above command will build and install trace-cmd tool in /usr/local/bin. If you prefer to install it in another directory, do instead:

$ make && make prefix=<dir> install

Using the trace-cmd tool, the debug filesystem will be automatically mounted (if it is not already mounted) on /sys/kernel/debug/.

Trace-cmd writes the generated traces in a file, called trace.dat, and displays the content of this file either with the default format, as defined in the corresponding trace event’s definition, or it uses a plugin, if available, to parse and display the file content.

Trace-cmd plugins are written in C or in Python and can be found under the directory /usr/local/lib/trace-cmd/plugins or <dir>/lib/trace-cmd/plugins.

For the Python plugins to work, you need to install python-dev and swig packages.

As it was described in the previous post, to enable a trace event via direct access to the debugfs tracing files we do:

$ echo 1 > events/<trace system>/<trace event>/enable
$ echo <trace system>:<trace event> >> set_event

And to disable it:

$ echo 0 > events/<trace system>/<trace event>/enable
$ echo '!<trace system>:<trace event>' >> set_event

To view the trace output, we do:

$ cat trace

An alternative way, using trace-cmd, would be:

$ trace-cmd record -e <trace system>:<trace event>
$ trace-cmd report

‘record’ will enable the trace event and record its trace in the trace.dat file until Ctrl-C is pressed, while ‘report’ will load the plugin if available and will display the trace.
A subsequent record command will overwrite the previous contents in trace.dat.

In case you want to record more than one trace events, for instance:
– all the trace events for the xhci-hcd, you can do:

$ trace-cmd record -e xhci-hcd

– multiple trace events, you can do:

$ trace-cmd record -e xhci-hcd:xhci_dbg_context_change -e xhci-hcd:xhci_cmd_completion

And then filter the report according to a specific trace event:

$ trace-cmd report -F xhci_cmd_completion

Or even filter according to the value of an event’s structure field:

$ trace-cmd report -F xhci_cmd_completion:dma==0x374060c0

The general format used for filtering is:

$ trace-cmd report -F xhci-hcd/xhci_cmd_completion:dma==0x374060c0

The ‘report’ command will use the plugin, if available, to report the traces. In case you don’t want to load any plugins and you want to view the traces with the default format (not the raw trace) do:

$ trace-cmd report -N

If you are a raw foodist do:

$ trace-cmd report -R

To list all available trace events do:

$ trace-cmd list -e

To list all available tracers do:

$ trace-cmd list -t

To list all available plugins do:

$ trace-cmd list -P

This command, though, does not list my plugin written in python. So maybe it lists only the C plugins under ../lib/trace-cmd/plugins.

When finished with tracing and you want back your performance and buffers do:

$ trace-cmd reset

I think the most important commands (to me) have been covered so far. An extended documentation on the trace-cmd commands and options can be found in the downloaded trace-cmd files under Documentation/ subdirectory.

Posted in Uncategorized | Leave a comment

Linux Kernel Tracing

Ftrace is a tracing mechanism built into the kernel which gives the ability to developers to trace specific events by inserting appropriate tracepoints into the appropriate code sites.

Except from the trace event feature, the kernel can be configured to provide tracing support to dynamically trace function calls and various latencies helping the developers to profile and analyze the performance of their code.

This post will focus on how to implement trace events and tracepoints and manage the trace event output.

The trace data generated by trace events are stored in a ring buffer. When a trace event is implemented, a function is defined and associated to this trace event. This function is called probe and its main role is to register the event data into the ring buffer. Then, the user can output the trace data registered in the ring buffer via the debug virtual filesystem.

When a tracepoint of a given trace event is encountered in the code and its trace event is enabled, the probe function is called at the tracepoint’s callsite. If the trace event is disabled, its function is simply not called. The above result is achieved because the probe function is not called directly but it is called via a wrapper function, with the same prototype, which internally tests whether the trace event is enabled before calling the probe. So, in effect, the wrapper function is the function that will be called at the tracepoint callsite.

A simplification of the wrapper function is shown below.

void trace_##name(proto)
	if (trace_event_##name_is_enabled)

It can be observed that the name of the tracepoint consists of the name of the trace event prepended with the prefix ‘trace_’. Hence, the name of the tracepoint function of an event called, for example, ‘cmd_status’ will be ‘trace_cmd_status’.
‘proto’ corresponds to the function parameters and ‘args’ are the arguments with which will be called the probe.

So, there are two basic restrictions when defining the probe. First, the probe function and the tracepoint function must share the same prototype, void(*)(proto) and second the variables used as probe’s args must be defined in proto’s parameters.

There are special macros used to define a trace event. Follows an example of the cmd_status trace event.

	    TP_PROTO(u32 type, u32 status),
	    TP_ARGS(type, status),
		    __field(u32, type)
		    __field(u32, status)
		    __entry->type = type;
		    __entry->status = status;
	    TP_printk("cmd = %d, status = %d",

A more simplified version can be:

TRACE_EVENT(name, proto, args, struct entry, assign, print)
name:         the name of the trace event
proto:        the prototype of the tracepoint and probe funtions'
args:         the arguments that will be passed when probe is called
struct entry: the definition of the fields of the struct that will
              be recorded into the ring buffer. This struct will
              contain the data we want to trace
assign:       the functions that will be called to assign values to
              the fields
print:        the format and arguments that will be used to print
              the data stored in the ring buffer entry

The above trace event example will produce something similar to the following, at least as far as i have understood so far:

void trace_cmd_status(u32 type, u32 status)
    if (trace_event_cmd_status_is_enabled)
         ((void(*)(u32 type, u32 status))(probe)(type, status);

struct trace_entry_cmd_status
    u32	type;
    u32	status;

void probe(u32 type, u32 status)
    struct trace_entry_cmd_status *__entry;

    __entry->type = type;
    __entry->status = status;

    register __entry into the ring buffer for event cmd_status;

void ftrace_printk_cmd_status()
    struct trace_entry_cmd_status *__entry;

    __entry = read entry from the ring buffer for event cmd_status;

    printk("cmd = %d, status = %d", __entry->type, __entry->status);

So, to trace an event in the code, it is required to define the probe function and output format, and to specify the desired callsites by inserting tracepoints. To add a tracepoint for the cmd_status event, simply call trace_cmd_status() with the appropriate arguments.

The trace events must be defined in a header file which is called trace header. At the beginning of the trace header, which for example is named xhci-trace.h add the following lines:

#define TRACE_SYSTEM xhci-hcd

#if !defined(__XHCI_TRACE_H) || defined(TRACE_HEADER_MULTI_READ)
#define __XHCI_TRACE_H

#include  <linux/tracepoint.h>

The first pair of lines define the name for the trace system which is essentially a means to group all the trace events defined for the xhci-hcd.

The second pair of lines define a check where the second condition, defined(TRACE_HEADER_MULTI_READ), permits the trace header to be included multiple times. That is essential because the macros used to define the trace events must be processed multiple times.

At the end of xhci-trace.h, add:

#endif /* __XHCI_TRACE_H */

/* this part must be outside the header guard */


#define TRACE_INCLUDE_FILE xhci-trace

#include <trace/define_trace.h>

The first pair of lines, outside protection, prompt the trace/define_trace.h to search in the current directory for the xhci-trace.h header and not in the default, include/trace/events.

The second pair of lines tell the trace/define_trace.h the name of the trace header, which is xhci-trace.h of course. The .h will be appended to the TRACE_INCLUDE_FILE automatically.

Except from the trace header, there is need of another file to keep the tracing code self contained. The second file called, for example, xhci-trace.c contains the following lines:

#include "xhci-trace.h"

The first line will prompt the trace/define_trace.h to create the necessary ftrace structures.
The definition of CREATE_TRACE_POINTS must be included in only one file (that is the reason of this file’s existence) and must be followed by the trace header.

Finally, it is necessary to add in the search path to the path to the trace header. That can be done by adding to the Makefile the following line:

CFLAGS_xhci-trace.o := -I$(src)

The code that implements the trace events, as it is already mentioned, is placed in the trace header.

  • The control of ftrace is performed via the debugfs and the tracing directory. The tracing management can be performed either directly via read/write operations on the tracing files, or indirectly with the help of trace-cmd tool. Also, the trace-cmd tool can be installed with graphical interface, called KernelShark.

    To mount the debugfs, do:

    $ mount -t debugfs debugfs /sys/kernel/debug/

    Then, enter the tracing directory:

    $ cd  /sys/kernel/debug/tracing/

    The contents of the tracing directory may vary depending on the tracing config options set when the kernel was compiled. Here, follows a list of the contents that appear in my tracing directory with a brief description of their use.

    	Contains all the tracers supported by the current kernel
    	For example, the built-in tracers for my kernel are
            blk and nop.
    	The nop tracer simply displays the output of ftrace_printk()
    	The block tracer is a special-purpose tracer that generates
            traces related to the I/O traffic of a given block device
    	Other tracers can be also configured, for tracing:
    	- all function calls: function, function_graph
    	- the time passed with interrupts or preemption disabled:
    	  irqsoff, preemptoff, preemtirqsoff
    	- the process wakeup latencies: wakeup, wakeup_rt
    	- the stack: stack tracer
    	- the context switches: sched_switch
    	Don't forget CONFIG_DYNAMIC_FTRACE !!
    	Contains all the trace events supported by the current
            kernel configuration.
            Each entry in this file has the following format:
    	<trace system name>:<trace event name>
    	Contains the size in KB of the CPU trace ring buffer.
            If the size of trace output exceeds the buffer size,
            the contents of the 'trace' file will be overwritten.
    	Contains the accumulated size in KB of all CPU ring
    	Shows the tracer that is currently enabled. To select
            another tracer, choose the desirable tracer from available
            tracers and write its name into this file, for example:
    	$ echo blk > current_tracer
    	To enable just the event tracing feature, choose the
            nop tracer.
    	Classifies events per trace system. Each trace system
            subdirectory contains a directory per trace event and
            the files 'enable' and 'filter'.
    	The file 'enable' can be used to enable or disable all
            the events of the corresponding trace system, by writing
            1 or 0 respectively.
    	For example, to enable all trace events of trace system, do:
    	$ echo 1 > events/<trace system name>/enable
    	To filter the trace events with respect to the contents of
            their fields, add the appropriate filter expression in the
            file 'filter'.
            Filter expressions have the following format:
    	  <field> <relational op> <value>
    	More complex expressions can be created adding logical
            operators. To disable the filter, echo 0.
    	Each trace event subdirectory contains four files:
    	enable, filter, format and id.
    	To enable a specific event of a trace system, do:
    	$ echo 1 > events/<trace system name>/<trace event name>/enable
    	The id and the fields of a trace event appear in files
            'id' and 'format'.
    	Writing 1 into this file will free the trace ring buffer.
            If the option disable_on_free is set, then writing 1 into
            this file will also disable tracing.
    	Contains the available configuration options for the trace
            output. To enable or disable a specific tracing option,
            write 1 or 0 respectively to the corresponding option file.
    	Groups the trace output files (trace, trace_pipe and
            trace_pipe_rawper) per CPU and provides per CPU stats.
    	Contains the process names and ids running producing the
            registered events.
    	Contains the currently enabled trace events and can be
            used to enable or disable specific tracing events from
            the available events.
    	To enable for example the timer:timer_init event, do:
    	$ echo timer:timer_init >> set_event
    	The contents of this file can be managed easily using
            regular expressions.
            For example, to disable the timer:timer_init event, do:
    	$ echo '!timer:timer_init' >> set_event
    	To enable all the events of timer trace system, do:
    	$ echo 'timer:*' >> set_event
    	And to disable them, do:
    	$ echo '!timer:*' >> set_event
    	Contains the trace output produced by the currently enabled
            tracer. To display its content, do:
    	$ cat trace
    	The principal fields of each trace entry are:
    	TASK-PID:  it is composed by the process name and id.
    	CPU:       it is the CPU on which the process was running.
    	TIMESTAMP: it represents the time passed from the
                       initializatioN of tracing and its units
                       depend on the trace_clock mode set (i think).
    	FUNCTION:  it is the trace event name.
    	The trace is the last element and shows up at the end of
            each entry.
    	Shows the clock used to take the timestamp values of traces.
    	Is used to introduce markers in the trace output.
    	Lists all the available tracing options for configuring
            the trace output. The disabled options have the prefix 'no'.
    	So, to enable, for example the 'raw' tracing option, do:
    	$ echo raw > tracing_options
    	And, to disable it again, do:
    	$ echo noraw > tracing_options
    	Contains the trace output and is used to pipe the current
            trace output into a command. Once its content is consumed,
            it will block until new tracing output is generated.
    	Shows a mask representing on which CPUs tracing is currently
            enabled. To enable tracing on specific CPUs, write the
            appropriate mask into this file.
    	Shows whether the tracing mechanism is currently enabled and
            can be used to enable or disable tracing by writing 1 or 0,
    	Is used in function duration tracing to set and display the
            duration filter value. Function duration tracing traces the
            function entries and exits to evaluate function's duration.
            The tracing threshold is used for filtering the function
            duration trace by duration.
    	The threshold value is in microseconds and a zero value
            means that no duration threshold is applied in trace output.

    In order to enable early boot tracing for built-in modules,
    use boot option:


    where, event-list is a comma separated list of events. The events shall have the format:

    Posted in Uncategorized | Leave a comment

    Diving into the linux kernel

    In 7 days from now, my internship in the Linux USB3.0 controller driver will
    start and I still don’t know what to say. That just leaves me speechless.

    Not fingerless.

    Circumstances vortex, that used to send me to keep company to the deep-sea
    vampire squids, this summer casts me out in antarctica to meet the “penguins”.

    Well … a hot summer spent among penguins, iceweasels and icedoves will be
    after all an exciting “cool” summer.

    the brainless

    Posted in Uncategorized | Leave a comment