Category Archives: Kernel Hacking

Article about ‘perf annotate’

Just found out about Ravi’s article about ‘perf annotate’, concise yet covers most features, including cross-annotation, go read it!

Looking for users of new syscalls

Recently Linux got a new syscall to get extended information about files, a super ‘stat’, if you will, read more about it at LWN.

So I grabbed the headers with the definitions for the statx arguments to tools/include/ so that ‘perf trace’ can use them to beautify, i.e. to appear as
a bitmap of strings, as described in this cset.

To test it I used one of things ‘perf trace’ can do and that ‘strace’ does not: system wide stracing. To look if any of the programs running on my machine was using the new syscall I simply did, using strace-like syntax:

# perf trace -e statx

After a few minutes, nothing… So this fedora 25 system isn’t using it in any of the utilities I used in these moments, not surprising, glibc still needs wiring statx up.

So I found out about samples/statx/test-statx.c, and after installing the kernel headers and pointing the compiler to where those files were installed, I restarted that system wide ‘perf trace’ session and ran the test program, much better:

# trace -e statx
16612.967 ( 0.028 ms): statx/562 statx(dfd: CWD, filename: /etc/passwd, flags: SYMLINK_NOFOLLOW, mask: TYPE|MODE|NLINK|UID|GID|ATIME|MTIME|CTIME|INO|SIZE|BLOCKS|BTIME, buffer: 0x7ffef195d660) = 0
33064.447 ( 0.011 ms): statx/569 statx(dfd: CWD, filename: /tmp/statx, flags: SYMLINK_NOFOLLOW|STATX_FORCE_SYNC, mask: TYPE|MODE|NLINK|UID|GID|ATIME|MTIME|CTIME|INO|SIZE|BLOCKS|BTIME, buffer: 0x7ffc5484c790) = 0
36050.891 ( 0.023 ms): statx/576 statx(dfd: CWD, filename: /etc/motd, flags: SYMLINK_NOFOLLOW, mask: BTIME, buffer: 0x7ffeb18b66e0) = 0
38039.889 ( 0.023 ms): statx/584 statx(dfd: CWD, filename: /home/acme/.bashrc, flags: SYMLINK_NOFOLLOW, mask: TYPE|MODE|NLINK|UID|GID|ATIME|MTIME|CTIME|INO|SIZE|BLOCKS|BTIME, buffer: 0x7fff1db0ea90) = 0
^C#

Ah, to get filenames fetched we need to put in place a special probe, that will collect filenames passed to the kernel right after the kernel copies it from user memory:

[root@jouet ~]# perf probe 'vfs_getname=getname_flags:72 pathname=result->name:string'
Added new event:
probe:vfs_getname    (on getname_flags:72 with pathname=result->name:string)

You can now use it in all perf tools, such as:

perf record -e probe:vfs_getname -aR sleep 1

[root@jouet ~]# trace -e open touch /etc/passwd
0.024 ( 0.011 ms): touch/649 open(filename: /etc/ld.so.cache, flags: CLOEXEC) = 3
0.056 ( 0.018 ms): touch/649 open(filename: /lib64/libc.so.6, flags: CLOEXEC) = 3
0.481 ( 0.014 ms): touch/649 open(filename: /usr/lib/locale/locale-archive, flags: CLOEXEC) = 3
0.553 ( 0.012 ms): touch/6649 open(filename: /etc/passwd, flags: CREAT|NOCTTY|NONBLOCK|WRONLY, mode: IRUGO|IWUGO) = 3
[root@jouet ~]#

Make sure you have CONFIG_DEBUG_INFO set in your kernel build or that the matching debuginfo packages are installed. This needs to be done just once per boot, ‘perf trace’ will find it in place and use it.

Lastly, if ‘perf’ is hardlinked to ‘trace’, then the later will be the same as ‘perf trace’.

Getting backtraces from arbitrary places

Needs debuginfo, either in a package-debuginfo rpm or equivalent or by building with ‘cc -g’:

[root@jouet ~]# perf probe -L icmp_rcv:52 | head -15

  52  	if (rt->rt_flags & (RTCF_BROADCAST | RTCF_MULTICAST)) {
      		/*
      		 * RFC 1122: 3.2.2.6 An ICMP_ECHO to broadcast MAY be
      		 *  silently ignored (we let user decide with a sysctl).
      		 * RFC 1122: 3.2.2.8 An ICMP_TIMESTAMP MAY be silently
      		 *  discarded if to broadcast/multicast.
      		 */
  59  		if ((icmph->type == ICMP_ECHO ||
  60  		     icmph->type == ICMP_TIMESTAMP) &&
      		    net->ipv4.sysctl_icmp_echo_ignore_broadcasts) {
      			goto error;
      		}
      		if (icmph->type != ICMP_ECHO &&
      		    icmph->type != ICMP_TIMESTAMP &&
[root@jouet ~]# perf probe icmp_rcv:59
Added new event:
  probe:icmp_rcv       (on icmp_rcv:59)

You can now use it in all perf tools, such as:

	perf record -e probe:icmp_rcv -aR sleep 1

[root@jouet ~]# perf trace --no-syscalls --event probe:icmp_rcv/max-stack=5/
     0.000 probe:icmp_rcv:(ffffffffb47b7f9b))
                          icmp_rcv ([kernel.kallsyms])
                          ip_local_deliver_finish ([kernel.kallsyms])
                          ip_local_deliver ([kernel.kallsyms])
                          ip_rcv_finish ([kernel.kallsyms])
                          ip_rcv ([kernel.kallsyms])
  1025.876 probe:icmp_rcv:(ffffffffb47b7f9b))
                          icmp_rcv ([kernel.kallsyms])
                          ip_local_deliver_finish ([kernel.kallsyms])
                          ip_local_deliver ([kernel.kallsyms])
                          ip_rcv_finish ([kernel.kallsyms])
                          ip_rcv ([kernel.kallsyms])
^C[root@jouet ~]#

Humm, lots of redundant info, guess we could do away with those ([kernel.kallsyms) in all the callchain lines…

dwarves on the spotlight

Recently I saw somebody boasting that Linus now uses Fedora and I don’t know why I found it silly. But then when Ingo Molnar even mentioned the name of the package where codiff is available I found that I can be silly too 😉

tuna & the oscilloscope

Since I haven’t blogged about these toys, please take a look at Carsten’s article at OSADL and perhaps at my OLS 2008 paper too.

ait and tuna

Work is supposed to take most of your time, right? Survival, mouths to feed, all that stuff… But it can be fun, even if not _directly_ kernel related. Sure, there is life outside the kernel, hey, “kernel”… I keep listening to the head-honchos (heck, I had to use that term, it looks like spanish, the language of our capital, Buenos Aires!): “forget about the kernel, the action is somewhere else, in (l)userland!”

So here I am, in userland, playing with GUI stuff, drag and drop! Python! GTK! Wow! Sounds boring? No, I don’t think so. Discovery time is never boring. Its fun to try for hours to grok some new semantic domain. Even if you, in another life, wrote a DOS GUI system out of reading a marketing insertion on Unix Review in 1988 😛

So what have I been doing in this strange land? Well, when you work you have to show the numbers, and explain them, and remember what happened when you switched that knob or applied that patch, no?

Damn, alchemy doesn’t requires you remember all that stuff, just taste the new stuff, if you think its not poisonous, that is.

But if you need to remember… There is something cooking for this year OLS, or to the first conference after it if it thinks I’m too bollocksish 🙂

And to _show you the code_, full of python newbie mistakes… but hey, I even dared to write python bindings for such supposedly interesting stuff as ethtool and schedutils.

And finally to what uses it: AIT, because short, meaningless names are en vogue. Anyway, try tuna, a tuning application that has a cool name, one that I unfortunately didn’t came up with but fortunately was near the genius that though about it, thanks!

And thanks to whoever did the right thing and brought Evgeniy to kernelplanet, he is da guy from Russia! Get healthy and in crazy coding frenzy mode again!

libdwfl conversion

Finally I found time to convert dwarves to use the more active libdwfl library found in elfutils, thanks to Ulrich Drepper and Roland McGrath we can forget about most of the low level details related to RELA architectures and get the dwarves to work with kernel modules on architectures such as x86_64, where we need to do relocation work on the .debug_info section, now just enjoy pahole & friends working on more targets.

There is still some work to do on two of the dwarves that handle more than one target, codiff and ctracer, where dwfl_default_argp is not enough as it only handles one side of the game needed for these two dwarves, we need to open another set to compare in the case of codiff and another set for resolving the kprobes functions in the case of ctracer.

News fom dwarves land

Pahole:

Now we can also move bitfields around to combine it with other fields, moving a list of members with the same offset (the definition of a bitfield) to after other members that have holes bigger than the size of the (possibly type demoted) list of members comprising a bitfield, also there is a new pahole command line option to show all the steps involved in reorganizing a struct, –show_reorg_steps, look here for how it looks like.

Also related to this pahole news is the fact that the class__reorganize() & supporting code was moved to libdwarves.so, where its is made useful for the other dwarves, which leads us to the next dwarves news:

Ctracer:

The class tracer now is past its stone age, getting fast to an exciting iron age: printk was replaced by a not so carefully (as in SMP?!?! I only have a dual pentium 100 machine here, damn!) stolen from blktrace (I hope that the UTT effort is merged sooner than later so that I can use it), namely using debugfs + relay to, ho hum, relay the internal state of the class (a funny name for structs) being traced to userspace, ah, since I’m talking about the internal state: the way that it is collected at the probe points now is much, much improved, looking like blktrace, except for the fact that the data being collected is the subset of the members that are “reducible” to a basic type (signed and unsigned int, long, char, long long), which, for now, its just the basic types, but will be shortly augmented by a set of “reducers” for things like spinlock_t, wait_queue_head_t, etc.

Ok, lets break the paragraph a bit to get away from my problem with writing: very long paragraphs…

Back to how ctracer packs the internal state: have you seen how pahole can reorganize structs to fill holes and reduce the structure size? Yeah, we just look at all the struct members, looking for DW_TAG_basic_type members and leave just those, then call class__reorganize() and the struct gets in the best possible layout to save space at each probe point.

Ctracer also uses the same opportunity to generate a tool called ctracer2ostra.c, that is used to convert the binary traces to a colon separated list of formatted values easily parsable by ostra-cg, the callgraph tool that produces nifty buzzword soups (CSS, javascript, ajax in the future, cross-referencing with LXR, python matplotlib graphs of values that each traced class member got in the course of tracing, and plottings for the times each of the “class” “methods” took).

The new improved process for this is detailed in the README.ctracer file, but to give you a quick glance of what it involves:

rpm -ivh http://oops.ghostprotocols.net:81/acme/dwarves/rpm/libdwarves1-0-14.i386.rpm http://oops.ghostprotocols.net:81/acme/dwarves/rpm/dwarves-0-14.i386.rpm
mkdir foo
cd foo
ln -s /usr/lib/ctracer/* .
make CLASS=sock
insmod ctracer.ko
# do some networking kung foo fighting activity or change CLASS to your preferred struct (but beware, as we use do_gettimeofday (how lame) you can get some crashes if tracing something more fishy than struct sock methods)
cat /sys/kernel/debug/ctracer.o > /tmp/ctracer.log
rmmod ctracer.ko
make callgraphs
# Enjoy things like this, where I was bold enough to try this thing on my main machine while doing all sorts of stuff, believe me, it was just one or two dozen harmless oopses, kprobes trying to justify its paycheck!

And to top it all I’ve been getting help from Davi Arnaut on the global variables and DW_AT_location front, were he contributed basic support for where variables are in the computer memory pyramid (Registers? Stack?), he even wrote the first non-acme dwarf, pglobal, that shows all the global variables in your dear project binary files, now its just a matter of extending this to use libebl and get the register names for each arch to go to the next ctracer level, where we’ll stop using costly Jay-probes and get the elusive parameters directly from their hiding place!

Kill-a-hole: Reorganizing struct layouts

Say a project has this struct:

struct cheese {
char name[52];
char a;
int b;
char c;
int d;
short e;
};

And we want to see how the layout looks like more precisely:

[acme@filo examples]$ pahole swiss_cheese cheese
/* <11b> /home/acme/git/pahole/examples/swiss_cheese.c:3 */
struct cheese {
char name[52]; /* 0 52 */
char a; /* 52 1 */

/* XXX 3 bytes hole, try to pack */

int b; /* 56 4 */
char c; /* 60 1 */

/* XXX 3 bytes hole, try to pack */

/* — cacheline 1 boundary (64 bytes) — */
int d; /* 64 4 */
short int e; /* 68 2 */
}; /* size: 72, cachelines: 2 */
/* sum members: 64, holes: 2, sum holes: 6 */
/* padding: 2 */
/* last cacheline: 8 bytes */
[acme@filo examples]$

Heck, what a swiss cheese! Surely we can do better, huh? Lets ask pahole for a little help:

[acme@filo examples]$ pahole -kV swiss_cheese cheese
/* moving c(size=1) to after a(offset=52, size=1, hole=3) */
/* moving e(size=2) to after c(offset=53, size=1, hole=2) */

/* <11b> /home/acme/git/pahole/examples/swiss_cheese.c:3 */
struct cheese {
char name[52]; /* 0 52 */
char a; /* 52 1 */
char c; /* 53 1 */
short int e; /* 54 2 */
int b; /* 56 4 */
int d; /* 60 4 */
/* — cacheline 1 boundary (64 bytes) — */
}; /* size: 64, cachelines: 1 */
/* saved 8 bytes and 1 cacheline! */
[acme@filo examples]$

Much better now, no?

Ok, lets try something more interesting, like some Linux kernel structs, the output is big, so here are some links for some structs that spent some jiffies on pahole’s spa:

Type demotion of bigger than needed bitfields will help getting more saved 🙂

ctracer made easy

Make sure you have the kernel-debuginfo package installed!

rpm -ivh http://oops.ghostprotocols.net:81/acme/dwarves/rpm/libdwarves1-0-9.i386.rpm http://oops.ghostprotocols.net:81/acme/dwarves/rpm/dwarves-0-9.i386.rpm
mkdir foo
cd foo
ln -s /usr/lib/ctracer/Makefile .
ln -s /usr/lib/ctracer/ctracer_jprobe.c .
make CLASS=sock
insmod ctracer.ko
# do some networking activity or just move the mouse 🙂
dmesg # or tail /var/log/messages
rmmod ctracer

Now to work on kahole, kill a hole, a tool that will reorganize structs to kill holes and that will provide the bits needed for struct “views”, i.e. to specify a list of fields in a struct that are of interest for collection at each probe point and that will also be used to create a userspace utility that will read the relay channel (/sys/kernel/ctracer0) and generate what ostra-cg, the callgraph tool from the OSTRA days needs to generate things such as this session collecting struct task_struct methods calls in the Linux kernel, ah, ostra-cg also produces this methods statistics and from this page you get plottings such as this one for sched_
fork calls
, workload is forgotten right now, but as soon as I get my new test machine we’ll have more interesting graphics 🙂