Category Archives: Linux Kernel

Looking for users of new syscalls

Recently Linux got a new syscall to get extended information about files, a super ‘stat’, if you will, read more about it at LWN.

So I grabbed the headers with the definitions for the statx arguments to tools/include/ so that ‘perf trace’ can use them to beautify, i.e. to appear as
a bitmap of strings, as described in this cset.

To test it I used one of things ‘perf trace’ can do and that ‘strace’ does not: system wide stracing. To look if any of the programs running on my machine was using the new syscall I simply did, using strace-like syntax:

# perf trace -e statx

After a few minutes, nothing… So this fedora 25 system isn’t using it in any of the utilities I used in these moments, not surprising, glibc still needs wiring statx up.

So I found out about samples/statx/test-statx.c, and after installing the kernel headers and pointing the compiler to where those files were installed, I restarted that system wide ‘perf trace’ session and ran the test program, much better:

# trace -e statx
16612.967 ( 0.028 ms): statx/562 statx(dfd: CWD, filename: /etc/passwd, flags: SYMLINK_NOFOLLOW, mask: TYPE|MODE|NLINK|UID|GID|ATIME|MTIME|CTIME|INO|SIZE|BLOCKS|BTIME, buffer: 0x7ffef195d660) = 0
33064.447 ( 0.011 ms): statx/569 statx(dfd: CWD, filename: /tmp/statx, flags: SYMLINK_NOFOLLOW|STATX_FORCE_SYNC, mask: TYPE|MODE|NLINK|UID|GID|ATIME|MTIME|CTIME|INO|SIZE|BLOCKS|BTIME, buffer: 0x7ffc5484c790) = 0
36050.891 ( 0.023 ms): statx/576 statx(dfd: CWD, filename: /etc/motd, flags: SYMLINK_NOFOLLOW, mask: BTIME, buffer: 0x7ffeb18b66e0) = 0
38039.889 ( 0.023 ms): statx/584 statx(dfd: CWD, filename: /home/acme/.bashrc, flags: SYMLINK_NOFOLLOW, mask: TYPE|MODE|NLINK|UID|GID|ATIME|MTIME|CTIME|INO|SIZE|BLOCKS|BTIME, buffer: 0x7fff1db0ea90) = 0
^C#

Ah, to get filenames fetched we need to put in place a special probe, that will collect filenames passed to the kernel right after the kernel copies it from user memory:

[root@jouet ~]# perf probe 'vfs_getname=getname_flags:72 pathname=result->name:string'
Added new event:
probe:vfs_getname    (on getname_flags:72 with pathname=result->name:string)

You can now use it in all perf tools, such as:

perf record -e probe:vfs_getname -aR sleep 1

[root@jouet ~]# trace -e open touch /etc/passwd
0.024 ( 0.011 ms): touch/649 open(filename: /etc/ld.so.cache, flags: CLOEXEC) = 3
0.056 ( 0.018 ms): touch/649 open(filename: /lib64/libc.so.6, flags: CLOEXEC) = 3
0.481 ( 0.014 ms): touch/649 open(filename: /usr/lib/locale/locale-archive, flags: CLOEXEC) = 3
0.553 ( 0.012 ms): touch/6649 open(filename: /etc/passwd, flags: CREAT|NOCTTY|NONBLOCK|WRONLY, mode: IRUGO|IWUGO) = 3
[root@jouet ~]#

Make sure you have CONFIG_DEBUG_INFO set in your kernel build or that the matching debuginfo packages are installed. This needs to be done just once per boot, ‘perf trace’ will find it in place and use it.

Lastly, if ‘perf’ is hardlinked to ‘trace’, then the later will be the same as ‘perf trace’.

Getting backtraces from arbitrary places

Needs debuginfo, either in a package-debuginfo rpm or equivalent or by building with ‘cc -g’:

[root@jouet ~]# perf probe -L icmp_rcv:52 | head -15

  52  	if (rt->rt_flags & (RTCF_BROADCAST | RTCF_MULTICAST)) {
      		/*
      		 * RFC 1122: 3.2.2.6 An ICMP_ECHO to broadcast MAY be
      		 *  silently ignored (we let user decide with a sysctl).
      		 * RFC 1122: 3.2.2.8 An ICMP_TIMESTAMP MAY be silently
      		 *  discarded if to broadcast/multicast.
      		 */
  59  		if ((icmph->type == ICMP_ECHO ||
  60  		     icmph->type == ICMP_TIMESTAMP) &&
      		    net->ipv4.sysctl_icmp_echo_ignore_broadcasts) {
      			goto error;
      		}
      		if (icmph->type != ICMP_ECHO &&
      		    icmph->type != ICMP_TIMESTAMP &&
[root@jouet ~]# perf probe icmp_rcv:59
Added new event:
  probe:icmp_rcv       (on icmp_rcv:59)

You can now use it in all perf tools, such as:

	perf record -e probe:icmp_rcv -aR sleep 1

[root@jouet ~]# perf trace --no-syscalls --event probe:icmp_rcv/max-stack=5/
     0.000 probe:icmp_rcv:(ffffffffb47b7f9b))
                          icmp_rcv ([kernel.kallsyms])
                          ip_local_deliver_finish ([kernel.kallsyms])
                          ip_local_deliver ([kernel.kallsyms])
                          ip_rcv_finish ([kernel.kallsyms])
                          ip_rcv ([kernel.kallsyms])
  1025.876 probe:icmp_rcv:(ffffffffb47b7f9b))
                          icmp_rcv ([kernel.kallsyms])
                          ip_local_deliver_finish ([kernel.kallsyms])
                          ip_local_deliver ([kernel.kallsyms])
                          ip_rcv_finish ([kernel.kallsyms])
                          ip_rcv ([kernel.kallsyms])
^C[root@jouet ~]#

Humm, lots of redundant info, guess we could do away with those ([kernel.kallsyms) in all the callchain lines…

Friggin mistery…

I’m working on adding a thin python binding for the evsel and evlist abstractions I introduced in the perf tools when I stumbled on this, how peterz calls “friggin mistery mmap events”…

Lets start with the simple python script that started this:

import perf

def main():
	cpus = perf.cpu_map()
	threads = perf.thread_map()
	evsel = perf.evsel(task = 1, comm = 1,
			   wakeup_events = 1, sample_period = 1,
			   sample_id_all = 1,
			   sample_type = perf.SAMPLE_PERIOD | perf.SAMPLE_CPU | perf.SAMPLE_TID)
	evsel.open(cpus = cpus, threads = threads);
	evlist = perf.evlist()
	evlist.add(evsel)
	evlist.mmap(cpus = cpus, threads = threads)
	while True:
		evlist.poll(timeout = -1)
		for cpu in cpus:
			event = evlist.read_on_cpu(cpu)
			if not event:
				continue
			print "cpu: %2d, pid: %4d, tid: %4d" % (event.sample_cpu,
								event.sample_pid,
								event.sample_tid),
			print event

if __name__ == '__main__':
    main()

So what does this script? It will ask the kernel perf infrastructure to generate events every time a thread is created and when they exit.

Running it:

[root@emilia linux]# tools/perf/python/twatch.py
cpu:  0, pid:    0, tid:    0 { type: mmap, pid: 0, tid: 0, start: 0, length: 0x694ae632, offset: 0, filename:  }
cpu:  3, pid:    0, tid:    0 { type: mmap, pid: 0, tid: 0, start: 0x3, length: 0xc804d5ca, offset: 0, filename:  }
cpu:  6, pid:    0, tid:    0 { type: mmap, pid: 0, tid: 0, start: 0x6, length: 0xbba638e9, offset: 0, filename:  }
cpu:  1, pid: 1893, tid: 1893 { type: mmap, pid: 1893, tid: 1893, start: 0x1, length: 0x7899c915, offset: 0, filename:  }
cpu:  2, pid:   43, tid:   43 { type: mmap, pid: 43, tid: 43, start: 0x2, length: 0xca5a622e, offset: 0, filename:  }
cpu:  5, pid: 1523, tid: 1525 { type: mmap, pid: 1523, tid: 1525, start: 0x5, length: 0x9a4cd05f, offset: 0, filename:  }
cpu:  4, pid:    0, tid:    0 { type: mmap, pid: 0, tid: 0, start: 0x4, length: 0x7a4d4989, offset: 0, filename:  }

cpu:  2, pid: 1543, tid: 1543 { type: fork, pid: 1970, ppid: 1543, tid: 1970, ptid: 1543, time: 515738951270}
cpu:  6, pid: 1970, tid: 1970 { type: comm, pid: 1970, tid: 1970, comm: sshd }
cpu:  2, pid:   60, tid:   60 { type: fork, pid: 1971, ppid: 60, tid: 1971, ptid: 60, time: 515743769565}
cpu:  3, pid: 1971, tid: 1971 { type: fork, pid: 1972, ppid: 1971, tid: 1972, ptid: 1971, time: 515743810636}
cpu:  7, pid: 1972, tid: 1972 { type: comm, pid: 1972, tid: 1972, comm: modprobe }
cpu:  7, pid: 1972, tid: 1972 { type: exit, pid: 1972, ppid: 1972, tid: 1972, ptid: 1972, time: 515745080846}
cpu:  1, pid: 1971, tid: 1971 { type: exit, pid: 1971, ppid: 1971, tid: 1971, ptid: 1971, time: 515746027272}
cpu:  7, pid: 1972, tid: 1972 { type: exit, pid: 1972, ppid: 1972, tid: 1972, ptid: 1972, time: 515745086164}
cpu:  1, pid: 1971, tid: 1971 { type: exit, pid: 1971, ppid: 1971, tid: 1971, ptid: 1971, time: 515746034982}
cpu:  7, pid: 1970, tid: 1970 { type: fork, pid: 1973, ppid: 1970, tid: 1973, ptid: 1970, time: 515747725784}
cpu:  0, pid: 1278, tid: 1281 { type: fork, pid: 1278, ppid: 1278, tid: 1974, ptid: 1281, time: 515803257141}
cpu:  3, pid: 1278, tid: 1974 { type: comm, pid: 1278, tid: 1974, comm: rs:main Q:Reg }
cpu:  2, pid: 1973, tid: 1973 { type: exit, pid: 1973, ppid: 1973, tid: 1973, ptid: 1973, time: 515803913973}
cpu:  0, pid: 1970, tid: 1970 { type: fork, pid: 1975, ppid: 1970, tid: 1975, ptid: 1970, time: 515804961019}
cpu:  2, pid: 1973, tid: 1973 { type: exit, pid: 1973, ppid: 1973, tid: 1973, ptid: 1973, time: 515803917806}
cpu:  3, pid: 1975, tid: 1975 { type: fork, pid: 1976, ppid: 1975, tid: 1976, ptid: 1975, time: 515808031062}
cpu:  7, pid: 1976, tid: 1976 { type: comm, pid: 1976, tid: 1976, comm: bash }

Everything seems nice except for those first 8 mmap events, one per CPU, that I didn’t ask for in the evsel (event selector) constructor, here:

	evsel = perf.evsel(task = 1, comm = 1,
			   wakeup_events = 1, sample_period = 1,
			   sample_id_all = 1,
			   sample_type = perf.SAMPLE_PERIOD | perf.SAMPLE_CPU | perf.SAMPLE_TID

See ‘task = 1’? That asks for fork and exit events to be generated in the ring buffer, ‘comm = 1’ asks for the COMM (short, 16 bytes max, process name), but I didn’t specified ‘mmap = 1’, to ask for events to be generated when executable mmaps take place…

As I know that those events are stashed in the ring buffer by the perf_event_mmap_output routine, I thought about using ‘perf probe’ to figure out what was causing those events, so I did:

[root@emilia linux]# perf probe mmap_out=perf_event_mmap_output
Add new event:
  probe:mmap_out       (on perf_event_mmap_output)

You can now use it on all perf tools, such as:

	perf record -e probe:mmap_out -aR sleep 1

[root@emilia linux]#

That created a new event, one that will take place every time the perf_event_mmap_output is called. To reduce typing I added it aliased to ‘mmap_out’.

Now lets use it, asking for callchains to be taken every time this event happens:

[root@emilia linux]# perf record -e probe:mmap_out -g tools/perf/python/twatch.py 
cpu:  7, pid:    0, tid:    0 { type: mmap, pid: 0, tid: 0, start: 0x7, length: 0xc6b8162f, offset: 0, filename:  }
cpu:  2, pid:    0, tid:    0 { type: mmap, pid: 0, tid: 0, start: 0x2, length: 0xdf243434, offset: 0, filename:  }
cpu:  3, pid:    0, tid:    0 { type: mmap, pid: 0, tid: 0, start: 0x3, length: 0xf9f2d7ab, offset: 0, filename:  }
cpu:  5, pid: 1893, tid: 1893 { type: mmap, pid: 1893, tid: 1893, start: 0x5, length: 0xa00963c1, offset: 0, filename:  }
cpu:  1, pid:    0, tid:    0 { type: mmap, pid: 0, tid: 0, start: 0x1, length: 0x95744474, offset: 0, filename:  }
cpu:  0, pid: 1893, tid: 1893 { type: mmap, pid: 1893, tid: 1893, start: 0, length: 0xa0655031, offset: 0, filename:  }
cpu:  4, pid:    0, tid:    0 { type: mmap, pid: 0, tid: 0, start: 0x4, length: 0xb2b21883, offset: 0, filename:  }
^CTraceback (most recent call last):
  File "tools/perf/python/twatch.py", line 41, in 
    main()
[ perf record: Woken up 1 times to write data ]
  File "tools/perf/python/twatch.py", line 30, in main
    evlist.poll(timeout = -1)
[ perf record: Captured and wrote 0.012 MB perf.data (~545 samples) ]
KeyboardInterrupt
[root@emilia linux]#

And there they are, lets see what caused them?

[root@emilia linux]# perf report --stdio
# Events: 30 
#
# Overhead    Command      Shared Object                  Symbol
# ........  .........  .................  ......................
#
   100.00%  twatch.py  [kernel.kallsyms]  [k] perf_event_mmap_output
            |
            --- perf_event_mmap_output
                perf_event_mmap
               |          
               |--96.67%-- mmap_region
               |          do_mmap_pgoff
               |          |          
               |          |--93.10%-- sys_mmap_pgoff
               |          |          sys_mmap
               |          |          system_call
               |          |          __mmap
               |          |          |          
               |          |           --100.00%-- _dl_map_object
               |          |          
               |           --6.90%-- elf_map
               |                     load_elf_binary
               |                     search_binary_handler
               |                     load_script
               |                     search_binary_handler
               |                     do_execve
               |                     sys_execve
               |                     stub_execve
               |                     0x3d068a6dd7
               |          
                --3.33%-- install_special_mapping
                          arch_setup_additional_pages
                          load_elf_binary
                          search_binary_handler
                          load_script
                          search_binary_handler
                          do_execve
                          sys_execve
                          stub_execve
                          0x3d068a6dd7

#
# (For a higher level overview, try: perf report --sort comm,dso)
#
[root@emilia linux]# 

Ok, loading ELF files, the python interpreter…

The investigation is still underway, but using the perf tools in this way sounds so cool that I thought about sharing it here 🙂