ftrace的使用

laumy
性能工具
2024-08-27
393热度
0评论

tracer

irqsoff

当关闭中断时，CPU就无法响应中断了（NMI和SMI除外），无法响应外部事件做出反应。这会阻止定时器触发或鼠标中断触发，导致系统延迟。
irqsoff跟踪器跟踪中断被禁用的时间，当达到新的最大延迟时，跟踪器会保存导致该延迟点的跟踪，一边每次达到新的最大值，旧的保存的跟踪会被丢弃，新的跟踪会被保存。如果要重置最大值，用echo 0写到tracing_max_latency中。

# echo 0 > options/function-trace
# echo irqsoff > current_tracer
# echo 1 > tracing_on
# echo 0 > tracing_max_latency
# ls -ltr
[...]
# echo 0 > tracing_on
# cat trace

上图示例可以最大延迟为3603us，在default_idle_call和__do_softirq中禁用了中断，主要看=> started at:default_idle_call和=> ended at: __do_softirq。表示关中断的开始函数和开中断的函数。

上面示例中，将funciton-trace关掉了，没有启用此tracer过程的函数跟踪。如果设置function-trace，就会有很多的打印，会将此过程中的函数执行trace打印出来。echo 1 > options/function-trace。

如果想要以函数图调用的方式打印，那么with echo 1 > options/display-graph。

有时候cat trace是空的，可能设置的追踪阈值太长，可以修改短一点。

echo 5 > tracing_thresh        # 设置阈值为 5μs

function

function为函数跟踪器，可以从调试文件系统启动函数跟踪器，echo function > current_tracer。

# echo function > current_tracer
# echo 1 > tracing_on
# usleep 1
# echo 0 > tracing_on
# cat trace

需要注意的是，function tracer使用环形缓冲区来存储上述数据，最新数据可能会覆盖最旧的数据，有时使用echo 来停止跟踪器是不够的，因为跟踪可能会覆盖您想要记录的数据。因此最好直接从程序中禁用跟踪，允许您在到达你感兴趣的部分时停止跟踪，如果要从C程序禁用跟踪，可以使用类似下面代码；

int trace_fd;
[...]

int main(int argc, char *argv[]) {
    [...]
    trace_fd = open(tracing_file("tracing_on"), O_WRONLY);
    [...]
    if (condition_hit()) {
    write(trace_fd, "0", 1);
    }
    [...]
}

单个线程的跟踪，

# cat set_ftrace_pid
  no pid
# echo 3111 > set_ftrace_pid
# cat set_ftrace_pid
  3111
# echo function > current_tracer
# cat trace | head

如果想要trace一个函数在启动运行时，可以使用下面的示例程序。

    #include <stdio.h>
    #include <stdlib.h>
    #include <sys/types.h>
    #include <sys/stat.h>
    #include <fcntl.h>
    #include <unistd.h>
    #include <string.h>

    #define _STR(x) #x
    #define STR(x) _STR(x)
    #define MAX_PATH 256

    const char *find_tracefs(void)
    {
           static char tracefs[MAX_PATH+1];
           static int tracefs_found;
           char type[100];
           FILE *fp;

           if (tracefs_found)
               return tracefs;

           if ((fp = fopen("/proc/mounts","r")) == NULL) {
               perror("/proc/mounts");
               return NULL;
           }

           while (fscanf(fp, "
                     STR(MAX_PATH)
                     "s 
                     tracefs, type) == 2) {
               if (strcmp(type, "tracefs") == 0)
                       break;
           }
           fclose(fp);

           if (strcmp(type, "tracefs") != 0) {
               fprintf(stderr, "tracefs not mounted");
               return NULL;
           }

           strcat(tracefs, "/tracing/");
           tracefs_found = 1;

           return tracefs;
    }

    const char *tracing_file(const char *file_name)
    {
           static char trace_file[MAX_PATH+1];
           snprintf(trace_file, MAX_PATH, "
           return trace_file;
    }

    int main (int argc, char **argv)
    {
        if (argc < 1)
                exit(-1);

        if (fork() > 0) {
                int fd, ffd;
                char line[64];
                int s;

                ffd = open(tracing_file("current_tracer"), O_WRONLY);
                if (ffd < 0)
                        exit(-1);
                write(ffd, "nop", 3);

                fd = open(tracing_file("set_ftrace_pid"), O_WRONLY);
                s = sprintf(line, "
                write(fd, line, s);

                write(ffd, "function", 8);

                close(fd);
                close(ffd);

                execvp(argv[1], argv+1);
        }

        return 0;
    }

当然也可以使用简单的脚步来实现

Or this simple script!
::

  #!/bin/bash

  tracefs=`sed -ne \'s/^tracefs \\(.*\\) tracefs.*/\\1/p\' /proc/mounts`
  echo 0 >  $tracefs/tracing_on echo$  $>$ tracefs/set_ftrace_pid
  echo function >  $tracefs/current_tracer echo 1>$ tracefs/tracing_on
  exec \"$@\"

function graph tracer

function graph tracer与function tracer类似，不同之处在于它会在函数进入和退出时对其进行探测，这是通过每个task_struct中使用动态分配的返回地址堆栈来实现的。在函数进入时，跟踪器会覆盖跟踪每个函数的返回地址以设置自定义探测器，因此原始返回地址存储在task_struct中的返回地址堆栈中。在函数两端进行探测可实现特殊功能，例如：测量函数的执行时间，拥有可靠调用堆栈来绘制函数的调用图。这种跟踪器在以下几种情况很有用：

找到奇怪内核行为的原因，详细了解任何区域发生的情况。
遇到奇怪的延迟，但很难找到根源。
快速找到特定函数的调用路径。
窥视正在运行的内核并查看发生了什么。

有几列是可以动态启用和禁止的。
- cpu number是默认会启动函数执行的cpu编号，有时候最好只跟踪一个cpu（tracing_cpu_mask），否则在cpu跟踪是，会看到无须的函数调用。隐藏cpu: echo nofuncgraph-cpu > trace_options。
- duration表示函数执行的时间，会显示在函数结束括号行上。如果是叶函数，则显示在当前函数的同一行。如果要关掉，则echo nofuncgraph-duration > trace_options。
如果函数的开头不在跟踪缓冲区，则函数名称可以显示到函数的右括号后面，可以echo funcgraph-tail > trace_options进行使能。

dynamic ftrace

如果使能了CONFIG_DYNAMIC_FTRACE，则在禁用函数跟踪时，系统将几乎不产生任何开销。其工作原理是使能了gcc的-pg参数会自动在内核的函数开头插桩mcount函数（与架构有关系，gcc 4.6版本开始，x86架构添加mfentry，它调用“fentry”而不是“mcount”）。
在编译时，每个C文件对象通过recordmcount程序（位于脚本目录）运行，该程序将解析C对象的ELF标头，查找.text部分中调用mcount的所有位置。
NOTE:注意的是并非所有的section都被跟踪，可以通过notrace或其他办法来不让其跟踪，并且不会跟踪所有的内联函数，可以cat available_filter_functions节点来查看可以跟踪那些函数。
创建一个“__mcount_loc”的段（section），该段中记录了所有包含在.text中对mcount调用点的引用位置。最后__mcount_loc在链接时统一链接到一个__mcount_loc中。

具体的过程如上图所示，在系统启动时，初始化SMP之前，动态ftrace代码会扫描此表并将所有位置更新为替换为nop指令，同时还会记录位置，这些位置被添加到available_filter_functions表中。在模块加载和执行之前进行处理，卸载模块时，它还会从ftrace函数列表中删除其函数。
在启动动态跟踪后，修改函数跟踪点的过程取决于具体的arch。修改函数跟踪点的方法时要修改位置放置一个断点，同步所有的CPU。接着修改其指令，同步给所有的CPU再把断点移除。通过这样动态的方式，可以做到有选择的跟踪指定函数，其他不想跟踪的函数就的位置执行的是nop指令，不至于影响性能。
内核中使用两个文件用于启动和禁用指定的函数跟踪分别是set_ftrace_filter和set_ftrace_notrace。可以通过available_filter_functions来查看跟踪的函数。

# echo sys_nanosleep hrtimer_interrupt > set_ftrace_filter
# echo function > current_tracer
# echo 1 > tracing_on
# usleep 1
# echo 0 > tracing_on
# cat trace

设置set_ftrace_filter可以使用通配符匹配，示例如下：

``<match>*`` ：匹配<match>开头的函数
``*<match>``：匹配<match>结尾的函数
``*<match>*``：匹配其中包含<match>的函数
``<match1>*<match2>``：匹配<match1>开头并以<match2>结尾的函数

设置set_ftrace_filter接口支持过滤命令，格式为\:\:\

-mod: 启用每个模块的功能过滤,如只需要ext3模块中的write*功能
      echo \'write*:mod:ext3\' > set_ftrace_filter
-traceon/traceoff: 指定函数打开和关闭时跟踪，参数确定跟踪系统打开和关闭的次数，如果为指定，则没有限制，例如在前5此遇到错误时禁止跟踪
      echo \'__schedule_bug:traceoff:5\' > set_ftrace_filter

dynamic ftrace with function graph tracer

上面解释了function tracer和function graph tracer，但有些特殊功能只在function graph tracer中可用。如果跟踪一个函数及其子函数，只需要函数将其名称写到set_graph_function中。

echo function_graph > current_tracer
echo __do_fault > set_graph_function
echo 1 > tracing_on
...
echo 0 > tracing_on

other

ftrace有一个总开关，/proc/sys/kernel/ftrace_enabled，向其写0或1表示关闭和使能，默认是开启的状态。更多细节参考：Documentation/trace/ftrace.rst。

events使用

sched_switch

sched_switch是静态Tracepoint事件追踪，下面是示例

echo 1 > /sys//kernel/debug/tracing/events/sched/sched_switch/enable
# 使能sched_switch

echo 'prev_pid == 1162 || next_pid == 1244' > /sys/kernel/debug/tracing/events/sched/sched_switch/filter
#设置sched_switch 事件的过滤条件，使得只有当进程ID为1162的进程
#切换为进程ID为1244的进程时，才会记录这个事件，否则打印太多了


echo "" > /sys/kernel/debug/tracing/trace
# 清除trace buffer

echo 1 > /sys/kernel/debug/tracing/tracing_on
# 开始tracing

cat /sys/kernel/debug/tracing/trace
# 查看结果

运行结果如下：

# tracer: nop
#
# nop latency trace v1.1.5 on 5.15.147
# --------------------------------------------------------------------
# latency: 0 us, #15/15, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
#    -----------------
#    | task: -0 (uid:0 nice:0 policy:0 rt_prio:0)
#    -----------------
#
#                    _------=> CPU#            
#                   / _-----=> irqs-off        
#                  | / _----=> need-resched    
#                  || / _---=> hardirq/softirq 
#                  ||| / _--=> preempt-depth   
#                  |||| / _-=> migrate-disable 
#                  ||||| /     delay           
#  cmd     pid     |||||| time  |   caller     
#     \   /        ||||||  \    |    /       
  <idle>-0         0d..2. 95713us $: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=wpa_supplicant next_pid=1244 next_prio=120 <idle>-0 0d..2. 10104901us$ : sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=wpa_supplicant next_pid=1244 next_prio=120
  <idle>-0         0d..2. 20114087us $: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=wpa_supplicant next_pid=1244 next_prio=120 wifi_dae-1162 2d..2. 25998599us!: sched_switch: prev_comm=wifi_daemon prev_pid=1162 prev_prio=120 prev_state=R+ ==> next_comm=sugov:0 next_pid=88 next_prio=-1 wifi_dae-1162 2d..2. 25999269us!: sched_switch: prev_comm=wifi_daemon prev_pid=1162 prev_prio=120 prev_state=R ==> next_comm=sugov:0 next_pid=88 next_prio=-1 wifi_dae-1162 2d..2. 25999412us!: sched_switch: prev_comm=wifi_daemon prev_pid=1162 prev_prio=120 prev_state=R+ ==> next_comm=sugov:0 next_pid=88 next_prio=-1 <idle>-0 0d..2. 25999584us!: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=wpa_supplicant next_pid=1244 next_prio=120 wifi_dae-1162 3d..2. 26000075us*: sched_switch: prev_comm=wifi_daemon prev_pid=1162 prev_prio=120 prev_state=S ==> next_comm=swapper/3 next_pid=0 next_prio=120 <idle>-0 0d..2. 26012127us!: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=wpa_supplicant next_pid=1244 next_prio=120 <idle>-0 0d..2. 26012552us!: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=wpa_supplicant next_pid=1244 next_prio=120 logread-495 0d..2. 26013054us$ : sched_switch: prev_comm=logread prev_pid=495 prev_prio=120 prev_state=S ==> next_comm=wpa_supplicant next_pid=1244 next_prio=120
rcu_pree-14        3d..2. 27564266us#: sched_switch: prev_comm=rcu_preempt prev_pid=14 prev_prio=120 prev_state=I ==> next_comm=wpa_supplicant next_pid=1244 next_prio=120
   <...>-1163      3d..2. 27567664us!: sched_switch: prev_comm=wifi_daemon prev_pid=1163 prev_prio=120 prev_state=S ==> next_comm=wpa_supplicant next_pid=1244 next_prio=120
   <...>-1163      3d..2. 27568473us$: sched_switch: prev_comm=wifi_daemon prev_pid=1163 prev_prio=120 prev_state=S ==> next_comm=wpa_supplicant next_pid=1244 next_prio=120
  <idle>-0         3d..2. 30116102us : sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=wpa_supplicant next_pid=1244 next_prio=120

从上面的打印结果可以看到，只会打印下一个调度进程为next_pid=1244或者上一个调度进程为prev_pid=1162的两个进程。

irq

echo 1 > /sys/kernel/debug/tracing/events/irq/irq_handler_entry/enable
echo 1 > /sys/kernel/debug/tracing/events/irq/irq_handler_exit/enable
# 使能最终中断的进入和退出追踪。

echo "irq == 62" > /sys/kernel/debug/tracing/events/irq/irq_handler_exit/filter
echo "irq == 62" > /sys/kernel/debug/tracing/events/irq/irq_handler_entry/filter
#设置irq的过滤，要过滤的中断号可以通过cat /proc/interrupts获取。

echo "" > /sys/kernel/debug/tracing/trace
# 清除trace buffer

echo 1 > /sys/kernel/debug/tracing/tracing_on
# 开始tracing

cat /sys/kernel/debug/tracing/trace
# 查看结果


/sys/kernel/debug/tracing# cat trace
# tracer: nop
#
# entries-in-buffer/entries-written: 354/354   #P:4
#
#                                _-----=> irqs-off
#                               / _----=> need-resched
#                              | / _---=> hardirq/softirq
#                              || / _--=> preempt-depth
#                              ||| / _-=> migrate-disable
#                              |||| /     delay
#           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
#              | |         |   |||||     |         |
         sugov:0-86      [000] d.h..  1306.666839: irq_handler_entry: irq=62 name=rwnx_hostwake_irq
         sugov:0-86      [000] d.h..  1306.666847: irq_handler_exit: irq=62 ret=handled
         sugov:0-86      [000] d.h..  1306.667226: irq_handler_entry: irq=62 name=rwnx_hostwake_irq
         sugov:0-86      [000] d.h..  1306.667229: irq_handler_exit: irq=62 ret=handled
          <idle>-0       [000] d.h1.  1306.673856: irq_handler_entry: irq=62 name=rwnx_hostwake_irq
          <idle>-0       [000] d.h1.  1306.673864: irq_handler_exit: irq=62 ret=handled
          <idle>-0       [000] d.h1.  1306.722395: irq_handler_entry: irq=62 name=rwnx_hostwake_irq
          <idle>-0       [000] d.h1.  1306.722414: irq_handler_exit: irq=62 ret=handled
          <idle>-0       [000] dNh1.  1306.722680: irq_handler_entry: irq=62 name=rwnx_hostwake_irq
          <idle>-0       [000] dNh1.  1306.722688: irq_handler_exit: irq=62 ret=handled
         sugov:0-86      [000] d.h1.  1306.722773: irq_handler_entry: irq=62 name=rwnx_hostwake_irq
         sugov:0-86      [000] d.h1.  1306.722778: irq_handler_exit: irq=62 ret=handled
          <idle>-0       [000] d.h1.  1306.730310: irq_handler_entry: irq=62 name=rwnx_hostwake_irq
          <idle>-0       [000] d.h1.  1306.730313: irq_handler_exit: irq=62 ret=handled
         sugov:0-86      [000] d.h1.  1306.734280: irq_handler_entry: irq=62 name=rwnx_hostwake_irq
         sugov:0-86      [000] d.h1.  1306.734292: irq_handler_exit: irq=62 ret=handled
          <idle>-0       [000] d.h1.  1306.734520: irq_handler_entry: irq=62 name=rwnx_hostwake_irq
          <idle>-0       [000] d.h1.  1306.734526: irq_handler_exit: irq=62 ret=handled
          <idle>-0       [000] d.h1.  1306.734717: irq_handler_entry: irq=62 name=rwnx_hostwake_irq
          <idle>-0       [000] d.h1.  1306.734722: irq_handler_exit: irq=62 ret=handled
          <idle>-0       [000] dNh1.  1306.757440: irq_handler_entry: irq=62 name=rwnx_hostwake_irq
          <idle>-0       [000] dNh1.  1306.757453: irq_handler_exit: irq=62 ret=handled
         sugov:0-86      [000] d.h1.  1306.757530: irq_handler_entry: irq=62 name=rwnx_hostwake_irq
         sugov:0-86      [000] d.h1.  1306.757536: irq_handler_exit: irq=62 ret=handled
          <idle>-0       [000] d.h1.  1306.792327: irq_handler_entry: irq=62 name=rwnx_hostwake_irq
          <idle>-0       [000] d.h1.  1306.792340: irq_handler_exit: irq=62 ret=handled
          <idle>-0       [000] dNh1.  1306.792629: irq_handler_entry: irq=62 name=rwnx_hostwake_irq
          <idle>-0       [000] dNh1.  1306.792636: irq_handler_exit: irq=62 ret=handled
         sugov:0-86      [000] d.h..  1306.792811: irq_handler_entry: irq=62 name=rwnx_hostwake_irq
         sugov:0-86      [000] d.h..  1306.792817: irq_handler_exit: irq=62 ret=handled
          <idle>-0       [000] d.h1.  1306.821387: irq_handler_entry: irq=62 name=rwnx_hostwake_irq
          <idle>-0       [000] d.h1.  1306.821407: irq_handler_exit: irq=62 ret=handled
         sugov:0-86      [000] d.h..  1306.821615: irq_handler_entry: irq=62 name=rwnx_hostwake_irq
         sugov:0-86      [000] d.h..  1306.821620: irq_handler_exit: irq=62 ret=handled

irqsoff和events/irq有什么区别？ irqsoff是统计中断关闭时间，而events/irq是主要用于记录中断处理的活动，包括进入中断、退出中断等等。

小结

# 启动sched_switch追踪可以有一下3种方式。
 echo sched:sched_switch >> /sys/kernel/debug/tracing/set_event
 echo sched_switch >> /sys/kernel/debug/tracing/set_event
 echo 1 > /sys/kernel/debug/tracing/events/sched/sched_switch/enable

# 可以通过cat set_event来查看追踪的event

cat /sys/kernel/debug/tracing/set_event

# 也可以通过设置set_event_pid来过滤进程
echo 1244 > /sys/kernel/debug/tracing/set_event_pid

# 可以cat trace_pipe来实时观察追踪信息
cat /sys/kernel/debug/tracing/trace_pipe

#如果要清空设置的过滤,写0,
echo 0 > /events/irq/irq_handler_exit/filter

总结：

tracer有function/graph等追踪器，没有设置过滤的话就是全局的。而event是在特定函数出入口统计跟踪，一般系统有内置的模块比如irq，sched_switch等。
tracing_on是总开关，通过使能1或0来开关。
trace是ftrace跟踪系统的主输出文件，用于记录所有的跟踪事件，但是会保留历史记录信息，不需要的话需要echo "" > trace 清空。trace_pipe是一个实时跟踪输出文件，而不会保存历史数据，它提供了实时的数据流。
event和tracer可以同时使用。