制造、调试OOPS

1.前言
    在Linux系统里出现kernel panic时会打印OOPS信息,把当前的寄存器状态、堆栈信息、完整的Call trace都打印出来,以便我们更好的定位错误。下面我们将会自已制造一个OOPS,并采用多种方式来定位调试OOPS,以便大家更好的了解内核,深入探索内核,利用内核异常信息来定位处理问题。
2.制造OOPS
    这次我们以模块的形式来制造一个OOPS,文件名为oops_example.c,内容如下:
    #include <linux/kernel.h>
    #include <linux/module.h>
    static __init int oops_example_init(void)
    {
            int *p = 0;
            printk("Hi OOPS!\n");
            *p = 12;
            return 0;
    }
    static __exit void oops_example_exit(void)
    {
            printk("Bye OOPS!\n");
    }
    module_init(oops_example_init);
    module_exit(oops_example_exit);
    相应的Makefile文件内容如下:
    obj-m += oops_example.o
    CUR_PATH:=$(shell pwd)
    LINUX_KERNEL_PATH:=/home/xinu/linux-3.13.6
    all:
            make -C $(LINUX_KERNEL_PATH) M=$(CUR_PATH) modules
    clean:
            make -C $(LINUX_KERNEL_PATH) M=$(CUR_PATH) clean
    相应的源码文件目录树如下:
    oops_example/{Makefile,oops_example.c}
    当我们make后,执行sudo insmod oops_example.ko,再运行dmesg命令,可以看到如下输出:
    [198326.094748] oops_example: module license 'unspecified' taints kernel.
    [198326.094753] Disabling lock debugging due to kernel taint
    [198326.095023] Hi OOPS!
    [198326.095031] BUG: unable to handle kernel NULL pointer dereference at           (null)
    [198326.095079] IP: [<ffffffffa0018012>] oops_example_init+0x12/0x1000 [oops_example]
    [198326.095121] PGD 1fa60a067 PUD 1f9808067 PMD 0
    [198326.095149] Oops: 0002 [#1] SMP
    [198326.095171] Modules linked in: oops_example(PO+) bnep rfcomm bluetooth i915 snd_hda_codec_hdmi snd_hda_codec_via snd_hda_intel snd_hda_codec snd_hwdep snd_pcm drm_kms_helper snd_seq_midi drm snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device snd psmouse alx soundcore snd_page_alloc serio_raw i2c_algo_bit mdio mac_hid lpc_ich ppdev video parport_pc lp parport
    [198326.095398] CPU: 6 PID: 8616 Comm: insmod Tainted: P           O 3.11.0-rc4+ #3
    [198326.095433] Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./Z77M-D3H, BIOS F13 09/19/2012
    [198326.095478] task: ffff880103a05c40 ti: ffff8801fa964000 task.ti: ffff8801fa964000
    [198326.096427] RIP: 0010:[<ffffffffa0018012>]  [<ffffffffa0018012>] oops_example_init+0x12/0x1000 [oops_example]
    [198326.097414] RSP: 0018:ffff8801fa965cf8  EFLAGS: 00010296
    [198326.098394] RAX: 0000000000000008 RBX: ffffffffa0018000 RCX: 0000000000000006
    [198326.099392] RDX: 0000000000000007 RSI: 0000000000000007 RDI: ffff88021f38d490
    [198326.100408] RBP: ffff8801fa965cf8 R08: 000000000000000a R09: 0000000000000001
    [198326.101432] R10: 0000000000000345 R11: 0000000000000344 R12: ffffffffa011f000
    [198326.102483] R13: 0000000000000000 R14: ffff8801ffb6dcc0 R15: ffffffffa011f050
    [198326.103528] FS:  00007f1aef978700(0000) GS:ffff88021f380000(0000) knlGS:0000000000000000
    [198326.104581] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [198326.105628] CR2: 0000000000000000 CR3: 0000000103bd5000 CR4: 00000000001407e0
    [198326.106701] Stack:
    [198326.107771]  ffff8801fa965d78 ffffffff8100024e ffff8801ffb6dcc0 ffffffffa011f050
    [198326.108896]  ffff8801fa965d48 ffffffff810400e3 0000000000000000 0000000000000000
    [198326.110029]  ffffffffa011f018 ffffffffa001a000 ffff8801fa965d78 00000000efa1d5ab
    [198326.111178] Call Trace:
    [198326.112319]  [<ffffffff8100024e>] do_one_initcall+0x4e/0x170
    [198326.113488]  [<ffffffff810400e3>] ? set_memory_nx+0x43/0x50
    [198326.114669]  [<ffffffff810b61c9>] load_module+0x1369/0x1ab0
    [198326.115856]  [<ffffffff810b2f90>] ? show_initstate+0x50/0x50
    [198326.117048]  [<ffffffff810b69bc>] SyS_init_module+0xac/0xd0
    [198326.118246]  [<ffffffff816a3ec2>] system_call_fastpath+0x16/0x1b
    [198326.119448] Code: <c7> 04 25 00 00 00 00 0c 00 00 00 31 c0 5d c3 00 00 00 00 00 00 00
    [198326.120746] RIP  [<ffffffffa0018012>] oops_example_init+0x12/0x1000 [oops_example]
    [198326.122009]  RSP <ffff8801fa965cf8>
    [198326.123268] CR2: 0000000000000000
    [198326.128801] ---[ end trace c5dcd2af51846c6e ]---
    终于,我们制造出了一个OOPS,下面要分析和调试这OOPS。
3.分析OOPS
    对于上面打印出来的OOPS信息,我们分析下相关内容:
    BUG: unable to handle kernel NULL pointer dereference at           (null)
    该句向我们报了个BUG,说明存在不能被处理空指针。
    IP: [<ffffffffa0018012>] oops_example_init+0x12/0x1000 [oops_example]
<p>    IP是指令指针,此处的ffffffffa0018012为出错时的IP值,该值对就oops_example_init+0x12,那么oops_example_init函数的入口地址就是ffffffffa0018012-12=ffffffffa0018000了,这里0x12表示出错位置对应在oops_example_init函数里的偏移,而0x1000表示该函数的大小,即offset/size。</p> <p>    Oops: 0002 [#1] SMP </p> <p>    该句中#1表示该Oops发生的次数,而0002是错误码(16进制),在arch/x86/mm/fault.c(3.11.0-rc4版本的Kernel,下同)文件中有如下注释:
    bit 0 ==    0: no page found          1: protection fault
    bit 1 ==    0: read access               1: write access
    bit 2 ==    0: kernel-mode access  1: user-mode access
    bit 3 ==                                         1: use of reserved bit detected
    bit 4 ==                                         1: fault was an instruction fetch
此处是在内核态进行写操作时找不到页面而出错了。
 有时,OOPS还会打印出Tainted信息,如上面例子有如下打印:
 CPU: 6 PID: 8616 Comm: insmod Tainted: P           O 3.11.0-rc4+ #3
其后的P用来表明内核是因什么原因被污染了(Tainted),在oops-tracing.txt文件中有相应的说明,而P表示我们没有在代码里使用MODULE_LICENSE(“GPL”);声明模块使用GPL协议,加载后会Tainted到内核,故而有该提示,如果加上的话,Tainted:P将显示为Not tainted。
下面简要说明这些标志代表的意思:
G : Modules loaded have a GPL or compatible license.
P : Proprietary module has been loaded.
F : Module was force loaded by “insmod –f”.
S : SMP kernel running on hardware that hasn’t been certified as safe to run multiprocessor.
R : Module was force unloaded by “rmmod –f”.
M : Processor has reported a Machine Check Exception.
B : System has hit bad_page.
U : User or user application specifically requested.
D : Kernel has died recently.
A : ACPI table overridden.
W : Taint on warning.
C : Staging driver loaded.
I : Kernel run in the a severe bug’s platform firmware.
O : Externally-built module loaded.
    RIP: 0010:[<ffffffffa0018012>]  [<ffffffffa0018012>] oops_example_init+0x12/0x1000 [oops_example]
  RIP是64位的指令指针寄存器,0010是代码段寄存器的值,oops_example_init+0x12/0x1000上面有分析过了,即<symbol>+offset/length。
   从RIP到CR4之间都是CPU寄存器的值,而Stack:到Call Trace:之间的内容为栈的值,而Call Trace:到Code:之间为OOPS发生前函数的调用顺序列表(逆序),而Code:一整行表示在OOPS发生时正被运行的机器码所在段以16进制转储的值。
4.调试OOPS
1)objdump
使用objdump -S oops_example.ko命令来反汇编,可以输出如下内容:
       oops_example.ko:     file format elf64-x86-64
      Disassembly of section .init.text: </p><p>        0000000000000000 <init_module>:
        #include <linux/kernel.h>
        #include <linux/module.h>
        static __init int oops_example_init(void)
        {
          0:   55                      push   %rbp
                int p = 0;
                printk(“Hi OOPS!\n”);
          1:   48 c7 c7 00 00 00 00    mov    $0x0,%rdi
          8:   31 c0                   xor    %eax,%eax
        #include <linux/kernel.h>
        #include <linux/module.h>
        static __init int oops_example_init(void)
        {
           a:   48 89 e5                mov    %rsp,%rbp
                int *p = 0;
                printk(“Hi OOPS!\n”);
           d:   e8 00 00 00 00          callq  12 <init_module+0x12>
                *p = 12;
         12:   c7 04 25 00 00 00 00    movl   $0xc,0x0
         19:   0c 00 00 00
                return 0;
        }
          1d:   31 c0                   xor    %eax,%eax
          1f:   5d                      pop    %rbp
          20:   c3                      retq  
        Disassembly of section .exit.text:
        0000000000000000 <cleanup_module>:
       static __exit void oops_example_exit(void)
        {
           0:   55                      push   %rbp
                printk(“Bye OOPS!\n”);
           1:   48 c7 c7 00 00 00 00    mov    $0x0,%rdi
           8:   31 c0                   xor    %eax,%eax
                return 0;
        }
        static __exit void oops_example_exit(void)
        {
          a:   48 89 e5                mov    %rsp,%rbp
               printk(“Bye OOPS!\n”);
          d:   e8 00 00 00 00          callq  12 <cleanup_module+0x12>
        }
          12:   5d                      pop    %rbp
          13:   c3                      retq  
        对照上面的OOPS信息中IP: [<ffffffffa0018012>] oops_example_init+0x12/0x1000 [oops_example]一句,我们很快可以定位到其对应12:   c7 04 25 00 00 00 00    movl   $0xc,0x0这一句,前面的12刚好与0x12这一偏移对应上,而该位置上一句
p = 12;刚好是该汇编语句对应的C代码,而再往上跟可以知道int p = 0; 而汇编代码刚好将0xc(即12)写入0地址,故而有了上面的NULL指针提示。
2)gdb
运行gdb oops_example.ko命令,有如下输出:
       Copyright (C) 2012 Free Software Foundation, Inc.
        License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
        This is free software: you are free to change and redistribute it.
        There is NO WARRANTY, to the extent permitted by law.  Type “show copying”
        and “show warranty” for details.
        This GDB was configured as “x86_64-linux-gnu”.
        For bug reporting instructions, please see:
        <http://bugs.launchpad.net/gdb-linaro/>…
        Reading symbols from /home/guochongxin/xinu/xinu/linux_kernel_driver_l1/oops_example/oops_example.ko…done.
        (gdb)
        接下来添加符号文件到调试器,使用add-symbol-file命令,第一个参数是目标文件,第二个参数是text setcion的起始地址,可使用sudo cat /sys/module/oops_example/sections/.init.text命令获得,我这边得到的是0xffffffffa0018000,接下来在上面的(gdb)后面输入add-symbol-file oops_example.ko 0xffffffffa0018000后有如下输出:
        (gdb) add-symbol-file oops_example.ko 0xffffffffa0018000
        add symbol table from file “oops_example.ko” at
                .text_addr = 0xffffffffa0018000
        (y or n)
        此时输入y确认后有如下输出:
        Reading symbols from /home/guochongxin/xinu/xinu/linux_kernel_driver_l1/oops_example/oops_example.ko…done.
        (gdb)
        在OOPS有RIP开头一行,可知道出错位置的函数名为oops_example_init,接下来在上面(gdb)后输入disassemble oops_example_init后有如下输出:
        Dump of assembler code for function oops_example_init:
           0x0000000000000024 <+0>:     push   %rbp
          0x0000000000000025 <+1>:     mov    $0x0,%rdi
          0x000000000000002c <+8>:     xor    %eax,%eax
          0x000000000000002e <+10>:    mov    %rsp,%rbp
           0x0000000000000031 <+13>:    callq  0x36 <oops_example_init+18>
           0x0000000000000036 <+18>:    movl   $0xc,0x0
          0x0000000000000041 <+29>:    xor    %eax,%eax
           0x0000000000000043 <+31>:    pop    %rbp
          0x0000000000000044 <+32>:    retq  
        End of assembler dump.
        (gdb)
        从上面可以知道函数的起始地址为0x0000000000000024 ,而RIP一行里可看到其偏移为0x12,两者相加后为0x0000000000000036,刚好对应上面反汇编的movl   $0xc,0x0这行,接下来我们显示该地址对应的C源码,在上面的(gdb)后输入list *0x0000000000000036后有如下输出:
        0x36 is in oops_example_init (/home/guochongxin/xinu/xinu/linux_kernel_driver_l1/oops_example/oops_example.c:10).
        5       {
        6               int *p = 0;
        7
        8               printk(“Hi OOPS!\n”);
        9
        10              *p = 12;
        11
        12              return 0;
        13      }
        14
        可以看到源文件路径冒号后有10,刚好对应序号10对应的一行,即出错处在
p = 12;一行,这样就定位到了问题位置了,如何解决就不用说明了吧!
5.后记
    Kernel(3.11.0-rc4)配置项中需要将Kernel hacking —> Compile-time checks and compiler options —> [*] Compile the kernel with debug info一项(CONFIG_DEBUG_INFO)选上。
    除了上面提及的调试方法,在Linux Kernel源码下还有scripts/decodecode脚本来帮忙处理。
6.参考资料
《Understanding a Kernel Oops!》
《Linux kernel OOPS debugging》
《根据linux Oops定位错误代码行》
《Linux内核的Oops》
 Documentation/oops-tracing.txt


评论

此博客中的热门博文

I/O映射之I/O端口

通过Netlink检测网线插拔

使用seq_file