Saturday, July 5, 2014

0x000000D1 Debugging - NotMyFault exploration (x64)

I've discussed some 0xD1 debugging here, but I figured I'd also go into a different 0xD1 scenario here, and just show it from different angles by using NotMyFault to force a bug check.

Download NotMyfault here.

--------------------

DRIVER_IRQL_NOT_LESS_OR_EQUAL (d1)

This indicates that a kernel-mode driver attempted to access pageable memory at a process IRQL that was too high.

We're all familiar with this bug check, so let's move on to what I wanted to talk about.

Let's go ahead and do an !analyze -v

DRIVER_IRQL_NOT_LESS_OR_EQUAL (d1)
An attempt was made to access a pageable (or completely invalid) address at an
interrupt request level (IRQL) that is too high.  This is usually
caused by drivers using improper addresses.
If kernel debugger is available get stack backtrace.
Arguments:
Arg1: fffff8a0066eb800, memory referenced
Arg2: 0000000000000002, IRQL
Arg3: 0000000000000000, value 0 = read operation, 1 = write operation
Arg4: fffff88002af7385, address which referenced memory
fffff8a0066eb800 was the memory that was referenced. It's either invalid or it was at an IRQL that was too high.

kd> !pte fffff8a0066eb800
                                           VA fffff8a0066eb800
PXE at FFFFF6FB7DBEDF88    PPE at FFFFF6FB7DBF1400    PDE at FFFFF6FB7E280198    PTE at FFFFF6FC50033758
contains 000000007AC84863  contains 000000000367B863  contains 000000006B4C6863  contains 00003B5000000000
pfn 7ac84     ---DA--KWEV  pfn 367b      ---DA--KWEV  pfn 6b4c6     ---DA--KWEV  not valid
                                                                                  PageFile:  0
                                                                                  Offset: 3b50
                                                                                  Protect: 0
Using our handy !pte command which shows page table and directory entry for an address, we can see that it is not a valid address despite appearing to be one based on a first glance. Why is it not valid? As we can see above, and as I highlighted in purple, it's because this address is currently on the pagefile.

Why can't we just page it in? As we know, this is not how the Windows memory manager works regarding kernel-mode and its rules. If we're at IRQL (2) or higher (which we are, see argument 2), we cannot page anything in, therefore we bug check.

Great, so we know why the system crashed. However, what caused it?

--------------------

Let's go ahead and dump the stack:

kd> k
Child-SP          RetAddr           Call Site
fffff880`032f4448 fffff800`02a912a9 nt!KeBugCheckEx
fffff880`032f4450 fffff800`02a8ff20 nt!KiBugCheckDispatch+0x69
fffff880`032f4590 fffff880`02af7385 nt!KiPageFault+0x260
fffff880`032f4720 fffff880`02af7727 myfault+0x1385
fffff880`032f4870 fffff800`02dac127 myfault+0x1727
fffff880`032f48d0 fffff800`02dac986 nt!IopXxxControlFile+0x607
fffff880`032f4a00 fffff800`02a90f93 nt!NtDeviceIoControlFile+0x56
fffff880`032f4a70 00000000`76df138a nt!KiSystemServiceCopyEnd+0x13
00000000`0023edc8 00000000`00000000 0x76df138a
So here we have our call stack. Rather than doing <--- next to the calls, I'll just do this below because I don't want to destroy the formatting of the stack.

We start out with something in user-mode that we don't have the symbols for, and this is why it's 0x76df138a as opposed to a resolved name that we can understand. Why did I make the 7 in the address red, and how did I know we started out with something going on in user-mode? Good question! When the first digit of an address like that is 7 or lower, it's a user-mode address.

This is also due to the fact that this is a kernel-dump, which we can see towards the top of our crash dump within WinDbg:

Kernel Summary Dump File: Only kernel address space is available
With that said, we cannot see what the application was doing outside of when it went down into kernel-mode.

So we know that some application (0x76df138a) did something, and called down into kernel-mode. Everything above 0x76df138a is now kernel-mode. On x64, you can tell because the addresses start with fffff880`032f4a00 under Child-SP which implies kernel-mode.

We can see it goes through a few functions, and then ends up in myfault. Shortly afterwards, we hit a pagefault (trying to page in memory from the pagefile -- big no no).

--------------------

If we take a look at the trap frame:

kd> .trap 0xfffff880032f4590
NOTE: The trap frame does not contain all registers.
Some register values may be zeroed or incorrect.
rax=0000000005000000 rbx=0000000000000000 rcx=0000000000002481
rdx=fffffa8001810000 rsi=0000000000000000 rdi=0000000000000000
rip=fffff88002af7385 rsp=fffff880032f4720 rbp=fffff880032f4b60
 r8=0000000000012408  r9=0000000000000810 r10=fffff80002a12000
r11=0000000000000002 r12=0000000000000000 r13=0000000000000000
r14=0000000000000000 r15=0000000000000000
iopl=0         nv up ei ng nz na po nc
myfault+0x1385:
fffff880`02af7385 8b03            mov     eax,dword ptr [rbx] ds:00000000`00000000=????????
The first very important thing to note is the note about the trap frame not containing all registers, and how they may be either zeroed out or incorrect. The big question is why? Well, trap frame code generation on x64 versions of Windows does not save the contents of registers that are non-volatile.

With that said, registers such as rbx, rdi, rsi, etc, are either zeroed out or incorrect. This is due to the fact that on x64, any code that runs after the generation of a trap frame will properly hand it and restore it to its own frame. It's seen as an unnecessary step in a hot path within the kernel.

Extremely detailed article with much more info here.

Moving on, what happened with the instruction we failed on, we were moving a pointer which was stored in the rbx register:

mov     eax,dword ptr [rbx]

Uh oh, rbx is zeroed out. With that said, we can't !pte the register address to double check it, etc. We just need to assume that this all occurred because of myfault attempted to access memory that was either paged out or invalid (which it did).

--------------------

If you wanted any extra proof or to see if NotMyFault was the crash, you could dump all of the processes at the time of the crash to see if there was any correlation. In this case, you'd use !process 0 0. Flags are important in this case, and you can as always check the WinDbg help file for info, or use MSDN.

PROCESS fffffa80040a7060
    SessionId: 1  Cid: 0654    Peb: 7fffffd4000  ParentCid: 0708
    DirBase: 670ea000  ObjectTable: fffff8a00666c330  HandleCount:  68.
    Image: NotMyfault.exe
We can see we did indeed have a NotMyFault process running at the time of the crash, so we can at this point assume that this is very likely the accurate cause of the crash.

Hope you enjoyed reading!

No comments:

Post a Comment