gpt_loader.sys revisited, file read problem

By dose | June 28, 2015
Under: Uncategorized

It’s been over a year since I last analyzed and fixed a bug in the Paragon GPT
loader driver which enables us Windows XP users to use GPT partitioned drives
beyond 2TB in size.
Last time, I fixed a severe bug that caused the driver to crash.
This time a user reported a strange bug with the driver in the comments section
which I also experienced once but first ignored it:

The Problem

When reading files that are located beyond the 2TB area, massive memory usage
occurs and as a result, the computer slows down to a crawl. This is especially
a problem if you are copying large files from your 2TB harddisk.
More informations about the problem can be found in the comments for the last
fix where a user reported them.

Time to take a further look at the problem:

When copying file from T: to S: where T: is the GPT-drive that gpt_loader.sys
handles, during copy the physical memory usage increases a lot and the
Lazywriter thread starts flushing data to the SOURCE file on T: as can be seen
in Filemon:

Looking at the callstack of the first write to the source file, the write
operation roots at nt!MiMappedPageWriter.
This is the lazy writer thread that periodically sweeps through the dirty
pages and flushes them to disk.
So first conclusion is that there must be some memory mapped page containing
the source file whose pages got dirty for some reason. As they are dirty, the
system needs to cache their content which in turn seems to create the huge
memory usage ignoring the disk cache limits.
Dirty pages also need to be flushed back to disk which causes additional load
for no reason and theoretically may be even dangerous as a file that is
only being read may get corrupted on power loss. As read files normally
don’t get corrupted it is assumed that pages are dirty even though they
haven’t changed their content.
The thread is processing the MmMappedPageWriterList.

Now when trying with another file not in cache and checking some statistics,
it can be seen that there are many dirty pages for the SOURCE file being copied
(which are being flushed back to disk):

lkd> !memusage
...
Control Valid Standby Dirty Shared Locked PageTables  name
8a0c75d8   748  77632 577680     0     0     0  mapped_file( (2013-12-28 20-10) Polt- ORF2 N.ts )
8a074c80   208  654244  1636     0     8     0  mapped_file( (2013-12-28 20-10) Polt- ORF2 N.ts )
...

See amount of dirty pages!

lkd> !ca 8a0c75d8

ControlArea  @ 8a0c75d8
Segment      e55cf488  Flink      00000000  Blink        00000000
Section Ref         1  Pfn Ref       2cc80  Mapped Views        4
User Ref            0  WaitForDel        0  Flush Count         0
File Object  8a182408  ModWriteCount     0  System Views        4

Flags (8080) File WasPurged

File: \(2013-12-28 20-10) Polt- ORF2 N.ts

Segment @ e55cf488
Type nt!_MAPPED_FILE_SEGMENT not found.
lkd> !ca 8a074c80

ControlArea  @ 8a074c80
Segment      e386d998  Flink      00000000  Blink        00000000
Section Ref         1  Pfn Ref       2cc50  Mapped Views        2
User Ref            0  WaitForDel        0  Flush Count         0
File Object  8a1d4628  ModWriteCount     0  System Views        2

Flags (8080) File WasPurged

File: \(2013-12-28 20-10) Polt- ORF2 N.ts

Segment @ e386d998
Type nt!_MAPPED_FILE_SEGMENT not found.
lkd> !fileobj 8a182408

\(2013-12-28 20-10) Polt- ORF2 N.ts

Device Object: 0x8b560be8   \Driver\gpt_loader
Vpb: 0x8b57a450
Event signalled
Access: Read SharedRead

Flags:  0xc0062
Synchronous IO
Sequential Only
Cache Supported
Handle Created
Fast IO Read

FsContext: 0xe5584850    FsContext2: 0xe55849a8
Private Cache Map: 0x89e93b50
CurrentByteOffset: 2cc50000
Cache Data:
Section Object Pointers: 8a1dba3c
Shared Cache Map: 89e93a78         File Offset: 2cc50000
Vacb: 8b5d87f8
Your data is at: d3ad0000
lkd> !fileobj 8a1d4628

\(2013-12-28 20-10) Polt- ORF2 N.ts

Device Object: 0x8b578e30   \Driver\Ftdisk
Vpb: 0x8b586af0
Event signalled
Access: Read Write SharedRead SharedWrite

Flags:  0x43062
Synchronous IO
Sequential Only
Cache Supported
Modified
Size Changed
Handle Created

FsContext: 0xe1584990    FsContext2: 0xe1584ae8
Private Cache Map: 0x89dd72d0
CurrentByteOffset: 2cc50000
Cache Data:
Section Object Pointers: 896adb14
Shared Cache Map: 89dd71f8         File Offset: 2cc50000
Vacb: 8b5dba68
Your data is at: c2ad0000

lkd> !object 8a182408
Object: 8a182408  Type: (8b60ee70) File
ObjectHeader: 8a1823f0 (old version)
HandleCount: 1  PointerCount: 3
Directory Object: 00000000  Name: \(2013-12-28 20-10) Polt- ORF2 N.ts {HarddiskGptVolume1}

So this is the file being READ from the GPT disk as suspected and it has dirty
pages for some unknown reason.
It is possible that the view originates from the cache manager.
Cache manager normally has some sort of write throttling so that available
cache memory cannot be exceeded, but as this occurs on a READ file, the
throttling doesn’t have any effect here leading to excessive memory usage.

So it is time to have a look at what gpt_loader is actually doing in its
processing routine for read/write. Translated to Pseudo C-Code, it’s
basically the following (largely shortened to the relevant calls):

ATA_PASS_THROUGH_DIRECT InputBuffer;
IO_STATUS_BLOCK IoStatusBlock;
NTSTATUS Status;
KEVENT Event;
PIRP AtaIRP;
union {
  USHORT AtaFlags;
  BOOL bRead;
} flg;
DWORD IoStatusInformation; // Returned later in Irp->IoStatus.Information as number of bytes transferred
DWORD nSectors = IoGetCurrentStackLocation(Irp)->Parameters.Read.ByteOffset.QuadPart / this->dw124;

flg.bRead = IoGetCurrentStackLocation(Irp)->MajorFunction == IRP_MJ_READ;
InputBuffer.DataBuffer = MmGetSystemAddressForMdlSafe(Irp->MdlAddress, HighPagePriority);
flg.AtaFlags = ATA_FLAGS_48BIT_COMMAND | ATA_FLAGS_USE_DMA | (flg.bDoRead?ATA_FLAGS_DATA_IN:ATA_FLAGS_DATA_OUT);

for (IoStatusInformation = 0; nSectors > 0; IoStatusInformation+=InputBuffer.DataTransferLength)
{
  nSectorsRead = nSectors>(31 * (4096 / this->nBytesPerSector))?(31 * (4096 / this->nBytesPerSector)):nSectors;
  InputBuffer.DataTransferLength = nSectorsRead * this->nBytesPerSector;
  InputBuffer.AtaFlags = AtaFlags;
  // Omitted here: Fill InputBuffer with ATA-read command and data to read/write ...
  KeInitializeEvent(&Event, 0, 0);
  AtaIRP = IoBuildDeviceIoControlRequest(
    IOCTL_ATA_PASS_THROUGH_DIRECT,
    this->DeviceObject,
    &InputBuffer,
    sizeof(InputBuffer),
    &InputBuffer,
    sizeof(InputBuffer),
    0,
    &Event,
    &IoStatusBlock);
  if ((Status = IoCallDriver(this->DeviceObject, AtaIRP)) == STATUS_PENDING) {
    KeWaitForSingleObject(&Event, 0, 0, 0, 0);
    Status = IoStatusBlock.Status;
  }
  if (!NT_SUCCESS(Status)) break;
  nSectors -= nSectorsRead;
  InputBuffer.DataBuffer += nSectorsRead * this->nBytesPerSector;
}

When reading the documentation and what we can see here is that
IOCTL_ATA_PASS_THROUGH_DIRECT call requires not a MDL but a virtual address
where to read data to. So the driver does the obvious: It gets virtual
address from MDL via MmGetSystemAddressForMdlSafe and passes the pointer
to it to the lower level ATA driver so that the buffer gets read and filled.
Seems fine, right? And obviously works.
But from what I can see the following happens down the chain which causes the
unpleasant phenomenon mentioned above:
The lower level driver atapi.sys needs an MDL to read to, so
in IdeAtaPassThroughSetupIrp it does IoAllocateMdl for write access
with the virtual address passed in, assigns it to Irp->MdlAddress,
locks it with MmProbeAndLockPages and passes the call
through to the next driver. When the passthrough is done, it calls
its function IdeAtaPassThroughFreeIrp which does MmUnlockPages(Irp->MdlAddress).
On unlock, the page table entries of the write pages are marked as Modified
causing the unpleasant behaviour mentioned above.

Fixing it

So in order to circumvent this problem, the gpt_loader.sys driver instead
would need to allocate a buffer with size 0x1F000 bytes (maximum size supported
is 4096 * 31 for a block and it’s better to allocate the buffer once and reuse
it on every call than allocating and freeing it on every call, which looks a bit
expensive), let the lower level ATAPI driver read to that buffer
and then memcpy the read data from this buffer to the input buffer
to circumvent marking the pages dirty.

Now can this be fixed with patching? It seems to be quite hard as we must
actually add instructions to the driver without increasing its size or
overwriting vital functions.
First problem is the buffer space. This turns out to be easy. In generateLoader,
memory for the handling class is allocated with:

HandlerClass = malloc_pool(0x154u, NonPagedPool);

.00010877: 57                           push        edi
.00010878: 6854010000                   push        000000154
.0001087D: E8B4650000                   call       .000016E36

So we just add 0x1F000 to the size of the class structure and address
HandlerClass+0x154 as the buffer. This also ensures that it gets freed properly
on exit without the need to add free-function:

.00010878: 6854200000                   push        00001F154

The harder part is fixing the processIrp routine. Looking at the pseudo-code
above, we basically need to change the routine to the following:

ATA_PASS_THROUGH_DIRECT InputBuffer;
IO_STATUS_BLOCK IoStatusBlock;
NTSTATUS Status;
KEVENT Event;
PIRP AtaIRP;
union {
  USHORT AtaFlags;
  BOOL bRead;
} flg;
DWORD IoStatusInformation; // Returned later in Irp->IoStatus.Information as number of bytes transferred
DWORD nSectors = IoGetCurrentStackLocation(Irp)->Parameters.Read.ByteOffset.QuadPart / this->dw124;
PBYTE Buffer = MmGetSystemAddressForMdlSafe(Irp->MdlAddress, HighPagePriority);

flg.bRead = IoGetCurrentStackLocation(Irp)->MajorFunction == IRP_MJ_READ;
InputBuffer.DataBuffer = flg.bRead?this->offs154:Buffer;
flg.AtaFlags = ATA_FLAGS_48BIT_COMMAND | ATA_FLAGS_USE_DMA | (flg.bDoRead?ATA_FLAGS_DATA_IN:ATA_FLAGS_DATA_OUT);

for (IoStatusInformation = 0; nSectors > 0; IoStatusInformation+=InputBuffer.DataTransferLength)
{
  nSectorsRead = nSectors>(31 * (4096 / this->nBytesPerSector))?(31 * (4096 / this->nBytesPerSector)):nSectors;
  InputBuffer.DataTransferLength = nSectorsRead * this->nBytesPerSector;
  InputBuffer.AtaFlags = AtaFlags;
  // Omitted here: Fill InputBuffer with ATA-read command and data to read/write ...
  KeInitializeEvent(&Event, 0, 0);
  AtaIRP = IoBuildDeviceIoControlRequest(
    IOCTL_ATA_PASS_THROUGH_DIRECT,
    this->DeviceObject,
    &InputBuffer,
    sizeof(InputBuffer),
    &InputBuffer,
    sizeof(InputBuffer),
    0,
    &Event,
    &IoStatusBlock);
  if ((Status = IoCallDriver(this->DeviceObject, AtaIRP)) == STATUS_PENDING) {
    KeWaitForSingleObject(&Event, 0, 0, 0, 0);
    Status = IoStatusBlock.Status;
  }
  if (!NT_SUCCESS(Status)) break;
  nSectors -= nSectorsRead;
  if (flg.AtaFlags & ATA_FLAGS_DATA_IN) {
    RtlCopyMemory(Buffer, InputBuffer.DataBuffer, InputBuffer.DataTransferLength);
    Buffer += nSectorsRead * this->nBytesPerSector;
  } else InputBuffer.DataBuffer += nSectorsRead * this->nBytesPerSector;
}

First, we need more space on the stack for our pointer:

.00015DEA: 8BFF          mov    edi,edi
.00015DEC: 55            push   ebp
.00015DED: 8BEC          mov    ebp,esp
.00015DEF: 81EC8C000000  sub    esp,00000008C
.00015DF5: A1008C0100    mov    eax,[00018C00]

So, change it to sub esp, 90h, so that [ebp-90h] is our new pointer:

.00015DEF: 81EC90000000  sub    esp,000000090

As there is new code to add, we need to create a new section for the code,
because there is not enough space to stuff all that into the original function.
We can cut off 0x200 bytes of the end of the .reloc section and create a new
code section for our code there.
But due to the alignment of .reloc, we also have to change the
section table to remove the discardable flag of .reloc, otherwise our code
will vanish when .reloc gets discarded. This unfortunately adds 1,75KB of
increased memory usage to our driver, but that shouldn’t hurt you too much
I guess 😉
Next we have to ensure that our new buffer pointer gets initialized properly
with the target and the InputBuffer.DataBuffer gets setup correctly to our
new buffer. Here is the original code where buffer gets initialized:

.00015E92: 8945CC        mov    [ebp][-34],eax        ; InputBuffer.DataBuffer
.00015E95: 3BC3          cmp    eax,ebx
.00015E97: 7517          jne    .000015EB0
.00015E99: BE170000C0    mov    esi,0C0000017

We are moving this to a seperate routine in order to be able to place a
call in here:

.00015E92: E8694F0000    call   .00001AE00
.00015E97: 7517          jne    .000015EB0

Now there is one very important thing to consider: The routine we are patching is
a read/write routine, so we only need to do all that buffer copy magic on read,
not on write or we will be toast!
[ebp][-49] contains a flag that is set when reading and not set when writing.
We can use that.

In our new routine at 00001AE00:

0001AE00: 385DB7         cmp    [ebp][-49],bl         ; Check if we want to read or write
0001AE03: 8BD8           mov    ebx,eax               ; On write, set eax buffer directly like it used to be
0001AE05: 740C           je     .00001AE13            ; Jump on write, on read instead:
0001AE07: 898570FFFFFF   mov    [ebp][-00000090],eax  ; Fill our stack variable with ptr to dest buffer
0001AE0D: 8D9E54010000   lea    ebx,[esi][00000154]   ; Pointer to buffer in Class that we allocated on read
0001AE13: 895DCC         mov    [ebp][-34],ebx        ; Set InputBuffer.DataBuffer to class-buffer on read, to eax (direct MDL buffer) on write
0001AE16: 33DB           xor    ebx,ebx               ; Restore abused ebx to 0
0001AE18: 3BC3           cmp    eax,ebx               ; Do comparison we had to eliminate for CALL
0001AE1A: C3             retn                         ; ...and back

Next comes the part IoStatusInformation+=InputBuffer.DataTransferLength at the
end of the loop that needs to be adapted so that content of temporary buffer
can be copied to input IObuffer from MDL:

.00016001: 11559C        adc    [ebp][-64],edx
.00016004: 0FAF45A8      imul   eax,[ebp][-58]
.00016008: 0145CC        add    [ebp][-34],eax       ; InputBuffer.DataBuffer+=eax
.0001600B: 8B45C0        mov    eax,[ebp][-40]       ; eax=InputBuffer.DataTransferLength
.0001600E: 014594        add    [ebp][-6C],eax       ; IoStatusInformation+=eax
.00016011: 395DA0        cmp    [ebp][-60],ebx       ; nSectors==0?
.00016014: 0F87C6FEFFFF  ja     .000015EE0
.0001601A: EB32          jmps   .00001604E

ecx isn’t used for anything in this routine starting from this point, so
we can reuse it as counter for memcpy without saving.
eax also isn’t used anywhere else so we can just fill ecx instead
of eax here and use eax for incrementing src ptr later
(as DataTranferLength theoretically can be < nSectorsRead*BytesPerSector
on incomplete reads, although that shouldn’t happen).
But we also need to skip InputBuffer.DataBuffer+=eax on read operation,
as we are always reading to the same temp buffer and only incrementing
our dest pointer we copy the memory to. Therefore we move up
eax=InputBuffer.DataTransferLength to overwrite the add and then call
our new routine at AE20:

.00016008: 8B4DC0        mov    ecx,[ebp][-40]       ; ecx=InputBuffer.DataTransferLength
.0001600B: 014D94        add    [ebp][-6C],ecx       ; IoStatusInformation+=ecx
.0001600E: E80D4E0000    call   .00001AE20           ; Call our new routine
.00016013: 90            nop                         ; Padding

In our new routine, we can finally do the copy.
As copying up to 0x1900 bytes with repe movsb doesn’t look particularily fast,
better use memcpy which is located at 6CF2 in our driver version.
Unfortunately the bRead-Flag is a union with the USHORT AtaFlags  which gets set at
5ED3, so we have to test for ATA-Flags now

Read = ATA_FLAGS_48BIT_COMMAND | ATA_FLAGS_USE_DMA | ATA_FLAGS_DATA_IN = 0x1A,
Write = ATA_FLAGS_48BIT_COMMAND | ATA_FLAGS_USE_DMA | ATA_FLAGS_DATA_OUT = 0x1C

.0001AE20: 8A5DB6        mov    bl,[ebp][-4A]        ; Fetch ATA Flags
.0001AE23: 80FB1C        cmp    bl,01C               ; Read or write?
.0001AE26: 7505          jne    .00001AE2D           ; Jump on read, on write do:
.0001AE28: 0145CC        add    [ebp][-34],eax       ; InputBuffer.DataBuffer+=eax
.0001AE2B: EB17          jmps   .00001AE47           ; Do other missing stuff and back, on read do:
.0001AE2D: 50            push   eax                  ; Save eax to increment Dest ptr
.0001AE2E: 51            push   ecx                  ; Length = ecx
.0001AE2F: FF75CC        push   d,[ebp][-34]         ; Source = InputBuffer.DataBuffer
.0001AE32: FFB570FFFFFF  push   d,[ebp][-00000090]   ; Destination = Dest ptr to current loc in MDL
.0001AE38: E8B5BEFFFF    call   memcpy               ; ntoskrnl.exe
.0001AE3D: 83C40C        add    esp,00C              ; Fix the stack
.0001AE40: 58            pop    eax                  ; Restore eax
.0001AE41: 018570FFFFFF  add    [ebp][-00000090],eax ; Update Dest ptr to current loc in MDL
.0001AE47: 33DB          xor    ebx,ebx              ; Restore ebx
.0001AE49: 395DA0        cmp    [ebp][-60],ebx       ; nSectors == 0? (we deleted this with our call)
.0001AE4C: C3            retn                        ; ...and back

Patch

Of course use that on your own risk, I do not guarantee for anything, but for me
this fixes the bug and the driver now works flawlessly 🙂
I wrote a little patcher that
patches the driver accordingly. Just run it and
if it patched successfully, reboot the system to load the fixed version of the
driver.
Feel free to try it and if you are also suffering from this problem, you can
leave a comment if this actually fixes it for you too.

If you haven’t done it already, also apply the first patch which fixes a
BSOD problem, every patch is for a certain problem only, this patch
therefore doesn’t contain the fixes from the BSOD patch.

For those who use crappy Antivirus programs like Antivir, don’t get fooled by the generic Antivirus signature-match TR/Downloader.gen (which is really stupid, as the exeutable doesn’t even call any Internet functions, so how should that download anything??), you can check with Virustotal.
If you have such an Antivirus program, use this build instead which is a larger executable but isn’t subject to false positives.

6 comments | Add One

Comments

  1. Isidro - 07/6/2015 at 03:50

    I have two GPT_LOADER, one that comes with Hitachi HD (although it does work without an Hitachi), and the one in Paragon. Patch1 solved BSOD. Patch2 caused read errors (without GPT drives): Chipset intel B85, two 2TB drives using standard partitions. Returning to non Patch2 solved read errors. Will test the non Hitachi GPT_LOADER and report later.

  2. Isidro - 07/8/2015 at 05:54

    Non Hitachi GPT_loader works ok with both patches.
    Only annoyance (of all versions), is the BSOD when one tries to disable an HD from Device Manager, and the imposibility to detect a new HD hot plugged. (Detect new hardware does not detect it if gptloader installed).

  3. Atlan - 07/8/2015 at 18:59

    GPT Loader is a SCSI Driver and works only in Pata Mode and not for sata mode or raid. Pata doesn’t support hot plugging for HDs. If you switched the SATA Controller Mode in Bios from SATA (AHCI) to PATA Compatible, so you can’t changed HDs by running System, if it using GPT Driver. That means Harddisk bigger than 2TB or smaller 2TB but with GPT Partition. 2TB Harddisk in MBR shouldn’t resulting in BSOD by disabling them. If you can make different Settings for Controller, so only the Controller set to PATA will result in BSOD. That is normal. Only SATA supporting the hotplug.

    The Hitachi GPT Loader is a special Version from Paragon for Hitachi. There are some differents only Hitachi knows. Paragon doesn’t offer support for the hitachi Version of GPT Loader.

  4. Atlan - 07/8/2015 at 19:05

    With the GPT Loader by Paragon Disk Manager Professional 12 both patches works great here, now. Really good work. All my test was positive successfully. No bugs or errors by copy transfer. Files are ever binary identical on my tests. No problem by system performance anymore. Thanks.

  5. DCT - 10/18/2015 at 13:03

    Hello! I’ve found another issue using gpt_loader – BSOD 0x00000035 NO_MORE_IRP_STACK_LOCATIONS
    It seem to happen when there are many HDD/Flash drives connected to motherboard and there are read/write operations with 3TB HDD. The bug is severe on my configuration when there are 7 HDDs connected but not happenned when there are only 2-3 HDDs.
    BSOD is caused by PartMgr.sys+921 (according to BlueScreenView).
    The current way to fix it is to substitute disk.sys and partmgr.sys by files from windows 2003 server sp2 (the server OS is more tolerant to many HDDs, the files could be unpacked from sp2 update of windows 2003 server available on Microsoft site). But the substitution of windows xp original files are not quite a good idea.
    Is there any smart way to fix this bug? For example, is it possible to change maximum size of IRP_STACK of PartMgr.sys? Or could it be fixed by patching gpt_loader.sys to use less IRP?

  6. DCT - 10/18/2015 at 14:37

    >[i]Only annoyance (of all versions), is the BSOD when one tries to disable an HD from Device Manager, and the imposibility to detect a new HD hot plugged. (Detect new hardware does not detect it if gptloader installed).[/i]

    See my reply in 1st patch topic how to fix it.

Trackbacks

Leave a Comment

Name:

E-Mail :

Subscribe :
Website :

Comments :