Dissecting Windows Section Objects

1 year ago 104
BOOK THIS SPACE FOR AD
ARTICLE AD

The post has been improved with the corrections kindly reported by Andrea Allievi.

Instead of introduction

We can't imagine Windows without section objects (or file mapping objects in terms of Windows API) and hardly can we find a Windows kernel subsystem that doesn't address it. The great idea behind section objects is that instead of calling Windows File APIs to work with a file, you can read virtual memory to get file data and write virtual memory to write file data. But this simple concept doesn't have simple things under the hood. To simplify the understanding of this difficult topic, we take Windows x86 edition with 32-bit pointers.

Don't worry if you can't understand all the things, even skilled Windows Internals readers may have difficulties with this topic. I would recommend to read the corresponding chapter from the Windows Internals book, because this blog post includes a lot of technical stuff and describes some kind of low level things.

The basic terms

So if you're ready, let's get started. First, we need to take a quick look at some technical terms, because without understanding any of them, we can't get the full picture. Next we'll focus on each of them in detail.

Section object - a kernel object described by the _SECTION structure. In the terms of Windows API it's called file mapping object. There're two types of section objects: pagefile-backed section and file-backed section. The first one is used when processes want to share a region of virtual memory. The file backed section reflects the contents of an actual file on disk.Virtual Memory Manager (VMM) - a set of Mm functions in ntoskrnl that are responsible for all operations related to virtual and physical memory. The VMM also creates, maintains and deletes section objects as well as their substructures (see below).I/O manager - in the context of our topic, these are Io functions in ntoskrnl that are used by the VMM to perform I/O operations with the mapped file data. This subsystem just initiates I/O operations, which are actually performed by file system drivers and disk drivers on device stacks.PTE (Page Table Entry) - a structure that is used by the CPU and VMM to translate virtual addresses to physical ones.Proto-PTE (Prototype PTE, PPTE) - a special type of invalid PTEs that is used only by the VMM (not CPU) to work with section objects and serves as an intermediate level for the translation virtual addresses to the mapped section pages (file data). PPTE points to a subsection and helps the VMM to find file data that should be located in the corresponding virtual memory pages.PTE pointing to PPTE - a special type of hardware invalid PTEs /with zeroed valid (V) flag/ that is designed to find the corresponding PPTE in the Segment structure (PPT).Prototype page table (PPT) - an array of PPTEs that is a part of Segment structure. Once the process maps a section, the VMM fills the hardware PTEs of the virtual pages with pointers to the elements of this array. When the process unmaps a section, the VMM removes pointers to PPTEs from hardware PTEs.Segment - a data structure that provides the section object with the necessary information to calculate pointers to subsections, it also contains a PPT.Segment Control Area (or just Control Area, CA) - a structure containing information required for performing I/O operations with file data in or from the mapped file. It's stored in the non-paged pool. With the help of CA the VMM can address the same file as binary and as executable.Subsection - a data structure containing the necessary information to calculate offsets relative to the beginning of the mapped file using PPTEs. There is normally only one subsection if the file was mapped as binary. In case if it was mapped as executable, the number of subsections is the same as the number of sections in the mapped executable.Page fault (#PF) for section - a situation (an exception) when a thread tries to access a virtual page mapped to the section, but its PTE is marked as not valid.Modified page writer - system threads that are responsible for synchronizing modified file data in virtual memory with a disk file.Page Frame, Page Frame Number, PFN database - terms describing physical memory: physical memory page, its number, numbers database. The latter includes information about all physical memory pages (page frames) and is designed to track status of each physical page (page frame).

Diving deeper into the Section kernel objects

Section is a kernel object that is created and maintained by the VMM. The MmCreateSection function creates the kernel object, allocating memory for it from the paged pool, initializes its fields, creates Control Area and Segment structures if needed (see MiCreateImageFileMap, MiCreateDataFileMap). To create an object, the caller of MmCreateSection must provide a pointer to a FileObject that describes the file to be mapped. Using the FileObject, the functions mentioned above initialize Control Area and Segment structures.
MmCreateSection is responsible not only for initializing a Section object, but also for initializing and maintaining important PSECTION_OBJECT_POINTERS FILE_OBJECT->SectionObjectPointer structure. You can see its definition below.

typedef struct _SECTION_OBJECT_POINTERS {PVOID DataSectionObject; PVOID SharedCacheMap; PVOID ImageSectionObject;} SECTION_OBJECT_POINTERS;

.DataSectionObject points to the Control Area structure if a file to be mapped as binary;.ImageSectionObject points to the Control Area structure if a file to be mapped as executable;.SharedCacheMap points to the shared cache map (see Inside the Windows Cache Manager). This field is used by the Cache Manager to cache file data.
As you can see all these three fields point to the structures needed to perform a certain type of file operations. The SECTION_OBJECT_POINTERS structure is created by the FSD when it gets a request to create (open) a file. The Cache Manager deals with .SharedCacheMap. Even if there are no sections for the file object (i e .DataSectionObject and .ImageSectionObject are NULL), .SharedCacheMap is almost always initialized (for disk files), because the Cache Manager caches parts of the file to provide quick access to its data. To create .DataSectionObject and .ImageSectionObject the VMM uses functions MiCreateDataFileMap and MiCreateImageFileMap.

NTSTATUS MmCreateSection(OUT PVOID *SectionObject, IN ACCESS_MASK DesiredAccess, IN POBJECT_ATTRIBUTES ObjectAttributes OPTIONAL, IN PLARGE_INTEGER MaximumSize, IN ULONG SectionPageProtection, IN ULONG AllocationAttributes, IN HANDLE FileHandle OPTIONAL, IN PFILE_OBJECT File OPTIONAL)

Description of these arguments matches those ones from NtCreateSection.

Take a look at the Control Area structure

Segment control area (or just Control Area, CA) is a structure containing the information necessary to perform I/O operations with a section. It's stored in the nonpaged pool and is described by the following structure.
Control Area contains all the necessary data to perform I/O operations with the section.
Pointer to a Segment containing information from the PE file header and a PPTE array.Pointer to a File Object describing mapped file that will be used for I/O operations.An array of subsections, which is located after the CA structure in virtual memory, containing the necessary data to calculate file offsets.
The Control Area structure contains the flags that indicate what kind of data is addressed by the section. When the VMM creates a CA object for an executable file using MiCreateImageFileMap, its size is equal to the size of the CA structure, plus the size of one Subsection structure multiplied by the number of subsections (i e number of PE sections + 1 for PE header). It's important to note that, in case of image file map, all _SUBSECTION structures are located immediately after the Control Area and their number is stored in the NumberOfSubsections field. The subsections of one section (Control Area) are linked in the list via .NextSubsection. The !ca comment of WinDbg prints information about Control Area.
We can also explore these structures manually for the first three subsections.
Further, we'll discuss this output in more detail.

As it was mentioned earlier, the FILE_OBJECT structure has a very important structure called _SECTION_OBJECT_POINTERS. This structure addresses two CAs, one for a binary mapping type and second if the file is mapped as executable (the same file can be mapped as both binary and executable). These CAs point to different Segments with their own PPTE tables. This structure is maintained by the FSD.

Subsections are allocated in virtual memory strongly after the CA structure. For example, if the Control Area describes executable view, then ControlArea = ExAllocatePoolWithTag (NonPagedPool, sizeof(CONTROL_AREA) + (sizeof(SUBSECTION) * SubsectionsAllocated), 'iCmM')

A few words about Subsections

Subsection (_SUBSECTION) is a data structure containing the necessary information to calculate file offsets for the mapped file using the PPTEs. In case of a binary mapping type, there's only one subsection, but if the file is mapped as executable, then there're as many sections as there are in the executable. Since all the PTEs describing this subsection will have the same page protection bits (copy-on-write, read only, etc), it would be logically to maintain one data structure for all these PTEs. This data structure is called Subsection. All PPTEs point to the same corresponding subsection for both binary and executable mapping types. Moreover, the subsections contain the starting sector of the beginning of the PE's section. It's taken from the PE header as Raw_section_offset/SECTOR_SIZE. Also the subsection stores a pointer to the first PPTE in the segment's PPTE table and number of PTEs for this subsection (i e the number of virtual pages for this PE section, its VirtualSize rounded to a multiple of PAGE_SIZE). Having the address of the structure (executable mapping type), we can easily calculate the offset in the PE file, which this PPTE describes (as a distance between the base and current PTEs). If Pte is a pointer to PPTE, then the formula is.

(((PUCHAR)Pte - (PUCHAR)Subsection->SubsectionBase) / sizeof(PTE)) << PAGE_SHIFT + Subsection->StartingSector * SECTOR_SIZE 

or for x86

(((PUCHAR)Pte - (PUCHAR)Subsection->SubsectionBase) / 4) << 12 + Subsection->StartingSector * SECTOR_SIZE

If Subsection is a ptr to the subsection, then the first PTE that describes it is FirstPte = &Subsection->SubsectionBase[0], and it's boundary, LastPte = &Subsection->SubsectionBase[Subsection->PtesInSubsection]. I e if X - the address of a PE file's subsection in virtual memory, then &Subsection->SubsectionBase[0] <= Pte < &Subsection->SubsectionBase[Subsection->PtesInSubsection].

Exploring the Segment structure

Unlike the Control Area structure that is designed to perform I/O operations with a file, the Segment stores information about a PE file that was taken from its PE header. In case of a binary file, this data isn't used. According to its purpose, a Segment also stores the Proto-PTE table (array) that addresses the offsets from the beginning of the mapped file through the Subsection structures. For example, if the VMM needs to load file data from the mapped file into virtual memory, it locates the corresponding Proto-PTE entry in the Segment table via not valid hardware PTE, which caused a page fault, from the page table. Next, using the Control Area structure and the calculated file offset, the VMM reads data from the file into virtual memory.

MmCreateSection creates segments using the following functions. It happens only if the file is mapped for the first time, otherwise the function gets a pointer to it via FileObject. Note that no matter how many sections have been created for the file object, there's always only one segment structure per type of mapping (binary, executable) for all of them. The same applies to Control Area structures, there's only one Control Area per type of mapping regardless of the number of created sections.

NTSTATUS MiCreateImageFileMap (IN PFILE_OBJECT File, OUT PSEGMENT Segment)

NTSTATUS MiCreateDataFileMap (IN PFILE_OBJECT File, OUT PSEGMENT *Segment, IN PUINT64 MaximumSize, IN ULONG SectionPageProtection, IN ULONG AllocationAttributes, IN ULONG IgnoreFileSizing)

As you can see MiCreateImageFileMap accepts fewer arguments, because it reads all the necessary information from the PE header of the executable file to be mapped. Description of other arguments you can find in NtCreateSection.

The following structure describes Segment.
ControlArea - pointer to the corresponding CA.TotalNumberOfPtes - roughly mapped_file_size/PAGE_SIZE.SizeOfSegment - size of the structure in bytes. MiCreateImageFileMap calculates it as SizeOfSegment = sizeof(SEGMENT) + (sizeof(MMPTE) * ((ULONG)TotalNumberOfPtes - 1)) + sizeof(SECTION_IMAGE_INFORMATION). PrototypePte - pointer to an array of PPTE. In fact, it's NewSegment->PrototypePte = &NewSegment->ThePtes[0].ThePtes - an array of PPTE, PPTE page table.
Perhaps the following image gives you a better understanding.

Behind the curtain of Section PTEs

As it was mentioned many times earlier, PPTEs and hardware PTEs pointing to them are key things to understand the virtual addresses translation concept for the mapped sections properly. The difference between them is that the first is stored in the Segment object, while the second in the process's page table (hardware PTE). Both can be in two major states - valid and invalid (P bit in the structure). Zeroed bit means that the mapped page is absent in physical memory and signals the VMM that its content should be read from disk. If the P bit is true, this virtual page is resident in physical memory and no additional actions are required from the VMM. The invalid PTE has a flag signaling that this PTE points to PPTE, i e belongs to the memory mapped file. Once a thread tries to access an invalid memory page, a page fault exception occurs and the VMM exception handler analyzes the PTE to learn what kind of pages it describes. There are several types of invalid PTEs, but we won't discuss this topic here. Also note that in case of a resident virtual page the VMM stores a pointer to PPTE and its value in the PFN database. Let's take a look at the format of these structures. You can the format of the PTE pointing to PPTE in the following pic.


Once you get the ProtoIndex, you can calculate the PPTE address with this formula: PrototypePteAddress = MmPagedPoolStart + PrototypeIndex << 2.

Below you can see PPTE format.


SubsectionAddress = MmSubsectionBase + PrototypeIndex << 3. MmSubsectionBase is usually equal to MmNonPagedPoolStart, because the WhichPool bit is usually set to 1.

Note that the PTE format in the prototype array can vary and is not always a _MMPTE_SUBSECTION.

Now, using our knowledge, we can put all the pieces together and make a complete picture of the actions for getting file data when a thread tries to access a virtual page belonging to a mapped file.

A little practice

Let's get to the Proto-PTE table. Take a random process, dump its basic information and go to the table.
We can go a bit deeper and calculate the offsets manually. To explore these structures it's better to take information from the cache slots as in the case of usual user-mode processes, the kernel can delay the creation of the Proto-PTE table until a thread addresses the mapped file data. I got a list of the cache slots on my system and select one describing the registry hive file NTUSER.DAT. Since it's a data file, there's only one subsection for its Control Area.
Now we can calculate the file offset starting from which the file is mapped to the cache slot using this formula.

FileOffset_LSN = (((PUCHAR)Pte - (PUCHAR)Subsection->SubsectionBase) / 4) << 12 + Subsection->StartingSector * SECTOR_SIZE

(E15B7208 - E15B7008) / 4 *1000 + 0 = 80000, this value you can see in the VACB structure above (Offset: 0x00080000).

Here's another example.
Now look at a more interesting case with PE files, Control Area of which has more than one subsection (one Subsection per one PE subsection). We can simplify our task and skip the first steps, starting with Control Areas. !memusage command can help us.
We can see the addresses of the Control Area structures in the first column. Print it for ole32.dll.
For clarity, copy the results to the table. If we open the PE file in the Cerbero PE Insider tool, we'll see that it has five sections. Note that the first subsection is allocated for the PE header.
We can see that the subsections number 2-3 are executable and match the PE sections .text and .orpc. This means that they address the PPTEs with code sections. The 4th subsection describes global data and has the copy-on-write protection. The rest are only available for read access.The first subsection describes the file header and starts at offset 0. On disk, the PE header fits in two sectors, but after its mapped to virtual memory, it occupies one virtual page and, therefore, can be described by one PPTE.The second subsection describes the first PE file's code section. It starts from the second sector (0x400 / SECTOR_SIZE == 2). The virtual size of this section is 0x11EF5E, i e rounding it up to a multiple of the page size, 0x11EF5E + 0xA2 = 0x11F000 / PAGE_SIZE = 0x11F. This value matches the number of PTEs in the subsection. We can calculate number of sectors for the section from the header Raw size, 0x11F000 / 200 = 0x8F8 that equals the number of sectors for this section.The third section also contains code and starts with sector number 0x11F400 / 0x200 = 0x8FA. The size is 0x6000 bytes (in this case we take the physical size as it larger than virtual), 0x6000/0x1000 = 6 PTEs.The 4th section starts from 0x125400 / 0x200 = 0x92A, the size 0x7000 / 0x1000 = 7 PTEs.The 5th section starts from 0x12BA00 / 0x200 = 0x95D, размер 0x2000 / 0x1000 = 2 PTEs.The 6th, 0x12D200 / 0x200 = 0x969, размер 0xE000 / 0x1000 = 0xE PTEs.
Let's check out the formula mentioned above in practice. Take the third subsection, which describes the ole32.dll section starting at offset 0x8FA (in sectors).

Dispatching #PF exceptions for mapped files

As we know the I/O Manager and VMM minimize the performance overhead by performing most of their operations asynchronously and by demand. Probably Windows developers don't know about this principle, because synchronous operations are default behavior for Windows API while the situation with Native and kernel API is reversed. This principle also applies to section objects. When MapViewOfFile Windows API returns control to the caller thread, it doesn't mean that mentioned subsystems copy the file data to virtual memory for RW just as if a thread modified the mapped file data in virtual memory, it doesn't mean that these changes will be immediately flushed to the physical file. Instead, the VMM delays the actual I/O operation until a thread of the process tries to access the file data by reading virtual memory. Once it happened, the #PF exception occurs and the VMM initiates an I/O operation to read file data into virtual memory. The common work in this case falls on the shoulders of MiDispatchFault function.

There're several possible situations for the sections describing file-backed data. Note that the section PPTE can be in the states inherent in a hardware PTE.
The PPTE is invalid and points to a subsection. In this case, the VMM needs to load the corresponding file data from disk into physical memory. The MiResolveMappedFileFault function is responsible for this, but an actual I/O operations is initiated by MiDispatchFault.The PPTE is valid and points to a page frame. The VMM just need to fill the hardware PTE with this frame number.The PPTE marked as copy-on-write.
MiDispatchFault calls MiResolveProtoPteFault passing it a pointer to PTE and PPTE. MiResolveProtoPteFault works with PPTE as well as with usual PTE, because PPTE can be in the same states as hardware PTE. The function starts by validating the PPTE, i e whether it's located in physical memory or not.
After checking the rights access to the page, the function checks the case when the PPTE is marked as Demand Zero and its hardware PTE marked as copy-on-write. In this case, the VMM resolves the fault by calling MiResolveDemandZeroFault and passing it a pointer to real PTE. Further, MiResolveProtoPteFault makes the PPTE valid, it can be in the following states: Demand Zero, Transition, Page File, Pagefile-backed, File-backed.
MiResolveProtoPteFault and MiResolveMappedFileFault functions perform important steps: reserve a page frame (physical page), initializes the corresponding entry in the PFN database, prepare a MDL structure and a special ReadBlock structure for further disk read operation. You can see the entire process in the following diagram.

Inside the Page Writers

In the last part of this blog post we're gonna discuss the Mapped Page Writer subsystem (thread), which is a part of the Modified Page Writer subsystem (or just thread). At this point we already know how the VMM and I/O manager read mapped file data to process's virtual memory in order to provide access to it. But what about writing file data? As was mentioned above, the VMM minimizes the performance overhead by performing its operations that involve disk I/O by demand. That's why the actual read operation on a memory mapped file only happens when a thread tries to access a file view and not when executing MapViewOfFile.

The VMM has two system threads called MiModifiedPageWriter and MiMappedPageWriter. In fact, MiModifiedPageWriter just creates MiMappedPageWriter thread and shifts the rest of work to MiModifiedPageWriterWorker. These last two functions (MiModifiedPageWriterWorker and MiMappedPageWriter) are two infinite loops that can be called Modified Page Writer, because they implement all its functionality. The first one is responsible for gathering information about modified pages belonging to the page file (MiGatherPagefilePages) and about modified pages belonging to the mapped files (MiGatherMappedPages). It also adjusts the frequency of the flushing operations or how often modified pages will be written to disk. The second thread takes the information prepared by MiModifiedPageWriterWorker and performs the actual disk write operation (for mapped files).

The main part of MiModifiedPageWriterWorker is an infinite loop with waiting on the MiMappedPagesTooOldEvent event. This event can be set in several circumstances and adjusts the frequency of performing flushing. To provide a fixed time frequency of flushing, the VMM uses a timer object and a DPC object (MiModifiedPageWriterTimerDpc), i e the DPC handler MiModifiedPageWriterTimerDispatch calls every time a timer expires. Since this handler is executed with high IRQL DPC_DISPATCH (2, the scheduler level), the system ensures its operation in privileged mode. The MmInitSystem function initializes this object during the system startup and when MiModifiedPageWriterWorker need to gather dirty memory pages for the first time, it sets this timer. The timer is set for 3 seconds.
Forgot to mention that physical memory pages (frames) that were modified since the section was mapped are called Dirty. The CPU sets this bit at the first write operation to the page. Once the VMM processes this page in a certain way, it resets this bit. Without it, the VMM wouldn't be able to track the changes made by the thread on the mapped page and synchronize them with the file data on disk.
For the convenience of flushing file data and, in order to reduce overhead, MiMappedPageWriter doesn't flush every dirty page separately, instead, MiGatherMappedPages gathers in the packet (MMMOD_WRITER_MDL_ENTRY) a set of dirty frames that belong to the same section and are adjacent to this dirty frame. The MMMOD_WRITER_MDL_ENTRY structure describes a set of dirty PFNs that should be written to disk. In fact, the VMM uses two MDL lists, one for a paging file and another for sections. The pool of these MDL items is allocated in the NtCreatePagingFile function that is responsible for creating page files. The same for memory mapped files - MmMappedFileHeader.

As it was mentioned above, MiMappedPageWriter is responsible for initiating disk write operation, here's its pseudocode, in which the details are omitted.
Please take a look at the following diagram to understand the entire process.

However, a thread can force the VMM to write modified data to disk immediately. Windows API provides applications with the FlushViewOfFile function. The VMM's internal function MiFlushSectionInternal is responsible for flushing section data and calls IoAsynchronousPageWrite to perform disk write operation.

Instead of conclusion

Thank you for your attention and hope you enjoyed the blog post. Windows Sections is quite a difficult topic, especially, for beginners, because you should already have an idea of other Windows kernel subsystems to understand it properly.

If you have any comments or remarks, please let me know and feel free to contact. I'm going to cover several other topics on the Windows VMM internals such as the PFN database, hyperspace and virtual address translation.
Read Entire Article