Implementing Plausible Crash Recovery

April 2nd, 2014 / Open Source / PLCrashReporter
By: landonf

Yesterday we announced Plausible Crash Recovery, a working crash recovery system built on top of PLCrashReporter. Upon a crash, the recovery implementation steps backwards from the crashing function, restoring non-volatile register state and returning nil to the original caller.

Despite the fact that it was released on April 1st and indeed, was a prank, it does actually work. That doesn’t mean that using it is a good idea, though, and today I figured I’d explain how it works, why you don’t want to use it as-is, and where the underlying technology might actually be applicable.

If you haven’t already checked out the source and played with it yourself, give it a go. You can try plugging in some different crashing bugs of your own, and see how it behaves.

As a fair warning, I’m going to sacrifice some precision in my explanations below for the sake of overall clarity; there are a lot of details and edge cases that must be accounted for when implementing a crash reporter (or in this case, a crash recovery system), and if you’re interested in a digging in further, feel free to stop by and chat with us on the freenode #plcrashreporter IRC channel.

April Fools

I know a number of folks assumed — like many of the other April Fools absurdities — that Plausible Crash Recovery didn’t actually work. Despite the fact that we bolted on a goofy UI, the code works as advertised, as befits a proper hack. The “restoration” UI, despite being totally unnecessary, even shows the actual steps taken to restore thread state; the only exception being the “Reticulating Splines” step – that part I made up.

Disaster Averted

The last time I was privy to a fun April Fools Day prank was back in the 90s, when some co-workers implemented a local man-in-the-middle attack on common stock ticker sites, proxying and adjusting the returned data for our .com to show a precipitious fall (or rise, I don’t recall which). What was fun about the prank wasn’t the actual effect it had on people — as I recall, nobody was seriously fooled, which was probably a good thing for all involved.

Rather, what made my coworker’s prank fun (for me, anyway) was that it was a good hack. It was the kind of wacky technical implementation that you can do when you decide that it’s OK to break the rules and see what neat ideas come of it. It reminded me of the ethos that drove the fabeled MacHack conferences, the source of gems such as Quinn “The Eskimo!”‘s 2002 “Best Hack” winner, Firestarter, which demonstrated that just plugging in a firewire cable was enough to allow DMA writes to the target computer’s video buffer (in this case, displaying flames at the bottom of the screen).

So when it struck me that PLCrashReporter actually had the tooling necessary to implement a bad clone of a bad Visual Basic feature, actually implementing Crash Recovery seemed like a good April 1st hack — in the classic meaning of a hack.

Of course, like most hacks, the fact that it mostly works doesn’t mean you should actually use it.

Rolling Back Time

I often think of PLCrashReporter itself as a “time machine debugger” — it ideally provides a view into the past that can be used to reconstruct the state of the process and debug a long-past failure. Crash Recovery takes this time machine metaphor much further — using PLCrashReporter’s async-safe stack unwinding to step backwards from the crashing function, restoring non-volatile register state and returning nil to the original caller.

To understand how this works, we first need to understand the state that represents a thread of execution, and what parts of it must be rewound to return to the original caller — as well as what can’t be rewound, but really should be.

For a any given process, if you were to pause it at a moment in time — say, when a crash occurs — the crashed function’s execution state would be fully encapsulated in:

  • Global State (including the heap, file descriptors, memory mappings, etc …)
  • Thread Stack
  • CPU Register State

If we want to restart execution in the crashed function’s caller, we need to work backwards from the current process state to restore as much of the caller’s previous state as we can.

Global State

Global state includes (but certainly isn’t limited to) the heap, file descriptors, shared data structures, and even the process’ current working directory. Any part of global state that is changed during execution of the crashing function represents modified state that must be rolled back if we wish to perfectly restore the thread to its pre-crashed state.

Unfortunately, restoring global state is a non-starter — there’s no way for us to know what was changed. For example, if the crashed thread has has corrupted the heap (or it was already corrupt), we can’t restore the heap to a non-corrupted state, and the application will likely just crash again. However, there are plenty of crashes that don’t involve corruption or mutation of global state, in which case we don’t need to restore any global state to allow execution to continue in the caller.

In other words, despite this limitation, we can actually recover from a large number of common crashes despite not having the facility to roll back global state. Of course, the less mutable shared state you use, the more recoverable your crashes are — funny as that might sound in the context of an April Fool’s hack, that’s actually the principal behind “fail fast” semantics often supported in functional programming languages. A failed thread can simply be discarded if there’s no chance it will leave behind partially modfied shared state, and the process state will remain fully consistent.

Thread Stack

The thread’s stack maintains state for each called subroutine via a series of stack frames. At the time of the crash, the current stack frame is represented via the following state:

  • The stack pointer points to the current top of the stack. Any new stack allocations will likewise adjust the stack pointer.
  • The frame pointer usually – depending on the architecture, calling conventions, and emitted code — points to a fixed stack allocation that is at a fixed offset from the caller’s original stack pointer, and is used to store the caller’s return address and original frame pointer.
  • The return address is the address to which the called function should return upon completion. Depending on the architecture and calling conventions, this may be stored in a register (as it is on ARM), or may be stored on the stack via the frame pointer (x86).

To restore the caller’s original stack, as well as to determine the code address at which we should restart execution to simulate a return, we need to derive the caller’s stack pointerframe pointer, and return address from the current thread’s stack state. If we’re able to successfully determine those original values, then we’ll have successfully restored the stack, as well as the execution address.

Of course, if the crashed function smashes any of this data, or the caller’s stack frame, or some other critical data on the stack, we can’t actually recover reliably; should we try, we’ll likely just trigger a secondary crash.

While that’s the basic premise, the actual process of performing the stack unwinding is a bit tricky. On some architectures (including 32-bit iOS and 32-bit Mac OS X), the frame pointer is almost always stored in a fixed register, and can be easily fetched from the crashed thread’s register state. The caller’s original stack state can simply be directly fetched or computed from the frame pointer register.

On other architectures, however, things aren’t so simple. On Mac OS X x86-64, for example, there is no requirement that the frame pointer be saved in a machine register. Instead, additional unwind data is provided by the compiler; this data defines how the caller’s state may be restored from the current thread state: values may be computed from existing registers, existing stack values, as fixed offsets, or through a variety of other mechanisms. This relates to how we restore register state, and we’ll cover how this works in more detail below.

Register State

The crashed thread’s register state represents the processor’s execution state at the time of the crash. During execution, the crashed function may have overwritten some of the caller’s register values; since the crashed function will never have the opportunity to restore those overwritten values, restoring the caller’s state will require that we somehow determine:

  • Which registers are expected by the caller to have been preserved (ie, caller-preserved registers).
  • Which of those registers have been modified and require restoration.
  • How to actually restore the original values for those registers.

To answer the first question, we simply need to look at the platform’s defined calling conventions. For Apple’s platform, these are defined in the iOS ABI Function Call Guide and Mac OS X ABI Function Call Guide. The calling conventions define callee-preserved and caller-preserved registers:

  • Callee-preserved registers (or, non-volatile registers) must be preserved by the called function, if it overwrites the caller’s original register value(s). These are the registers that we must restore, if they’ve been overwritten.
  • Caller-preserved registers (or, volatile registers) must be preserved by the calling function, if it requires later access to those values. These registers may be freely overwritten, and do not need to be restored prior to returning to the caller.

This answers the first question, but we’re left with a connondrum — when execution stops in the middle of a crashing function, how do we know which non-volatile registers have been modified, and how do we know how to restore their original values?

Unfortunately, on Apple’s 32-bit platforms (ARM and i386), the answer is that we don’t. This information is not available, and we simply have to restore the stack state we can and hope that’s enough. Surprisingly, this actually works fairly often. It is, of course, a terrible idea, and one of many good reasons why “crash recovery” ought to be considered a hack, and not an actually useable product.

On Apple’s 64-bit platforms (x86-64 and ARM64), however, this information is provided via the same unwind data that allow us to pop the thread’s stack frame; we can interpret the unwind data at crash time to perform non-volatile register restoration.

Leveraging Unwind Data

Background: Exception Unwinding

We’ve already established that on 32-bit Apple platforms, our ability to unwind the stack is limited due to the lack of unwinding data. The reason for this actually has to do with how exceptions are handled on the platform. On 32-bit Apple systems, when a try/catch/finally block is declared, the current thread’s state is actually saved via setjmp() (or equivalent functionality), and pushed onto a per-thread stack of exception handlers; when it comes time to find an exception handler, the stack is popped until a matching handler is reached, and the equivalent of longjmp() is used to re-load that thread state, resuming execution.

This approach has two downsides; first of all, there’s no way for a debugger or crash reporter to use the exception unwinding information to unwind arbitrary intermediate frames. The only time exception unwinding information is available is when a catch block is executed, and in that case, it’s only possible to restore the specifically saved thread state. Secondly, there is the issue of runtime cost. At each catch or finally statement, the thread state must be saved and pushed onto a stack, even if it’s never used.

The  alternative approach, and what is used on Apple’s 64-bit platforms, is the use of so-called zero-cost exceptions. Rather than recording thread state at runtime, the compiler builds a lookup table that covers all codein an executable. This table defines how to accurately unwind a single frame from any valid instruction address, as well as providing language/runtime-specific definitions of where try/catch/finally blocks are defined, and how to handle them.

As a result, it’s not necessary to do any work at runtime if an exception is not thrown; hence the name “zero-cost exceptions”. If an exception is thrown, the language/exception runtime must consult the lookup table to correctly unwind the stack.

As it turns out, this is exactly the same information that debuggers, crash reporters, and evil crash recovery hacks need to perform their own stack unwinding.

Interpreting the Unwind Data

To correctly unwind a frame in our crash recovery system, we need to actually interpret the unwind data, and extract the rules necessary to calculate, load, or otherwise restore the caller’s original register and stack state.

Conceptually, it helps to think of the unwind data as being stored as two-column table; each row in the table represents an instruction address within the binary (the first column), and (in the second column) are the unwind instructions necessary to restore the caller’s state. To perform the unwind operation, we first need to find the row that represents the instruction at which the crash occured, and then apply any restoration rules defined at that row.

In reality, such a direct encoding of the unwind table would be prohibitively enormous. To solve that, complex encoding schemes are used to minimize duplication and maximize data re-use; on Apple platforms, these are DWARF and Apple’s own Compact Unwind.

DWARF

DWARF is a (mostly) platform architecture-neutral standard for defining debugging information, including unwind data. To add support for a new architecture in PLCrashReporter’s DWARF implementation, it’s generally sufficient to simply add a mapping between DWARF register numbers defined by the platform vendor, and the actual registers they represent; interpreting the format operates entirely in terms of these abstract register numbers.

The encoding is capable of representing almost any possible set of unwind rules; the lookup tables and restoration rules are implemented as an versatile interpreted bytecode, including a turing complete set of DWARF expression opcodes. Amusingly, this aspect of DWARF has been used to implement in-process arbitrary code execution without native code. If you’re curious, you can peruse PLCrashReporter’s DWARF expression interpreter source here.

While enormously useful (and necessary!), the versatility of DWARF comes at the cost of the encoding’s conciseness; this is what Apple set out to address with their non-standard Compact Unwind encoding.

Compact Unwind

Apple’s Compact Unwind encoding is architecture-specific, non-portable, and is unique to Apple. Whereas DWARF can represent almost any set of rules necessary to perform unwinding, Apple took a different approach with the compact unwind encoding — it’s only capable of representing a limited set of unwinding rules, but these rules cover all (or just about all) of the code constructs actually emitted by the compiler. In exchange for these limitations, the compact unwind encoding is, well, compact; it’s much smaller than the corresponding DWARF representation, which means appreciably smaller binaries.

The Compact Unwind encoding can’t represent the full range of unwinding rules that may be required, and as such, it’s used in concert with DWARF. At link time, any DWARF rules that can be represented using the compact unwind encoding will be converted by ld, and the DWARF data will be discarded.

Since the original DWARF data is discarded, this means that correct crash reporting (and, in the case of Crash Recovery, frame unwinding) requires both DWARF and Compact Unwind support. You can find you can find PLCrashReporter’s Compact Unwind implementation (for x86-64 and ARM64) here.

Applying the Unwind Changes

As part of our work to support crash reporting on 64-bit platforms, we already had implemented full DWARF and Compact Unwind support in PLCrashReporter, including the APIs necessary to represent register modifications across stack frames; we implemented this with the eventual goal of including non-volatile register state for all frames in all threads in the crash report.

We had to do very little to implement the Crash Recovery system — it was a simple matter of calling directly into our unwinding APIs from our signal handler, and applying the computed register results to the ucontext_t containing the signal thread state. If you return directly from a signal handler, any changes made to ucontext_t within the signal handler will be applied to the target thread — by modifying the ucontext_t, we’re able to update the stack pointerframe pointer, as well as any non-volatile registers. In addition, by setting the instruction pointer, we actually cause the thread to resume in the crashed function’s caller upon return from the signal handler.

Since it’s just a little bit of glue on top of PLCrashReporter’s existing async-safe APIs, the Crash Recovery code only took about a day to write; if you’d like to take a look, the signal handler additions can be found here.

Returning Nil to the Caller

Having implemented unwinding, the last thing we needed to do was set the return value to nil. I have to admit we cheated a bit here; we ignored floating point and structure return types.

To handle pointer return types (including Objective-C objects), we simply set the return address register to 0x0. This handles most return values, but in the case where structures are returned on the stack, or a special handling is required for floating point, you’ll see unexpected results.

Conclusion

While the Crash Recovery implementation is an interesting technical exploration of what’s possible, it would be a terrible idea to actually use it as a blanket “fix” for crashes, even if it worked absolutely perfectly. The nature of crash is such that the current process state is, by default, undefined; if it was defined, it wouldn’t have crashed. Blindly attempting to proceed can do worse than crash; data corruption and deadlocks are entirely likely.

That doesn’t mean that this avenue of exploration is bereft of value, however. For example, if we extended the PLCrashReporter APIs to directly support the idea of “patch and continue”, we could  support some pretty common operations that currently require custom per-architecture+platform implementations in runtime VMs, such as trapping “optimistic” error handling cases – managed code could use this mechanism to exclude NULL or divide-by-zero checks in generated machine code, instead trapping the signals, verifying that the failure occurs within managed code, and converting the signal into a language-level stack-unwinding NullPointerException or DivideByZero exception.

A more aggressive avenue of exploration is the idea of emergency “hot patches” deployable directly from a crash reporting service. If your shipped application is unexpectedly crashing across your entire user-base with a call to CFRelease(NULL), and you know it’s safe to work around the issue, a crash reporting service could support feeding a PLCrashReporter-based hotpatch to your application, working around the issue until you could actually ship a release.

After all, there’s no upside to having customers being frustrated and continuing to submit crash reports for a known issue.

It’s not clear that we’ll see any of these ideas — or any of the others floating around our heads — in an actual shipping product, but it’s mighty fun to hack something out and see what they might look like.


Introducing Plausible Crash Recovery

April 1st, 2014 / Announcements / Open Source / Plausible Labs
By: landonf

Update: Check out the post-April Fools Follow-up, which delves deeply into the actual implementation of Plausible Crash Recovery, and where this work could actually see practical use.

Sheer performance and deep insight are essential in a crash reporting solution like PLCrashReporter, but our hardcore team is never satisfied by just pushing the envelope — we’re here to destroy it.

Today, I’m extremely pleased to announce the future of iOS and Mac OS crash reporting: Plausible Crash Recovery™.

Plausible Crash Recovery™ works almost by magic, automatically detecting iOS and Mac application crashes, and resuming execution at the next available statement, ensuring that your users never have to deal with a crashed application again. Why just report crashes when you can prevent them?

Crash Recovery is a bit like a time machine, using PLCrashReporter’s best-in-class async-safe stack unwinding to step backwards from the crashing function, restoring non-volatile register state and returning nil to the original caller — think of it like nil messaging on steroids. It truly has to be seen to be believed:

View in HD

Of course, our engineers weren’t satisfied until Plausible Crash Recovery™ handled more fatal signals than any other crash recovery product on the market. NULL dereference? No problem. CFRelease(NULL)? Piece of cake. Sending an Objective-C message to invalid memory? We’ve got you covered.

Developer Preview – Available Today

We could not be happier to get these improvements into the hands of billions of app developers.

If you want to take Plausible Crash Recovery™ for a spin, we’re making it available to early adopters today. Our April 1st preview release contains both the source code to PLCrashReporter with Plausible Crash Recovery™, as well as iOS and Mac OS X demo applications that you can use to test Plausible Crash Recovery™ immediately.

To use PLCrashReporter with Plausible Crash Recovery™ in your own code, simply link against the provided iOS or Mac OS X PLCrashReporter.framework and enable the crash reporter.

Warning: While PLCrashReporter with Plausible Crash Recovery™ does actually work as advertised, it has seen limited testing, and application developers are cautioned to pay close attention to the release date of this announcement prior to shipping PLCrashReporter with Plausible Crash Recovery™ in an actual product.

We also must give credit to Microsoft Visual Basic’s ground-breaking On Error Resume Next, which directly inspired the implementation of PLCrashReporter with Plausible Crash Recovery™.


Calling all Colorists and Space Cadets: Color/Space Launches Today

January 23rd, 2014 / Announcements
By: plausible

Here at Plausible Labs, coding often seeps into our dreams. On a cool night in San Francisco last summer, one of our engineers woke up with the vision for Color/Space, a small game for iOS that we’re launching today.

Color/Space takes its players on a fast-paced space mission where they must mix colors to keep planets from escaping a central star’s gravity. Players choose challenges from any of the six levels, from grayscale all the way to the CMYK model. Along the way, users can learn something new about various color models. With each new round, time moves a bit faster and the stakes are raised to keep the planets in orbit.

The game puts players’ speed, imagination, and color theory knowledge to the test while being an interstellar getaway for designers, gamers, and space cadets. Available on the App Store now!

App Icon


Crittercism Joins the PLCrashReporter Consortium!

January 22nd, 2014 / Announcements / Open Source
By: landonf

Plausible Labs is extremely pleased to announce that Crittercism has joined the PLCrashReporter Consortium, providing significant support for the ongoing open-source development of PLCrashReporter.

Plausible CrashReporter provides an open source in-process crash reporting framework for use on both the iPhone and Mac OS X, used by first-tier commercial crash reporting services like Crittercism.

Ongoing open source development work is sponsored directly by the members of the PLCrashReporter Consortium, as well as by individual application developers through our application developer support services.

Here at Plausible Labs, we’re big believers in the idea that complex development tools — such as compilers, debuggers, and crash reporters — benefit from being developed openly and under a liberal license, as to allow for wide adoption, peer review, and technical validation of the implementation across the widest possible user base. This development model has made PLCrashReporter one of the most reliable, well-architected, and feature-complete crash reporters available for iOS and Mac OS X.

We’ve laid out an ambitious project roadmap for this new year, starting with a few goals that we believe are imperative to growing PLCrashReporter’s utility and value:

  • Increase the scope and depth of useful data gathered in our crash reports, while maintaining our strict user privacy requirements.
  • Expand the user base of the library to ensure the continued health of the project.
  • Work with implementors of managed runtimes (such as Xamarin, Unity3d, RubyMotion, and RoboVM) to improve compatibility between PLCrashReporter and their managed runtime (some of which already use PLCrashReporter).
  • Maintain our focus on reliability by introducing technical solutions to provide even stronger reliability guarantees as the scope and complexity of the library continues to grow.
  • Improve usage and integration documentation, targeted at both 3rd party integrators, and application developers, to help encourage a healthy development ecosystem.

If your platform or application relies on PLCrashReporter, we’re always interested in hearing about your priorities, and your feedback on our roadmap. The support of existing and new sponsors makes an enormous difference in both the scope and scale of what we can produce, and the sponsorship of companies like Crittercism is imperative to the success of the project.

Tags:


Plausible’s VoodooPad for iOS is out! Or: how to migrate.

December 23rd, 2013 / VoodooPad
By: mikeash

A new version of VoodooPad for iOS is out, the first one from Plausible Labs. Version 2.0.7 brings very few changes over 2.0.6: a couple of bug fixes and an updated Dropbox SDK. The big change is that it’s now our build, and with our build comes a new app on the App Store that customers need to grab.

Background

The App Store originally had no way to transfer apps from one developer to another. If an app was ever purchased by a new company, as we’ve done with VoodooPad, there were only two choices:

  1. Give the selling company’s entire iTunes Connect account to the new company.
  2. Take the app down from the selling company’s account, then post it as a new app under the new company’s account, and somehow migrate all the users.

Neither option was great, but they were the only choices. The first one only works if it’s the seller’s only product, and the second one is tough on users.

This year, Apple added the ability to transfer individual apps between developers. Hooray! We used this with the Mac version of VoodooPad and it worked great.

Unfortunately, this functionality comes with a huge gotcha: if an app has ever so much as gazed upon iCloud, it cannot be transferred. I’m not clear on exactly why this is, but there is apparently some inherent technical limitation in Apple’s backend code that makes this impossible. Presumably it’s the same sort of technical limitation which prevented app transfers altogether for so many years.

While VoodooPad for iOS has never used iCloud, it did have iCloud enabled as part of an experiment, and that was enough to make it ineligible for transfer. That means we’re stuck with the two bad choices above. The first is out of the question, as Flying Meat also has Acorn on the App Store and so can’t just give us the entire account. That leaves us with the second choice: take down Flying Meat’s VoodooPad for iOS, and post our own.

What To Do

Flying Meat has posted 2.0.7 on their account, but that will be the last version there. For future updates, you must switch to our version. Since it’s considered a separate app in the App Store, you need to obtain it manually. You can do so here:

https://itunes.apple.com/us/app/voodoopad/id777792045?ls=1&mt=8

If you’re like me, you may be thinking, “But wait, won’t I have to buy it again?” Normally that would be the case, since Apple doesn’t see the two apps as being related in any way, and so your previous purchase doesn’t count for the new version. To mitigate this, we’ve made the new version free through January 21st, to give everyone time to move over to the new version. Please make sure you grab the new version now, while it’s free. Once you do this, you’ll need to transfer your documents across, since it acts like a completely new app and doesn’t share any data with the old one.

Document Transfer

If you’re syncing with Dropbox then the transfer process is extremely easy. Just set up Dropbox in the new copy of VoodooPad and all of your documents will automatically sync to it.

If you’re not syncing with Dropbox, then I definitely recommend setting it up. It’s free, it’s easy to use, and it works great. You’ll need to set up both the old and new copies of VoodooPad with Dropbox, and then the new copy will sync the documents from the old copy.

If you’re not syncing with Dropbox and don’t want to, then you can either use WiFi sync with a Mac to transfer your documents, or you can use iTunes. Check out the documentation on syncing and transferring for more information:

https://plausible.coop/voodoopad/docs-5/syncing%20and%20transferring.html

However you transfer the documents across, double-check to make sure that the new copy of the app has everything, and then you can go ahead and delete the old copy from your device.

Summary

The Plausible Labs version of VoodooPad for iOS is now on the App Store, but as a separate app. It’s free through the middle of next month so that existing users can move over at no cost. Existing users need to:

  1. Download a copy of the new app from the App Store.
  2. Transfer their existing data into the new app, either through Dropbox, WiFi sync, or iTunes.

It’s also a great time for new users to check out VoodooPad for iOS, since you can also grab it for no cost during this migration period.

This is the final piece we had to take care of in moving VoodooPad from Flying Meat to Plausible Labs. We’re looking forward to being able to fully concentrate on improvements to both the Mac and iOS versions of the app now that we’ve tamed the App Store beast.

Thanks for your patience with all of this, and please don’t hesitate to contact us if you have any questions or problems with the iOS migration.


PLCrashReporter 1.2 Release Candidate

December 16th, 2013 / Uncategorized
By: landonf

I’m pleased to announce the first release candidate of PLCrashReporter 1.2. Plausible CrashReporter provides an open source in-process crash reporting framework for use on both the iPhone and Mac OS X, and is used by the preeminent crash reporting and analytics services for Mac OS X and iOS.

This release adds support for more precise stack unwinding on ARM64 using DWARF eh_frame and Apple compact unwind metadata. As far as I’m aware, PLCrashReporter is the only 3rd-party crash reporter on iOS to support eh_frame/compact unwinding, and as such, provides the most accurate backtraces available for the platform. (See below for the technical details.)

Significant changes since 1.2-beta2:

  • Production ARM64 support.
  • Improved ARM64 unwinding (DWARF eh_frame and Apple compact unwind support).

Full Changelog.

This release was funded by Plausible Labs, HockeyApp, and Flurry via the PLCrashReporter Consortium. Our thanks goes out to the Consortium Members that make our work on PLCrashReporter possible.

New ARM64 Unwinding Implementation

On Mac OS X, Apple’s compiler toolchain includes additional metadata within each binary that can be used to perform near-perfect backtraces — whether you’ve crashed in objc_msgSend, inside of a C++ function, or even inside of a custom assembly trampoline.

Without this metadata, crash reporting implementations must apply heuristics to determine best unwind strategy, and in many cases, will do so incorrectly, resulting in incorrect or incomplete backtraces, making bugs harder to understand and more opaque.

With the introduction of ARM64, Apple has brought support for this additional metadata to iOS, where we were able to add support to PLCrashReporter. This involved reverse engineering the incorrectly documented compact unwind format (rdar://15057141 – Incorrect compact unwind documentation in compact_unwind.h), and implementing a full suite of ARM64 hand-written assembly regression tests that we used to ensure the correctness of our unwinder and compatibility with Apple’s implementation.

We’re comfortable saying that PLCrashReporter provides the absolute best crash reporting data available from any 3rd-party on-device crash reporting library for Mac OS X or iOS.

Technical Details

On ARM/ARM64 (ARM), the return address — that is, the address at which execution should resume when the called function returns — is stored in the link register (lr) when a function is called. This address (generally) points to the next valid instruction directly following the branch instruction.

When the called function returns, it uses the saved value stored in the link register to branch to the original caller. If the called function itself calls a function, the link register will be overwritten — thus, the contents of the link register must be saved and restored as to allow the called function to return to its own original caller. In most functions generated by the compiler, the link register will be stored in a standard location, reachable via the frame pointer — this is what a naive crash reporter will rely on to perform walking of the stack frames and product a backtrace.

Where this approach fails is when a function behaves differently than expected — such as objc_msgSend(), where it neither sets up a stack frame nor saves its link register. Objc_msgSend works as a trampoline — it looks up and branches directly to the method implementation, and once complete, is no longer relevant in a backtrace. If a crash occurs in objc_msgSend, and a naive backtrace implementation simply consults the stack where the link register is normally saved, it’ll fetch the contents of the previous function’s link register, and thus, skip a frame. The information will still be available from the crash report — you can fetch the link register from the report and determine what the caller should be — but this is hardly ideal when generating easy-to-use backtraces.

In more complex cases, the frame pointer and link register may be entirely indecipherable by a naive frame walker — for instance, if a crash occurs inside of a custom assembly trampoline that has overwritten both, or if the crash occurs inside of a managed runtime that generates non-standard native code.

To handle these cases, Apple’s toolchain provides information on the location of non-volatile register data — including the link register — and the steps necessary to access it, generating a ‘perfect’ backtrace. These instructions are provided via both DWARF unwind data, and Apple’s compact unwind encoding. The compact unwind encoding provides a strict subset of the DWARF unwinding data, while consuming considerably less space, at the cost of only being able represent a subset of the standard well-defined methods for unwinding a single frame. The DWARF format and implementation, on the other hand, is considerably more complex — it even includes a full, turing complete opcode interpreter that may be used during the unwinding process to express the location and steps to retrieve register values. The DWARF data can be used to express even the most complex and bizarre unwinding steps for a hand-written function. As far as we’re aware PLCrashReporter is the only crash reporting solution to provide full support for client-side unwinding via both DWARF and Apple’s Compact unwinding data.


Further Thoughts on the VoodooPad Acquisition

November 6th, 2013 / VoodooPad
By: mikeash

My name is Mike Ash, and I’d like to talk to you about our acquisition of VoodooPad and the way I’ve put it to use for many different things over the years. I’m a software engineer here at Plausible Labs. I wear a lot of other hats as well. Plausible Labs is a cooperative, which means that all of us here at Plausible share in the ownership and decision-making of the company. I helped get the ball rolling on VoodooPad, although the decision was made collectively as befits our corporate structure. I’m also going to be the principal engineer working on VoodooPad for the moment, and I’m happy to get my hands dirty.

Some of you may know me from NSBlog, home of the Friday Q&A series of articles about Mac and iOS programming. As it happens, I do all of my blog writing in VoodooPad these days.

I first discovered VoodooPad way back when it was fresh and new, around the original release in 2003. I was enchanted at the prospect of being able to organize my personal data in a personal wiki. Before that, I had a mass of text files strewn about my computer. A lot of them lived on the desktop. More of them lived in a folder on the desktop, where I occasionally “organized” files that I’d kept around for a while. Still more lived in random, unfathomable places.

I started a VoodooPad document with that old version and called it “Random Notes”. I still have that same document today, now upgraded to a VoodooPad 5 document and filled with lots of data from over the years. I started out by filling it up with those assorted text files and finally getting rid of my desktop clutter. (It came back within hours, but now it’s at least limited to images and zip files and other non-textual information.) Then I started adding more. Now this document is filled with everything from insurance policy numbers to assorted programming tricks to notes on places to fly gliders in China. Some of this stuff is tremendously old and completely useless to me now, but the personal wiki concept lets me keep everything around without any clutter, and I never have to worry about deleting something I’ll later need.

With the release of VoodooPad 5, it gained support for the Markdown syntax. By a wonderful coincidence, my blog uses the Markdown syntax. I saw this as the perfect opportunity to move away from a painful system of text files and scripts and streamline the process of writing my articles. I wrote a small script that sends an article to my blog server, and now I do all of my writing from within my master VoodooPad document. It’s not built as a blogging platform, but the power and flexibility of the system lets me mold it to my needs.

VoodooPad has been indispensable to me for years. Looking back, I believe I’ve used it for a longer time than any other piece of third-party software I’ve used since I first laid hands on a Commodore 64 back in the mid-80s. I’m excited to move up from a user of the app to a developer, and hope that we can give many more years of solid updates for a powerful and flexible app.

The first Plausible release of VoodooPad is version 5.1.3. It contains a few bug fixes that Gus, head of Flying Meat made since 5.1.2, and it also changes names and URLs around as appropriate. We’re now hard at work on a 5.2 release following the road map from Gus. I’m a firm believer in software that’s done when it’s done, so I don’t want to make any promises on a timeframe, but I certainly don’t want to keep you waiting too long!


VoodooPad Acquisition

November 6th, 2013 / VoodooPad
By: mikeash

We are pleased to announce the acquisition of the VoodooPad personal wiki from Flying Meat.

We’re big fans of VoodooPad and put it to a lot of use. We’re excited to take over the reins from Flying Meat and hope that we can live up to the expectations they have set.

For the moment, existing and new users should notice few changes. For the future, we hope to be worthy stewards of the app, and maintain its spirit and quality while continuing to improve it.

Product acquisitions can be worrying to existing customers. The new owners can come in with overly-ambitious plans that end up not working out. We plan to take things slow and steady and make sure we thoroughly understand the app and its user base before we look for any exciting new directions.

We’d love to get to know VoodooPad users. If you have any questions or comments, or would just like to say hello, please write in! We’d be delighted to hear from you.


PLCrashReporter 1.2-beta1 (and ARM64 Support!)

September 13th, 2013 / Announcements / Open Source
By: landonf

I’m pleased to announce the first beta release of PLCrashReporter 1.2. Plausible CrashReporter provides an open source in-process crash reporting framework for use on both the iPhone and Mac OS X, and is used by most of the first-tier commercial crash reporting services for Mac OS X and iOS.

This is the first major update to PLCrashReporter’s design since the 1.0 release, and there’s a lot of significant improvements — and we’ve set the stage for some even more significant enhancements for Mac OS X and iOS in the future. The extensive work on this release was funded by Plausible Labs and HockeyApp via the PLCrashReporter Consortium.

New features in this release include:

  • Experimental ARM64 support.
  • Mach-based exception handling on Mac OS X and iOS (configurable).
  • Client-side symbolication using the Mach-O symbol table and Objective-C meta-data (configurable).
  • Enhanced stack unwinding using both DWARF and Apple’s Compact Unwind data when available (i386, and x86-64, ARM64 forthcoming).
  • Support for tracking preserved non-volatile registers across frame walking. Allows for providing non-volatile register state for frames other than the first frame when using compact or DWARF-based unwinding.
  • Back-end support for out-of-process execution.
  • A unique incident identifier is now included in all reports.
  • Reports now include the application’s start time. This can be used to determine (along with the crash report timestamp) if an application is crashing on launch.
  • Build and runtime configuration support for enabling/disabling local symbolication and Mach exception-based reporting.
  • Mac OS X x86-64 is now a fully supported target.

You can download the latest release here, or review the full API Documentation.

More details on a few of the big (and cool) features:

ARM64 Support

We’ve implemented baseline support for ARM64, including all the necessary assembly/architecture code changes. For this initial release — prior to the availability of actual iPhone 5S hardware — we’re providing a separate binary release that includes ARM64 support. This is intended to allow projects that depend on PLCrashReporter to experiment with integrating arm64 into their builds; applications should not be released with PLCrashReporter/ARM64 until the implementation has been validated against actual hardware.

Once we have ARM64 hardware in hand, we’ll validate our implementation via our test suite and fix any issues that likely exist. One of the most exciting changes that we’ll be investigating after the release of the iPhone 5S is support for frame unwinding using the now-available ARM64 compact unwind and DWARF eh_frame data; this will provide the best possible stack traces on iOS, and has not been available for arm32 targets.

Mach Exception Handling

This release supports the use of optional Mach exception handling, rather than a standard POSIX signal handler. The use of Mach exception handling can be configured at runtime, or can easily be excluded/included from the entire build at compile-time.

Mach exceptions differ from POSIX signals in three significant ways:

  • Exception information is delivered as a Mach message via a Mach IPC port, rather than by the kernel calling into a userspace trampoline.
  • Exception handlers may be registered by any process that has the appropriate mach port rights for the target process.
  • Exception handlers may be registered for a specific thread, a specific task (process), or for the entire host. The kernel will search for handlers in that order.

These properties can be useful for a crash reporter; they allow us to operate a reporter entirely out-of-process on platforms where this is supported (eg, Mac OS X), they allow us to create multiple tiers of crash reporting (eg, we can register a per-thread Mach exception handler that detects only crashes that occur in our own crash reporter), and they allow us to catch crashes that leave the currently executing thread in a non-viable state (such as due to a stack overflow, in which case there is no room on the target thread’s stack for the signal handler’s frame).

However, there are also some downsides, which is why Mach exceptions have been such a long time coming to PLCrashReporter, and why they remain optional:

  • On iOS, the APIs required to implement Mach exception handling are not fully public — more details on the implications of this may be found in the API documentation referenced below.
  • A Mach exception handler may conflict with any managed runtime that registers a BSD signal handler that can safely handle otherwise fatal signals, allowing execution to proceed. This includes products such as Xamarin for iOS.
  • Interpretation of particular fault types often requires information that is architecture/kernel specific, and either partially defined or undefined.

In some circles, Mach exception handling has been described as the “holy grail” of crash reporting. I think that’s a bit of a misnomer; I’d be tempted to call them the “holy hand grenade”; they provide some advantages, but can just as easily explode in an implementor’s hand. In the process of implementing this feature, we found (and worked around) two separate kernel bugs that resulted in an in-kernel deadlock caused by in-process use of Mach exceptions. The fact is that Apple treats Mach exceptions as a partially exposed private API, and the only truly supported consumer of Mach exceptions is Apple’s own Crash Reporting implementation.

Our general recommendation is to continue to use POSIX signal handlers on iOS; for further information, refer to PLCrashReporter’s Mach Exceptions on Mac OS X and iOS documentation.
Mach exception handling may be enabled via -[PLCrashReporter initWithConfiguration:].

Client-side Symbolication

While DWARF debugging information is necessary for first-class symbolication, it’s not always available; for example, when running an in-development copy of your code on your phone for which you lost the dSYM by performing a rebuild. Traditionally, the crash report generated by such a case is useless, as you have no reasonable way of matching it up to even symbol names.

To help in these instances, we’ve implemented support for client-side symbolication, which will provide basic symbol information even when the dSYM is long-gone. Our implementation goes quite a bit beyond most other systems, in that in addition to using the Mach-O symbol table (which is often stripped, or in the case of iOS, all symbol names are renamed to <redacted>), Mike Ash implemented async-safe introspection of the runtime Objective-C metadata to fetch class and method names for all symbols implemented in Objective-C. As far as we know, we’re the only crash reporting implementation to do this, and we think it’s pretty neat.

Client-side symbolication may be enabled via -[PLCrashReporter initWithConfiguration:]; since release builds should always have dSYMs, we recommend only enabling client-side symbolication for non-release builds.

Enhanced Stack Unwinding

On x86-64 and i386 (and soon, ARM64!), additional unwinding data is provided in-binary, and may be used to both produce better stack traces, but also to provide the state of non-volatile registers at each stage of the stack frame. To support this, we’ve implemented fully async-safe and portable implementations of DWARF eh_frame stack unwinding, as well as support for Apple’s Compact Frame encoding. This should significantly improve stack traces on Mac OS X, and once we have our hands on the hardware and can add ARM support, ARM64.

In the future, we will also be exposing the enhanced register state for all frames, making it even easier to dig into the state of the process at the time of the crash.


Partnership with BitStadium + HockeyApp

May 31st, 2013 / Announcements / Plausible Labs
By: landonf

Since our first release of PLCrashReporter in 2008, it has come to be relied upon by analytics companies, developer tools providers, and internal corporate crash reporting services. We believe that PLCrashReporter is unrivaled as a reliable, stable, well-tested, and carefully constructed crash reporting tool.

Almost since the beginning, HockeyApp’s developers have contributed to the development of PLCrashReporter — contributing patches, developing both open-source and commercial services around it, and ultimately, funding the development of additional open-source features.

We are very excited to announce a long-term joint partnership between our two companies. This partnership will allow us to focus development efforts on further improving and expanding the reach of PLCrashReporter, as well as developing new features, services, and improvements for HockeyApp’s server and client products. The first result of this partnership will be the release of the next generation open source PLCrashReporter. Exclusive early access for HockeyApp customers and details are coming soon!

HockeyApp and Plausible Labs share a combined vision regarding the future of PLCrashReporter. We genuinely believe that complex tools such as PLCrashReporter should be open-source, in the same way that Apple provides kernel, compiler, and library sources, as to allow for peer review and validation of the approaches we have taken in our technical implementation. Integrators — whether they be application developers or platform providers — should be certain of the robustness of the software, that there is no use of private API or poor implementation that could harm their business or their customer’s interests.

To support this vision, we are launching plcrashreporter.org, a dedicated open-source project, administered by Plausible Labs. It is our goal to ensure that PLCrashReporter remains a trustworthy, free, and open-source solution to crash reporting on iOS, Mac OS X, and — in coming months — future platforms. We will also be founding the PLCrashReporter Consortium (modeled on SQLite’s), with the goal of sharing resources to fund the ongoing open-source development of PLCrashReporter.

HockeyApp is joining the PLCrashReporter Consortium as its founding member, and we look forward to other companies that rely on PLCrashReporter joining in supporting the project’s ongoing open-source development.