AudioTee: capture system audio output on macOS

Introduction

I'm open sourcing AudioTee, a Swift command-line tool that captures your Mac's system audio output and streams it to stdout, suitable for programmatic consumption by other applications. It uses the Core Audio taps API introduced in macOS 14.2 to solve the boring problem of audio capture, which stands in the way of too many fun applications not being built. My original (and so far only) use case is to pipe system audio to a NodeJS process which in turn relays it to a real-time ASR service.

Background

Recording system audio on macOS has always seemed unnecessarily complicated - if not borderline impossible - without third party software like SoundFlower or BlackHole. Thankfully, with the introduction of Core Audio taps in macOS 14.2, Apple now offers a reasonable API for recording system audio. However, knowledge of the API is sparse and the documentation is lacking, leading many developers to still believe it impossible.

Core Audio taps

Core Audio taps allow you to ‘tap’ into the audio output of a specific device and set of running processes. In my case (and I suspect many others), I want to tap all processes playing through the default output device, which is possible by passing an empty process list and setting the isExclusive flag to true (exclusive meaning inverted).

Once tapped, audio can then be used as an input in an aggregate device. In the case of AudioTee, the tap is the only input into the aggregate device since it's the only audio stream we're interested in capturing. Taps can be configured to permit audio passthrough or they can be muted, allowing for any amount of post-processing before actually playing the audio out of the system speakers.

Credit where it's due: once you actually sit down and read the documentation, this is all actually fairly well explained. But it's equally easy to miss, and the sample XCode application Apple provide isn't that helpful for the programmatic use case.

Design decisions

AudioTee makes several opinionated choices that reflect my original use case of streaming audio to a speech recognition service:

Mono output only

All audio is forced to mono, regardless of the source material. This isn't configurable because most ASR services expect mono input anyway, and it halves the amount of data you need to process. If you need stereo output, you'll need to modify the code or use a different tool.

Chunk-based streaming

Rather than providing a continuous stream of audio samples, AudioTee groups them into chunks (200ms by default) and outputs each chunk as a discrete message. This makes it easier to process in real-time applications where you need to handle audio in manageable pieces rather than sample by sample.

JSON protocol with optional binary mode

By default, AudioTee outputs JSON messages with base64-encoded audio data, which is terminal-safe and easy to eyeball. When piped to another process, it automatically switches to a hybrid mode where metadata is still JSON but audio data is raw binary, reducing encoding overhead.

These choices won't suit everyone, but they made sense for my use case and keep the implementation focused. The main classes of interest are AudioTapManager and AudioRecorder if you want to adapt the code for different requirements.

Using AudioTee

Getting started is straightforward if you have Swift installed:

git clone https://github.com/makeusabrew/audiotee.git
cd audiotee
swift run

For production use, you'll want to build a release binary:

swift build -c release

Basic usage is simple—just run it and start playing audio:

./audiotee

By default, AudioTee auto-detects whether it's running in a terminal or being piped to another process and chooses the appropriate output format. You can override this behaviour with the --format flag if needed. Be warned that binary output will make a mess of your terminal.

If you're planning to send the audio to a speech recognition service, you'll probably want to convert to a lower sample rate:

./audiotee --sample-rate 16000

While most ASRs work fine with higher sample rates, there's no benefit with voice input and some services will internally step down to 16kHz anyway. You might as well save yourself the bandwidth and them the processing overhead.

There are a lot more options available, including the ability to filter which processes are tapped, and to mute the tap. See the README for more details.

Acknowledgements

Apple's documentation and sample XCode application are actually pretty good. The open source repository AudioCap was also extremely helpful to get my bearings, and if you're interested in pre-emptively checking or requesting permissions without actually recording audio, you should check out their clever TCC probing approach to doing so. Both examples are SwiftUI applications, but they'll probably help you fill in some gaps too.

Last but not least: as with many of my recent projects in the last year or eighteen months, AI helped me a lot. We should get used to saying that out loud. With the combined learning curve of Swift and Apple's terse documentation, AudioTee would have taken me a lot longer without it. Thanks, Claude!