audio2score uses AI to turn recorded music into sheet music — but just how good are the results?
While headline‑grabbing so-called Artificial Intelligence (AI) technology fills our social media feeds with fantasy artwork and formulaic poetry and prose, a much more focused version of it has been powering various transcription‑oriented music software for years. Tools to scan printed music to live notation, or that extract note information from melodies or individual tracks, ranging from cheap iOS apps to Celemony's Melodyne, are commonplace now.
However, software that can ingest audio of a multi‑instrument mix and then attempt to separate out and notate all its constituent parts is still comparatively rare. For good reason: it’s really difficult to do, even for expert‑level human musicians. But it’s exactly this that the (self‑consciously lower‑case) capella audio2score pro 5.0 aims to do, on Windows (10 or later) and macOS (10.13 or later). In fact the app goes beyond mere transcription, as we’ll see in a minute, and journeys well into arrangement territory. It’s more capable and flexible than you might at first assume.
Analysis
audio2score pro 5 (which I’ll call A2S for the rest of this review, for grammatical sanity) runs in a single window on your computer, and has a pleasant immediacy about it from the off. After launch you’re prompted to open an audio file (an 8‑ or 16‑bit WAV, MP3, or WMA on Windows), or an A2S project in capasp format. Then it’s straight down to business.
Assuming you’ve fed in new audio, you’re met with a dialogue box that asks you to specify what it contains: the choices are Piano only, Classical (without vocals), Pop (without vocals), and Pop (with vocals). An analysis process takes place, with a progress window appearing to show the app’s progress calculating ‘spectrum’ and ‘sound tissue’, then recognising tuning, key and notes. On my M1 Mac this typically took (approximately) half the duration of the audio, varying according to complexity. And then, boom, a notated score appears.
It is just possible, at this stage, especially with simple musical material, that the software will have nailed it. You can gauge that visually if you’re a music reader, or to some extent via A2S’s embedded SoundFont‑based playback module. Playback controls are found in a strip at the bottom of the window, and one of them is a useful crossfader that outputs your original audio file (in mono) when set to the left, sample playback from the extracted data to the right, or a sync’ed mix of the two in between.
If all is up to snuff you’ll have already reached the ultimate point of the exercise, which is to export a PDF score, a type 0 or 1 MIDI file, or a MusicXML document for passing on to another notation application [like Sibelius]. If you use capella software’s own notation application (capella) then the transcription can be transferred to that directly or via the specialised CapXML format. There is one last option, to save the A2S project itself, and that lets you return to unfinished work without having to re-run the analysis process.
Reality
Although audio2score’s initial analysis often gets some things right, the chances of complete accuracy are actually slim. And while playback of extracted data frequently sounds reasonably plausible, in a ball‑park sort of way, the notation may well be off, ranging from clunky, inelegant and scored for a different instrumentation, via strangely full of holes and misallocated voices, to downright bizarre.
The vast majority of features in the application are concerned with refining the accuracy of the data extraction, and hence the coherence of the musical content and the readability of a score. It all happens at two markedly different levels of complexity.
In what’s known as the Assistant — where you’ll begin and end your A2S journeys — you can get a lot done. Your analysis will have begun in the Start tab, and Export is the ultimate goal, but In between are several others: Recognition, Score, Barlines and Layout. There’s a point to this ordering, and tweaks you make to the relatively few parameters in each tab can have a significant bearing on the outcome.
Most fundamental of all is the so‑called ‘Scheme of recognition’. A ‘Note by note’ scheme does indeed attempt to extract and represent the maximum amount of original audio file content, without adding anything that wasn’t there to begin with. Interestingly, though, and perhaps surprisingly, A2S is by design not at all hung up on sticking with the instrumentation in your audio file, and indeed frequently merrily discards it.
After ingesting a piano piece, for example, choosing the ‘Piano’ instrumentation preset might give you a faithfully‑notated result. But you could just as easily select ‘Guitar’ to get an instant arrangement for that instrument, notated on an appropriate octave‑down single treble stave and with a playable pitch range. Conversely, you might choose to analyse a guitar performance and have A2S score it for classical organ, with a pedal part and feasible voicing and hand‑stretches for a bass and treble stave above. Eleven preset instrumentations completely disregard the instrumentation in the input file, while a further seven do attempt to recognise timbre, but only within the predefined combinations of piano, strings and wind, which I think is a bit restrictive. For vocal pop music transcription, which is a new ability in version 5 of the app, some alternative options appear here, tilted towards transcription in a lead‑sheet style. You get to specify the voice type, choose whether the vocal line also gets integrated into the accompaniment staves, and the type of accompanying instrument or ensemble.
A2S’s alternative recognition scheme is called ‘Holistic’, and it’s quite a different kettle of fish. This unashamedly does not look to make an accurate transcription, but instead takes an overview of your content as a whole and generates a novel arrangement based on it. It’ll almost always include elements that were not in the audio file, potentially very different in style, based on seven wide‑ranging preset instrumentations that start with a generic piano/chords/bass combo and end up with a nine‑stave orchestra.
Many if not all Holistic scores end up largely fictional: the underlying harmonic and rhythmic content will be there, but the instrumental parts are essentially algorithmic. Piano left‑hand parts frequently end up as rolling broken chords, and orchestral inner parts as alternating pairs of notes, regardless of whether there was any movement at all in the original. In the ‘Piano/chords/bass’ instrumentation you’ll additionally get a pair of generic staves with a single‑note bass line and block treble‑clef chords, which can be surprisingly useful as building blocks around which you could manually build other content.
The whole premise of the Holistic scheme can seem odd at first, especially if you came along expecting the whole truth and nothing but the truth. It’s immediately obvious to a notation‑reading musician, though, how much generally more coherent and legible its scored output is than the Note‑by‑note alternative. The sacrificing of accuracy for plausibility here will definitely be the better option for some material, and for many users’ needs.
To tie up this section I’ll quickly mention the two sliders that adjust the volume threshold at which notes are sensed, and the level of detail in the extraction. Crank either or both up and you get more notes in your score, which may or may not help the situation.
You’ll also find parameters and data display corresponding to musical key and reference pitch recognition, the number of likely harmonies in each measure/bar, and the presence and style of chord symbols in the score. Time signatures can be manually adjusted, tempo ranges quickly defined and restricted, and barlines moved by tapping along to the original audio file using the space bar. All of these are essential steps towards accuracy, especially when the AI has got the wrong end of the stick. Finally, some basic page layout elements can be controlled.
Augmented
A2S’s Assistant interface is quick and convenient, but much more control lurks behind the ‘To main program’ button.
The main thrust here is a score view that’s synchronised with a graphic representation of notes the software has found inside your audio. The Melodyne‑like blobs are the ‘sound tissue’, with notes optionally colour‑coded to their recognised timbre. The blobs are not editable, but the overlay of ‘note boxes’ is, creating what is for all intents and purposes a DAW‑like ‘piano‑roll’ editor. Click on a box, or a score note, and all the data associated with it appears in a panel to the right. Boxes can deleted and also dragged to change their pitch or start and end points. And for more clarity for working in detail on a complex score, whole instrumental groups or just individual instruments can be individually hidden.
To the left in this same view is a list of the instruments in the score. You can add and delete staves, and existing ones may be reordered, renamed, their clefs changed, and combinations of instruments arranged into bracketed groups in the score: all of which goes a long way towards breaking out from the preset instrumentation schemes of the Assistant. There’s also support for transposing instruments here, with the underlying notation engine making the necessary changes to pitch and key signature for an instrument like a clarinet in Bb, for example.
The main program view adds some additional sophistication to the Assistant’s parameters. Now you get controls for quantisation (which can be overridden on a per‑instrument basis) and the tendency for individual parts to form smooth lines. In the case of chordal instruments you can specify the maximum number of notes that will exist in a chord, and the largest interval between them. Finally a ‘Postprocessing’ tab allows completely new notes and harmonies to be added into A2S’s automated transcription, via mouse and keystrokes.
AI IQ
The only meaningful way to assess a product like A2S is to chuck a lot of material at it and see what comes out. That’s exactly what I did, and the results were... interesting.
At the risk of biasing the outcome of this review, incorrectly, I’ll start by describing what happens when things don’t go well. The reason being that they also reveal a lot of what A2S gets right. And in fact the situation gets rosier from then on.
Call me naïve, but the very first thing I fed into A2S was an orchestral recording: a very clear, detailed and quite acoustically dry capture of an early classical symphony. I imagined it would be a fair fight. It had an active, full string section, plus much slower‑moving four‑part winds and horn, aligning quite nicely with the Note‑by‑note ‘Piano + four strings + five winds’ instrumentation preset.
The resulting score seemed close to unusable at first sight. There was a fundamental lack of fine detail (even with sensitivity sliders cranked), and all string staves were peppered with rests and the occasional individual note clearly cut adrift from elsewhere (or completely fictional). The app could not detect repeated violin quavers in a prominent main theme, and always transcribed them as long held notes. In places the metrical basis of the music, and hence the notated rhythms, had been completely scrambled. Give these parts to session musicians, in this state, and I suspect you’d have a mutiny on your hands.
But it was also interesting to see how much was correct. That included time signature, key, a lot of the meter and barline placement, most of the bass line, the harmony (revealed by the chord symbols), and the sections where the wind came in and out. In fact the wind writing, while not the same as the original, was often entirely plausible. Oh, and the piano stave was completely empty, as it should have been.
As a ‘broad brush’ depiction of the music, it has to be said that capella's A2S was strangely successful. A lot of work would certainly still be needed to make either a usable score or MIDI data ready for playback by an orchestral sample library. However, both beat tapping and some straightforward reallocation of timbre in the main program improved things, and there are enough tools on hand to tighten things up even further if you’re so inclined.
Next up, I tried some clear, non‑distorted pop mixes with drums and vocals. Again, the outcomes were at first only partially successful, and a long way from what a pro‑level human transcriber would come up with. And yet, again, there was still a lot of useful material that could be corrected or built on further down the line. Drums and percussion were ignored as per the app’s design and didn’t seem to interfere with pitched content too much. Holistic mode always did better here than Note by Note, which suffered from the cross‑stave bleed I mentioned already, but the way it works takes some getting used to.
Take holistic piano parts for example, in the Piano/chords/bass instrumentation. You get to choose what goes in the right‑hand part: either a simplified reduction of other instrumental parts or pretty much a copy of the Chords stave. But that quasi‑arpeggiated left‑hand line: wow! From a moderate ‘Level of difficulty’ upwards it becomes relentlessly active, and can be viciously hard to play, with leaps and figuration that constantly goes outside a typical hand span, and makes no concession for key. It’s idiomatic piano writing to the extent that it sounds piano‑ish, but it has the unmistakable feel of a part an AI would write, not a skilled human composer. There’s little regard for elegance, such as in the frequent overlaps and crashes between right‑ and left‑hand material.
Other Holistic scoring aspects were better. In the orchestral and quartet presets you’re likely to see some continuous, melodic‑style parts with reasonable voice leading, and the simple but animated broken‑chord accompaniment patterns are effective, in a minimalistic, neutral sort of way. Orchestral wind and brass parts are given to padding, with held chord notes evenly distributed amongst them. The application’s repertoire of material is small though: those same figurations occur again and again, regardless of the material you feed in.
In the end, the best and most useful and accurate outcomes were with relatively simple material that included instrumental duos, trios and solo piano. Here, A2S would often manage pitch accuracy very well. Rhythmic accuracy was always less certain initially, but could be improved with the onboard tools. The application never recognised compound (shuffle/triplet) rhythmic content by itself, for me, and the resulting attempts to notate it 4/4 led to chaos. Manually entering the time signature would fix things in a blink of an eye though.
Working at this lower level of complexity, and with a few provisos in hand, there’s lots to like. It accurately revealed for me a few jazz piano voicings in recordings I’d pondered over for years, and that in itself was very pleasing. I suspect it would do the same for much guitar writing too, though it’s then a bit of a shame there’s no tab notation option within the application itself.
The promise of software like this, to be able to pull off transcriptions that would represent hours of painstaking work even for a professional orchestrator, is immense.
Final Score
The promise of software like this, to be able to pull off transcriptions that would represent hours of painstaking work even for a professional orchestrator, is immense. It could prove to be a huge time‑saver for some jobs, but has an educational potential too, revealing harmonies, note patterns, and pointing out arrangement and orchestration strategies, in audio material that otherwise has no printed or published version.
That’s the dream, and there’s no question the app can already transcribe some material well. Its extensive arrangement, reduction, note‑generation and page layout capabilities take it to places far beyond a straight transcription tool. Various aspects of execution are excellent: it’s fast, clean, easy to use, and really well documented. It’s also markedly more sophisticated than tools like Melodyne or the built‑in audio‑to‑MIDI features in some DAWs [eg. PreSonus Studio One Professional]. They might appear to be similar, but most are fundamentally simpler, built to deal with separated, track‑based, unmixed audio, and sometimes restricted to monophonic lines. A2S is attempting something in an entirely different order of complexity.
However, the limits of musical plausibility in A2S’s output are reached surprisingly quickly, and the built‑in tweaking and editing tools only go so far. Ask the app to crunch more complex mixes and it’s not at all certain what you’ll get in return. The Holistic mode keeps things coherent longest, and can at least reveal underlying harmonic building blocks, but does so only by discarding much of the original material and generating new. I don’t have many other gripes, but one concerns the lack of warning that Main Program edits can be lost if you decide to try another recognition scheme. Another is the lack of support for 24‑bit WAV files.
I instinctively applaud apps like audio2score pro, for their sheer ‘have a go’ bravery. The complexity of the underlying programming skill must be immense. I suspect though that truly accurate audio transcription and creative offshoots that equal what experienced human arrangers could achieve are still off by some decades and many orders of magnitude of processing power.
So don’t expect perfection. However, go in with the idea that A2S will get you some of the way, and if you’re willing to build on its approximations then there’s a lot of value here. If it can save hours of tedious donkey work — and for many jobs I think it will — then that should justify the reasonable asking price in no time at all for many users.
Dos & Don’ts
A2S will tackle many different kinds of music, but its transcription skills are far from comprehensive. Most notably, there’s no recognition of drums or percussion at all: it’s not even attempted, and no settings will change that. So if you’re still trying to get to the bottom of the Amen Break, look elsewhere.
Vocal recognition improved hugely in this latest version of the app, thanks to a new fine‑tuned recognition algorithm and features purpose‑built to handle them. What A2S still struggles with, though, are synthetic and distorted sounds. They’re either ignored or bleed into nearby instrument staves causing all sorts of chaos. Autechre devotees probably need not apply.
Otherwise the biggest limitations include really intricate rhythm, including tuplets other than triplets for example, and an inability to track time signature or key changes in larger musical structures. That’s almost never a deal‑breaker though, even for the most committed prog fan, with solutions ranging from feeding audio into A2S in sections, to further editing further down the line in a more fully‑featured notation application.
Pros
- More than just a straight transcription tool: arrangement, reduction, editing and page layout features are built in too.
- User friendly and well documented.
- At its best, it’s a real time‑saver.
Cons
- When the AI goes wrong, it goes very wrong!
- No recognition at all of drums and percussion, and the tech is flummoxed by distortion and some synth sounds.
- Timbre‑aware scoring is subject to a limited menu of preset instrumentation only.
Summary
An easy‑to‑use app that attempts one of those notoriously difficult jobs in music: generating a musical score from a mixed, multi‑instrument audio file. It works... sometimes.