BRINGING DROIDS TO LIFE WITH THOUGHTFUL SPEECH

Michael Perl  •  Contact

With thanks to Disney's L. Towersey and M. McMaster.
All proceeds support Force for Change and FIRST.

Arrow

IconStarbirdINTRODUCTION

IconR2R2-D2

IconAstromechASTROMECH

IconGonkGONK

IconDOD-O

IconFAQFAQ

IconStarbirdINTRODUCTION

Droids in the Star Wars universe come in a fantastic variety of shapes, sizes, and colors. The visual experience of droids is widely celebrated, from fan artists to droid builders clubs to archivists. Every physical inch of every droid has been dissected and analyzed thousands of times over.

However, as is often the case in media, the auditory experience of droids has received much less attention. For as unforgettable and universally beloved as these droids' voices are, little effort has gone into studying their speech. Even less effort has gone into recreating their speech beyond basic pre-recorded soundboards.

When audio and visual experiences work in concert, that's when magic happens. Walt Disney knew this well. He didn't call them Audio-Animatronics for nothing.

The purpose of Human-Cyborg Relations (H-CR) is to fill the droid audio void. The process varies from droid to droid, but in every case the goal is the same: bringing droids to life with thoughtful speech.

Droids in the Star Wars universe generally communicate in one of two ways:

• Galactic Basic Droids, like C-3PO, speak in human language.
• Binary Droids, like R2-D2, speak in beeps, whistles, clinks, pops, and so on.

Generating these two types of speech requires two completely different approaches.

BB8BINARY DROIDS...Beeps & Whistles
GOAL

GENERATE VARIED, EMOTIONAL BINARY IN REALTIME

Binary droids are sentient. They are often capable of conversing thoughtfully and emotionally with people in the Star Wars universe. In our universe… not so much.

Traditional droid audio systems in toys, in video games, and at conventions operate like a soundboard. They simply play preset audio clips without regard for context or variety.

H-CR generates living, breathing binary droid speech one "syllable" at a time. Whether the subject is an entirely new droid, or a droid with strict canonical precedent like R2-D2, the generated speech is emotionally-specific and highly varied… as one would expect from a sentient being.

H-CR is the difference between a droid that looks real and a droid that is real.

EXAMPLE
R2-D2's original Birdsong B
New birdsongs procedurally generated by H-CR in the style of Birdsong B
USES

• THEME PARKS - An autonomous droid like Disneyland’s J4-K3 uses H-CR to converse thoughtfully with a guest. An Audio-Animatronic droid in a ride queue reacts as a guest approaches. The vocalization is different every single time.

• TOYS - An updated Galaxy’s Edge Droid Depot mobile app lets the user trigger happy, sad, scared, or angry vocalizations with infinite variety. These triggers influence an underlying "emotional state" which gets displayed in the app. This emotional state, in tandem with the droid’s personality chip, colors its vocalizations.

• SETS - A droid delivers an emotionally loaded cue for an actor to play off.

• VIDEO GAMES - A droid companion delivers a fresh reaction to every event.

3POGALACTIC BASIC DROIDS...Human Language
GOAL

TRANSFORM AN OPERATOR’S VOICE IN REALTIME

Galactic basic droids have diverse and memorable voices. Their unique vocal character comes largely from post-production effects, like D-O’s trademark stutter and low-fi speaker, or a tactical droid’s synthetic, monotone delivery.

These effects traditionally require a combination of manual editing and power-hungry, high-latency software which makes live performance impossible. Existing voice processing apps are slow and low quality. They all have the same limited feature set because they avoid the most complex (and interesting) voice effects.

H-CR applies modulation effects in realtime. The software is light and highly portable. The latency is so low that the operator can comfortably monitor the output in realtime without a long, disorienting echo.

EXAMPLE
A raw recording of J.J. Abrams
A tactical droid from The Clone Wars
J.J. modulated by H-CR to sound like a tactical droid
Leia speaking through Boushh's helmet in Episode VI
J.J. modulated by H-CR to sound like he is speaking through Boushh's helmet
USES

• LIVE PERFORMANCES - A performer provides a live voice for D-O during a panel or on a convention floor. The performer can play to the room, react, and improvise.

• VOICE-OVER RECORDINGS - An actor monitors their modulated voice in realtime. This live feedback allows them to tailor their performance to the effect.

• PROTOTYPES - A director experiments in realtime with voice effects for a unique new droid.

• TOYS - An app transforms a user’s voice to sound like a bounty hunter.

IconR2R2-D2
R2Panel
Download Mac and iOS DOWNLOAD
Download Windows DOWNLOAD
Download Android DOWNLOAD
INSTRUCTION MANUAL

If you use this vocalizer, please support STEM education by making a donation to FIRST®

Generate unique, emotional R2-D2 vocalizations with the press of a button! The R2-D2 Vocalizer is a fully functional AI powered by advanced research into the speech patterns and emotional logic of R2-D2.

Each vocalization is unique. A sophisticated algorithm assembles R2’s speech "syllable" by "syllable." Billions upon billions of rules dictate how each sound should be selected, processed, timed, and sequenced to achieve emotional and canonical accuracy.

CHAPTER 1: THE CHALLENGE

R2-D2’s voice is the most well known and deeply loved of any binary droid. The features that make his speech memorable also make it incredibly difficult to reproduce:

R2BenBurtt

• It’s organic. Ben Burtt produced many of R2’s sounds with his voice. Some are unmistakably human, like whistles. Others are animalistic, like bird calls. Convincing organic sounds are virtually impossible to synthesize.

• It’s analog. Ben used analog synthesizers to shape R2’s trademark sounds. The sounds were then recorded on tape. Analog layers create warmth and coloration that are difficult to reproduce digitally.

• It’s hand curated. The sounds we hear in the film are not first-takes. Ben describes experimenting with noises while turning knobs on his synthesizer in realtime. Only the most interesting, best sounding, and most emotionally purposeful takes made the cut.

These analog, organic, human processes are the reason R2’s voice is unforgettably lifelike. If you ask someone to "make a BB-8 sound," they’ll give you a vague, lovely sweep. If you ask someone to "make an R2-D2 sound," they’ll give you a sound-by-sound impression of an entire line with accurate pitch, timing, and timbre.

The speech generated by H-CR must be emotionally specific. It must also be infinitely varied, yet indistinguishable from R2’s original lines. The latter is critical both for listener expectations in practice, and for adherence to canon on principle. The results must be R2-D2.

Therefore, digitally synthesizing new sounds is out of the question. The only option for creating "new" sounds is to repurpose the originals: to break R2's recordings into building blocks which can be intelligently reprocessed and reassembled. As it turns out, there is precedent for this.

We found that R2 speaks 643 lines containing ~2000 sounds across the Skywalker saga. Yet of those 2000 sounds, only ~350 of them are "original." The other 82% are repeats and variations. Of the 18% which are original, 97% can be traced back to the original trilogy, and 92% can be traced back to Episode IV alone. R2 reuses a ton of sounds… especially outside of the original trilogy. This makes sense; we reuse words all the time.

But there’s another likely reason Ben Burtt and Lucasfilm didn’t produce new R2 sound libraries for every single film: you don’t mess with perfection. Instead, the bulk of R2s speech was assembled by cutting, pasting, and post-processing samples from a small batch of original recordings.

That assembly process is what makes R2, R2. It’s a pattern of choices by the designers that define what R2 feels, how those feelings change in response to stimuli, and how R2 expresses those emotions through speech. It’s R2’s sentience, codified.

This is the process H-CR replicates. And in doing so, for all intents and purposes, we may have actually created a real, sentient R2-D2.

CHAPTER 2: THE SOUNDS

We've established that H-CR can't synthesize new R2 sounds. The software must instead repurpose existing R2 sounds in order to generate new speech. In Chapter 2 we collect, break down, clean up, and organize the original R2 source recordings.

R2Recordings
STEP 1

COMPILING RECORDINGS

We began from a collection of ~200 original R2 source recordings. While the lineage of these files is unknown, the recordings are in lossless WAV format and generally sound free from data compression. The files have names like gen-1, gen-2, hc-1, hc-2. We use those names later to identify the individual sounds within each file, such as Tone gen-1 and Bump hc-2.

R2Categorizing
STEP 2

RESTORING

Some of the source recordings have quality issues like hiss, pops, distortion, and reverberant room sound. These imperfections sound unpleasant on their own, and their roughness is magnified when they are played next to other, cleaner recordings.

We use a combination of manual spectrograph editing and algorithmic restoration to surgically fix these recording problems on a case by case basis. To avoid over sanitizing, we only treat clearly damaged sounds. We preserve all human imperfections and enough tape hiss to match the cleaner recordings.

EXAMPLE
Chipmunk gen-76 source
Chipmunk gen-76 after restoration
R2Categorizing
STEP 3

CATEGORIZING

It’s time to organize. We start by cutting each track into shorter sounds. These sounds are categorized and color-coded based on their tonal character. The categories are:

• Bumps - Soft "boops"
• Tones - "Beeps"
• Whistles - Human whistles characterized by breathiness, usually with rising pitch at the start and descending pitch at the end
• Systles - Synthetic sounding versions of whistles
• Greets - Mumbled mechanical sounds with a series of fast tones
• Squelches - Muddy, flappy, bubbly
• Mumbles - Muffled "hmm"s
• Moans - Sad, wistful, muffled held "mmm"s
• Gasps - Raspy, struggling air intakes
• Reeos - Zipping, mechanical motors
• Lasers - Filter sweeps, pew-pews, differentiated from systoles by more harmonics and faster, more dramatic sweeps
• Chipmunks - Fast staccatos like a chipmunk laughing or chittering, differentiated from mumbles by rapidity
• Birdsongs - Clustered series of whistles, systles, and bumps
• Growls - Low, muffled, soft attack, often sweeping, sometimes ranging up higher into honkiness, differentiated from bumps by muffled and soft attack
• Alarms - Sweeping alarm sirens
• Arpeggios - Rapid tones, differentiated from birdsongs by having more dissonance and fewer whistles and sweeps
• Randoms - Miscellaneous sounds that don’t fit into the above categories

EXAMPLE
Sounds from the Reeo category
R2Syllables
STEP 4

DELINEATING SYLLABLES

Now that each source recording has been restored and divided into categories, we are left with smaller units of sound. For additional flexibility, one final pass is made to further divide these units on long pauses. These resulting units effectively represent words in R2’s speech.

However, for reasons that will be explained later, words are still not small enough units to reproduce all of R2’s film sounds… let alone to generate new speech. Words need to be further broken down into syllables. Syllable breaks are located at every conceivable dividing point within each word. Some words contain only one syllable, while other words contain sixteen.

Tone F is a single beep. It does not need extra syllable breaks. Reeo A contains two whirs. It becomes two syllables. Birdsong gen-32 contains twelve chirps. Each chirp becomes a syllable.

EXAMPLE
Tone F (one syllable)
Reeo A (two syllables)
Birdsong gen-32 (twelve syllables)
R2Deduplicating
STEP 5

DEDUPLICATING SYLLABLES

Sometimes the same syllable appears in multiple source recordings. Upon inspection, it might turn out that Bump gen-5 and Bump gen-27 are actually identical. Duplicate sounds must be identified for two reasons:

First, they help identify the best quality, most balanced, least noisy recordings of each sound. The purer the source, the more H-CR can do with it.

Second, they narrow down the final library of sounds. There is no need to save the same sound twice.

In the end, we are left with 835 unique syllables. These sounds take up just 14 MB uncompressed. With 128 kbps compression, the entire library comes in under 3MB.

R2Generating
STEP 6

DISSECTING SYLLABLES

There aren't always enough examples of a particular category of sound in order to create meaningful variation down the road. In these cases, H-CR goes beyond the syllabic level and adds yet another dimension of variability by dissecting the syllable recordings themselves.

Let's look at some moans. We have multiple sound recordings for what is very clearly supposed to be the same type of sound with the same stylistic expression. All of these moans are variations on the same concept.

• sad-2 - Two moans with muffled tones
• sad-3 - Moan with muffled tones
• sad-4 - Doubled moan
• sad-5 - Moan with clear tones
• sad-6 - Moan with clear tones
• wc-4 - Moan with muffled tones followed by moan without tones
• wc-19 - Moan from sad-1 time compressed with different trailing sound and different squelch
• wc-14 - Moan with clear tones
• wc-25 - Moan with muffled tones, doubled moan

EXAMPLE
Stylistically similar Moans

Every one of these words features a human sounding "hmm." Most of them also include overlaid tones... but not all of them. The ones without tones exhibit a frequency shift (not to be confused with a pitch shift), which is effectively one half of a ring modulation. Here we have the opportunity to create a new dimension of source material that is still canonically supported. If we separate the moans from the tones, we can procedurally generate both sounds separately and exponentially increase the variety of Moan vocalizations. We can even create additional toneless moans by applying realtime frequency shifting to the isolated moans.

For this task we use a spectral editor (pictured above). This lets us view the audio spectogram -- a visual representation of a sound's frequency and amplitude over time -- and edit that spectogram as if it's an image in photoshop. In this way we can isolate parts of the sound, such as the tones within a moan, while leaving the moan untouched.

R2SpectogramMoan
EXAMPLE
Original sad-2 moan
Sad-2 moan with tones removed

We use a similar trick to create infinite variations on R2's one and only electrocution sound. We dissect and reverse engineer the recording to open up tons of possibilities for controlled randomization.

R2SpectogramElectrocution
EXAMPLE
The original electrocution
Electrocution tones isolated
Electrocution swirls isolated
Electrocution upper moan isolated
Electrocution lower moan isolated

The source files we used contain all of the R2-D2 sounds from Episode IV. These account for ~92% of all vocalizations from the Skywalker saga. However, most of the remaining 8% -- the "new" sounds peppered into Episode V onward -- are not present in the source files. We were able to isolate some of these sounds from the final mixes using a combination of spectral editing, equalization, and reverb removal. But many sounds from that 8% still elude H-CR. If anyone is able to provide access to those newer R2 sounds, please contact us.

EXAMPLE
Line 35 from Episode VI
Systle 6-35_1 isolated from the film's final mix
A missing sound from Episode III
CHAPTER 3: THE LOGIC

We have extracted the syllables, which are the building blocks of R2’s speech. In Chapter 3, we develop instructions that tell H-CR what to build with those blocks. We must reverse engineer the sound designers’ thought processes to uncover the logic behind R2’s language. This logic tells H-CR what it’s "allowed" to do. How can R2’s sounds be manipulated without losing what makes R2, R2?

R2Film
STEP 7

AGGREGATING FILM LINES

We search all nine Skywalker saga films for every instance of R2 sounds. H-CR uses these versions of the films:

• Harmy’s Despecialized Editions of the original trilogy
• 2000s DVD releases of the prequel trilogy
• Disney+ versions of the sequel trilogy

643 lines are identified, time stamped, and exported to audio files. The files are labeled with the naming convention [Episode]-[ClipNumber], then imported into an audio project.

R2ReverseEngineering
STEP 8

REVERSE ENGINEERING R2'S LOGIC

Each film line is reconstructed one word at a time. This meticulous process involves wading through our database of ~350 unique words to find a match for every single one of R2’s ~2000 film sounds.

The sequence of these words in R2's lines provides the beginning of our logical framework. If the designers put Birdsong B before Reeo A in one line, but before Bump C in another line, then H-CR can also put Birdsong B before either Reeo A or Bump C and the result will still be R2.

However, as explained in Chapter 1, only 18% of R2's sounds in the films are "original." The source words do not always appear fully intact in the films. Sometimes a film sound is created by chopping up and reshuffling a word. This is where syllables come in.

Take line 157 from Episode V. This sound is a clear example of a Birdsong. Birdsongs are homogenous to begin with, so identifying the source word is already a difficult task.

EXAMPLE
Line 157 from Episode V
Every possible match from the Birdsong category

In this case, there is no exact match to be found. What was already a difficult process of cross-referencing 350 unique words suddenly turns into cross-referencing against 700 unique syllables, some of which are as brief as 0.02 seconds. After careful manual examination of every "chirp" syllable in every Birdsong, a solution appears.

The sound in line 5-157 is actually generated entirely from the frequently used word Birdsong B. However, while Birdsong B’s eleven syllables often appear verbatim, in this line the syllables have been completely rearranged. First comes syllable 11, then 10, then syllables 5-8, then 2-4, then 11 again.

EXAMPLE
The original Birdsong B
Line 157 from Episode V
Birdsong B rearranged to match Line 157

The rearranging, adding, or dropping of syllables gives us powerful new insight into the minds of the designers. These are deliberate variations designed to create diversity and convey new emotional ideas. Every one of these variations exponentially expands H-CR’s options for generating speech.

Syllabic rearrangement isn’t the only form of variation. Sometimes effects are applied. For example, the Reeo at the start of line 82 from Episode V and the Reeo from line 26 of Episode IV appear initially to be different sounds. But when the line 5-82 Reeo is time compressed without pitch compensation (thereby making the clip play faster and at a higher pitch, like holding fast forward on a tape recorder), the result turns out to be identical to the line 4-26 Reeo.

EXAMPLE
Line 82 from Episode V
Reeo H
Line 26 from Episode IV
Reeo H time/pitch adjusted to match Line 26

These effects add a third dimension to the reverse engineering process. Not only could a film sound be any of 350 words, it could also be any syllable within any of those words, in any order, with extreme time/pitch adjustment or even backwards playback. But this also means H-CR now has a third dimension in its vocalizer logic.

Without taking any creative liberties, H-CR’s logical framework now knows that Reeo H can be played at either 56% or 100% pitch/speed. With only minor creative liberties, H-CR can further deduce that Reeo H can be played at any rate between 56% and 100%, and that any similar sounding Reeo can also be played within that range.

Alarm E showcases another variation created by effects. This word is made up of three connected segments, each of which represents one syllable. Alarm E appears nine times in the films. However, in two of those appearances, the third syllable is pitched downward.

EXAMPLE
Alarm E from wc-39
Alarm E from wc-29, which has a lower final syllable

Again, without taking creative liberties, H-CR now knows that the final note of Alarm E can be played at 93% pitch/speed. With minor creative liberties, any of Alarm E’s three segments can be played at a rate between 93% and 100%, and so can any segment of any similar sounding Alarm.

Roughly half of the ~2000 film sounds contain variations. Each of these variations contributes at least one rule to H-CR’s logic. The resulting algorithm tells us how words can be arranged and when syllables can be added, removed, reorganized, reversed, or pitch shifted… and to what degrees.

This is the logic of R2’s speech. With this algorithm, H-CR accomplishes the first portion of its goal: infinitely varied speech that is indistinguishable from the original.

R2Generating
STEP 9

RESYNTHESIZING WORDS

With access to words at the syllabic level, as well as access to the individual sounds within some dissected syllables, H-CR can now resynthesize variations on R2's speech. Every single one of R2's ~350 unique words is its own puzzle. H-CR's algorithm tackles each word with a unique process for randomization and resynthesis.

Let's look at Birdsong B. In an earlier example we saw how the designers created a variation on Birdsong B for Line 157 from Episode V. H-CR can automatically replicate this process by stitching together individual chirp syllables. However, H-CR has a leg up on the original designers here because it's operating algorithmically rather than by hand. The Birdsong B variation from Line 157 is assembled by rearranging syllables from Birdsong B. The designers only needed one variation, so there was no reason to look for source audio outside of Birdsong B. But H-CR's goal is unlimited variation. Thus, H-CR can not only resynthesize Birdsong B from Birdsong B's syllables, but it can also dive into every other birdsong and extract more eligible syllables for extra variation.

Crucially, the algorithm isn't just picking those syllables at random. Because R2's birdsounds are so organic, there is a whole series of rules that must be observed to ensure that the resynthesized birdsongs retain the feel of the originals. Some chirps sound pure, some sound raspy, some sound synthetic, some sound natural. Some are high pitched, some are low pitched. Some increase in pitch, some decrease in pitch, some hold steady. Some are long, some are short. The ebb and flow of those qualities over the course of each birdsong gives them their memorable, lifelike quality.

The timing of each chirp is critical as well. Humans are highly attuned to the rhythm of organic sequences like R2's birdsongs. A gap between two particular chirps that's 0.01s too long can be the difference between a birdsong that sounds right and a birdsong that sounds very wrong.

Using the original birdsongs to seed the algorithm, H-CR carefully selects, retimes, and modifies each chirp into a sequence that conveys a particular emotion (more on that in the next chapter) while matching the logic and styling of the source material.

EXAMPLE
The original Birdsong B
New Birdsong Bs procedurally generated by H-CR

Other words require entirely different methods. R2 only has one scream recording. That recording is reused throughout the films. A scream doesn't have individual syllables, so the technique used to vary Birdsong B would not work.

The scream usually appears in the films as a direct copy/paste, but in one instance a time shift and pitch increase were applied in post production. Remember that the original trilogy pitch increases were created using the fast forward effect. This particular variation features the inverse: a pitch increase and a time expansion. This effect would have been achieved using the more advanced audio editing techniques available in the prequel era. To recreate this in realtime, H-CR uses granular synthesis -- the process of breaking a sound into many small pieces and reassembling those modified pieces in realtime -- to create longer or shorter screams without affecting pitch. The pitch can then be manipulated independently from the duration.

EXAMPLE
The original R2-D2 scream
The scream in Line 38_2 from Episode III (time expansion and pitch increase)
New R2 screams procedurally generated by H-CR
CHAPTER 4: THE EMOTIONS

H-CR can now generate an infinite number of R2-D2 sound variations. But we don't want to generate sounds at random. The goal is to generate speech that conveys specific emotions. In order to achieve this, we need to create a catalog of emotions from which R2’s logic can draw.

R2Cataloging
STEP 10

CATALOGING EMOTIONS

The first step is to decide how to categorize emotions. H-CR’s emotional classification model is based on discrete emotion theory. The software tracks a stripped down set of Paul Ekman’s core emotions: happiness, sadness, anger, and fear. These four emotions — and the gradients between them -- neatly encompass R2’s emotional experiences in film. R2 can feel sentimental by being sad and happy. He can feel sassy by being happy and angry. He can feel excited by being scared and happy, and so on.

Now that we know which core emotions to track, we catalog the context of every line from the films. For each line, H-CR tracks the surrounding events, an English interpretation of R2’s speech, and happiness/sadness/anger/fear values from 0-3.

To avoid confirmation bias or reverse engineering, the emotional judgements are made independently from the characteristics of the sounds. Just because a line sounds superficially happy in a vacuum doesn’t necessarily mean it is happy. Between the scene context, R2’s body language, and the responses of other characters (which often conveniently include literal translations of R2’s speech), the emotional values are often objectively clear.

When R2 speaks at the Naboo victory parade, he is obviously happy. When R2 flees from droidekas, gets captured, defiantly electrocutes his captors, and then gets kicked, he is obviously scared and mad. When R2 watches the Echo Base shield doors close, dooming Luke to certain death in the frozen plains, he is obviously sad and scared.

R2Ambiguity
STEP 11

INTERPRETING AMBIGUITY

Sometimes R2’s emotions are not obvious. In these cases, H-CR can identify other scenes with the same word patterns. With the help of this extra context, the emotions often become clear… sometimes with shocking results.

R2 is generally understood to be the fearless counterpoint to C-3PO’s belligerent anxiety. Analysis shows this is not the case. In reality, 75% of R2’s lines convey some degree of fear, and 50% of them convey moderate to extreme fear. R2 is almost always scared.

For example, contrary to popular belief, R2 spends almost the entirety of Episode IV’s opening on the Tantive in worry. His very first line, Tone I, Greet C, Tone E, uttered as an explosion rattles the Tantive, is a sequence heard in other contextually-confirmed nervous moments:

• R2 implores Anakin to override the autopilot as their Naboo fighter takes oncoming fire
• Luke’s X-Wing approaches Dagobah
• R2 gets separated from Luke in the halls of Cloud City

EXAMPLE
Line 1 from Episode IV
The matching source words: Tone I, Greet C, Tone E

Seconds later, when 3PO says "we’re doomed," R2’s response is neither sassy nor dismissive. In fact, the crux of this vocalization — Chipmunk A followed by Reeo K — accompanies fearful moments:

• Qui-Gon reluctantly agrees to let Padme join their ominous expedition on Tatooine
• R2 receives Obi-Wan’s message bearing news that the Trade Federation is amassing a droid army
• R2 gets raised into Luke’s X-Wing as the Battle of Hoth begins

So when 3PO says they're doomed, R2 is actually agreeing. They might be doomed.

EXAMPLE
Line 4 from Episode IV
The matching source words: Whistle B, Chipmunk A, Bump D

At the conclusion of the opening, as R2 and 3PO flee to Tatooine in their escape pod, 3PO asks, "Are you sure this thing is safe?" R2 responds with a lone Greet A, the same line from other nervous moments:

• Luke exits the X-Wing into the Dagobah swamp
• R2 gets lies in a heap after the dragonsnake spits him out
• Obi-Wan approaches R2 as a stranger on Tatooine

So, no… R2 is not sure the escape pod is safe.

EXAMPLE
Line 20 from Episode IV
The matching source word: Greet A

How can we have such a fundamental misunderstanding of R2’s nature? The original trilogy-era R2 prop that shaped the world’s understanding of the character was stilted and slow. Its stiff body language may have unintentionally conveyed a steadiness that misrepresented the R2’s intended emotional state. Because of this, audiences may have missed a subtle distinction: they mistook bravery for fearlessness.

R2 may spend 75% of his time in fear, but he is undeniably brave. 90% of R2’s lines are delivered with either happiness or fear, and 20% of them are delivered with both happiness and fear. In these situations, R2 is feeling excited. We often see his excitement in action, as he eagerly carries out his duties in the face of danger.

One particular line perfectly sums up R2’s personality and vocal character. As Luke and Leia rejoin the group to plan for the assault on the second Death Star, R2 vocalizes an unknown Birdsong similar to Birdsongs C and hc-9_1. Nearly identical sounds appear in moments of simultaneous fear and happiness:

• R2 realizes the menacing Old Ben Kenobi is actually Obi-Wan, the intended recipient of his message
• R2 begins downloading the secret plans for the Death Star
• Yoda lifts Luke’s X-Wing from the Dagobah swamp

"This sure is exciting," says R2 as they prepare for the assault. "Exciting is hardly the word I would choose," responds 3PO. R2 is excited. He's scared, but he’s also happy. This exchange encapsulates R2’s bravery and highlights the true contrast between R2 and 3PO. R2 is not 3PO’s blindly confident, one-dimensional counterpoint, as is widely believed. R2 is 3PO’s gutsy, forward-looking counterpoint. R2 spends as much time afraid as 3PO, but R2 tackles his fear with enthusiasm.

EXAMPLE
Line 66 from Episode VI
The closest matching source words: Birdsong variant from C and hc-9, Alarm F

Other results are less transformative, but still shed fascinating light on R2’s vocal quirks. Take the sequence Whistle I, Tone, Cricket, Squelch H, Tone G, Tone E. This pattern appears eight times and generally coincides with elation:

• R2 Dances with Ewoks after the Battle of Yavin
• R2 wakes up in Episode VII and tells 3PO he knows how to find Luke
• R2 reunites with 3PO in Episode IX

However, two instances of this sequence stand out:

• On Tatooine in Episode IV, R2 tells Luke that Leia’s message for Obi-Wan is "nothing… only a malfunction… old data"
• Approaching Dagobah in Episode V, R2 offers to engage autopilot after he finds out where Luke is taking him.

EXAMPLE
Whistle I, Tone, Cricket, Squelch H, Tone G, Tone E
Line 61 from Episode IV: R2 tells Luke the message is old data
Line 37 from Episode V: R2 offers to engage autopilot (time compression / pitch shifting exhibited)

These are both nervous moments. On Tatooine, the preceding lines find R2 expressing wariness as Luke pries at R2’s chassis. Mumble wc-31_1: "I don’t like this…" Approaching Dagobah, the preceding lines find R2 expressing deep anxiety about the plan. Whistle C: "Wait, we’re going to Dagobah instead of meeting up with the others…?"

Why would R2 suddenly utter a single, solitary elation sequence in the middle of these tense situations? It seems R2 is a terrible actor. In both cases, he’s trying to play it cool, hoping to manipulate Luke with a faux chipper attitude.

• On Tatooine: "Oh that silly message!? That’s old data, just a malfunction!" Luke doesn’t buy it, and R2 immediately snaps back with extreme anger. Squelch J, Alarm A, Reeo gen-28_1: "This recording is not for you!"
• Approaching Dagobah: "Oh yea, Dagobah! Or, hey, I could always just enable autopilot and take us out of here!" Luke chuckles, "That’s alright, I’d like to keep it on manual." R2 comes straight back with a notoriously nervous Whistle K: "Okay, but I don’t like this…."

One final amusing insight into the mind of R2: after meeting Luke on the moisture farm in Episode IV, R2 asks 3PO, "Do you think Luke likes me?" The pattern, Chipmunk wc-35_1, Tone wc-35_1, Systle wc-35_1, appears again in Episode V as R2 peers into Yoda’s hut to spy on Yoda and Luke. R2, who was devastated when Luke ordered him to wait alone at the camp, stares longingly at the couple. He sees Luke making a new friend. Luke ditched R2 for a new short friend. Again, Chipmunk wc-35_1, Tone wc-35_1, Systle wc-35_1. It seems R2 is wondering aloud whether he has been replaced: "Does Luke still like me?!"

EXAMPLE
Chipmunk wc-35_1, Tone wc-35_1, Systle wc-35_1
Line 69 from Episode IV: R2 asks 3PO if Luke likes him
Line 98 from Episode V: R2 spies on Luke and Yoda
R2CombiningBoth
STEP 12

COMBINING LOGIC AND EMOTIONS

With the emotional catalog complete, H-CR now has an emotional reference for every existing combination of R2’s sounds. Every word, syllable, and variation carries a quantifiable emotional message.

When this emotional catalog is paired with the logical framework, H-CR can achieve the second half if its goal: speech that is emotionally deliberate. The logic handles the variation. The emotional catalog dictates what those variations convey.

For example, the logical framework tells H-CR a particular Beep can be preceded by a Laser at either 100% of 68% pitch. The emotional framework tells H-CR the first variation conveys 1/3 fear while the second variation conveys 2/3 fear and 1/3 sadness.

Mission accomplished. H-CR can now generate infinitely variable, emotionally specific 'new' speech that is indistinguishable from the original.

CHAPTER 6: THE INTERFACE

Because our R2 doesn't have any sensors to take in data, he will be reliant on the operator to provide stimuli. We need a way to tell R2 when something happens. He can then process the stimulus, adjust his emotional state, and vocalize a response.

R2DesigningUI
STEP 13

DESIGNING THE UI

The top priority: ease of use. The operator may be on a busy set, or be simultaneously controlling a physical R2-D2's movement's with an RC remote. H-CR should offer the minimum number of buttons necessary to accurately trigger the full range of stimuli. The buttons should be intuitive and clearly marked. They should also be large and easy to press whether the software is running on a computer with a mouse and keyboard, a fat tablet, or a skinny phone. To accomplish this, the software's buttons are laid out across the bottom of the screen. The interface scales based on the aspect ratio so that the buttons are always as large as possible, and always span the full width of the screen.

The secondary priority: looking cool. The Star Wars universe is notorious for its gorgeous, retro, grimy, incandescent panels, screens, buttons, and switches. The R2 UI should tap into that aesthetic. It should look like a physical control module that an Industrial Automaton employee would use to pilot an R2-series astromech... a module that's seen better days.

H-CR offers eight main stimulus buttons. For each of R2's four core emotions, there is one mild stimulus button and one extreme stimulus button. The buttons are clearly marked with robotic emoticons indicating which button stimulates which emotion. The stimuli from these eight buttons can coherently trigger every possible R2 sound, with one exception.

A ninth button is needed to provide an electrocution stimulus. R2's electrocution vocalization is distinctive, lengthy, and contextually specific. We don't want to leave it out, and it would be inappropriate and inaccurate to simply trigger electrocution vocalizations at random times or under random emotional criteria.

For the same reason, the extreme fear button always triggers a scream sound. R2's screams are too distinctive and too contextually specific leave to chance.

A tenth button functions as a latching switch to enable "muse" mode. In this state, R2 automatically emits sounds periodically. This can be a convenient way for a busy operator to put R2 on autopilot. But, like every other function, it also has a basis in canon. We often see R2 talking absentmindedly to himself on film. For example, as he rolls through the deserts of Tatooine in search of Obi-Wan. In these instances, R2's vocalizations have a distinctly different quality. They are more subdued. This makes sense, as we don't mutter to ourselves with the same purpose or energy that we use when speaking to someone else, or when reacting to a stimulus. H-CR will only generate certain types of vocalizations in muse mode. Therefore, muse mode is not to be confused with a randomizer, and it is not a substitute for operator input when it comes to triggering stimuli.

Finally, a debug button sends the operator to the "override" panel. More on this below.

The experience is further enhanced by purely aesthetic features. H-CR simulates the effects you would see if you were handling a real, physical console. The interface is not simply a static 2D image. It is a fully three dimensional control panel rendered in realtime. When running on a device with an accelerometer, which includes virtually any modern phone or tablet, the angle of the device subtly pivots the control panel for a parallax effect. Because the panel is rendered with realtime lighting, this pivoting also affects the angle at which light hits the controls, resulting in shiny reflections and moving shadows. Other subtle effects, like flickering incandescent lamps during electrocution events, round out the faux-physical experience.

R2Feedback
STEP 14

PROVIDING VISUAL FEEDBACK

H-CR's R2 module is designed to be operated blindly. A distracted user can easily operate the software without glancing down at their device. That said, the software does offer visual cues to guide new users and to provide helpful insight into R2's logical and emotional thought process at any given moment.

Four columns of incandescent lamps reveal R2's emotional state. As R2 becomes happier, the happiness lamps light sequentially. While an experienced will recognize R2's feelings purely from the color of his speech, these lamps offer a quick reference point.

When a stimulus is triggered with a button press, the lamp inside that button lights up and flashes while R2 vocalizes his reaction. When the vocalization finishes, the lamp extinguishes and R2 is ready to vocalize another stimulus. H-CR allows the user to cue an additional stimulus before R2 has finished vocalizing the prior one. If a button is pressed while R2 is still vocalizing, the button's lamp illuminates to indicate that a new stimulus has been cued. When the prior vocalization ends, R2 automatically processes the new stimulus.

R2Overriding
STEP 15

OVERRIDING

The override panel presents options for changing H-CR's default behavior.

Two buttons toggle between "canonical" mode and "improvisational" mode. In canonical mode, H-CR only generates vocalizations that have appeared, verbatim, in a Star Wars film. In improvisational mode, H-CR puts its logic to work: every vocalization is unique.

A "personality chip override" locks R2's emotional state in place. The stimulus buttons will no longer change his mood. The operator can adjust his emotions manually by touching the lamp columns on the main panel.

A memory bank allows the operator to store and recall vocalizations. Pressing and holding on one of the memory buttons will write the most recent vocalization to that slot. Pressing and releasing a memory button normally will trigger that vocalization.

CHAPTER 7: THE AI

With the interface built, it's time to bring R2 to life.

R2AI
STEP 16

ASSEMBLING THE AI

When the software first boots up, R2 is emotionally neutral. None of his emotional state lamps are illuminated. As soon as the operator triggers a stimulus, R2's AI goes to work. Let's say R2 was previously feeling extremely happy, extremely angry, and neutrally (not at all) scared. The operator presses the extreme fear stimulus button. How does the AI decide how much to modify each of R2's four core emotions based on this stimulus? Two factors inform the decision.

The first factor is a basic model of emotional processes. If someone who is happy gets scared, they are likely to become much less happy. If they were angry, they are likely to become slightly less angry as they become consumed by fear. Finally, they are of course guaranteed to become more scared. This tells the AI in what direction the core emotions should shift, and it provides rough guidance on how drastic that shift should be on a neutral subject. With an extreme fear stimulus, a generic subject would go from very happy to barely or neutrally happy, from very angry to moderately angry, and from neutrally scared to very scared. But we aren't working with a generic subject. We're working with R2-D2.

The second factor is based on our catalog of R2's emotional history. This is where R2's sentience comes in. The basic emotional model attempts to take R2 from extremely happy to barely happy, but H-CR's emotional model intervenes. H-CR knows that moderate happiness, neutral anger, and moderate fear is one of R2's most common states. Therefore, instead of dropping all the way down to barely happy, R2 has statistically higher odds of settling on moderately happy. The same thing happens with anger. The basic model wants to take R2 from extremely angry to very angry. H-CR recognizes that high anger is one of R2's least common states, so it has a high likelihood of guiding R2's anger past very angry down to moderately angry.

This example is oversimplified for clarity. Behind the scenes, the AI is actually factoring in the frequency and pattern of every emotional combination from every film scene to analyze where R2's feelings tend to rest, and thus to decide how R2's emotions should tend to drift in response to a particular stimulus.

R2Vocalizing
STEP 17

VOCALIZING

The user provides a stimulus. R2 has an internal emotional reaction. Now we need an external vocal reaction. In the above example, R2 was extremely happy, extremely angry, and neutrally scared. He then experienced an extreme fear stimulus. The resulting speech should of course be a high-fear vocalization. But it should not be as high-fear as it would be if R2 had been extremely fearful to begin with. It should also have a moderate shade of happiness and a high shade anger, since R2 is experiencing those emotions as well.

In canon mode, H-CR searches its database for an appropriate line from the films. It then recreates that line from syllables, applying variations when necessary to replicate the film.

In improv mode, H-CR algorithmically pieces together vocalizations one syllable at a time. Say H-CR is trying to vocalize moderate fear and a tiny hint of sadness. In the earlier example, we examined a Laser conveying level 1 fear at 100% pitch/time, and level 2 fear and level 1 sadness at 68% pitch/time. The emotion we want to vocalize is more fearful than the level 1 fear the 100% pitch variant would convey, but it is also less sad than the level 1 sadness the 68% pitch variant would convey. H-CR might split the difference and settle on a pitch shift of around 90%. All that to come up with one emotionally eligible variation on one syllable. Can we even use it?

Which syllables are viable to initiate the vocalization? Which syllables are viable ending points for the vocalization? Which syllables are viable to come next, and how should their odds of selection be weighted? What categories should we select from, and with what odds? What subcategories? How should the syllables be processed? Are their neighbors part of the same word? Can we make rearrangements or substitutions? What range of times are viable between each syllable? Are those times measured from the start of the previous syllable, or from the end of the previous syllable? How long should the sequence be? Should there be breathers during the sequence and how long, when, and where should they appear? How do the answers to these questions for any one syllable affect the answers for all of the other syllables?

H-CR solves these problems in realtime. Billions upon billions of rules dictate how each sound should be selected, processed, and sequenced to achieve emotional, structural, and canonical accuracy.

CHAPTER 8: THE CONCLUSION

H-CR has now achieved its ultimate goal: it can generate speech that is infinitely varied, emotionally specific, and indistinguishable from the original.

Equally importantly, H-CR is honoring the source material by generating its speech according to canonical rules. It is simulating the exact same process that the sound designers used when manually generating R2's speech for the films.

At this point we have more than just a dataset. We know what R2-D2 feels, how often he feels it, how his feelings change in response to stimuli, and how to intelligently voice those feelings using the logic of the sound designers, i.e. the internal logic of R2.

Now, our software has obviously only captured the R2 that we see on film. We don’t have a copy of his (notoriously unreliable) memory banks. But if we assume that the events in the films are R2’s formative experiences… then what’s the difference between the R2 that is "real" in the context of the Star Wars universe, and H-CR’s algorithm?

For all intents and purposes, what we have created is the actual, sentient R2-D2 in real life.

IconAstromechASTROMECH
Astromech Panel
Download Mac and iOS DOWNLOAD
Download Windows DOWNLOAD
Download Android DOWNLOAD
INSTRUCTION MANUAL

If you use this vocalizer, please support STEM education by making a donation to FIRST®

The Astromech Vocalizer from Human-Cyborg Relations lets users design their very own astromech with a custom voice and personality. The software is built on the foundation of the R2-D2 vocalizer. It provides deep options to expand beyond the strict canonical rules that govern R2's emotions and speech patterns.

Users can customize their astromech's vocabulary to create a droid that communicates exclusively in beeps or that almost never whistles. They can adjust the overall pitch to create a soprano droid, a bass droid, or anything in between using an advanced formant-preserving pitch shifting algorithm. They can set the rate of speech and drawl to create a frenetic droid or a deliberate droid. They can tailor a custom personality that governs how the droid responds vocally and emotionally to stimuli... tendencies that will evolve as the droid learns, grows, and develops a living memory. And much, much more.

IconGonkGONK
Gonk Panel
Download Mac and iOS DOWNLOAD
Download Android DOWNLOAD
INSTRUCTION MANUAL

If you use this vocalizer, please support STEM education by making a donation to FIRST®

The Gonk Vocalizer is powered by advanced research into the speech production of Gonk droids. Gonks are famous for speaking just one word. Only a handful of variations on that word were originally recorded.

Using individual phonemes from those original recordings, the Gonk Vocalizer synthesizes brand new Gonk vocalizations in realtime. For the first time ever, hear the Gonk express itself with infinite, organic variety. No two "Gonk"s will ever sound the same.

The source audio behind the "Chat" vocalization was carefully extracted from the sandcrawler scene in Episode IV using a combination of algorithmic isolation techniques and manual spectrograph editing. It is presented here for the first time ever in its pure, isolated state.

EXAMPLE
Gonk Chat original
Gonk Chat isolated
IconDOD-O
COMING SOON

On film, D-O speaks Galactic Basic. His sounds are generated by transforming a human voice... namely, that of J.J. Abrams.

H-CR reproduces this transformation by acting as a voice modulator. In Star Wars terms, the software is basically a backwards vocabulator. H-CR uses a series of effects to turn a regular human voice into the voice of D-O.

As with the R2-D2 module, H-CR's D-O module recreates the editing process that the sound designers used. It does so in realtime, in a portable format, with no manual editing required.

The modulator includes two main parts. The first is a filter bank that matches the timbre of D-O's voice. This includes a reverse engineered "matching equalizer" that was developed by comparing the frequency content of D-O's speech with a high quality recording of J.J. Abrams' voice. The second is an adaptive delay chain that recreates D-O's stutter. H-CR is the only realtime voice modulator in the world to feature an effect like this.

EXAMPLE
A raw recording of J.J. Abrams
D-O from episode IX
J.J. modulated by H-CR to sound like D-O
IconFAQFAQ

Will you create a vocalizer for ___ droid?

I would like to! While each droid presents a unique set of challenges, H-CR contains a powerful set of generalized tools that can be applied to droids of all makes and models.

If you have a droid in mind, please feel free to reach out!

Can I use this software with my project?

Yes! As long as it's for personal use. Resale or commercial use of H-CR is not permitted.

Are you profiting from this project?

No! All proceeds support Force for Change and FIRST. The software is entirely free and has no ads or monetization of any kind. Users are instead encouraged to donate to charity. All of proceeds from collaborations with 3rd party control platforms go to charity as well.

Can you port the R2 Vocalizer to ___ platform?

Depends! General purpose microcontrollers, such as Arduino and Raspberry Pi, are not suitable targets because they lack realtime audio processing capabilities. H-CR vocalizers do not simply play one raw sound. They play and process sounds polyphonically in realtime.

A typical sound file plays at 44,100 samples per second. This is roughly 1000 times as fast as an average video source. And unlike video where dropped frames are a minor nuisance, a single dropped audio sample usually results in a loud "pop." This is why realtime sound processing requires hardware and software with specialized audio capabilities.

A port to the Teensy Audio platform is in progress. The Teensy and its accompanying audio shield are small, cheap, efficient, and widely available!

Will we be able to interface with the Teensy port of the R2 Vocalizer?

Yes! H-CR on the Teensy will offer both access to both input and output. In this arrangement, H-CR acts as the emotional and vocal brain for your project.

Your software can provide an input stimulus encoded as a good/bad/maddening/frightening event with an intensity value. This stimulus can come from a user pressing a button on an RC remote, from an advanced automonmous sensor array, or from any other system you can dream up. H-CR will then provide the appropriate emotional processing, vocalize the stimulus, and return data describing the new emotional state.

You can use the returned data to set an appropriate light color, automate a movement, and more!

IconEmpireCOMING SOON

More droids are soon to come.

Additionally, a modular voice modulation framework for Galactic Basic droids is in the works. This includes a portable, low latency, high quality formant shifter which was used to create the Bouschh and Tactical Droid examples from the Galactic Basic summary section. The framework will allow new designers and seasoned veterans alike to both prototype new droids and recreate existing droids.

"If droids could think, there would be none of us here, would there?"

Michael Perl  •  Contact