← Back to work GenAI · Music · UX Research

resong

A prompt scaffolding tool that helps novice users translate auditory imagination into generative AI music prompts, built from original qualitative research in a Sound and Environment course.

21 Student groups
5 Coding dimensions
3 User types
3 Design screens
Role UX Researcher and Designer (solo)
Methods Qualitative content analysis, coding schema development, interaction design
Tools Figma, HTML/CSS, Suno AI
Context Sound and Environment, University of Florida, Fall 2024

When students in my undergraduate Sound and Environment course were asked to generate music using AI, most of them knew exactly what they wanted to hear. What they couldn't do was say it.

resong is a prompt scaffolding tool that bridges that gap, helping novice users move from vague auditory intuition to a structured, effective prompt they can paste directly into Suno AI.

Describing sound is harder than it looks.

Generative AI music tools like Suno require users to describe their desired music in clear and specific language. Unlike describing an image, describing music requires vocabulary that spans genre, texture, mood, instrumentation, and acoustic quality — domains that most people have never been asked to articulate explicitly.

In my Sound and Environment course, I assigned students a creative project: use a field recording as the emotional seed for an AI-generated piece of music. The result revealed a sharp gap between what students felt and what they could write.

The research question

What patterns emerge when novice users attempt to describe music for AI generation? And what does that reveal about where they need support?

21 groups of undergraduate students submitted written prompts alongside their AI-generated tracks. Students came from a range of backgrounds. A few had formal music training, and none had prior experience with AI music generation. Their original recordings were supposed to be recordings of birdsong captured on a phone.

This context gave me a unique privilege: as the course instructor and researcher, I had deep familiarity with the assignments, the recordings, and the students' expressed intentions. I could compare what they meant with what they wrote.

Coding schema

I developed a five-dimension coding schema to systematically analyze the 21 submissions. Each dimension represents a distinct layer of musical description:

Code Dimension Definition Example from data
GEN Genre Stylistic or historical category "jazz," "lo-fi," "mariachi," "J-pop"
SON Sound quality Timbral and textural descriptors "high-pitched," "trilling," "staccato," "echoing"
EMO Emotion / Mood Affective or expressive qualities "calming," "groovy," "haunting," "upbeat"
BIO Bioacoustic / Sound source References to the bird, habitat, or ecological context "bird call," "sharp rasps," "cawing," "forest"
INS Instrument Specific instrument references "piano," "trumpet," "acoustic guitar," "cello"

Each submission was coded for the presence, absence, and relative weight of each dimension.

The clearest pattern in the data is that students rarely addressed all five dimensions. The typical prompt drew heavily from one or two dimensions while leaving others entirely absent.

Average prompt length: 156 characters against a 120-character limit — 131% over on average. Students were trying to say more than the interface allowed.

EMO dominated. Nearly every student led with emotional or atmospheric description. Prompts like "calm, melodic smooth rhythm similar to the beat of a children's lullaby, slow tempo, and steady beat" (Group 7) and "relaxing fall music" (Group 15) convey a feeling but give the AI little sonic specificity to work with.

GEN was the most common structural anchor. Students frequently named a genre to organize their prompt: jazz, lo-fi, mariachi, J-pop, R&B trap. Sometimes with striking specificity: "J-pop, like YOASOBI's style, with some electric keyboard and guitar in the background, no jazz or heavy metal/heavy rock" (Group 5). But genre naming alone, without sonic or emotional layering, produced generic results.

BIO appeared frequently but remained underdeveloped. Some students referenced their recording by copying Cornell Lab language almost verbatim — "high-pitched metallic chips and series of loud, sweet whistles, rock piano" (Group 10) — without translating it into musical direction.

SON and INS were most commonly absent. Timbral descriptors and instrument specifications, the dimensions that most directly shape an AI's sonic output, appeared in fewer than half of submissions.

Three distinct prompt strategies emerged

Type 01
The Feeler
"Make an upbeat song that sounds like the feeling of walking through a park on a sunny day and incorporate bird calls."
Prompts weighted entirely toward EMO. Strong affective clarity, weak technical specificity. The emotional intention is clear; the sonic path to get there is not.
Type 02
The Describer
"Mellow bird call song with high keys on the piano, use the bird call recording, add an orchestra and some Chinese classical."
Prompts weighted toward BIO and SON. Acoustically grounded, but emotionally and generically underspecified.
Type 03
The Maximizer
"The musical genre is relaxed modern jazz with some synthetic electronic sounds. One instrument sounds like a trumpet. Then 15 seconds in, a synthetic techno sound arises..."
Prompts that attempted to cover everything at once. Group 2 wrote 626 characters — 522% of the limit. The impulse was right; the execution needed scaffolding.

None of the three was wrong, but all three were incomplete in predictable, designable ways.

The research findings mapped directly to design decisions in resong.

Finding 01
Users default to emotion because it's the most natural entry point
Let them start there. The open text field at the top of resong is intentionally unconstrained. Users can write anything, in their own words. The tool meets them where they are rather than forcing a structured form from the start.
Finding 02
Users don't know what dimensions they're missing
The Prompt Coverage bar makes the gap visible. As users fill in dimensions, the coverage bar fills proportionally and changes color per dimension. The counter (2 / 5 dimensions) creates a gentle, non-judgmental signal that there's more to explore, without blocking progress.
Finding 03
Users lack vocabulary for SON and INS in particular
Each accordion dimension opens into a chip selector with pre-populated vocabulary drawn from the research data — the actual words students used successfully, plus common alternatives. Users can select chips or type their own. This removes the blank-page problem for the dimensions they find hardest.
Finding 04
The recording itself is underused as a creative resource
The "About this sound" panel surfaces contextual information about the recording, sourced from Cornell Lab for bird calls, giving users language they might not have had. A recording of a Northern Cardinal shouldn't just be "a bird." It should open into "high-pitched metallic chips and series of loud, sweet whistles" — Cornell Lab's own description, which doubles as usable prompt vocabulary.
Finding 05
The Maximizer needs a channel, not a constraint
The tip box between the link field and the generate button is contextually aware. It identifies the dimension with the least coverage and nudges the user toward it: "Try adding a genre to give the AI a stronger musical direction." Rather than cutting students off, resong redirects their energy toward the dimensions they haven't yet addressed.

The prototype below is interactive — navigate through all three screens to follow the full user flow from empty state to generated prompt.

Screen 01 Empty state

The user arrives with their recording. The interface is intentionally quiet: a large upload zone, a prompt field with a concrete example placeholder, and the five accordion dimensions visible but collapsed. Nothing is required yet. The message is: start anywhere.

Screen 02 In progress

Once a recording is uploaded, the left panel contextualizes it with a waveform card and an "About this sound" box sourced from Cornell Lab. On the right, as the user writes and selects chips, the coverage bar responds in real time. The generate button remains disabled — not as a gate, but as an invitation to keep going.

Screen 03 Prompt generated

The generated prompt synthesizes the user's free-text description with their chip selections into a structured, Suno-ready string. Dimension tags show which part of the prompt came from which dimension, making the scaffolding visible and educational. The user can copy the prompt, refine further, or open Suno AI directly.

Interactive prototype — click through all three screens

resong sits at the intersection of qualitative research and interaction design in a way that is specific to my background. The five-dimension schema came directly from ethnomusicological analysis that used to study music cross-culturally and was applied to understanding how novice users describe music to machines. The design then responded to that analysis with targeted scaffolding rather than generic UX patterns.

This is what differentiates resong from a typical prompt helper: it's grounded in real data about real users, and every design decision traces back to a specific finding.

What I would do next

With more time and participants, I would test whether the scaffolding actually improves prompt quality, comparing AI outputs from scaffolded vs. unscaffolded prompts, rated by listeners blind to the condition. I would also explore adaptive chip suggestions based on the recording content, using audio analysis to surface vocabulary automatically. And I would design for the Feeler, Describer, and Maximizer explicitly, offering different entry paths for each user type rather than a single linear flow.

AI Trust in Culturally Complex Learning →