Introduction

Treat a neural audio codec as a granular synthesis domain: encode a target and a source pool into Encodec latents, then re-synthesize the target by replacing each latent frame with its closest match from the pool. This keeps the target’s macro-timing while transplanting timbre from the pool.

What we do

We perform granular resynthesis in Encodec’s 128‑D latent space using windowed cosine similarity matching. Matching uses a local context window; output is assembled with configurable hop/stride and optional multi‑frame grains (overlap‑add) for smoother sustained content. We also evaluate pedalboard-based pool augmentation.

Motivation

Tokui-style latent granular resynthesis is compelling because it is training-free: it reuses a pre-trained codec and works immediately on arbitrary audio. What’s missing in prior work is (1) a practical map of the parameter space and (2) an interface where those parameters become audible.

Experimental setup

Categories: percussion (amen, tabla, trap hats, tribal) and instruments (flute, sax, violin, jazz piano)
Pairs: all ordered target→pool permutations within each category
Grid: grain_size ∈ {1..5}, window_size ∈ {1..5}, stride ∈ {1..3}, hop ∈ {1..3}, augmentation ∈ {none, augmented}
Metrics: MFCC‑L2 (mean MFCC distance to pool) and FAD (distributional distance to pool embeddings)

Percussion: what moves the needle

Percussion is transient-dominated, so “frame-wise” matching can be brittle: small timing mismatches become clicks, and longer grains can smear attacks.

Simple cosine matching: works on noisy texture (kitchen pan)

Method

tokui_style_transfer_cosine

Target (amen)

Pool (kitchen pan)

Output (cosine)

…but fails on structured percussion (amen → tabla)

Method

tokui_style_transfer_cosine

Pool (tabla)

Output (cosine)

Windowed matching helps tabla: adds local context

Method

tokui_style_transfer_window (window=3)

Output (windowed)

Grain size tradeoff: bigger grains smear transients

On percussion, increasing grain_size can reduce discontinuities but often softens attacks (less “snap”).

Method

tokui_style_transfer_window (window=3, stride=1, hop=1)

Output (grain=1)

Output (grain=5)

Notebook plots (amen → tabla, windowed match)

Instruments: smoother latent trajectories

Sustained/harmonic sources tend to benefit from longer grains: you preserve partials over multiple frames and reduce “jitter” from frame-to-frame token jumps.