Introduction

Treat a neural audio codec as a granular synthesis domain: encode a target and a source pool into Encodec latents, then re-synthesize the target by replacing each latent frame with its closest match from the pool. This keeps the target’s macro-timing while transplanting timbre from the pool.

What we do

We perform granular resynthesis in Encodec’s 128‑D latent space using windowed cosine similarity matching. Matching uses a local context window; output is assembled with configurable hop/stride and optional multi‑frame grains (overlap‑add) for smoother sustained content. We also evaluate pedalboard-based pool augmentation.

Motivation

Tokui-style latent granular resynthesis is compelling because it is training-free: it reuses a pre-trained codec and works immediately on arbitrary audio. What’s missing in prior work is (1) a practical map of the parameter space and (2) an interface where those parameters become audible.

Experimental setup

Percussion: what moves the needle

Percussion is transient-dominated, so “frame-wise” matching can be brittle: small timing mismatches become clicks, and longer grains can smear attacks.

Simple cosine matching: works on noisy texture (kitchen pan)

Method
tokui_style_transfer_cosine
Target (amen)
Pool (kitchen pan)
Output (cosine)

…but fails on structured percussion (amen → tabla)

Method
tokui_style_transfer_cosine
Pool (tabla)
Output (cosine)

Windowed matching helps tabla: adds local context

Method
tokui_style_transfer_window (window=3)
Output (windowed)

Grain size tradeoff: bigger grains smear transients

On percussion, increasing grain_size can reduce discontinuities but often softens attacks (less “snap”).

Method
tokui_style_transfer_window (window=3, stride=1, hop=1)
Output (grain=1)
Output (grain=5)

Notebook plots (amen → tabla, windowed match)

Waveforms: the output keeps amen’s timing density but inherits more tabla-like envelope/dynamics (often lower peak range).

Amen/tabla waveforms

PCA: target and pool occupy distinct regions. Matched-pool frames and output concentrate near the pool region, which is consistent with “projecting” the target trajectory onto pool-like latent vocabulary rather than interpolating between them.

Latent PCA

Instruments: smoother latent trajectories

Sustained/harmonic sources tend to benefit from longer grains: you preserve partials over multiple frames and reduce “jitter” from frame-to-frame token jumps.

Bigger grains help: violin → sax

Method
tokui_style_transfer_window (window=1, stride=1, hop=1)
Output (grain=1)
Output (grain=5)

Augmentation matches the pitch better (more codebook coverage) but the output is still not smooth and metrics are worse.

FAD and MFCC get worse with augmentation but perceptually the pitch is matched better with augmentation!

Config
grain_size=5, stride=3,
Output (no aug)
Output (augmented)

The pitch/gain-shifted pool variants broaden the pool's latent coverage, so the matcher finds closer harmonic candidates

Evaluation plots

Grain size sweep. Larger grains reduce frame-to-frame jitter and tend to lower FAD (especially for instruments), but can smear percussion transients.

Grain size plot

Stride sweep. Stride=1 is consistently best; higher stride reduces query density and increases artefacts.

Stride plot

Augmentation. In aggregate it lowers percussion FAD slightly but worsens MFCC‑L2; for instruments it worsens both metrics on average (pair-specific exceptions exist). But more perceptual metrics are needed to evaluate the true impact.

Augmentation plot

Parameter sensitivity. Grain size and stride dominate; hop and window size are secondary and often pair-dependent.

Parameter sensitivity