Canonical pitch-shifting algorithms in functional JavaScript. Frequency-domain algorithms (vocoder, phaseLock, transient, formant, sms, hpss) shift bins natively; time-domain algorithms (ola, wsola, psola, granular) apply their namesake stretcher from time-stretch then anti-aliased sinc resample. Consistent unified API: batch, stream, multi-channel. Part of the audiojs ecosystem.
npm install pitch-shiftimport pitchShift, { phaseLock, transient, psola, formant, wsola } from 'pitch-shift'
// Auto-select an algorithm from content hints
let auto = pitchShift(audio, { semitones: 5, content: 'voice' })
// Batch processing
let pitched = phaseLock(audio, { ratio: 1.5 }) // pitch up by factor of 1.5
// Streaming (real-time)
let write = phaseLock({ ratio: 1.5 })
let output = write(inputBlock)
let tail = write() // flush
// Separate-channel stereo
let stereo = phaseLock([left, right], { ratio: 1.5 })Each algorithm is a canonical pitch-shift implementation with its own character. Frequency-domain algorithms shift bins natively; time-domain algorithms use their namesake stretcher from time-stretch + anti-aliased sinc resample — the canonical form for time-domain pitch shifting.
| Algorithm | Domain | Form | Best for |
|---|---|---|---|
ola |
Time | OLA time-stretch + sinc resample. Plain overlap-add without similarity search — the baseline the others improve on. | Baseline / general |
vocoder |
STFT | Bin-shift phase vocoder (SMB/Bernsee). True instantaneous frequency per bin, loudest-wins scatter, synthesis phase accumulation. | Simple tonal material |
phaseLock |
STFT | Laroche-Dolson peak-locked vocoder. Peaks get independent frequency shift; non-peak bins preserve phase offset relative to the nearest peak. | General music |
transient |
STFT | Peak-locked vocoder with spectral-flux transient detection. On transient frames, synthesis phase resets to analysis phase, preserving attacks. | Music with percussion |
psola |
Time | PSOLA time-stretch (autocorrelation period → pitch-mark grains) + sinc resample. Formants preserved in the stretch stage. | Speech, monophonic voice |
wsola |
Time | WSOLA time-stretch (per-grain similarity search ±tolerance) + sinc resample. Clean time-domain shift without FFT. |
Speech, low-latency |
granular |
Time | OLA time-stretch with small grains (1024) + sinc resample. Grain rate is audible — the texture is the point. | Creative textures |
formant |
STFT | Cepstral envelope preservation. Flatten spectrum by the real-cepstrum envelope, vocoder-shift the flat residual, re-impose the envelope. | Voice (preserves formants) |
paulstretch |
STFT | Large-frame phase randomization. Magnitudes gathered from k/ratio; phases drawn uniformly from [0, 2π). |
Ambient, extreme shifts |
sms |
Sinusoidal | Peak-scaled Spectral Modeling Synthesis. Parabolic-interpolated peak picking → sinusoidal lobes shifted to round(f·ratio), stochastic residual preserved. |
Harmonic/tonal |
hpss |
STFT | Fitzgerald median-filter harmonic/percussive separation. Time-axis median → harmonic estimate; freq-axis median → percussive estimate; soft mask; vocoder-shift the harmonic part, pass-through the percussive. | Mixed music (drums+tonal) |
sample |
Time | Playback-rate pitch shift. Hann-windowed sinc interpolation at fractional read-head stepped by ratio per output sample. No time preservation — higher pitch = shorter clip (zero-padded tail). |
Sampler/tracker instrument playback |
hybrid |
Hybrid | Crossfade between phaseLock (frequency domain) and wsola (time domain), weighted by per-sample spectral-flux transient confidence. Tonal regions resolve via the phase vocoder; attacks resolve via WSOLA similarity search. |
Mixed dynamic material |
Each algorithm preserves a different invariant and surrenders the rest. No single one wins everywhere — the reason to reach for one over another is what it keeps intact by construction and what it must give up for that. The guide below is what each canonical form trades.
ola — OLA time-stretch + sinc resample. Plain overlap-add without similarity search — the baseline the others improve on. Preserves pitch accuracy, amplitude envelope. Destroys formants (shifted by the resample), phase coherence across long spans, transients (grain-rate phase cancellation). Reach for the simplest possible pitch shift, or as a reference to compare against.
vocoder — SMB/Bernsee bin-shift. Recovers the true instantaneous frequency of each bin from the consecutive-frame phase advance, then re-accumulates synthesis phase at the shifted frequency. Preserves dominant-partial pitch and long-horizon phase for each bin independently. Destroys transients (smeared across the frame), vertical phase coherence between adjacent bins ("phasiness"), formants. Reach for simple tonal material and minimal correct spectral pitch shift.
phaseLock — Laroche-Dolson peak-locked vocoder. Locks non-peak bins' synthesis phase to the nearest peak's, keeping the vertical phase relationship inside each sinusoidal lobe intact. Preserves phase coherence around each peak, partial structure, pitch accuracy. Destroys transients (still smeared, less than vocoder), formants. Reach for general music — the "try this first" phase vocoder.
transient — phaseLock plus spectral-flux transient detection. On flagged frames the synthesis phase snaps back to the analysis phase so the attack shape re-emerges verbatim. Preserves everything phaseLock preserves, plus attack localization on detected transients. Destroys formants; misses quiet transients at a too-high threshold and smears them. Reach for music with percussion where phaseLock alone loses the attack.
psola — PSOLA time-stretch (autocorrelation period contour → pitch-synchronous grains) + sinc resample. The stretch stage copies vocal periods verbatim, preserving formant shape; the resample stage changes pitch. Preserves waveform-per-period shape, attack localization, voiced-speech naturalness. Destroys polyphony (assumes a single pitch contour), unvoiced regions (pitch-mark jitter). Reach for monophonic speech, solo voice, or a single melodic instrument with formant structure.
wsola — WSOLA time-stretch (per-grain similarity search ±tolerance) + sinc resample. The similarity search eliminates grain-rate phase cancellation, producing a clean time-domain pitch shift without FFT. Preserves local waveform shape, attack envelopes, pitch accuracy. Destroys formants (shifted by the resample), phase coherence across long spans. Reach for low-latency speech, or anywhere the phase vocoder's frame is unacceptable.
granular — OLA time-stretch with small grains (1024) + sinc resample. No similarity search, no pitch sync. The grain rate is clearly audible — the texture is the point. Preserves grain-local timbre and a characteristic textural quality. Destroys pitch accuracy on complex tones, smooth envelopes. Reach for creative/textural effects where the grain character is the point.
formant — Cepstral envelope preservation wrapping a vocoder shift. Lifter-flatten the spectrum by its real-cepstrum envelope, shift the flat residual in bin space, re-impose the envelope unchanged. Preserves formant envelope (absolute Hz), vocal-tract character. Destroys what vocoder destroys (transients smear), risks cepstral ringing on very noisy or very sparse spectra. Reach for voice shifting without the chipmunk/giant artifact.
paulstretch — Large-frame phase randomization. Magnitudes are gathered from source bins at k/ratio; phases are redrawn uniformly from [0, 2π) every frame. Preserves long-term magnitude-spectrum statistics. Destroys phase, transients, any rhythmic micro-structure — by design. Stream-vs-batch decorrelates inherently, which is why the metric is marked —. Reach for ambient/drone textures and extreme shift ratios where the smear is the aesthetic.
sms — Peak-scaled Spectral Modeling Synthesis. Parabolic-interpolated peak picking builds a small track list of (freq, mag, phase) triples; each peak's lobe is copied intact to round(f·ratio); the stochastic residual is left unshifted. Preserves formant envelope (lobes scale freely with their peaks), harmonic structure, tonal clarity. Destroys transients, noise-like textures (absorbed into the residual), polyphonic material beyond maxTracks. Reach for sustained tonal/harmonic instruments and vowels where envelope matters.
hpss — Fitzgerald 2010 median-filter harmonic/percussive separation. Time-axis median → harmonic-friendly view; freq-axis median → percussive-friendly view; soft Wiener mask at exponent p splits the spectrogram. The harmonic component is vocoder-shifted; the percussive component passes through at its original phase. Preserves percussive onset locations (unshifted) and harmonic pitch (shifted). Destroys a little signal quality to mask leakage in both directions on ambiguous material. Reach for mixed music where drums and tonal content coexist and the kit should stay stationary while the melody moves.
sample — Playback-rate pitch shift: Hann-windowed sinc interpolation at a fractional read-head stepped by ratio per output sample. The intuition hardware samplers and tracker modules run on. Preserves waveform identity (literally the same audio, faster or slower) and formants trivially — everything scales together. Destroys time: output duration is input_length / ratio, and the tail is zero-padded to keep the unified API. Reach for instrument one-shots, ROM-sample playback, any context where "higher pitch = shorter clip" is the intended effect.
hybrid — Runs phaseLock and wsola in parallel and crossfades sample-by-sample by a transient-confidence signal from spectral flux. Tonal regions resolve via the phase vocoder; attacks resolve via WSOLA similarity search. Preserves phase coherence on tonal regions and attack shape on transients — simultaneously. Destroys CPU budget (≈2×), strict low-latency causality (the detector looks both ways), formants. Reach for mixed dynamic material where a single domain compromises the other.
Each algorithm is measured across ten canonical properties on synthetic fixtures with exact ground truth. The shift column is a direct log-magnitude distance between the algorithm output and a canonically generated shifted reference (e.g. sine(660) as the ground truth for pitchShift(sine(440), 1.5)) — no heuristic, no proxy metric. Run npm run quality for the live numbers.
| Algorithm | f0 err | THD% | alias | stream corr | cent err | onset err | attack corr | formant dist | phase coh | shift |
|---|---|---|---|---|---|---|---|---|---|---|
hpss |
0.00 | 0.0 | 0.052 | 1.000 | 0.007 | 0.000 | 0.996 | 1.267 | 0.922 | 1.464 |
vocoder |
0.00 | 0.0 | 0.000 | 1.000 | 0.006 | 0.000 | 0.983 | 1.343 | 0.922 | 1.491 |
formant |
0.00 | 0.0 | 0.000 | 1.000 | 0.061 | 0.000 | 0.984 | 1.000 | 0.981 | 1.616 |
ola |
1.00 | 0.2 | 0.005 | 1.000 | 0.003 | 0.000 | 0.995 | 2.345 | 0.869 | 1.650 |
wsola |
1.00 | 0.2 | 0.005 | 1.000 | 0.003 | 0.000 | 0.995 | 2.345 | 0.869 | 1.650 |
sample |
2.50 | 0.1 | 0.007 | 1.000 | 0.003 | 0.000 | 0.951 | 2.245 | — | 1.655 |
sms |
0.00 | 0.0 | 0.002 | 1.000 | 0.001 | 0.000 | 0.953 | 2.028 | 0.922 | 1.761 |
pitchShift (auto) |
0.00 | 0.0 | 0.000 | 1.000 | 0.012 | 0.000 | 0.985 | 1.600 | 0.993 | 1.795 |
transient |
0.00 | 0.0 | 0.000 | 1.000 | 0.012 | 0.000 | 0.985 | 1.600 | 0.993 | 1.795 |
phaseLock |
0.00 | 0.0 | 0.000 | 1.000 | 0.012 | 0.000 | 0.986 | 1.591 | 0.993 | 1.796 |
granular |
1.00 | 0.2 | 0.005 | 1.000 | 0.003 | 0.000 | 0.995 | 2.903 | 0.946 | 1.916 |
hybrid |
0.00 | 0.0 | 0.000 | 1.000 | 0.001 | 0.000 | 0.986 | 2.499 | 0.711 | 1.945 |
psola |
0.66 | 0.2 | 0.005 | 1.000 | 0.003 | 0.000 | 0.941 | 2.340 | 0.998 | 1.954 |
paulstretch |
1.67 | 0.3 | 0.223 | — | 0.061 | 0.000 | 0.961 | 7.449 | — | 2.241 |
Columns:
- f0 err (Hz) — pitch accuracy shifting a 440 Hz sine to 660 Hz. Zero-crossing estimator over the active signal region.
- THD% — total harmonic distortion on the shifted pure sine (up to 8 harmonics).
- alias — active-region RMS of output / input when shifting a 14 kHz sine by ×2. Canonical behaviour is near zero (nothing valid above Nyquist); time-domain stride-reads fold energy back.
- stream corr — streaming vs batch correlation on the 440 Hz sine. Marked — for algorithms whose phase or grain jitter decorrelates on pure tones even when producing valid output (paulstretch randomizes phases, psola jitters pitch marks).
- cent err — spectral centroid ratio error on a 3-partial chord. Lower means the timbre shifts by exactly
ratio. - onset err — period error of a 100 Hz Dirac impulse train after shift. Measures how well impulse locations survive.
- attack corr — plucked-string attack envelope correlation against the input.
- formant dist — cepstral envelope distance on a synthetic vowel. Lower = formants stay put.
formantdominates here by construction. - phase coh — AM-envelope coherence on a 5 Hz tremolo. Goertzel-extracted modulation depth,
min(out, in) / max(out, in). 1.0 means the slow envelope survives the shift intact. Marked — forpaulstretch(random phase is non-deterministic) andsample(time-compresses, so the modulation rate itself shifts). - shift (log-mag) — direct log-magnitude spectral distance between the algorithm output and the canonical shifted reference, averaged over four harmonic ground-truth fixtures:
sine(660),sineChord(330, [1,1.25,1.5]),karplusStrong(330), andamSine(660). Gain- and phase-invariant. Bold = leader. The single best "how close to the ideal pitch shift" number.
Notes. formant, hpss, and sms dominate formant preservation by construction. transient dominates transient preservation on drum material even though attack corr on a plucked string is close across algorithms. paulstretch stream-vs-batch is marked — because random phase synthesis decorrelates by design. See scripts/fixtures.js and scripts/metrics.js for the full rig.
All algorithms accept:
| Option | Default | Description |
|---|---|---|
ratio |
1 |
Pitch shift ratio (1.5 = +5 semitones, 2 = +1 octave) |
semitones |
from ratio | Pitch shift in semitones |
content |
music |
Auto-select hint for the default export: music, voice, speech, tonal |
method |
auto | Explicit algorithm for the default export |
formant |
false |
Use formant-preserving shifting through the default export |
frameSize |
2048 |
Frame size in samples |
hopSize |
frameSize/4 |
Hop between frames |
Algorithm-specific options:
transient:transientThreshold(default:1.5) — z-score over log-flux EMApsola:sampleRate,minFreq(default70),maxFreq(default600)wsola:tolerance(defaultframeSize/4) — similarity search radiusformant:envelopeWidth(defaultmax(8, N/64)) — cepstrum lifter cutoffsms:maxTracks(defaultInfinity),minMag(default1e-4)hpss:hpssTimeWidth(default17frames),hpssFreqWidth(default17bins),hpssPower(default2) — median window sizes and soft-mask exponentsample:sincRadius(default8) — windowed-sinc half-width in sampleshybrid:hybridThreshold(default0.8) — spectral-flux z-score threshold for full WSOLA blend
Default export selection:
voice/speech→psolatonal→sms- everything else →
transient
All frequency-domain algorithms (vocoder, phaseLock, transient, formant, sms, paulstretch, hpss) and sample accept a time-varying ratio — either a function (timeSeconds) => ratio or a Float32Array sampled uniformly across the input duration. STFT-based algorithms evaluate the ratio per frame; sample evaluates per output sample.
// Sinusoidal vibrato: ±10% pitch at 5 Hz
let vibrato = phaseLock(audio, {
ratio: (t) => 1 + 0.1 * Math.sin(2 * Math.PI * 5 * t),
sampleRate: 44100,
})
// Glissando from unison to +1 octave across a 2-second clip
let glide = sample(audio, {
ratio: new Float32Array([1, 1.25, 1.5, 1.75, 2]),
ratioDuration: 2,
sampleRate: 44100,
})Time-domain algorithms (ola, wsola, psola, granular) and hybrid reject function/array ratio — their stretch+resample form applies a single global ratio to the whole signal.
Variable pitch enables pitch correction when combined with a pitch detector. Detect the sung pitch per frame, compute the ratio to the nearest target note, and pass the correction curve as a ratio function:
import { yin } from 'pitch-detection'
import { formant } from 'pitch-shift'
// 1. Detect pitch per frame
let hop = 512, sr = 44100
let pitchFrames = []
for (let i = 0; i + 2048 <= audio.length; i += hop) {
let r = yin(audio.subarray(i, i + 2048), { fs: sr })
pitchFrames.push(r ? { freq: r.freq, clarity: r.clarity } : null)
}
// 2. Snap to nearest scale degree
let scale = [261.63, 293.66, 329.63, 349.23, 392.00, 440.00, 493.88] // C major
let snap = (f) => scale.reduce((a, b) =>
Math.abs(Math.log2(b / f)) < Math.abs(Math.log2(a / f)) ? b : a
)
// 3. Build correction curve and apply
let corrected = formant(audio, {
ratio: (t) => {
let idx = Math.min(Math.round(t * sr / hop), pitchFrames.length - 1)
let p = pitchFrames[idx]
if (!p || p.clarity < 0.5) return 1 // unvoiced → no correction
return snap(p.freq) / p.freq
},
sampleRate: sr,
})formant is the natural choice for voice correction (preserves vowel character). For the hard-tune "auto-tune effect", use phaseLock with aggressive snapping. For harmonic instruments, sms preserves partial structure while following the correction curve.
import { phaseLock, transient, psola, formant, granular, wsola, sms, hpss, sample, hybrid } from 'pitch-shift'
// Music with drums
let result = transient(audio, { ratio: 1.5 })
// Mixed music (drums + tonal content) with harmonic/percussive separation
let mixed = hpss(audio, { ratio: 1.5 })
// Hybrid: transient-gated crossfade between phase vocoder and WSOLA
let dynamic = hybrid(audio, { ratio: 1.5 })
// Voice (formant-preserving)
let voice = formant(audio, { semitones: 5 })
// Speech
let speech = psola(audio, { ratio: 0.75, sampleRate: 48000 })
// Tonal/harmonic
let tonal = sms(audio, { ratio: 2 })
// Creative granular
let grainy = granular(audio, { ratio: 1.3 })
// Explicit WSOLA alias
let speech = wsola(audio, { ratio: 0.85 })
// Sampler-style playback rate (instrument one-shots)
let played = sample(instrumentBuffer, { semitones: 7 })All algorithms support block-by-block streaming:
let write = phaseLock({ ratio: 1.5 })
// Process audio in chunks
let chunk1 = write(inputBlock1) // → Float32Array
let chunk2 = write(inputBlock2)
let tail = write() // flush remainingProcess channels independently:
let leftOut = phaseLock(leftChannel, { ratio: 1.5 })
let rightOut = phaseLock(rightChannel, { ratio: 1.5 })
// Or pass separate channels together
let [leftShifted, rightShifted] = phaseLock([leftChannel, rightChannel], { ratio: 1.5 })npm test
npm run quality
npm run benchnpm run quality reports pitch accuracy, stream-vs-batch correlation, stereo handling, and high-frequency attenuation.
- time-stretch — Time-domain stretchers (WSOLA, PSOLA) used by time-domain pitch-shift algorithms
- fourier-transform — FFT
- window-function — Hann windowing
The package name was previously held by mikolalysenko/pitch-shift (2013, frozen at v0.0.0). That package implements a single time-domain algorithm: per-frame Hann windowing → detect-pitch autocorrelation period → scalePitch linear interpolation → findMatch splice-point similarity search → overlap-add. This is the canonical WSOLA/TD-PSOLA pattern.
The same algorithm is available here as wsola (with per-grain cross-correlation search) or psola (with autocorrelation pitch marks). Both are native implementations without external pitch-detection dependencies and support batch, streaming, and multi-channel.
The old callback API:
// v0.0.0 (old)
var shifter = require('pitch-shift')(onData, t => ratio, { frameSize: 2048 })
shifter.feed(float32Array)New equivalent:
// v1 (this package)
import { wsola } from 'pitch-shift'
let write = wsola({ ratio })
let out = write(float32Array)
let tail = write() // flush- time-stretch — Time stretching
- audio-filter — Audio filters