Python:时间拉伸波档案

您所在的位置:网站首页 rubberband哪里可以听 Python:时间拉伸波档案

Python:时间拉伸波档案

2023-04-14 19:30| 来源: 网络整理| 查看: 265

百度翻译此文   有道翻译此文 问题描述

I'm doing some data augmentation on a speech dataset, and I want to stretch/squeeze each audio file in the time domain.

I found the following three ways to do that, but I'm not sure which one is the best or more optimized way:

dimension = int(len(signal) * speed) res = librosa.effects.time_stretch(signal, speed) res = cv2.resize(signal, (1, dimension)).squeeze() res = skimage.transform.resize(signal, (dimension, 1)).squeeze()

However, I found that librosa.effects.time_stretch adds unwanted echo (or something like that) to the signal.

So, my question is: What are the main differences between these three ways? And is there any better way to do that?

推荐答案librosa.effects.time_stretch(signal, speed) (docs)

In essence, this approach transforms the signal using stft (short time Fourier transform), stretches it using a phase vocoder and uses the inverse stft to reconstruct the time domain signal. Typically, when doing it this way, one introduces a little bit of "phasiness", i.e. a metallic clang, because the phase cannot be reconstructed 100%. That's probably what you've identified as "echo."

Note that while this approach effectively stretches audio in the time domain (i.e., the input is in the time domain as well as the output), the work is actually being done in the frequency domain.

cv2.resize(signal, (1, dimension)).squeeze() (docs)

All this approach does is interpolating the given signal using bilinear interpolation. This approach is suitable for images, but strikes me as unsuitable for audio signals. Have you listened to the result? Does it sound at all like the original signal only faster/slower? I would assume not only the tempo changes, but also the frequency and perhaps other effects.

skimage.transform.resize(signal, (dimension, 1)).squeeze() (docs)

Again, this is meant for images, not sound. Additionally to the interpolation (spline interpolation with the order 1 by default), this function also does anti-aliasing for images. Note that this has nothing to do with avoiding audio aliasing effects (Nyqist/Aliasing), therefore you should probably turn that off by passing anti_aliasing=False. Again, I would assume that the results may not be exactly what you want (changing frequencies, other artifacts).

What to do?

IMO, you have several options.

If what you feed into your ML algorithms ends up being something like a Mel spectrogram, you could simply treat it as image and stretch it using the skimage or opencv approach. Frequency ranges would be preserved. I have successfully used this kind of approach in this music tempo estimation paper.

Use a better time_stretch library, e.g. rubberband. librosa is great, but its current time scale modification (TSM) algorithm is not state of the art. For a review of TSM algorithms, see for example this article.

Ignore the fact that the frequency changes and simply add 0 samples on a regular basis to the signal or drop samples on a regular basis from the signal (much like your image interpolation does). If you don't stretch too far it may still work for data augmentation purposes. After all the word content is not changed, if the audio content has higher or lower frequencies.

Resample the signal to another sampling frequency, e.g. 44100 Hz -> 43000 Hz or 44100 Hz -> 46000 Hz using a library like resampy and then pretend that it's still 44100 Hz. This still change the frequencies, but at least you get the benefit that resampy does proper filtering of the result so that you avoid the aforementioned aliasing, which otherwise occurs.



【本文地址】


今日新闻


推荐新闻


CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3