Data generators for time series

目录

<!DOCTYPE html>

Keras_TimeseriesGenerator

Keras TimeseriesGenerator

In this reading we'll be looking at the TimeseriesGenerator, which is used for preprocessing and generating batches of temporal data.

Examples of sequential data are audio tracks, music, books and essays. Here, the order of the notes, words and sentences carry information about the meaning.

In [3]:
import tensorflow as tf
tf.__version__
Out[3]:
'2.0.0'

The dataset

In this notebook, we'll be using the DSD100 dataset to demonstrate the use of the TimeseriesGenerator to perform various preprocessing operations. The DSD100 dataset is a dataset of 100 different music tracks in different styles. Its intended use is signal separation, and it also includes the separate instrument tracks that add up to make the music tracks. The tracks are all stereophonic and sampled at 44.1kHz. A sample from the dataset is provided. Audio mixtures

Run the cell below to load one of the sample songs and press the play button to listen to the song.

In [4]:
# Play a sample track from the DSD100 dataset

from IPython import display as ipd

ipd.display(ipd.Audio("data/055 - Angels In Amplifiers - I'm Alright/mixture.wav"))

TimeseriesGenerator

Before diving into working with preprocessing techniques on the DSD100 dataset, let's start by getting grips with the operations that we can perform using the TimeseriesGenerator on simple synthetic data.

We'll begin by defining a simple time series dataset and corresponding sequence of targets.

In [5]:
# Create a simple time series dataset

import numpy as np

dummy_data = np.arange(1, 11, 1)
dummy_targets = np.arange(10, 110, 10)
print(dummy_data)
print(dummy_targets)
[ 1  2  3  4  5  6  7  8  9 10]
[ 10  20  30  40  50  60  70  80  90 100]

The TimeseriesGenerator has three required arguments: data, targets and length.

The data argument could be a list of numpy array that is at least 2-dimensional, with the first dimension corresponding to the time steps.

The targets could be a list or numpy array, where the first dimension should match with data. These are the target values that are aligned to the time steps of data. In some cases, data and targets could be the same.

The length argument controls the length of the samples generated by the TimeseriesGenerator in terms of the number of time steps.

In [6]:
# Create a TimeseriesGenerator object

from tensorflow.keras.preprocessing.sequence import TimeseriesGenerator

timeseries_gen = TimeseriesGenerator(dummy_data, dummy_targets, 4)
In [7]:
# Print the contents of the generator

print('Length:', len(timeseries_gen))
inputs, outputs = timeseries_gen[0]
print("\nData:")
print(inputs)
print("Targets:")
print(outputs)
Length: 1

Data:
[[1 2 3 4]
 [2 3 4 5]
 [3 4 5 6]
 [4 5 6 7]
 [5 6 7 8]
 [6 7 8 9]]
Targets:
[ 50  60  70  80  90 100]

We can see that the TimeseriesGenerator object has created inputs of length 4 from the data, and aligned them to the corresponding targets at the next time step. These inputs and targets have been batched together in numpy arrays.

Change the batch size

The TimeseriesGenerator also has a batch_size keyword argument. Let's create a generator with a batch size of 2. The default (maximum) batch size is 128.

In [8]:
# Create a TimeseriesGenerator object with length 3 and batch size 2

timeseries_gen = TimeseriesGenerator(dummy_data, dummy_targets, length=3, batch_size=2)

We can use the iter function to make our TimeseriesGenerator object iterable:

In [9]:
# Make the time series generator iterable

timeseries_iterator = iter(timeseries_gen)

Let's now generate some values using the iterator. Run the following cell a few times until the StopIteration of the generator object is reached. You will see that 2 samples are generated at a time until no more sample/target pairs can be formed:

In [10]:
# Iterate through the dataset examples

next(timeseries_iterator)
Out[10]:
(array([[1, 2, 3],
        [2, 3, 4]]), array([40, 50]))

You may have noticed that since there were 7 input/target sequences to generate in total, and our batch size was set to 2, the last batch consisted of just one input/target pair. Keep the possible difference in the size of the last batch in mind if you are using TimeseriesGenerator for applications that require a constant batch size.

Change the stride

The interval between consecutive samples can be adjusted using the stride keyword argument.

In [11]:
# Create a TimeseriesGenerator object with a stride of 2

timeseries_gen = TimeseriesGenerator(dummy_data, dummy_targets, length=3, stride=2, batch_size=1)

Above we have specified the length as 3, and the stride as 2. This means that we will generate sequences starting with the first sample being (1, 2, 3) to predict the target 40. Subsequent samples will each skip 2 timesteps since the stride is 2, meaning the next sample will will be (3, 4, 5) to predict 60, and the one after that will be (5, 6, 7) to predict 80.

Had we specified the stride to be the same as the length, then we would have samples which would not overlap, i.e. the first sequence would be the same, but the second sample would be (4, 5, 6) to predict 70, and so on.

In [12]:
# Make the time series generator iterable

timeseries_iterator = iter(timeseries_gen)
In [13]:
# Iterate through the dataset examples

while True:
    try:
        print(next(timeseries_iterator))
    except StopIteration:
        break
(array([[1, 2, 3]]), array([40]))
(array([[3, 4, 5]]), array([60]))
(array([[5, 6, 7]]), array([80]))
(array([[7, 8, 9]]), array([100]))

Reverse the time series

The reverse keyword argument will reverse the samples' output order. However, it will not change anything about the targets. That is, although the timesteps themselves will be reversed, the target will still be the timestep immediately following the end of the sample sequence.

In [14]:
# Create a reversed TimeseriesGenerator object

timeseries_gen = TimeseriesGenerator(dummy_data, dummy_targets, length=3, stride=1, batch_size=1, reverse=True)
timeseries_iterator = iter(timeseries_gen)

Let's generate a few samples and targets to see this in action:

In [15]:
# Iterate through the dataset examples

while True:
    try:
        print(next(timeseries_iterator))
    except StopIteration:
        break
(array([[3, 2, 1]]), array([40]))
(array([[4, 3, 2]]), array([50]))
(array([[5, 4, 3]]), array([60]))
(array([[6, 5, 4]]), array([70]))
(array([[7, 6, 5]]), array([80]))
(array([[8, 7, 6]]), array([90]))
(array([[9, 8, 7]]), array([100]))

Preprocess the DSD100 dataset

Load the audio files

We first need to load a track as an array that can be passed to the TimeseriesGenerator. To do this we can use the scipy package.

In [16]:
from scipy.io.wavfile import read, write

rate, song = read("data/055 - Angels In Amplifiers - I'm Alright/mixture.wav")
print("rate:", rate)
song = np.array(song)
print("song.shape:", song.shape)
rate: 44100
song.shape: (1942186, 2)

The song is a stereo signal, which is why the second dimension here is equal to 2. The rate is the sample rate of the audio file. This number sets the speed at which the file should be played, so this audio runs at 44,100 samples per second (Hz).

Create a generator for the audio time series

Now that we know how to load the audio files, we can experiment with using them with the TimeseriesGenerator.

We'll specify the length of the sequences as 200,000, and we'll also set the stride to be 200,000. Remember that the audio runs at 44.1kHz, so 200,000 samples corresponds to about 4.5 seconds of audio.

Setting the length equal to the stride means that the different sequences will have no overlap. We will specify the batch size as 1 such that only one sample is generated at a time.

In [17]:
# Create a time series generator for the audio file

timeseries_gen = TimeseriesGenerator(song, targets=song, length=200000, stride=200000, batch_size=1)
timeseries_iterator = iter(timeseries_gen)

Running the cell below will generate 3 sequential samples from the iterator. If you wish to generate another 3 samples, you can run the cell again.

When run for the first time, the first sequence encompasses timesteps 1 - 200,000, the second will encompass items 200,001 - 400,000, and the third 400,001 - 600,000.

For each sequence, the target is the timestep immediately after the end of the sequence, and in this case (since we have the length equal to the stride) it will also be the first timestep of the following sample.

We will generate samples and write them as wav files. You can see in each case that the generated sample is a sequential chunk from the original audio. If played one after another, they'll form a continuous section of the track.

In [18]:
# Get three samples from the audio time series generator

for i in range(3):
    sample, target = next(timeseries_iterator)
    write('example.wav', rate, sample[0])
    print('Sample {}'.format(i+1))
    ipd.display(ipd.Audio("example.wav"))
Sample 1
Sample 2
Sample 3

Change the stride

In [19]:
# Create a TimeseriesGenerator object with the stride equal to half the length

timeseries_gen = TimeseriesGenerator(song, targets=song, length=200000, stride=100000, batch_size=1)
timeseries_iterator = iter(timeseries_gen)

With the stride equal to half of the length, we see that the samples are no longer non-overlapping. Each subsequent sample starts halfway through the previous sample.

In [20]:
# Get three samples from the audio time series generator

for i in range(3):
    sample, target = next(timeseries_iterator)
    write('example.wav', rate, sample[0])
    print('Sample {}'.format(i+1))
    ipd.display(ipd.Audio("example.wav"))
Sample 1
Sample 2
Sample 3

Change the sampling rate

Using this keyword argument results in downsampling. Increasing the sampling_rate from 1 to 2 will result in the interval between subsequent timesteps within any one sample being increased to 2. What this means is that only every other timestep in a given sample will be included.

Note that the length argument in this case refers to the length before the sampling_rate is applied.

In [25]:
# Create a TimeseriesGenerator object with sampling_rate set to 2

timeseries_gen = TimeseriesGenerator(song, song, length=200000, stride=200000, batch_size=1, sampling_rate=2)
timeseries_iterator = iter(timeseries_gen)

In the following, we will write the files with the original rate of 44,100Hz.

In [26]:
# Get three samples from the audio time series generator

for i in range(3):
    sample, target = next(timeseries_iterator)
    write('example.wav', rate, sample[0])
    print('Sample {}'.format(i+1))
    ipd.display(ipd.Audio("example.wav"))
Sample 1
Sample 2
Sample 3

The above clips each contain only 100,000 samples due to the sampling_rate being set to 2. This results in the audio files sounding twice as fast.

However, we could also adjust the rate at which we write the wav files to compensate for the downsampling. This results in clips that sound similar to the originals, but at a reduced quality.

If you are interested in audio signal processing, you may have noticed that since we did not filter out high frequency components before downsampling, we may have introduced aliasing.

In [28]:
# Write the wav files at an adjusted sample rate

for i in range(3):
    sample, target = next(timeseries_iterator)
    write('example.wav', rate//2, sample[0])
    print('Sample {}'.format(i+1))
    ipd.display(ipd.Audio("example.wav"))
Sample 1
Sample 2
Sample 3

Change the start index

The TimeseriesGenerator also has start_index and end_index keyword arguments, to specify which portion of our data we want to use to generate samples. This can be useful in the case that we want to reserve part of our data for validation.

Here we specify the start_index as 400,000 which is double the length, and will effectively skip the first 2 samples that would have otherwise been generated.

In [30]:
# Create a TimeseriesGenerator object with start index set to 400,000

timeseries_gen = TimeseriesGenerator(song, song, length=200000, stride=200000, batch_size=1, start_index=400000)
timeseries_iterator = iter(timeseries_gen)
In [31]:
# Get three samples from the audio time series generator

for i in range(3):
    sample, target = next(timeseries_iterator)
    write('example.wav', rate, sample[0])
    print('Sample {}'.format(i+1))
    ipd.display(ipd.Audio("example.wav"))
Sample 1
Sample 2
Sample 3

Shuffle the samples

Setting the keyword argument shuffle to True will randomly re-order the generated samples.

In [32]:
# Create a shuffled TimeseriesGenerator object

timeseries_gen = TimeseriesGenerator(song, song, length=200000, stride=200000, batch_size=1, shuffle=True)
timeseries_iterator = iter(timeseries_gen)
In [33]:
# Get three samples from the audio time series generator

for i in range(3):
    sample, target = next(timeseries_iterator)
    write('example.wav', rate, sample[0])
    print('Sample {}'.format(i+1))
    ipd.display(ipd.Audio("example.wav"))
Sample 1
Sample 2
Sample 3

Reverse the audio

As a final bit of fun, let's reverse some of the samples to see what they sound like backwards! Remember that the keyword argument reverse reverses the sample ordering of the timesteps within the sample.

In [34]:
# Create a reversed TimeseriesGenerator object

timeseries_gen = TimeseriesGenerator(song, song, length=200000, stride=200000, batch_size=1, reverse=True)
timeseries_iterator = iter(timeseries_gen)

Play the samples below to hear some slightly demonic-sounding music:

In [35]:
# Get three samples from the audio time series generator

for i in range(3):
    sample, target = next(timeseries_iterator)
    write('example.wav', rate, sample[0])
    print('Sample {}'.format(i+1))
    ipd.display(ipd.Audio("example.wav"))
Sample 1
Sample 2
Sample 3