<!DOCTYPE html>
Creating Datasets from different sources¶
In this reading notebook, we will explore a few of the ways in which we can load data into a tf.data.Dataset
object.
import tensorflow as tf
print(tf.__version__)
The from_tensor_slices
and from_tensors
methods¶
We will start by looking at the from_tensor_slices
and the from_tensors
methods.
Both static methods are used to create datasets from Tensors or Tensor-like objects, such as numpy arrays or python lists. We can also pass in tuples and dicts of arrays or lists. The main distinction between the from_tensor_slices
function and the from_tensors
function is that the from_tensor_slices
method will interpret the first dimension of the input data as the number of elements in the dataset, whereas the from_tensors
method always results in a Dataset with a single element, containing the Tensor or tuple of Tensors passed.
# Create a random tensor with shape (3, 2)
example_tensor = tf.random.uniform([3,2])
print(example_tensor.shape)
# Create two Datasets, using each static method
dataset1 = tf.data.Dataset.from_tensor_slices(example_tensor)
dataset2 = tf.data.Dataset.from_tensors(example_tensor)
# Print the element_spec for each
print(dataset1.element_spec)
print(dataset2.element_spec)
As seen above, creating the Dataset using the from_tensor_slices
method slices the given array or Tensor along the first dimension to produce a set of elements for the Dataset.
This means that although we could pass any Tensor - or tuple of Tensors - to the from_tensors
method, the same cannot be said of the from_tensor_slices
method, which has the additional requirement that each Tensor in the list has the same size in the zeroth dimension.
# Create three Tensors with different shapes
tensor1 = tf.random.uniform([10,2,2])
tensor2 = tf.random.uniform([10,1])
tensor3 = tf.random.uniform([9,2,2])
We cannot create a Dataset using the from_tensor_slices
method from a list of tensor1
and tensor3
since they do not have the same size in the first dimension:
# Try to create a Dataset from tensor1 and tensor3 using from_tensor_slices - this will raise an error
dataset = tf.data.Dataset.from_tensor_slices((tensor1, tensor3))
However, we can of course create a Dataset from this tuple using the from_tensors
method, which interprets the tuple as a single element.
# Create a Dataset from tensor1 and tensor3 using from_tensors
dataset = tf.data.Dataset.from_tensors((tensor1, tensor3))
dataset.element_spec
Although tensor1
and tensor2
do not have the same shape, or even same rank (number of dimensions), we can still use the from_tensor_slices
method to form a dataset from a list of these tensors, since they have the same size in the first dimension.
# Create a Dataset from tensor1 and tensor2
dataset = tf.data.Dataset.from_tensor_slices((tensor1, tensor2))
dataset.element_spec
In the above, the first dimension was interpreted as the number of elements in the Dataset, as expected.
Creating Datasets from numpy arrays¶
We can also use the from_tensor_slices
and from_tensors
methods to create Datasets from numpy arrays. In fact, behind the scenes, the numpy array is converted to a set of tf.constant
operations to populate the Tensor in the TensorFlow graph.
# Create a numpy array dataset
import numpy as np
numpy_array = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]], [[9, 10], [11, 12]]])
print(numpy_array.shape)
# Create two Datasets, using each static method
dataset1 = tf.data.Dataset.from_tensor_slices(numpy_array)
dataset2 = tf.data.Dataset.from_tensors(numpy_array)
print(dataset1.element_spec)
print(dataset2.element_spec)
As before, from_tensors
interprets the entire array as a single element, whereas from_tensor_slices
slices the array along the first dimension to form the elements.
Creating Datasets from pandas DataFrames¶
A pandas DataFrame can be easily converted to a Dataset using the from_tensor_slices
method.
The Balloons dataset¶
A pandas DataFrame can be loaded from a CSV file. We will use the Balloons dataset to demonstrate. This dataset is stored in a CSV file, and contains a list of attributes describing instances of a balloon inflation experiment, such as the colour and size of the balloon, the age of the person who performed the attempted inflation, and the way in which they did it. Finally, there is the target column "Inflated", which is either T
for True, or F
for False, indicating whether or not the person managed to inflate the balloon.
# Load the CSV file into a Dataframe
import pandas as pd
pandas_dataframe = pd.read_csv('data/balloon_dataset.csv')
# Inspect the data
pandas_dataframe.head()
To convert the DataFrame to a Dataset, we first convert the DataFrame to a dictionary. By doing this, we preserve the column names as the dictionary labels.
Note: A Dataset can be formed from either a tuple or a dict of Tensors. We saw above a number of Datasets being formed from a tuple. The only distinction for a Dataset formed from a dict is that the Dataset items will be dicts accessed by key, rather than tuples accessed by index.
# Convert the DataFrame to a dict
dataframe_dict = dict(pandas_dataframe)
print(dataframe_dict.keys())
We can now run the from_tensor_slices
method on this dict
and print the resulting Dataset element_spec
, as well as an example element. Note that since we formed the Dataset from a dict
, we see the column (dictionary) names in the element_spec
.
# Create the Dataset
pandas_dataset = tf.data.Dataset.from_tensor_slices(dataframe_dict)
# View the Dataset element_spec
pandas_dataset.element_spec
# Iterate the Dataset
next(iter(pandas_dataset))
Creating Datasets directly from CSV Files¶
The TensorFlow experimental library contains a variety of functions and classes contributed by the community that may not be ready for release into the main TensorFlow library in their immediate form, but which may be included in TensorFlow in the future. One such useful experimental function is the tf.data.experimental.make_csv_dataset
function. This allows us to read CSV data from the disk directly into a Dataset object.
We will run the function on the example CSV file from disk, and specify the batch size and the name of the target column, which is used to structure the Dataset into an (input, target)
tuple.
Note: Because of the ephemeral nature of the experimental
package, you may well get warnings printed in the console when using a function or class contained in the package for the first time.
# Create the Dataset from the CSV file
csv_dataset = tf.data.experimental.make_csv_dataset('data/balloon_dataset.csv',
batch_size=1,
label_name='Inflated')
To check that we've loaded our Dataset correctly, let's print the element_spec
:
# View the Dataset element_spec
csv_dataset.element_spec
# Iterate the Dataset
next(iter(csv_dataset))
Note that in the above Dataset, the target column Inflated
does not have a key, since it is uniquely accessible as the second element of the tuple, whereas the attributes which reside as a dictionary of Tensors in the first element retain their labels so we can distinguish them.