Creating Datasets from different sources

目录

<!DOCTYPE html>

Creating_Datasets_from_different_sources

Creating Datasets from different sources

In this reading notebook, we will explore a few of the ways in which we can load data into a tf.data.Dataset object.

In [1]:
import tensorflow as tf
print(tf.__version__)
2.0.0

The from_tensor_slices and from_tensors methods

We will start by looking at the from_tensor_slices and the from_tensors methods.

Both static methods are used to create datasets from Tensors or Tensor-like objects, such as numpy arrays or python lists. We can also pass in tuples and dicts of arrays or lists. The main distinction between the from_tensor_slices function and the from_tensors function is that the from_tensor_slices method will interpret the first dimension of the input data as the number of elements in the dataset, whereas the from_tensors method always results in a Dataset with a single element, containing the Tensor or tuple of Tensors passed.

In [2]:
# Create a random tensor with shape (3, 2)

example_tensor = tf.random.uniform([3,2])
print(example_tensor.shape)
(3, 2)
In [3]:
# Create two Datasets, using each static method

dataset1 = tf.data.Dataset.from_tensor_slices(example_tensor)
dataset2 = tf.data.Dataset.from_tensors(example_tensor)
In [4]:
# Print the element_spec for each

print(dataset1.element_spec)
print(dataset2.element_spec)
TensorSpec(shape=(2,), dtype=tf.float32, name=None)
TensorSpec(shape=(3, 2), dtype=tf.float32, name=None)

As seen above, creating the Dataset using the from_tensor_slices method slices the given array or Tensor along the first dimension to produce a set of elements for the Dataset.

This means that although we could pass any Tensor - or tuple of Tensors - to the from_tensors method, the same cannot be said of the from_tensor_slices method, which has the additional requirement that each Tensor in the list has the same size in the zeroth dimension.

In [5]:
# Create three Tensors with different shapes

tensor1 = tf.random.uniform([10,2,2])
tensor2 = tf.random.uniform([10,1])
tensor3 = tf.random.uniform([9,2,2])

We cannot create a Dataset using the from_tensor_slices method from a list of tensor1 and tensor3 since they do not have the same size in the first dimension:

In [6]:
# Try to create a Dataset from tensor1 and tensor3 using from_tensor_slices - this will raise an error

dataset = tf.data.Dataset.from_tensor_slices((tensor1, tensor3))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-8d158a73c9fd> in <module>
      1 # Try to create a Dataset from tensor1 and tensor3 using from_tensor_slices - this will raise an error
      2 
----> 3 dataset = tf.data.Dataset.from_tensor_slices((tensor1, tensor3))

/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/data/ops/dataset_ops.py in from_tensor_slices(tensors)
    433       Dataset: A `Dataset`.
    434     """
--> 435     return TensorSliceDataset(tensors)
    436 
    437   class _GeneratorState(object):

/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/data/ops/dataset_ops.py in __init__(self, element)
   2362     for t in self._tensors[1:]:
   2363       batch_dim.assert_is_compatible_with(tensor_shape.Dimension(
-> 2364           tensor_shape.dimension_value(t.get_shape()[0])))
   2365 
   2366     variant_tensor = gen_dataset_ops.tensor_slice_dataset(

/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/framework/tensor_shape.py in assert_is_compatible_with(self, other)
    273     if not self.is_compatible_with(other):
    274       raise ValueError("Dimensions %s and %s are not compatible" %
--> 275                        (self, other))
    276 
    277   def merge_with(self, other):

ValueError: Dimensions 10 and 9 are not compatible

However, we can of course create a Dataset from this tuple using the from_tensors method, which interprets the tuple as a single element.

In [ ]:
# Create a Dataset from tensor1 and tensor3 using from_tensors

dataset = tf.data.Dataset.from_tensors((tensor1, tensor3))
dataset.element_spec

Although tensor1 and tensor2 do not have the same shape, or even same rank (number of dimensions), we can still use the from_tensor_slices method to form a dataset from a list of these tensors, since they have the same size in the first dimension.

In [ ]:
# Create a Dataset from tensor1 and tensor2

dataset = tf.data.Dataset.from_tensor_slices((tensor1, tensor2))
dataset.element_spec

In the above, the first dimension was interpreted as the number of elements in the Dataset, as expected.

Creating Datasets from numpy arrays

We can also use the from_tensor_slices and from_tensors methods to create Datasets from numpy arrays. In fact, behind the scenes, the numpy array is converted to a set of tf.constant operations to populate the Tensor in the TensorFlow graph.

In [ ]:
# Create a numpy array dataset

import numpy as np

numpy_array = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]], [[9, 10], [11, 12]]])
print(numpy_array.shape)
In [ ]:
# Create two Datasets, using each static method

dataset1 = tf.data.Dataset.from_tensor_slices(numpy_array)
dataset2 = tf.data.Dataset.from_tensors(numpy_array)

print(dataset1.element_spec)
print(dataset2.element_spec)

As before, from_tensors interprets the entire array as a single element, whereas from_tensor_slices slices the array along the first dimension to form the elements.

Creating Datasets from pandas DataFrames

A pandas DataFrame can be easily converted to a Dataset using the from_tensor_slices method.

The Balloons dataset

A pandas DataFrame can be loaded from a CSV file. We will use the Balloons dataset to demonstrate. This dataset is stored in a CSV file, and contains a list of attributes describing instances of a balloon inflation experiment, such as the colour and size of the balloon, the age of the person who performed the attempted inflation, and the way in which they did it. Finally, there is the target column "Inflated", which is either T for True, or F for False, indicating whether or not the person managed to inflate the balloon.

In [7]:
# Load the CSV file into a Dataframe

import pandas as pd

pandas_dataframe = pd.read_csv('data/balloon_dataset.csv')
In [8]:
# Inspect the data

pandas_dataframe.head()
Out[8]:
Colour Size Act Age Inflated
0 YELLOW SMALL STRETCH ADULT T
1 YELLOW SMALL STRETCH ADULT T
2 YELLOW SMALL STRETCH CHILD F
3 YELLOW SMALL DIP ADULT F
4 YELLOW SMALL DIP CHILD F

To convert the DataFrame to a Dataset, we first convert the DataFrame to a dictionary. By doing this, we preserve the column names as the dictionary labels.

Note: A Dataset can be formed from either a tuple or a dict of Tensors. We saw above a number of Datasets being formed from a tuple. The only distinction for a Dataset formed from a dict is that the Dataset items will be dicts accessed by key, rather than tuples accessed by index.

In [9]:
# Convert the DataFrame to a dict

dataframe_dict = dict(pandas_dataframe)
print(dataframe_dict.keys())
dict_keys(['Colour', 'Size', 'Act', 'Age', 'Inflated'])

We can now run the from_tensor_slices method on this dict and print the resulting Dataset element_spec, as well as an example element. Note that since we formed the Dataset from a dict, we see the column (dictionary) names in the element_spec.

In [10]:
# Create the Dataset

pandas_dataset = tf.data.Dataset.from_tensor_slices(dataframe_dict)
In [11]:
# View the Dataset element_spec

pandas_dataset.element_spec
Out[11]:
{'Colour': TensorSpec(shape=(), dtype=tf.string, name=None),
 'Size': TensorSpec(shape=(), dtype=tf.string, name=None),
 'Act': TensorSpec(shape=(), dtype=tf.string, name=None),
 'Age': TensorSpec(shape=(), dtype=tf.string, name=None),
 'Inflated': TensorSpec(shape=(), dtype=tf.string, name=None)}
In [23]:
# Iterate the Dataset

next(iter(pandas_dataset))
Out[23]:
{'Colour': <tf.Tensor: id=154, shape=(), dtype=string, numpy=b'YELLOW'>,
 'Size': <tf.Tensor: id=156, shape=(), dtype=string, numpy=b'SMALL'>,
 'Act': <tf.Tensor: id=152, shape=(), dtype=string, numpy=b'STRETCH'>,
 'Age': <tf.Tensor: id=153, shape=(), dtype=string, numpy=b'ADULT'>,
 'Inflated': <tf.Tensor: id=155, shape=(), dtype=string, numpy=b'T'>}

Creating Datasets directly from CSV Files

The TensorFlow experimental library contains a variety of functions and classes contributed by the community that may not be ready for release into the main TensorFlow library in their immediate form, but which may be included in TensorFlow in the future. One such useful experimental function is the tf.data.experimental.make_csv_dataset function. This allows us to read CSV data from the disk directly into a Dataset object.

We will run the function on the example CSV file from disk, and specify the batch size and the name of the target column, which is used to structure the Dataset into an (input, target) tuple.

Note: Because of the ephemeral nature of the experimental package, you may well get warnings printed in the console when using a function or class contained in the package for the first time.

In [25]:
# Create the Dataset from the CSV file

csv_dataset = tf.data.experimental.make_csv_dataset('data/balloon_dataset.csv',
                                                    batch_size=1,
                                                    label_name='Inflated')

To check that we've loaded our Dataset correctly, let's print the element_spec:

In [26]:
# View the Dataset element_spec

csv_dataset.element_spec
Out[26]:
(OrderedDict([('Colour', TensorSpec(shape=(1,), dtype=tf.string, name=None)),
              ('Size', TensorSpec(shape=(1,), dtype=tf.string, name=None)),
              ('Act', TensorSpec(shape=(1,), dtype=tf.string, name=None)),
              ('Age', TensorSpec(shape=(1,), dtype=tf.string, name=None))]),
 TensorSpec(shape=(1,), dtype=tf.string, name=None))
In [37]:
# Iterate the Dataset

next(iter(csv_dataset))
Out[37]:
(OrderedDict([('Colour',
               <tf.Tensor: id=382, shape=(1,), dtype=string, numpy=array([b'PURPLE'], dtype=object)>),
              ('Size',
               <tf.Tensor: id=383, shape=(1,), dtype=string, numpy=array([b'LARGE'], dtype=object)>),
              ('Act',
               <tf.Tensor: id=380, shape=(1,), dtype=string, numpy=array([b'DIP'], dtype=object)>),
              ('Age',
               <tf.Tensor: id=381, shape=(1,), dtype=string, numpy=array([b'ADULT'], dtype=object)>)]),
 <tf.Tensor: id=384, shape=(1,), dtype=string, numpy=array([b'F'], dtype=object)>)

Note that in the above Dataset, the target column Inflated does not have a key, since it is uniquely accessible as the second element of the tuple, whereas the attributes which reside as a dictionary of Tensors in the first element retain their labels so we can distinguish them.