Tensorflow(三): TFRecordDataSet封裝

9 min readAug 12, 2019

接續上一篇，知道如何將圖片做隨機預處理後，我們可以將Data preprocess 與 TFRecord 結合再一起，並封裝成方便使用的樣子。

使用tensorflow的tf.data.DataSet 大致有下面三個步驟:

一、定義數據集結構:

也就是tf.data.DataSet從哪讀取，常見的有:

(1) tf.data.DataSet.from_tensor_slices(Tensor)
    #自tensor中讀取data (傳入nparray會自動轉為tensor)text_files = ["/path/to/textfile1.txt","/path/to/textfile2.txt"]
(2) tf.data.TextLineDataSet(text_files)
    #自多文件中讀取filenames = ["/var/data/file1.csv", "/var/data/file2.csv"]
(3) tf.contrib.data.CsvDataset(filenames, record_defaults)
    #自csv中讀取filenames = ["/var/data/file1.tfrecord", "/var/data/file2.tfrecord"] 
(4) dataset = tf.data.TFRecordDataset(filenames)

接著要定義解析器，這裡可以使用DataSet的map函數將TFReocrd_Parser傳入進行解析。

#定義解析器，還原圖片與label
def parser(record):
    img_features = tf.parse_single_example(
            record,
            features={ 
                'Label': tf.FixedLenFeature([], tf.int64),
                'image_raw': tf.FixedLenFeature([],tf.string),
                'height':tf.FixedLenFeature([],tf.int64),
                'width':tf.FixedLenFeature([],tf.int64),
                'channel':tf.FixedLenFeature([],tf.int64)})
    height = tf.cast(img_features['height'], tf.int64)
    width = tf.cast(img_features['width'], tf.int64)
    channel = tf.cast(img_features['channel'], tf.int64)
    label = tf.cast(img_features['Label'], tf.int64)
    
    image = tf.decode_raw(img_features['image_raw'], tf.uint8)
    image = tf.reshape(image, [224,224,3])
    
    return image, label#傳入TFRecord與map解析器得到圖片與label
train_file = ["./data/Train.tfrecords"]
dataset = tf.data.TFRecordDataset(train_files)
dataset = dataset.map(parser)

二、進行Data Preprocess :

接著我們要對圖片進行預處理，如果不需要預處理的資料則可跳過(如:TestDataSet)，同樣的我們可以使用map進行函數傳入。

#定義預處理函數
def preprocess_for_train(image):
    image_data = tf.image.resize_images(image,[224,224],method=0)
    image_data = tf.\
                 image.\
                 random_saturation(image_data,lower=0.5,upper=1.5)
    image_data = tf.\
                 image.\
                 random_brightness(image_data,max_delta=70. /255.)
    image_data = tf.image.random_contrast(image_data,0.8,1.2)
    image_data = tf.image.random_flip_left_right(image_data)
    return image_data#進行預處理 傳入圖片
dataset = dataset.map(
      lambda image,label:(
        preprocess_for_train(image),label))

三、定義EPOCH、SHUFFLE、BATCH

接下來要就是DataSet最強的地方了，透過DataSet提供的API可以很輕易的對數據進行操作:

EPOCHS = 10
shuffle_size = 10000
Batch_Size = 128# shuffle_size 代表清洗大小，例如為10000時會在buffer區放入10000條數據
# 每讀入一條，buffer區就隨機輸出一條。buffer區越大則越隨機
dataSet = dataSet.shuffle(buffer_size=shuffle_size)# 切割Batch
dataSet = dataSet.batch(Batch_Size)#根據EPOCH數，複製數據增加數據數量
dataSet = dataSet.repeat(EPOCHS) #要注意的是由於repeat在shuffle之後，所以每一個EPOCH只會自己shuffle，不會受前一個EPOCH影響
# 也就是[1,2,3] shuffle後repeat(2)不會變成
# 不會變成 epoch1: [1,1,2]  epoch2: [2,3,3] 
# 只會變成 epoch1: [1,3,2]  epoch2: [2,3,1]

如果接下來是使用Keras則可以進行訓練了

model.fit(dataSet,epochs=EPOCHS)

但如果你是用tensorflow則還需最後一個步驟

四、定義遍歷器

tensorflow比較麻煩要親自定義遍歷器，並且使用get_next()函數作為讀取tensor的手段，主要分為兩種:

make_one_shot_iterator():
這用在DataSet已經確定了而不是placeholder

iterator = dataSet.make_one_shot_iterator()#讀取tensor
training_image,training_label = iterator.get_next()#訓練:
with tf.Session() as sess:
    for i in range(10):
        #每次獲得不同的image,label batch
        image,label = sess.run([training_image,training_label])
        
        #丟入訓練
        _,train_acc,train_loss =sess.
           run([train_step,accuracy,loss],
             feed_dict={x: image, y: label,keep_prob:dropout})

2. make_initializable_iterator()
這用在初始DataSet還未定義時，例如輸入files 是placeholder時，要注意需要在Session 中初始化 iterator.initializer

# input_files 是 placeholder
input_files = tf.placeholder(tf.string)
dataSet = tf.data.TFRecordDataSet(input_files)
dataSet = dataSet.map(parser)#定義 iterator
iterator = dataSet.make_initializable_iterator()
training_image,training_label = iterator.get_next()# 需要初始化iterator並傳入等檔案位置
with tf.Session() as sess:
    sess.run(iterator.initializer,
        feed_dict={input_files:["path/to/file1.tfrecord",
                                "path/to/file2.tfrecord"]})
    
    for i in range(10):
        image,label = sess.run([training_image,training_label])
        _,train_acc,train_loss =sess.
           run([train_step,accuracy,loss],
             feed_dict={x: image, y: label,keep_prob:dropout})

參考資料:

TensorFlow全新的数据读取方式：Dataset API入门教程

Dataset API是TensorFlow 1.3版本中引入的一个新的模块，主要服务于数据读取，构建输入数据的pipeline。此前，在TensorFlow中读取数据一般有两种方法：相Dataset…

zhuanlan.zhihu.com

Tensorflow(三): TFRecordDataSet封裝

一、定義數據集結構:

二、進行Data Preprocess :

三、定義EPOCH、SHUFFLE、BATCH

四、定義遍歷器

參考資料:

TensorFlow全新的数据读取方式：Dataset API入门教程

Dataset API是TensorFlow 1.3版本中引入的一个新的模块，主要服务于数据读取，构建输入数据的pipeline。此前，在TensorFlow中读取数据一般有两种方法：相Dataset…

Importing Data | TensorFlow Core | TensorFlow

The API introduces two new abstractions to TensorFlow: A tf.data.Dataset represents a sequence of elements, in which…

Written by LUFOR129

No responses yet

Tensorflow(三): TFRecordDataSet封裝

一、定義數據集結構:

二、進行Data Preprocess :

三、定義EPOCH、SHUFFLE、BATCH

四、定義遍歷器

參考資料:

TensorFlow全新的数据读取方式：Dataset API入门教程

Dataset API是TensorFlow 1.3版本中引入的一个新的模块，主要服务于数据读取，构建输入数据的pipeline。 此前，在TensorFlow中读取数据一般有两种方法： 相Dataset…

Importing Data | TensorFlow Core | TensorFlow

The API introduces two new abstractions to TensorFlow: A tf.data.Dataset represents a sequence of elements, in which…

Written by LUFOR129

No responses yet

Dataset API是TensorFlow 1.3版本中引入的一个新的模块，主要服务于数据读取，构建输入数据的pipeline。此前，在TensorFlow中读取数据一般有两种方法：相Dataset…