Stream Data Processing with Apache Kafka and TensorFlow

As one of the most popular deep learning frameworks, TensorFlow has been used widely adopted in production across a broad spectrum of industries. The upcoming TensorFlow 2.0, which was announced recently, will be released early this year with many changes. The high-level tf.keras API, eager execution, and tf.data greatly simplify the usage. Here is a good example of model training in several lines:

“`

import tensorflow as tf

mnist = tf.keras.datasets.mnist

 

(x_train, y_train),(x_test, y_test) = mnist.load_data()

x_train, x_test = x_train / 255.0, x_test / 255.0

 

model = tf.keras.models.Sequential([

 tf.keras.layers.Flatten(input_shape=(28, 28)),

 tf.keras.layers.Dense(512, activation=tf.nn.relu),

 tf.keras.layers.Dropout(0.2),

 tf.keras.layers.Dense(10, activation=tf.nn.softmax)

])

model.compile(optimizer=‘adam’,

             loss=‘sparse_categorical_crossentropy’,

             metrics=[‘accuracy’])

 

model.fit(x_train, y_train, epochs=5)

model.evaluate(x_test, y_test)

 

“`

However, the data input processing component tf.data in TensorFlow only covers a small set of file formats. Users from different industries often encounter challenges integrating TensorFlow with data sources not commonly seen in the machine learning community.

One example is the integration of TensorFlow with Apache Kafka. Kafka is widely used for stream processing and is supported by most of the big data frameworks such as Spark and Flink. For a long time, though, there was no Kafka streaming support in TensorFlow. The data formats such as TFRecords and tf.Example in TensorFlow are also rarely seen in big data or data science community.

Many users are forced to consolidate these two frameworks in a very awkward way: setup another infrastructure, read messages from Kafka, convert the messages into TFRecord format, invoke TensorFlow to read the TFRecord from a file system, run the training or inference, and save the models or results back to the file system. This process is really error-prone and hard to maintain from an infrastructure perspective.

With help from the community, Apache Kafka streaming support for TensorFlow has been released recently as part of the tensorflow-io package (https://github.com/tensorflow/io) by TensorFlow’s SIG IO. SIG IO is a special interest group under TensorFlow organization, with a focus on I/O, streaming, and file formats support. In addition to Apache Kafka streaming, tensorflow-io also includes support for a very broad range of data formats and frameworks. It supports Apache Ignite for memory and caching, Apache Parquet and Arrow for serialization, AWS Kinesis and Google Cloud Pub/Sub for streaming, and many video, audio, and image file formats. Both Python and R language could be used, which are especially convenient to the data science community.

It is worth to mention that tensorflow-io is implemented as a part of the tf.data pipeline and natural extension of TensorFlow 2.0 API. In other words, users are able to read the data from Kafka and pass it to tf.keras. Training or inference is exactly the same simple and succinct way as the example shown at the beginning of this article.

By Yong Tang

Exit mobile version