How to Train Recurrent Neural Network (RNN) Models and Serve Them in Production with TensorFlow and Flask

Scribendi has offered high-quality online editing and proofreading services for English documents since 1997. Since our inception, technology has been in our DNA, and we believe that pursuing and implementing the latest technologies are the key to allowing us to deliver outstanding services to our clients. As an ISO 9001:2015-certified company, we also follow rigorous quality assurance guidelines and continuous improvement processes. Drawing on extensive quality assurance data, we conduct research in natural language processing (NLP), including text extraction, grammatical error correction (GEC), genre classification, language modeling, and predictive analytics.

We are committed to demonstrating that established small- and medium-sized enterprises can leverage the latest technologies to improve their businesses. Containerized microservices, high availability, artificial intelligence, machine learning, big data analytics, and DevOps are not reserved for venture-capital-backed start-ups, multinational enterprises, or the top tech giants.

In the first year of our research partnership with the University of Waterloo, we developed new GEC algorithms, trained models, and used a blind evaluation to validate their effectiveness. Our research team considered the entirety of the process, taking into account new innovations and research on training and optimizing models. Our finalized models were trained using 30 million sentences. Unfortunately, there was very little information available on how to utilize limited resources to serve a model trained using a large amount of data in a production environment, particularly with a large number of people using the model concurrently. Nevertheless, while the models were initially challenging to deploy, we are now using them in production to support our human editors.

We recently moved from Torch to TensorFlow to develop, train, and serve our GEC models. Although there are extensive resources for implementing and training machine learning models in TensorFlow, its examples for serving (i.e., deploying and running) models in production focus on convolutional neural networks (ConvNETs), such as ResNet, and not Recurrent Neural Network (RNN).

In this tutorial, we shed light on how to train a long short-term memory (LSTM)-based model to ready it for production, working through the following steps

  1. Training an LSTM-based image classification model
  2. Saving and evaluating the model
  3. Exporting the model
  4. Serving the model in production

To get the most out of this post, you need to be familiar with:

While we recognize that state-of-art image classification is currently provided by ConvNET, our objective is to show you how to serve an RNN model in production. Thus, we decided to use the Modified National Institute of Standards and Technology (MNIST) dataset with LSTM as an RNN, as this dataset is the de facto standard for image classification research and is thus well known and easily available. It is small enough that you will be able to complete this tutorial on a desktop or laptop computer in a matter of minutes. Further, we believe that once readers understand how to serve a simple model, they can extend this practice to more complex models.
Now, let’s get started with the first step.

1. Training an LSTM-based image classification model

TensorFlow makes it very easy and intuitive to train an RNN model. We will use a linear activation layer on top of the LSTM layer. To facilitate exporting, we will introduce the input and output of the model, both of which will be useful when feeding the data during the inferencing process. As we are primarily focused on inferencing, we will keep the training simple. If you are only interested in code, check out my github repository.

Here are the steps we will follow to train the model:

  1. Loading the necessary libraries
  2. Downloading the data required to train and test the model
  3. Setting up the training and network parameters
  4. Defining the variables
  5. Defining the model
  6. Calculating the loss and optimization
  7. Training the model

In the following sections, we will go through each step, outlining the objectives and sharing the codebase necessary to complete the training. Please note that this code was tested with Python 3.6 and TensorFlow 1.8.

A. Loading the necessary libraries 

Start by loading necessary libraries. We will use the well-known MNIST dataset from TensorFlow.

from __future__ import print_function
import os
import numpy as np
import tensorflow as tf
from tensorflow.contrib import rnn
tf.app.flags.DEFINE_integer('model_version', 1, 'version number of the model.')
FLAGS = tf.app.flags.FLAGS
# Import MNIST data
from tensorflow.examples.tutorials.mnist import input_data

B. Downloading the data required to train and test the model

Download the data to a temporary location and convert the labels into a one_hot vector. To classify images using an RNN, consider every image row as a sequence of pixels. Because the MNIST image shape is 28 × 28 pixels, there will be 28 sequences with 28 steps for every sample.

mnist = input_data.read_data_sets("/tmp/data/", one_hot=True) 

C. Setting up the training and network parameters

Use the stochastic gradient descent (SGD) to keep the learning rate low and to make sure the model does not overfit the training data.

learning_rate = 0.001
training_steps = 10000
batch_size = 128
display_step = 200
# Network Parameters
num_input = 28 # MNIST data input (img shape: 28*28)
timesteps = 28 # timesteps
num_hidden = 128 # hidden layer num of features
num_classes = 10 # MNIST total classes (0-9 digits) 

D. Defining the variables

It is common in TensorFlow to use a placeholder and feed_dict to feed data into the model during training, testing, and inferencing. Name the input Input_X when exporting the model and during inferencing.

# Training Parameters
# tf Graph input
X = tf.placeholder("float", [None, timesteps, num_input],name ='Input_X')
Y = tf.placeholder("float", [None, num_classes])
# Define weights
weights = {
'out': tf.Variable(tf.random_normal([num_hidden, num_classes]))
}
biases = {
'out': tf.Variable(tf.random_normal([num_classes]))
}

E. Defining the model
To define a simple LSTM-based RNN model, prepare the data shape to match the requirements of the model. Next, create an LSTM cell with BasicLSTMCell, which is applied to the input; create a static_rnn cell within a scope named rnn; and set auto_reuse=true to reuse the module.

Finally, calculate the logits by applying linear activation. In the inference graph, use this linear activation as the output layer and apply softmax during the inference for the output. To use this linear activation layer as the output, put it into a scope and assign a name to the logit operation.

def RNN(x, weights, biases):

    # Prepare data shape to match `rnn` function requirements
    # Current data input shape: (batch_size, timesteps, n_input)
    # Required shape: 'timesteps' tensors list of shape (batch_size, n_input)

    x = tf.unstack(x, timesteps, 1)

    # Define a lstm cell with tensorflow
    lstm_cell = rnn.BasicLSTMCell(num_hidden, forget_bias=1.0)

    # Get lstm cell output
    outputs, _ = rnn.static_rnn(lstm_cell, x, dtype=tf.float32)

    # Linear activation, using rnn inner loop last output
    with tf.name_scope('output_layer'):
        logit = tf.add(tf.matmul(outputs[-1], weights['out']) , biases['out'],name ='add')
    return logit

After obtaining the logits, apply a softmax operation for prediction. As an alternative to the linear activation layer in the model, use the softmax layer as the output layer to provide a name for future use during the inferencing process.

logits = RNN(X, weights, biases)
prediction = tf.nn.softmax(logits,name='prediction')

F. Calculating the loss and optimization

To use a loss function, softmax_cross_entropy_with_logits_v2 compares the predicted output to the actual label and uses an optimizer (e.g., Adam, SGD, RMSprop) to minimize loss. Calculate the prediction and accuracy of the model, and print accuracy and loss statistics during training.

# Define loss and optimizer

loss_op = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(

logits=logits, labels=Y))

optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)

train_op = optimizer.minimize(loss_op)


# Evaluate model

correct_pred = tf.equal(tf.argmax(prediction, 1), tf.argmax(Y, 1))

accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

G. Training the model

To train the model and make it production ready, you must first initialize the variables, including weights and biases. Define the input and output for the frozen graph. Obtain the input and output tensors from their names in the graph. Define the input variable as Input_X and the output_layer as output_layer/add; then add :0 to the end of both.

The name itself appears as an operation in the graph; to use these names as tensors, :0 must be added to the end of the names (i.e., operations).

# Initialize the variables (i.e. assign their default value)

init = tf.global_variables_initializer()

output_tensor = tf.get_default_graph().get_tensor_by_name("output_layer/add:0")

input_tensor = tf.get_default_graph().get_tensor_by_name("Input_X:0")

Finally, define Saver and save the model as a checkpoint, which can subsequently be loaded for retraining and inferencing purposes. The resulting graph has many operations that are not necessary for inferencing, and the size of the saved model is large, which slows inferencing. Once training is complete, use the defined input and output tensors to export a smaller and faster model for inferencing.
Start training by creating a new session. After running the variable initializer, set the training loop according to the number of steps predefined in step C. Iterate through the training data and feed these into the model, batch by batch, to optimize the model and minimize loss.

saver = tf.train.Saver()
# start training
with tf.Session() as sess:

    # Run the initializer
    sess.run(init)

    for step in range(1, training_steps+1):
        batch_x, batch_y = mnist.train.next_batch(batch_size)
        # Reshape data to get 28 seq of 28 elements
        batch_x = batch_x.reshape((batch_size, timesteps, num_input))
        # Run optimization op (backprop)
        sess.run(train_op, feed_dict={X: batch_x, Y: batch_y})
        if step % display_step == 0 or step == 1:
            # Calculate batch loss and accuracy
            loss, acc = sess.run([loss_op, accuracy], feed_dict={X: batch_x,
                                                                 Y: batch_y})
            print("Step " + str(step) + ", Minibatch Loss= " + \
                  "{:.4f}".format(loss) + ", Training Accuracy= " + \
                  "{:.3f}".format(acc))
    for op in tf.get_default_graph().get_operations():
        if output_layer[0] in op.name:
                print(op.name)
    print("Optimization Finished!")

As defined previously, the variable display_step will print loss and accuracy data from the training to ensure that loss decreases and accuracy increases (see example below). This process of calculating loss is also known as the internal evaluation of the model.

Step 1, Minibatch Loss= 2.5360, Training Accuracy= 0.055
Step 1000, Minibatch Loss= 1.5179, Training Accuracy= 0.555
Step 2000, Minibatch Loss= 1.3314, Training Accuracy= 0.602
Step 3000, Minibatch Loss= 1.0988, Training Accuracy= 0.641
Step 4000, Minibatch Loss= 1.0893, Training Accuracy= 0.664
Step 5000, Minibatch Loss= 0.8246, Training Accuracy= 0.734
Step 6000, Minibatch Loss= 0.5758, Training Accuracy= 0.820
Step 7000, Minibatch Loss= 0.5413, Training Accuracy= 0.852
Step 8000, Minibatch Loss= 0.6734, Training Accuracy= 0.734
Step 9000, Minibatch Loss= 0.5125, Training Accuracy= 0.836
Step 10000, Minibatch Loss= 0.3872, Training Accuracy= 0.875

Once the model has been trained for five minutes on a CPU and 10,000 steps have been completed, loss decreases significantly and accuracy approaches 88%, which is sufficient for our purposes. While ConvNET may achieve better results, our goal is to create a simple RNN model and serve it in production.

2. Saving and evaluating the model

It is common practice to save a model at checkpoints during training to perform internal and external evaluations. An external evaluation can consist of an evaluation development dataset, if needed, and a test evaluation. A complete evaluation requires validation data to optimize the hyperparameters and test the data to assess model accuracy. As we want to keep it simple, we will only perform a test evaluation, which requires the following two steps:

  1. Saving the model
  2. Evaluating the model

A. Saving the model

After finishing the training loop, save the checkpoint for future retraining; tf.train.Saver() makes saving simple and can be done in one line of code.

# save the model for retraining

saver.save(sess,'./model.ckpt')

B. Evaluating the model

When training is complete, select a subset of test data, perform the prediction, and make sure that the test accuracy is close to the training accuracy. Select a batch size of 128 samples for this evaluation. If a large test dataset is used, consider utilizing a loop/iterator.

# Calculate accuracy for 128 mnist test images

test_len = 128

test_data = mnist.test.images[:test_len].reshape((-1, timesteps, num_input))

test_label = mnist.test.labels[:test_len]



sess.run(prediction, feed_dict={X: test_data}))


print("Testing Accuracy:", \

sess.run(accuracy, feed_dict={X: test_data, Y: test_label}))

The test accuracy is 0.90625, or 90.6%, which is very close to the training accuracy. Therefore, the model is not overfitting, but it can be trained further to an accuracy of about 98%, which is very close to the accuracy achieved via the state-of-art ConvNet/ResNet model.

3. Exporting the trained model for inference

Although the saved checkpoint contains metadata that are not needed for inferencing, exporting a TensorFlow model is important for several reasons. For production, all you need are the model definition and weights, which are exported to minimize the size of the model and make inferencing faster. Below, we explain how to export an RNN model that can be served in production.
First, export the trained model for inferencing using SavedModelBuilder and create a directory if one does not already exist. It is possible to add a model version here, but for the purposes of this demo, remove the previously saved model before saving another one.

Second, build tensor information for the input and output of the export model using the SavedModelBuilder API. Then, define the tensor_info_x and tensor_info_y protocol buffers.

# Export the model for prediction

export_base_path =  './exportmodel'

export_path = os.path.join(

tf.compat.as_bytes(export_base_path),

tf.compat.as_bytes(str(FLAGS.model_version)))

# Removing previously exported model

# shutil.rmtree(export_path)

builder = tf.saved_model.builder.SavedModelBuilder(export_path)


tensor_info_x = tf.saved_model.utils.build_tensor_info(input_tensor)

tensor_info_y = tf.saved_model.utils.build_tensor_info(output_tensor)

Third, define the signature, which is useful for prediction. Build the signature definition using key value mapping. Name the key of the input x_input (i.e., the protocol buffer for Input_X) and the tensor_info_x output as y_output (i.e., the protocol buffer for the logit tensor, tensor_info_y), then use method_name as the method for inferencing. We are using a predefined constant for inferencing.

prediction_signature = (

tf.saved_model.signature_def_utils.build_signature_def(

inputs={'x_input': tensor_info_x},

outputs={'y_output': tensor_info_y},

method_name=tf.saved_model.signature_constants.PREDICT_METHOD_NAME))

Fourth, add a metagraph and variables, such as Input_X and logit, to the build using SavedModelBuilder.add_meta_graph_and_variables() and the following arguments:

  • sess: the TensorFlow session that holds the trained model.
  • tags: the set of tags used to save the metagraph. In this case, since we intend to use the graph in serving, we can use the serve tag from the predefined SavedModel tag constants.
  • signature_def_map: the mapped user-supplied signature key for tensorflow::SignatureDef, , which is added to the metagraph. The signature specifies what type of model is being exported and which input and output tensors to bind to when running an inference.

To save the model as a frozen graph, use the following code:

builder.add_meta_graph_and_variables(

sess, [tf.saved_model.tag_constants.SERVING],

signature_def_map={

tf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY:

prediction_signature

},

)

builder.save()

4. Serving the model in production

Export the model from the export model directory. Now, you can do inferencing. Inferencing can be accomplished in a few ways. For example, we can:

  1. Serve the model with the Python Flask API
  2. Serve the model with the Tensorflow Serving API

Here, we will show both options, side by side, to compare the performances of the models in the server. Each part has two sides: a server side and a client side.

A. Serving the model with the Python Flask API

Flask is a micro web framework written in Python and is based on the Werkzeug toolkit and Jinja2 template engine. Applications that use the Flask framework include Pinterest, LinkedIn, and the community webpage for Flask itself.

Flask is popular with the Python community because of its easy-to-integrate application. It also has both server and client sides. The server loads the model and waits for client requests to provide responses. Flask can be downloaded from the developer’s webpage, which also features numerous examples.

Use a POST request from the client side to send image data and obtain a logit in response. Flask does not require additional applications to issue a POST request and send data back.

Use JavaScript object notation (JSON) for the data exchange; the POST request does not support the exchange of NumPy arrays. This format will allow you to send any type of data, depending on the model you trained. First, convert the image on the client side from a NumPy array to a List so it can be sent as a JSON. After receiving the image in this format, the server will get the List array, convert it to a NumPy array, and reshape it according to the model’s requirements.

Once the input is ready, feed data to the model to obtain the logits and apply softmax to get a prediction in the form of a NumPy array. Reconvert it to JSON to send it back to the client.

i. Flask Server
import tensorflow as tf
import numpy as np
from flask import Flask, request
import json

def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x)
    return e_x / e_x.sum(axis=-1)


app =Flask(__name__)
@app.route('/',methods=['POST'])
def predict():
    
    from_client = request.get_json()
    inference_data = from_client['data']
    inference_data = np.array(inference_data)
    batch_x = inference_data.reshape(-1,28,28)
    
    logits = sess.run([y],feed_dict={x:batch_x})
    prediction = softmax(logits)
    json_data = json.dumps({'y':prediction.tolist()})
    return json_data

Create a session by obtaining the default signature definition key. Next, assign the key used to save the input and output of the model and load the exported model from the TensorFlow SavedModel API as a metagraph. Extract the signature to extract the input and output tensors by name from the sessions, and assign the input of the graph as x and the logit as y. Now, we are ready for inferencing.


if __name__ == '__main__':
    tf.app.flags.DEFINE_string('model_path','./savedmodel/1/',help='model Path')
    tf.app.flags.DEFINE_string('host','0.0.0.0',help='server ip address')
    tf.app.flags.DEFINE_integer('port',5000,help='server port')
    FLAGS = tf.app.flags.FLAGS
    sess=tf.Session()

    signature_key = tf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY
    input_key = 'x_input'
    output_key = 'y_output'

    export_path =  FLAGS.model_path
    meta_graph_def = tf.saved_model.loader.load(
            sess,
            [tf.saved_model.tag_constants.SERVING],
            export_path)
    signature = meta_graph_def.signature_def

    x_tensor_name = signature[signature_key].inputs[input_key].name
    y_tensor_name = signature[signature_key].outputs[output_key].name

    x = sess.graph.get_tensor_by_name(x_tensor_name)
    y = sess.graph.get_tensor_by_name(y_tensor_name)
    
    app.run(host=FLAGS.host,port=FLAGS.port)

 

ii. Flask client

The client simply loads the data and sends a POST request to the server after adding a timeout. After receiving the response and calculating the accuracy, we will need to do some error handling, as shown below:

import numpy as np
import requests 
import json
from tensorflow.examples.tutorials.mnist import input_data
import argparse
import time

def main(args):  
    mnist = input_data.read_data_sets(args.data_dir, one_hot=True)
    counter = 0
    start_time = time.time()
    num_tests = args.num_tests
    for i in range(num_tests):
        image = mnist.test.images[i]
        data = {'data':image.tolist()}
        headers = {'content-type': 'application/json'} 
        url = args.host+':'+str(args.port)

        response = requests.post(url,json.dumps(data),headers = headers,timeout=10)
        if response.status_code == 200:
            response.encoding = 'utf-8'
            response = json.loads(response.text)
            response = response['y']
            if np.argmax(response) ==np.argmax(mnist.test.labels[i]):
                counter +=1
            else:
                pass
        else:
            print(response)
    print("Accuracy= %0.2f"%((counter*1.0/num_tests)*100))
    print("Time takes to run the test %0.2f"%(time.time()-start_time))
if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--data_dir', default='/tmp/data')
    parser.add_argument('--host',default='http://127.0.0.1')
    parser.add_argument('--port',default=5000)
    parser.add_argument('--num_tests',default=1)
    args = parser.parse_args()

    main(args)

We can see that the prediction accuracy is approximately the same as the test accuracy at 89%.

B. Serving the model with the TensorFlow Serving API

TensorFlow Serving requires a server and a client. Fortunately, a TensorFlow server can be started using a single line of code, although like Flask, it requires the use of JSON data. Conversion between NumPy and List is not required because TensorFlow uses gRPC for communication, which reduces the number of steps and increases serving speed. For installation, please follow these instructions.

Let’s set up the server:

tensorflow_model_server --port=9000 --model_name=mnist --model_base_path=/tmp/mnist_model/

Now, set up the client side. First, prepare the communication channel by supplying the host IP and port addresses. Next, create a stub, which is the service object on the client side. Here, we are creating a PredictRequest object to assign the model specs, which take a model name (model_name) and signature name (signature_name). These are key to the prediction_signature defined during the exportation of the model.

In the exported model, use x_input as the input for the prediction_signature to send data by creating a tensor protobuf_ using make_tensor_proto. In a production system, it is important to add a timeout with request and exception handling and logging; Stub.Predict takes a request and assigns a timeout of 10 seconds.

If there are any exceptions, they simply need to be printed. Otherwise, the client will get the softmax probability and compare the output to the gold label using argmax.

Finally, the accuracy can be printed. Here is the full code:

import numpy as np
import requests
import json
from tensorflow.examples.tutorials.mnist import input_data
import argparse
from grpc.beta import implementations
import tensorflow as tf
from tensorflow_serving.apis import predict_pb2, prediction_service_pb2
from tensorflow.contrib.util import make_tensor_proto

def main(args):  
    
    host, port = args.host, args.port

    mnist = input_data.read_data_sets(args.data_dir, one_hot=True)

    channel = implementations.insecure_channel(host, int(port))
    stub = prediction_service_pb2.beta_create_PredictionService_stub(channel)
    
    start_time = time.time()
    counter = 0
    num_tests = args.num_tests
    time
    for _ in range(num_tests):
        image, label = mnist.test.next_batch(1)
        
        request = predict_pb2.PredictRequest()
        request.model_spec.name = 'mnist'
        request.model_spec.signature_name = tf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY
        request.inputs['x_input'].CopyFrom(
            make_tensor_proto(image[0],shape=[1,28,28])
        )
        result_future = stub.Predict(request,10.0)
        exception = result_future.exception()
        if exception:
            print(exception)
        else:
            response = np.array(result_future.outputs['y_output'].float_val)
            if np.argmax(response) ==np.argmax(label):
                counter +=1
            else:
                pass
    print("Accuracy= %0.2f"%((counter*1.0/num_tests)*100))
    print("Time takes to run the test %0.2f"%(time.time()-start_time))

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--data_dir', default='/tmp/data')
    parser.add_argument('--host',default='127.0.0.1')
    parser.add_argument('--port',default=9000)
    parser.add_argument('--num_tests',default=100)
    args = parser.parse_args()

    main(args)

TensorFlow Serving provides similar accuracy, but it is two times faster than the Flask API.

Conclusion

We’ve just walked through the steps of creating an RNN model from training to inferencing. It is possible to change the RNN structure or use an RNN model for certain tasks, such as language modelling or machine translation. To create a model that can be served in production, the input and output must be changed accordingly.

It is very common to use docker images from TensorFlow to serve models in production. As a company, Scribendi is moving toward designing these tools as a micro-service, and we are using a docker stack to serve them in production. This will be discussed in an upcoming Scribendi.AI blog post.

References

https://www.tensorflow.org/serving/serving_basic
https://www.tensorflow.org/api_docs/python/tf/saved_model/signature_constants
https://grpc.io/
http://flask.pocoo.org/
http://colah.github.io/posts/2015-08-Understanding-LSTMs/

About the Author

Md Asadul Islam is a machine learning engineer with a focus on developing, training, and serving GEC models that make the work of human editors easier. He has an M.Sc. in Computer Science from the University of Alberta. During his studies, he primarily worked on native language identification, Bengali character recognition, combinatorial optimization, and probabilistic graphical modeling. He loves to read about research ideas, pursue challenges, and solve problems. Apart from his work, he loves to spend time with his loving family.

Remarks: I would like to thank Scribendi for allowing me to write this article. In addition, I also want to express my appreciation to Enrico Magnani, the CEO of Scribendi, for his in-depth insight, as well as to our excellent team of editors for editing this article and providing useful suggestions. Finally, I am grateful to Terry Johnson, the former vice-president of Scribendi, for his wonderful suggestions.


Leave a Reply

Your email address will not be published. Required fields are marked *