How to Boost your Tensorflow model inference performance using Asyncio.
Derrick Joseph (~derrick) |
Lately, there has been a lot of interest in Deep Learning(dl) and thanks to frameworks like tensorflow anyone can implement dl-papers and create models. But unfortunately, the deployment patterns followed are mostly rudimentary REST calls to the model or using tensorflow-serving, which is fine when you are experimenting but when the model gets deployed and the requests start flying, such methods will create a bottleneck in your Architecture. There are obvious workarounds like running multiple model instances behind a load balancer, but what if there is a much better Pythonic way.
Actor and CSP patterns have been around since the 70s(73 and 78 respectively) but only a niche group has taken a keen look at them and since the introduction of asyncio from Python3.5 onwards, the Python ecosystem has been opened up to these patterns in some limited but useful forms. This talk will show these patterns and how they can be used to deploy Deep Learning models in the right way, (the reference to Deep Learning alone has been intentional and relates to the batching in tensorflow). As to the question of the credibility of these patterns, Actor model is used by Erlang and CSP model is used by Go, yep we can write Python3 code like these languages.
This talk is not about.
- Microservices, are good but you cannot have 1000's if not tens of thousands unique Microservices created on the fly, connected uniquely for each user.(A unique pipeline per user). Also, microservices have the downside of depending on an external message passing solutions(Redis, Celery, RabbitMQ) which add to the latency.
- Deep Learning algorithms, as there are plenty of resources for the same, we are only looking at the model deployment perspective i.e. inference time optimization.
The proposed method is implemented using Python alone without any external dependencies including 3rd party message passing solutions making it faster and lighter than microservices.
A basic idea of asyncio coroutines and if possible streams.
I started using Python in 2014 to quickly hack together my master's thesis and its been a steady relationship since then. Over the past few years, I have been working on building scalable systems and deploying Data processing pipelines at scale.
Lately, I have been a part of a startup that offers Chatbot services and we were facing serious scale up issues, it was while solving these issues that I picked up on the ideas for this talk. If you think of Chatbots, each conversation is a unique data pipeline with each node depending on different entities and topics and having its own states, these are difficult to model using prevalent graph traversal techniques in Python, thus the Actor/CSP model with asyncio.