Skip to main content

Advanced Features

This page will introduce some advanced features of TensorOpera®Deploy. Including

  1. Autoscaling and Fail-over
  2. Geo-distributed Model Deployment
  3. Heterogeneity Model Deployment
  4. Multiple Return Type Support
    1. Streaming Response
    2. File Response

Autoscaling

TensorOpera®Deploy can automatically scale up and down the number of replicas of the model deployment based on the QPS.

When you deploy a endpoint, you can enable the autoscale feature by setting the

  1. min_replicas and max_replicas to tell the autoscaler the range of replicas.
  2. Concurrency per Replica before Scaling Up to tell the autoscaler the threshold to scale out in.
  3. Decision Time Window to indicate the time window to calculate the QPS.
  4. Scale Down Delay to indicate the delay time for scale down the replicas.

AutoscaleConf.png

Geo-distributed Model Deployment

Without building and configuring a complex Kubernetes Cluster. TensorOpera®Deploy can deploy models to nodes located at multiple regions and manage the traffic routing automatically.

GeoDistributed.jpg

Heterogeneity Model Deployment

TensorOpera®Deploy can deploy models to different types of devices, such as CPU, GPU, TPU, etc. Whether it is a single Macbook or a power A100, they can be connected together in an easy manner.

Heterogeneity.jpg

Multiple Return Type Support

Streaming Response

The following code example can be found at:
https://github.com/FedML-AI/FedML/tree/master/python/examples/deploy/streaming_response

def predict(self, *args, **kwargs):
return {"my_output": "example output"}

async def async_predict(self, *args):
return StreamingResponse(self._async_predict(*args))

async def _async_predict(self, *args) -> AsyncGenerator[str, None]:
# This function can also return fastapi.responses.StreamingResponse directly
input_json = args[0]
question = input_json.get("text", "[Empty question]")
for i in range(5):
yield f"Answer for {question} is: {i + 1}\n\n"
await asyncio.sleep(1.0)

In this example, we have a predictor that print a number every 1 second. The core part is that, apart from the original predict method, which return json obj as usual. If the user put "stream: true" in their request body.
e.g. curl -XPOST xxx -d '{"text": "my input ...", "stream": true}', then TensorOpera will automatically call the async_predict. So, to implement this, you will need to override this class method. The code above is an example returning a StreamingResponse, which take a AsyncGenerator as an input.

File Response

The following code example can be found at:
https://github.com/FedML-AI/FedML/blob/master/python/examples/deploy/stable_diffusion/src/inference_entry.py#L101

def predict(self, request: dict, header=None):
args = self.args
input_dict = request
prompt: str = input_dict.get("text", "").strip()

self.args.prompt = [prompt]

images, paths, pipeline_time = self.run_sd_xl_inference(warmup=False, verbose=args.verbose)

if len(prompt) == 0:
response_text = "<received empty input; no response generated.>"
else:
if header == "image/png":
return str(paths[0])
else:
with open(paths[0], "rb") as image_file:
encoded_string = base64.b64encode(image_file.read()).decode("utf-8")
return encoded_string

In this example, we have a predictor that generate a file to a local directory paths[0], here we have two ways to return it to the requester. The first way is that, if the header of the request, if there exist Accept: image/png, then TensorOpera framework will parse the string that predictor's return, take it as a local file path, and use fastapi.responses.FileResponse to transfer this file to binary string and return it. The second way is that, if requester do not include Accept: image/png in their header. Then TensorOpera framework will treat the return obj from the predictor as a string, not file path. So in the user-level code, developer will need to transfer the file to base64 string and return.