TensorOpera® Launch APIs
Launch APIs
Simple launcher APIs for running any AI job across multiple public and/or decentralized GPU clouds, offering lower prices without cloud vendor lock-in, the highest GPU availability, training across distributed low-end GPUs, and user-friendly Ops to save time on environment setup.
Before using some of the apis that require remote operation (e.g. fedml.api.launch_job()), please use one of the following methods to login
to TensorOpera AI platform first:
CLI:
fedml login $api_keyAPI:
fedml.api.fedml_login(api_key=$api_key)
fedml.api.launch_job()
Launch a job on the TensorOpera AI platform.
fedml.api.launch_job(yaml_file, api_key=None, resource_id=None, device_server=None, device_edges=None)
Arguments
yaml_file (str): Full path of your job yaml file.api_key (str=None): Your API key from TensorOpera AI platform (if not configured already).resource_id (str=None): Specificresource_idto use. Typically, you won't need to specify a specificresource_id. Instead, we will match resources based on your job yaml, and then automatically launch the job using matched resources.device_server (str=None):device_serverto use. Only needed when you want to launch a federated learning job with specificdevice_serveranddevice_edges.device_edges (List[str]=None): List ofdevice_edgesto use. Only needed when you want to launch a federated learning job with specificdevice_serveranddevice_edges.
Returns
LaunchResult object with the following attributes:
result_code (int): API result code.0means success. Full list of result codes can be found here.result_msg (str): API status message.run_id (str): Run ID of the launched job.project_id (str): Project Id of the launched job. This is default assigned if not specified in your job yaml fileinner_id (str): Serving endpoint id of launched job. Only applicable for Deploy / Serve Job tasks, and will beNoneotherwise.
Example
import fedml
api_key="YOUR_API_KEY"
yaml_file = "/home/fedml/train.yaml"
login_ret = fedml.api.fedml_login(api_key)
if login_ret == 0:
launch_result = fedml.api.launch_job(yaml_file)
if launch_result.result_code == 0:
print("Job launched successfully")
else:
print("Failed to launch job")
fedml.api.launch_job_on_cluster()
Launch a job on a cluster on the TensorOpera AI platform.
fedml.api.launch_job_on_cluster(yaml_file, cluster, api_key=None, resource_id=None, device_server=None, device_edges=None)
Arguments
yaml_file (str): Full path of your job yaml file.cluster (str): Cluster name to use. If a cluster with provided name doesn't exist, one will be created.api_key (str=None): Your API key from TensorOpera AI platform (if not configured already).resource_id (str=None): Specificresource_idto use. Typically, you won't need to specify a specificresource_id. Instead, we will match resources based on your job yaml, and then automatically launch the job using the matched resources.device_server (str=None):device_serverto use. Only needed when you want to launch a federated learning job with specificdevice_serveranddevice_edges.device_edges (List[str]=None): List ofdevice_edgesto use. Only needed when you want to launch a federated learning job with specificdevice_serveranddevice_edges.
Returns
LaunchResult object with the following attributes:
result_code (int): API result code.0means success. Full list of result codes can be found here.result_msg (str): API status message.run_id (str): Run ID of the launched job.project_id (str): Project Id of the launched job.inner_id (str): Serving endpoint id of launched job. Only applicable for Deploy / Serve Job tasks,Noneotherwise.
Example
import fedml
api_key="YOUR_API_KEY"
yaml_file = "/home/fedml/train.yaml"
login_ret = fedml.api.fedml_login(api_key)
if login_ret == 0:
launch_result = fedml.api.launch_job_on_cluster(yaml_file, cluster="my_cluster")
if launch_result.result_code == 0:
print("Job launched successfully on cluster")
else:
print("Failed to launch job on cluster")
Run APIs
fedml.api.run_stop()
Stop a run on TensorOpera AI platform.
fedml.api.run_stop(run_id, platform="falcon", api_key=None)
Arguments
run_id (str): Id of the run to stop. Each run has a unique identifier that should have been returned LaunchResult after launching a job and can also be found out from the Runs page on TensorOpera AI Platform.platform (str=falcon): The platform name at the TensorOpera AI Platform (options: octopus, parrot, spider, beehive, falcon, launch, default is falcon)api_key (str=None): Your API key from TensorOpera AI platform (if not configured already).
Returns
Boolean indicating whether the run was successfully stopped or not.
fedml.api.run_list()
List a run on TensorOpera AI platform.
fedml.api.run_list(run_name, run_id=None, platform="falcon", api_key=None)
Arguments
run_name (str):Name of the run. This can also be found out from the Runs page on TensorOpera AI Platform.run_id (str=None): Id of the run to list (Only required if run_name is not provided). Each run has a unique identifier that should have been returned LaunchResult after launching a job and can also be found out from the Runs page on TensorOpera AI Platform.platform (str=falcon): The platform name at the TensorOpera AI Platform (options: octopus, parrot, spider, beehive, falcon, launch, default is falcon)api_key (str=None): Your API key from TensorOpera AI platform (if not configured already).
Returns
FedMLRunModelList object which is a list of FedMLRunModel objects with attributes like status, running_time, cost, run_url etc.
fedml.api.run_status()
Get status a run on TensorOpera AI platform.
fedml.api.run_status(run_name, run_id, platform: str = "falcon", api_key: str = None)
Arguments
run_name (str):Name of the run. This can also be found out from the Runs page on TensorOpera AI Platform.run_id (str): Id of the run to get status of (Only required if run_name is not provided). Each run has a unique identifier that should have been returned LaunchResult after launching a job and can also be found out from the Runs page on TensorOpera AI Platform.platform (str=falcon): The platform name at the TensorOpera AI Platform (options: octopus, parrot, spider, beehive, falcon, launch, default is falcon).api_key (str=None): Your API key from TensorOpera AI platform (if not configured already).
Returns
Tuple of FedMLRunModelList and status (str) denoting status of the run.
fedml.api.run_logs()
Fetches logs of run from TensorOpera AI platform.
fedml.api.run_logs(run_id, page_num=1, page_size=10, need_all_logs=False, platform="falcon", api_key=None)
Arguments
run_id (str): Id of the run to fetch logs of. Each run has a unique identifier that should have been returned LaunchResult after launching a job and can also be found out from the Runs page on TensorOpera AI Platform.page_num (int): Page number of logs to fetch. Defaults to 1.page_size (int): Page size of logs to fetch. Defaults to 10.platform (str=falcon): The platform name at the TensorOpera AI Platform (options: octopus, parrot, spider, beehive, falcon, launch, default is falcon).api_key (str=None): Your API key from TensorOpera AI platform (if not configured already).
Returns
RunLogResult object with the following attributes:
run_status (str): Status of the run.total_log_lines (int): Total number of log lines.total_log_pages(int): Total number of log pages.log_line_lise (List[str]): Full List of log lines.run_logs (FedMLRunLogModelList): Object with attributes likelog_lines,log_full_urlandlog_devicesetc.
Cluster APIs
fedml.api.cluster_list()
List clusters associated with your account on TensorOpera AI platform.
fedml.api.cluster_list(cluster_names=(), api_key=None)
Arguments
cluster_names (Tuple[str]): List of cluster names. Defaults to empty, which means all clusters will be listed.api_key (str=None): Your API key from TensorOpera AI platform (if not configured already).
Returns
FedMLClusterModelList object with the following attributes:
cluster_list (FedMLClusterModel): Object with following attributecluster_name (str): Name of the cluster.cluster_id (str): Id of the cluster.status (str): Status of the cluster.
fedml.api.cluster_exists()
Check whether cluster with provided name exists on your account on TensorOpera AI platform.
fedml.api.cluster_exists(cluster_name, api_key=None)
Arguments
cluster_name (str): Name of clusterapi_key (str=None): Your API key from TensorOpera AI platform (if not configured already).
Returns
Boolean indicating whether the cluster with provided name exists or not.
fedml.api.cluster_status()
Check status of your cluster on TensorOpera AI platform.
fedml.api.cluster_status(cluster_name, api_key=None)
Arguments
cluster_name (str): Name of clusterapi_key (str=None): Your API key from TensorOpera AI platform (if not configured already).
Returns
Tuple (str(status), FedMLClusterModelList). More about FedMLClusterModelList can be found here.
fedml.api.cluster_start()
Start selected clusters on TensorOpera AI platform.
fedml.api.cluster_start(cluster_names: Tuple[str], api_key=None)
Arguments
cluster_name (Tuple[str]): Tuple of cluster names to start.api_key (str=None): Your API key from TensorOpera AI platform (if not configured already).
Returns
Boolean indicating whether the clusters were successfully started or not.
fedml.api.cluster_startall()
Start all existing clusters on your account on TensorOpera AI platform.
fedml.api.cluster_startall(api_key=None)
Arguments
api_key (str=None): Your API key from TensorOpera AI platform (if not configured already).
Returns
Boolean indicating whether the clusters were successfully started or not.
fedml.api.cluster_stop()
Stop selected clusters on TensorOpera AI platform.
fedml.api.cluster_stop(cluster_names: Tuple[str], api_key=None)
Arguments
cluster_name (Tuple[str]): Tuple of cluster names to stop.api_key (str=None): Your API key from TensorOpera AI platform (if not configured already).
Returns
Boolean indicating whether the clusters were successfully stopped or not.
fedml.api.cluster_stopall()
Stop all existing clusters on your account on TensorOpera AI platform.
fedml.api.cluster_stopall(api_key=None)
Arguments
api_key (str=None): Your API key from TensorOpera AI platform (if not configured already).
Returns
Boolean indicating whether the clusters were successfully stopped or not.
fedml.api.cluster_kill()
Kill (Tear Down) selected clusters on TensorOpera AI platform.
NOTE: Note that kill is different from stop. Clusters once killed cannot be restarted.
fedml.api.cluster_kill(cluster_names: Tuple[str], api_key=None)
Arguments
cluster_name (Tuple[str]): Tuple of cluster names to stop.api_key (str=None): Your API key from TensorOpera AI platform (if not configured already).
Returns
Boolean indicating whether the clusters were successfully killed or not.
fedml.api.cluster_killall()
Kill (Tear Down) all existing clusters on your account on TensorOpera AI platform.
NOTE: Note that kill is different from stop. Clusters once killed cannot be restarted.
fedml.api.cluster_killall(api_key=None)
Arguments
api_key (str=None): Your API key from TensorOpera AI platform (if not configured already).
Returns
Boolean indicating whether the clusters were successfully killed or not.
Result Codes
| Code | Name | Message |
|---|---|---|
| 0 | LAUNCH_JOB_STATUS_REQUEST_SUCCESS | LAUNCH_REQUEST_SUCCESS |
| 1 | RESOURCE_MATCHED_STATUS_MATCHED | MATCHED |
| 2 | RESOURCE_MATCHED_STATUS_JOB_URL_ERROR | ERROR_JOB_URL |
| 3 | RESOURCE_MATCHED_STATUS_INVALID_PARAMS | INVALID_PARAMS |
| 4 | RESOURCE_MATCHED_STATUS_BLOCKED | BLOCKED |
| 5 | RESOURCE_MATCHED_STATUS_QUEUED | QUEUED |
| 6 | RESOURCE_MATCHED_STATUS_BIND_CREDIT_CARD_FIRST | BIND_CREDIT_CARD_FIRST |
| 7 | RESOURCE_MATCHED_STATUS_QUERY_CREDIT_CARD_BINDING_STATUS_FAILED | QUERY_CREDIT_CARD_BINDING_STATUS_FAILED |
| 8 | RESOURCE_MATCHED_STATUS_NO_RESOURCES | NO_RESOURCES |
| 9 | RESOURCE_MATCHED_STATUS_REQUEST_FAILED | REQUEST_FAILED |
| 10 | LAUNCH_JOB_STATUS_REQUEST_FAILED | LAUNCH_REQUEST_FAILED |
| 11 | LAUNCH_JOB_STATUS_JOB_URL_ERROR | LAUNCH_ERROR_JOB_URL |
| 12 | LAUNCH_JOB_STATUS_JOB_CANCELED | LAUNCH_ERROR_JOB_CANCELED |
| 13 | LAUNCH_JOB_STATUS_NO_JOBS | LAUNCH_ERROR_NO_JOBS |
| 14 | RESOURCE_MATCHED_STATUS_QUEUE_CANCELED | QUEUE_CANCELED |
| 15 | CLUSTER_CONFIRM_FAILED | CLUSTER_CONFIRM_FAILED |
| 16 | CLUSTER_CREATION_FAILED | CLUSTER_CREATION_FAILED |
| 17 | LAUNCH_JOB_STATUS_INVALID | LAUNCH_JOB_STATUS_INVALID |
| 18 | LAUNCH_JOB_STATUS_BLOCKED | LAUNCH_JOB_STATUS_BLOCKED |
| 19 | APP_UPDATE_FAILED | APP_UPDATE_FAILED |