Faq

What is the roadmap for the product?

Here is the list of feature that we intend to deliver progressively until our v1:

REST API + CLI parity with UI
SOC 2 readiness and audit logging
Savings-based billing engine
Cluster-based cache storage & sharing
SSO / RBAC integration + usage quotas & alerts
User-customized model serving + auto-config for new models
Model catalog & pre-configured templates
Enhanced monitoring and cost reporting dashboard
Enterprise install packages (K8s + Helm support)
Private cloud & on-prem deployment options
Automated deployment validator & health check tool
Unified control plane for SaaS + On-Prem hybrid management

How is pricing working?

During the beta phase, we will not be charging you anything but the cost of the GPU you will be renting through us. Our intention is, once our beta has ended, to charge our customer for our value add. Namely, we will be charging based on the savings our customers are realizing through our caching technique. This model ensures that while you enjoy more GPU power for the same amount spent, we also have the incentive to further enhance our techniques. During the beta, you will be shown what those savings are but you will not be charged this.The formula that we will use for pricing once the product is in v1 is as follow:GPU = Number of GPU hours consumed GPH = Price per GPU/hour for the chosen provider EST = Estimated Saving based on cache hit rate reported Pricing: (GPU * GPH) + ( GPU * GPH * EST * 0.3 ) Baseline: (GPU * GPH) + ( GPU * GPH * EST)Where Baseline is the estimated cost if you serve the same amount of workload by yourself when renting the GPU servers directly from the cloudExample:GPU = 100h GPH = $2 EST = 60% Customers pays: 200 + ( 200 * 0.6 * 0.3 ) = $233 Baseline : 200 + ( 200 * 0.6 ) = $320

How do you estimate the cost savings?

We estimate the cost saving based on the cache hit rate. Every time the cache is hit, this is counted as GPU time saved.

What does tensormesh consider a cache hit?

We consider the cache as being hit when cache stored outside of the GPU VRAM is being pulled back into the GPU.

See it in action.

How long before I can access the beta?

We’re gradually onboarding new users in batches to ensure the smoothest possible experience as we scale. You’ll receive an email as soon as it’s your turn to join.

What if my model is not offered, can I still use it?

If the model is available on Huggingface, you can use it. The caveat is it may have longer bootstrap time because the model needs to be downloaded from Huggingface. If the model weights are “private”, e.g., post-trained and stored on your own S3 bucket, we will support it later during the beta phase.
‍
If the model weights are “private”, e.g., post-trained and stored on your own S3 bucket, we will support it later during the beta phase.

What if I want to use another GPU provider?

Please let us know which GPU provider you would like to see added on discord on using the feedback tool.

Can I use Ternsormesh on my own hardware?

We plan to offer an on prem version of Tensormesh shortly after our v1 is released.

Can I share the cache between multiple servers in a cluster?

The first beta version does not do this but it is on our roadmap to deliver this before our v1.

Are incoming queries routed to the right node based on cache affinity?

Yes, Tensormesh includes a cache aware routing.

Can I use a different inference server?

No, at this time Tensormesh is a full inference stack experience. You still can use LMCache with the supported list of inference servers, but that is outside of our product offering.

Can I perform the same action I can do on the Web UI through an API or a CLI?

This is functionality that will be added to the beta before we go to final.

For security reasons, I need to be hosting my own inference stack, is Tensormesh available for an on-prem deployment?

Tensormesh v1 will be available on prem. Feel free to contact us for more details.

Frequently Asked Questions