How did Sealos choose the gateway for the private server of Palu?

Fang HaiTao

Jan 29, 2024

Share on X

Author introduction: Fang Haitao, founder of Sealos and CEO of Huanjie Cloud Computing

Sealos Public Cloud _(https://cloud.sealos.io)_ has almost overwhelmed all mainstream open-source gateways on the market. This article can provide a good reference to help everyone avoid pitfalls in selecting gateways.

Complex Scenarios of Sealos Cloud

Since the launch of Sealos Public Cloud, there has been explosive growth in users, with a total of 87,000 registered users. Each user creates applications, and each application requires its own access point, resulting in a massive number of routing entries for the entire cluster. There is a need to support the capacity of hundreds of thousands of Ingress.

In addition, providing shared cluster services on the public network places extremely high demands on multi-tenancy. Routing between users must not interfere with each other, requiring excellent isolation and traffic control capabilities.

The attack surface of the public cloud is very large. Hackers will attack user applications running on the cloud and also directly target the platform's exit network, presenting significant security challenges.

There are high requirements for the performance and stability of the controller. When many routing entries accumulate, resource consumption can be very high, even causing OOM that leads to gateway crashes.

Exclusion of Nginx Ingress

We initially used Nginx Ingress but eventually discovered that there were several core issues that could not be resolved:

The reload issue, where each ingress change leads to a short disconnection. As user clusters grow, ingress creation and changes become frequent events, often leading to network instability.
Unstable long connections, also due to changes, frequently causing the long connections in use to drop.
Poor performance, slow activation time, and high resource consumption.

Thus, many underlying gateways implemented with Nginx were almost excluded. Our testing showed that gateways based on Envoy have much better performance, with minimal consumption of both control and data planes.

This is Envoy's:

This is Nginx's:

The difference is very significant, so we can rule out Nginx series options and fully embrace Envoy.

About APISIX

APISIX is an excellent project that solves some of the Nginx reload issues, so we at Laf used APISIX early on. However, unfortunately, the APISIX Ingress Controller is not very stable. The control plane has crashed multiple times, causing substantial failures, and we encountered issues like controller OOM. We really wanted to use it, but ultimately we were forced to withdraw due to these issues. Of course, the APISIX community is continuously following up on these problems and hopes to improve.

In summary: APISIX itself is quite stable, but there is still much that needs optimization in the controller, and stability needs improvement. The community support is also significant, but we urgently face online issues that cannot be solved at the community's iterative pace, so we can only switch to other gateways for now.

Cilium Gateway

Sealos's CNI was switched to Cilium early on, which is indeed powerful. We thought of using Cilium for the gateway as well, but reality proved challenging.

The Cilium Gateway only supports LB mode, which strongly relies on the cloud vendor's LB. We also have some privatized scenarios, so we want to avoid coupling. When routing is extensive, we encountered issues where Ingress activation is particularly slow, requiring minute-level activation time, resulting in poor user experience. We can accept activation within 5 seconds. So the conclusion is that we have to wait a bit longer.

Envoy Gateway

From the perspective of the development of K8s standards, it will gradually shift from Ingress to Gateway standards. Our underlying implementation leans more towards using Envoy, so the implementation of Envoy Gateway seems like a very good choice. However, we researched Envoy Gateway, but this project is still in its early stages, encountering some unstable bugs, such as OOM, path policy not being effective, and some features not functioning in the merge gateway mode. We are actively working on resolving these issues and constantly helping the upstream community with suggestions and contributions, hoping that it can reach a production-ready state in the future.

Gateway standards that are high-end but not very practical

The Gateway situation is quite awkward. It seems to me that the designers have not really practiced in multi-tenant scenarios. When multiple tenants share a cluster, the permissions of administrators and users must be clearly distinguished. The original design of the Gateway did not consider this fully. For example:

apiVersion: gateway.networking.k8s.io/v1kind: Gatewaymetadata:  name: egspec:  gatewayClassName: eg  listeners:  - name: http    port: 80    protocol: HTTP    # hostname: "*.example.com"  - name: https    port: 443    protocol: HTTPS    # hostname: "*.example.com"    tls:      mode: Terminate      certificateRefs:      - kind: Secret        name: example-co

Configurations such as listening ports should be intended for cluster administrators rather than ordinary users, and TLS certificate configurations belong to specific applications. Administrators can have the right to configure, but primarily each user should configure their own. Thus, the permissions are not separated. This means users must also have permission to configure the Gateway, leading to the need for many detailed permission control issues in the controller, such as port whitelisting, conflict detection, etc.

I personally think a more elegant design would be to sink tenant-level fields into HTTPRoute or a separate CRD, making the separation between user states and super administrators much clearer. The current methods can also work, but they are a bit mixed.

Ultimately, Higress wins

In addition to the major projects mentioned above, we've tested many other projects, but I won’t list them all here. Sealos ultimately chose Higress.

Our logic for choosing a gateway is very straightforward, primarily ensuring sufficient stability while meeting functional requirements. The final choice of Higress was nearly a process of elimination.

Stability ranks first. In our scenario, Higress is currently the only one that can achieve production usability. However, some problems arose during practical use, but fortunately, the support from the Higress community is significant, resolving issues quickly. Major issues include:

1. Slow Ingress activation speed; when routing entries are numerous, it can take over 2 minutes for new routes to take effect. The community ultimately optimized this to around 3 seconds, which is already maximized, with no further optimization necessary because it is faster than the container's readiness time. Higress employs an incremental loading configuration mechanism, which allows for impressive performance even with a vast number of routing entries.

2. Controller OOM: Resource consumption can be high without dynamic loading, leading to OOM situations. Current high problems have all been resolved.

3. Timeout issues: A parameter configuration for further optimizing loading delays called onDemandRDS occasionally leads to request timeouts in our main cluster. We currently have this configuration turned off while further investigating the cause, and this issue has not been observed in other clusters.

In terms of security, many of our fault issues are caused by performance problems, with excessive traffic causing gateway overload being common. Thus, the gateway's performance becomes critical. Testing shows that Envoy is significantly more robust, and the quality of the controller's implementation is crucial. Higress performs exceptionally in this regard:

Even with our immense routing and extremely high concurrency, the resources required are surprisingly minimal.

Higress is also compatible with Nginx Ingress syntax, primarily certain annotations. Our previous code used Ingress, so there was almost no migration cost, and it can be upgraded in just a few minutes.

Likewise, to promote better community development, we also provided Higress with some feedback:

To have better support for Gateway standards. Although it currently supports version 1, it has not yet fully compatible capabilities with Ingress.
To open up powerful functionalities, such as those related to security and circuit breaking, allowing open-source and commercial aspects to combine more closely. We do not oppose paid features; however, as the platform grows, stronger functionalities are needed.
Surrounding functionalities should be extended more through a plugin mechanism, making core functionalities more cohesive and easier to depend on.

Conclusion

The gateway is a very core component for cloud and applications. As Sealos continues to scale, many new challenges will arise. We hope to establish close cooperation with upstream and downstream communities to allow open-source gateways to develop better, benefiting more developers.

Many of the gateways listed above are excellent; just because Sealos didn't use them doesn't mean the projects aren't impressive. Our scenario is complicated and peculiar, and there are not many gateways that can support multi-tenancy in a public environment, so each viewer should consider their own context. Our selection serves only as a reference. Similarly, Sealos will maintain an open attitude to continue following the development of other gateways.

Finally, we sincerely thank the Higress open-source community for their strong support and applaud the Alibaba Cloud native team for open-sourcing such an excellent project, benefiting a wide range of community users.

Contact

Follow and engage with us through the following channels to stay updated on the latest developments from higress.ai.

https://medium.com/@higress_ai