How to pay off gateway technical debt? Layered, plugins, and unified, the three-piece set.

higress

Jan 7, 2025

Share on X

In the context of team and personnel changes, it is inevitable that many technical debts will arise. Standardization is a solution idea, and layered architecture, plugins, and unification of the technology stack are relatively easy paths to implement.

Like most enterprises, Zhengcaiyun also faces many technical debts at the gateway due to various historical reasons. For instance, there are many configurations for the container gateway, diverse configuration methods, and a significant operational pressure; coexistence of five types of open-source gateways, collaboration and iteration costs; and continuous business demands that pile up business logic on the gateway, making it difficult to manage risks.

01. Project Background of the Business Gateway

Due to some historical background, the Zhengcaiyun platform has encountered some issues in gateway construction:

There are many configurations for the container gateway, diverse configuration methods, leading to high operational pressure

The abundance of configurations is due to the fact that container gateway configurations include service routing, construction-type routing, and return/rewrite-type routing, among different routing types. The microservice architecture results in a large number of services, while the technical solutions of construction-type platforms lead to many subdomains, which increases the complexity of gateway configuration to M x N (where M is the number of services and N is the number of domain names). For example, there are about 400 subdomains and over 500 services, with a total configuration amount of about 200,000+; construction-type platform subdomains need to define root path forwarding independently, and each page's distribution path can be filled in arbitrarily, leading the gateway configuration to support the mapping relationship of each construction page to the path; the return/rewrite logic for the business is often generated during the migration of business services, and old requests have not been phased out. Some gateway rule configurations are managed based on services, while others are managed based on domains, leading to non-unique routing for each service, and often resulting in gateway rules that do not take effect. Operations need to investigate which configuration conflicts are causing these issues, which may also lead to a single change impacting the effectiveness of other routing rules.

The technology stack for gateways is relatively diverse, making operational maintenance difficult and collaboration harder during iterations

The traffic gateway uses OpenRetry, the container gateway uses Istio Ingress Gateway, the new open platform/business gateway uses Spring Cloud Gateway, the old open platform uses Kong, and the Node express corresponds to the constructed gateway, and APISIX is used for cross-network gateways, etc. Although different scenarios correspond to different development teams, the variation in technical stack selection increases the workload of daily maintenance and presents many problems for future capability construction.

Continuous business demands pile up more and more logic on the gateway, making risk management impossible

As the business develops, the business development teams have some expandable capability demands for the gateway, such as authentication, security control, and request body modification needs, which leads to more and more business logic piling up on the gateway. Some directly couples the code into the business, while others take the form of plugins, increasing many potential risks.

Based on the problems mentioned above, at the beginning of this year, the Infrastructure Group of Zhengcaiyun Co., Ltd. initiated the business gateway project. During the initial phase of the project, the infrastructure engineers conducted a demand survey, which ultimately summarized the following requests:

1) Governance of gateway configuration challenges. Gateway configuration governance needs to address the issues of difficulty and high-risk configurations.

2) Unifying the gateway technology stack as much as possible. Based on the background of cost reduction and efficiency improvement, unifying the technology stack benefits operational efficiency; a unified gateway technology stack requires that its capabilities should meet the business scenarios for most existing businesses. Current business scenarios include request forwarding, request rewriting, protocol conversion, plugin expansion (such as the previous business gateway integrating the Alibaba Cloud human-machine verification plugin), among others, and sustainable development, such as the capabilities for security control, traffic governance, traffic coloring, etc.

After a month of demand investigation and solution testing, the infrastructure team engineers finally proposed the following solutions:

Governance of gateway configuration

Strengthen specifications. Based on the current business scenarios and potential future business scenarios that may need support, we optimized the domain usage specifications, gateway rule configuration specifications, and integrated them into the daily change processes to reinforce the enforcement of specifications.

Gateway rules need to be configured in layers. Gateway configuration shall be layered, divided into domain name layer and service layer according to the responsibilities of different engineers. Operations engineers handle the configuration of the domain name layer, while development engineers are responsible for the service layer configurations.

Management of gateway rules centers around applications. Applications need to deal with resources and manage lifecycle like online and offline, binding gateway rules of the service layer to the applications, making service layer gateway rules an attribute of the application in the metadata center. The lifecycle of service layer gateway rules aligns with that of the applications.

Unifying the gateway technology stack. Unifying the gateway technology stack involves the selection of gateway services.

02. Selection of Gateway Services

The unification of technology stacks is a goal to be achieved gradually, but regarding the urgency of the issues that need to be addressed, the container gateway and business gateway need to be merged to solve the problems of gateway configuration governance and capability expansion with a single technology stack. After comparing various dimensions such as community activity and capability comparisons, we eventually selected these three gateway services: APISIX, Higress, and Istio (Ingress Gateway).

Higress vs. Istio (Ingress Gateway)

Higress has made higher-level encapsulations based on Istio's capabilities to improve ease of use and meet the demands of more business scenarios. Based on Higress, we can gain the following benefits:

Strengthening plugin capabilities. Higress has encapsulated open-source wasm client, making it more convenient for developers to create plugins, such as controlling the scope of plugin efficacy, and numerous plugins available in the community can be directly used.

Extending functionalities. Higress integrates various mainstream registry centers on multiple data sources, allowing these capabilities to be reused without redundant construction; it provides http2rpc capabilities based on envoy filters and encapsulates separate CRDs to facilitate http2dubbo configuration.

Higress vs. APISIX

APISIX is an excellent cloud-native gateway solution, grounded in OpenResty, and has also been used in the company's expressway project to perform cross-network domain request forwarding. However, considering that the current heavily used scenario for the gateway pertains to the container gateway, which corresponds to virtual service resources, we have also developed relevant solutions around virtual service in our internal systems. Native support for virtual service simplifies migration workload, therefore from the standpoint of migration convenience, we chose Higress.

Why to use Istio CRD

During technical selection, we continually pondered a question: why continue using Istio CRD instead of using Kubernetes Ingress or Gateway API for the business gateway configuration scheme? Would selecting a standard specification simplify the ongoing capability construction for the gateway?

Kubernetes Ingress provides basic routing capabilities but is too simple compared to our usage scenarios. If we were to replicate our gateway configuration into the description of Kubernetes Ingress, there would be vast amounts of annotation information configuration, and the overall configuration would inflate by many factors. This is because Ingress defines that each ingress resource can only configure one Host; if multiple hosts are needed, then corresponding multiple Ingress resources are required. However, within Istio CRD, there exists a single virtual service resource with a simple configuration that has strong readability.

As the business develops, the capability demands on the gateway side are increasing, such as traffic governance, plugin expansion capabilities, and mesh capabilities, which Ingress cannot satisfy.

The Kubernetes community defined Gateway API to address the existing problems with Ingress. However, the Gateway API has released version 1.1, but its richness in capabilities is still not as good as Istio CRD, and there are compatibility issues with lower versions of Kubernetes.

Ultimately, we continue to choose Istio CRD as the resource for gateway configuration.

03. Community Support

From the perspective of the Higress architecture diagram, the data plane of Higress still uses Envoy for request forwarding and employs the Istio pilot component for configuration discovery. The Istio CRD is a resource that Istio supports by default, using Higress + Istio CRD should theoretically go smoothly.

During the migration process, we encountered some issues, and I'm sharing these for reference:

Based on the domain name aggregation function not taking effect. To optimize the change speed, Higress has enabled enableSRDS by default. This feature isolates to take effect in Envoy configurations based on domain names; since our business scenario has particularly many subdomains corresponding to a single route, previously the gateway configuration was enabled based on domain name merging, using Higress led to a significant increase in configurations, and turning off enableSRDS resolved this issue.

Configuration effectiveness is particularly slow. When a large volume of gateway configurations becomes effective in Higress, it can take minutes. After investigation by the Higress community team who optimized this logic, ultimately the effectiveness time for configurations in Higress also came down to seconds.
When conflicting gateway rules are effective, Istio appears to have chaotic effectiveness. Versions below Istio 1.22.0 have an issue where during the issuance of conflicting gateway rules, chaos can emerge under specific conditions. Fortunately, in my process of migrating gateway rules, I discovered this issue. When I reported a bug to the Istio community, they had just fixed the issue in version 1.22.0. The Higress community team quickly helped merge the commit for the fix and published it in Higress version 1.4.1.
Under the same configuration, Higress has a higher memory usage. According to monitoring, under the same configuration, the memory usage of the Higress gateway is noticeably higher than that of istio-ingressgateway. Following the Higrress community team's refinement of Higress gateway monitoring metrics, both reached basic parity in resource consumption.

While handling the problems, the Higress community has responded proactively. Generally, they provide solutions on the same day problems are consistently reproduced. The slow effectiveness issue is merely a problem we encountered in our scenario; despite them having difficulty reproducing it locally, they persisted in resolving it that same day. With strong support from the Higress community, all problems were easily resolved. The fact proves that replacing Istio with Higress + Istio CRD has a controllable overall progress.

The Higress community members respond very efficiently to the questions we pose. When we review the Higress code and learn about plugin development, we often have questions posed to the Higress community group, and they respond actively upon seeing them.

The business gateway project, besides incorporating Higress to replace ingress gateway as the business gateway, has also focused on gateway rule governance. Below, I’ll introduce the layered solution approach for gateways, closely coupled with the business, for reference.

04. Layered Solution Ideas for Gateways

In terms of gateway rule governance, combined with the current layered configuration requirements for gateway rules, we defined CRD resources for application routing (appvirtualservice, abbreviated as avs), dividing the routing configuration into two layers: domain name layer and service layer. The domain name layer corresponds to the original virtualservice resource, while the service layer corresponds to avs resources; the avs does not carry domain name information and is separated from the domains. The gateway rules operated by developers correspond to the avs resources, which will ultimately take effect on the virtualservice resources.

Example of avs resource:

Based on practical usage scenarios, in the definition of avs resources, we also defined the PriorityClass field (supporting three levels: high, medium, low, default is medium) to determine the priority order of the rules’ effectiveness, somewhat alleviating the uncontrollable issue of the previous dependency on creation time determining the rules’ effectiveness order.

The layered approach for the gateway requires the systematic organization of the nearly 2,000 virtualservice configurations previously maintained manually, significantly impacting the overall logic. To ensure the entire process is repeatable, we developed tools to split the original virtualservice logic into a two-layer configuration structure of virtualservice + avs.

Based on the original routing rules, we divided the routing into default domain routes, service routes (conflicting service routes and non-conflicting service routes), and rr routes (return/rewrite routes). The default domain routes, rr routes, and conflicting service routes are related to domain names and need to be maintained and handled by gateway operations personnel, while non-conflicting service routes (accounting for 90%) are transformed into avs resources, which are processed using the page functions accessible to development personnel. By estimating this, the maintenance costs (problem investigation and error probability) of gateway rules reduced by 90% before and after layering.

Having solved the layering, we face a question: how to verify data consistency before and after changes and how to execute changes?

Data Consistency Verification

The method for verifying data consistency during routing changes involves collecting logs of the original traffic forwarding and replaying traffic into the test environment after the gateway is layered, then collecting the forwarding logs in that testing environment, comparing the two, and outputting a difference report. If the same request (same domain name + URL) forwards to different upstream clusters, it indicates that the original forwarding logic was modified after the gateway was layered, necessitating optimization of the layering logic or manual handling.

Two points need attention during development of tools:

1) Unique traffic collection. Traffic collection uses the md5 of the domain name + URL (excluding parameters) as a unique key. For request URLs that incorporate varying ID paths like /item/123445, regular expressions filter them into a unique request. The detailed traffic is logged by file, encompassing fields such as flow domain, URL, backend service, requestId, etc., for traffic replay and differential verification.

2) Replay traffic uses HEAD requests. The purpose of replaying traffic is to verify the correctness of forwarding logic. To prevent affecting production traffic, the original request methods are all set to HEAD, removing parameters and request body contents for replay.

Change Strategy

Create a new gateway instance, with the layered gateway rules taking effect on this new gateway instance. On the load balancing side, gradually amplify the flow proportionally from 1% -> 5% -> 10% -> 20% -> 50% -> 100% for the new gateway instance.

05. Initial Construction Achievements

The initial outcomes of the business gateway project are quite remarkable:

Seamless Experience. The changes in the business gateway are essentially seamless for business development teams; they only perceive that such change events occur.

Realization of Governance Goals for Gateway Rules. Gateway rules are layered into domain name layers and service layers, where the domain name layer consists of basic configurations while the service layer is managed centering on applications, achieving the overall goals of gateway rule governance.

Unified Technology Stack. The business gateway (Higress) has replaced the container gateway (Istio Ingress Gateway), the old open platform (Kong), cross-network gateway (APISIX), and the old business gateway (Spring Cloud Gateway), saving the resources occupied by relevant services while reducing subsequent maintenance costs.

Introducing Higress is merely the beginning; subsequently, there will be efforts to build an open platform (authentication and authorization, HTTP to Dubbo, etc.) and integrate unified authentication capabilities, ultimately achieving the goal of a unified gateway technology stack.

About Zhengcaiyun: Zhengcaiyun Co., Ltd. was established on August 8, 2016. The company leverages leading global digital technologies, such as cloud computing, big data, and artificial intelligence, to create China's first government procurement cloud service platform—the Zhengcaiyun platform, which has become an integrated procurement cloud service platform with the widest service range, the most users, and the most active transactions across regions, layers, and fields in the industry.

Contact

Follow and engage with us through the following channels to stay updated on the latest developments from higress.ai.

https://medium.com/@higress_ai