Monday, October 9, 2017

Flex5: A New Scalable Backend

Reposting this from our main company blog at:

Hello, everyone! This is Roger Diller, Technical Lead here at Flex. Let’s discuss Flex5 and its new scalable backend. This aspect is perhaps more important than the new UI as it lays down the foundation for Flex5.

We have been methodically crafting the Flex5 strategy for over a year now. It all began with the knowledge that Flash was on its way out and we needed to rewrite the UI in order to stay relevant. After quite a bit of discussion, we ended up on a “mobile first” strategy for two main reasons.

Adopting the mobile first strategy

First, we didn’t have a large mobile presence outside of our warehouse iOS app so gaining a stronger presence for mobile is important to us. Second, we knew that if we started with a less complex mobile UI, we could more easily focus on our design there, and then scale up to desktop vs the other way around. We decided to start with a tablet UI first and later move on to phone and desktop UI’s. It took a lot of work to come up with a UI design for tablet that would transition well to phone. By late spring of 2016, we had completed the brunt of the mobile design strategy. We began experimental coding of a tablet app and started a new REST API inside the existing Flex4 backend.

After our tablet UI design and app proof-of-concept success, we still had concern for the Flex4 backend. Could it stand up as a backend for the long-term? We weren’t sure. We thought we could possibly refactor our way to a faster and more scalable Flex4 backend. In the fall of 2016, we made substantial improvements to Flex4 performance, but we could see we weren’t going to be able to  reach the necessary level of performance to support future growth.

Rewriting the Flex backend architecture

It took some time to figure out a solid approach for rewriting the backend. How could we accomplish that without a risky all-or-nothing rewrite? In late 2016, we came up with a proof of concept Flex5 backend that would coexist with the Flex4 backend.

Let me take a moment and explain the key difference between the Flex4 & Flex5 backend. Flex4 was designed to run as one process per customer. This means we don’t have any way to run a second process for a customer to provide service redundancy in case one process goes down.

The Flex5 backend, on the other hand, is a cluster of at least two Flex5 processes that work together via a load balancer to provide the Flex service. This means one process can fail and the other process  will still be there to provide the service without the user knowing anything happened.

Improved reliability and performance

With this new architecture, we will be able to horizontally scale the Flex5 service. If demand goes up, we can add new application servers to the cluster to handle the load. This is huge and will allow us to support the high demand that Flex5 is going to bring. 

By early 2017, we gained confidence that the new backend was the way forward. We began to incrementally build new API’s in the Flex5 backend and simultaneously call “hard to rebuild” API’s in Flex4 (such as search and availability) until our schedule allows us to build them in the new backend.

We didn’t know it at the time, but later realized we were following the “strangler” rewrite pattern. It sounds kind of strange, but basically it means the new application grows beside or around an existing one and over time it takes over more and more of the work until eventually the old system is not used at all. This dramatically reduces risk and allows access to the new system much sooner and throughout the migration process. An important point I want to emphasize is that you will be able to use Flex4 & Flex5 side by side until the feature migration is complete. This means both systems point to the same database, so changes in one system are visible in the other.

This approach was working well but we were still missing one piece. We needed a way to coordinate events between Flex4 and Flex5. For example, if Flex5 saved a new inventory model, Flex4 was completely unaware that it was inserted into the database so its caches and search index were stale. In the Spring of 2017, we found an inter-process communication tool to solve this problem. The tool keeps each system aware of events happening in the other system and enables each system to respond to remote events.

The road ahead for Flex

In summary, all of the key pieces for shipping the tablet version of Flex5 are in place. We are still rounding out some less critical pieces, but we are getting closer and closer. We expect Flex5 to continue to gain momentum the rest of this year with 2018 being a year of heavy code lifting. It’s very exciting!

No comments:

Post a Comment