XXV’s Service Continuity

Some of you may remember that XX Network mainnet was originally scheduled to start this week, but was pushed to next week.

That turned out to be helpful as I was able to complete some architecting, planning and testing related to mainnet, and I’m almost ready for mainnet.

Implemented ¶

My current status as of today (November 10, 2021):

Analysis of performance, application resource requirements, database requirements (including backup & restore) completed this week
XX Chain, Node, and Gateway all ready to run in the cloud (all OS, apps & configuration files in place)
XX Node is running in the cloud
XX Gateway is still running on-premises

I’ve just finished migrating the Node and I may still move Gateway to the cloud before mainnet.

To be implemented before mainnet ¶

Implement advanced performance monitoring (improve on the current approach that I have on-premises)
Document all settings
Set up remote backup for two Postgres databases (already tested with the larger (Gateway) DB, and one-off attempt with the smaller (Node) database worked fine)
Set up backup for XX chain (already tested)

To be implemented by early December 2021 ¶

Centralize log collection (improve on the current approach that I have on-premises)
Write custom service-checking scripts and set up application-aware monitoring
Implement canarynet on-prem and use it to conduct failover and failback testing between on-prem and the cloud

Summary ¶

It took me around three hours to move Node service from on-premises to the cloud. This includes some unexpected problems (firewall, etc.).

The first big drop (protonet era 741) happened on switch-over. After that one era was missed (spent in waiting).

Node move to the cloud

Node after the move:

XXV’s node in Singapore

What does that mean for folks who nominate us?

Backup and restore of both the settings and data (XX chain, Postgres cmix_node database) was successful
Permissioning-related steps were handled without problems
Failback to on-premises should work the same way, only in the opposite direction

Once I complete all of the steps outlined above I believe I should be able to recover from most unplanned site failures within 12 hours, and move nodes around within only two-three hours.