XX Network Notes

MainNet ¶

No downtime or any issues except for the usual Internet slowdowns which degrade performance on certain days.

Maximum commission for Team Multiplier-backed nodes has been increased to 22%. I haven’t adjusted commission yet, so as of April 8, 2023, PRIDE is still at 17.9%.

There’s a community run “rewards payout” script that pays out rewards automatically - at least for time being, as certain donors have chipped in - so payouts for UNITED-VALIDATORS have been very regular.

More than one week of poor network performance across the board. Really annoying, but it’s completely outside of my control.

There was no downtime due to maintenance or other issues.

There were days of poor network performance.

There was no downtime due to maintenance or other issues.

Nothing unusual to report: as usual, there were 1-2 days of poor network performance after which I cut commission for 2 eras.

xx gateway code was updated to add TLS encryption and I happened to be reasonably alert to that schedule so I prevented significant downtime due to failed gateway update pushed out.

Other than that, the quality of nodes on xx network has improved as the changes to Team Multiplier calculation created better economic incentives, so recent months haven’t been nearly as frustrating as before.

Having moved PRIDE to the cloud, I used the opportunity to complete some overdue to-do’s from my monitoring wish-list.

I still don’t use any alerts, but now I monitor services across all nodes and not just TCP/IP ports, but actual services.

After this month’s gateway code updates all three nodes haven’t had issues so I haven’t felt the pressing need to do something about alerts.

The benefit of new monitoring is I no longer need to login to every machine or check the status in the XX Dashboard or Wallet - I just watch one page which has everything I need and nothing more than that. To that last point - I built the monitoring partially by writing own monitor scripts which use very little resources (less than 20 MB).

I’m updating November notes in advance (on November 26), but everything has been stable and smooth this month.

Regarding ENVY and SLOTH, the big news is I closed them to nominators. As before nominators smaller than 20K were left on board although I don’t need their nominations (I’ve more than 20K in extra nominations on each node).

I had a big problem with UNITED-VALIDATORS/PRIDE in October: first it suffered hours of unplanned downtime due to a terrestrial cable cut caused by construction crews (era 331). Hours later they fixed it and it seemed like everything was fine. But it wasn’t.

Two weeks later they arraynged scheduled downtime to fully fix the problem and after that PRIDE’s performance went down big time.

I don’t know the details, but my theory is the first cut made them reroute traffic to a spare line potentially reserved for commercial customers, resulting in no impact except for the downtime. But after they fixed the original line, it wasn’t done properly and it made the connection useless for running a node. And few days later PRIDE dropped out of validator set.

So with more than one year left on my high-speed subscription plan, I had to either move the node or give up on running it. I moved it to the cloud in era 362 so now ENVY, SLOTH and PRIDE are all running in the cloud.

There haven’t been noteworthy events this month.

It’s hard to tell why September was trouble-free. If I had to pick a reason I’d say it’s because some of the recent Ubuntu updates made systems perform well, and I haven’t had big network problems either.

It should also be pointed out that the increased competition that appeared thanks to the changes in TM rules that I advocated for have benefited the network - it runs better and processes more MTPS.

I voted against validator slot expansion because it’s not necessary at this time.

There haven’t been noteworthy events this month.

Having spent almost one month running XXV4 (“ENVY”) and XXV5 (“SLOTH”) out of Germany due to geo-mulitplier reset, I moved the nodes back during third week of July because the new geo-multiplier took effect this week.

I’ve experienced several PostgreSQL database timeouts on gateways, but I they didn’t last long enough to make the node drop out of validator set. Unlike in June, I did not lower my commission after those events because downtime was not caused by me.

The referendum for geo-multiplier reset unexpectedly passed, and all regions were set to geo-multiplier 1.00.

This made me move my TM backed node (PRIDE aka XXV3) to less expensive hardware, as the drop in xx earnings may not be sufficient to cover higher expenses (we’ll find out when xx coin becomes tradeable). Meanwhile I deployed two new nodes in West Europe (ENVY and SLOTH) so now I have three nodes.

Geo-mutliplier may return in July, after which node locations and configurations will be evaluated and possibly changed.

I experimented with XXV3 (my TM-backed node) to maximize its performance, and try to figure out if the additional spending can be justified. The problem is, we still don’t know the price of xx coin, so I still have no answer to this.

Issues with Team Multiplier and Phragmen made me take the second, independent node (XXV2) offline and use the wallet for nominating. Such are effects of TM socialism (Atlas Shrugged, etc.)

The entire network is heavily impacted by the poor performance of nodes in Ukraine and Russia.

I’ve had XXV2 Gateway die once - probably due to too many network errors - but I spotted it early enough.

XXV2 dropped out of active validator pool twice during the month of March:

Now I nominate both XXV3 and XXV2 and so far that’s been enough. XXV3 also has a Team Multiplier, so Phragmen usually allocates most of my own coins to XXV2.

My plan for April is to wait for xx coin launch and potentially stand up another node.

The first half of February hasn’t been great. I’m especially unhappy about the poor performance of XXV2.

If you look at the performance chart above or in Explorer, it’s easy to see that day-to-day variations are very large. Hardware resources are practically reserved, so it’s easy to tell it’s either the network or the peers in combination with unlucky grouping.

The second half was better, even though I didn’t make any changes. This shows that factors impacting the nodes’ performance are indeed external.

I haven’t done much infrastructure-wise, besides some “internal” work. One reason for this is that I was busy, and the second is I’m waiting to see how forthcoming changes impact the network (as a reminder, we’re awaiting two big changes: one is change in geo-multipliers and another is the ability of Team-backed validators to stake both own and other nodes).

My plan for March is to move XXV2 to another hosting provider and then resume work on infrastructure improvements, but I will make that decision after above-mentioned changes.

Highlights:

Infrastructure:

Uptime and performance:

Highlights:

Rewards in December 2021

Let’s look at one of those mystery cMix (or network?) events, this time in era 40 earlier this week.

Late last week and early this week I had another 24 hour period of strange failure spikes.

It started on its own in last hours of last week (Dec 20) and also affected the first day of this week (Dec 27).

Why do I think it wasn’t me?

Next time something similar happens I’ll just let it be and watch it for 48 hours to see if it fixes itself on its own.

XX chain is a lottery, so I just look at weekly cMix stats from MainNet dashboard: successful cMix rounds, and precomp and realtime failures and averages.

Week cMix Rounds PC TO RT TO PC % RT % PC s RT s Comment
2021/11/15 15870 490 188 1.85 1.15 11.34 2.54 Partial week 1 (cld)
2021/11/22 24179 673 69 2.43 0.28 11.84 2.64 Self-inflicted interruption (cld)
2021/11/29 19843 413 48 0.24 1.80 11.02 2.74 cMix (cld)
2021/12/06 24951 440 63 0.25 1.48 10.96 2.76 Stable (cld)
2021/12/13 24914 431 77 0.30 1.40 11.01 2.73 Stable (cld)
2021/12/20 25661 548 67 0.26 1.87 10.48 2.68 cMix (cld, rsd)
2021/12/27 TBD TBD TBD 0.27* 1.96* 10.62* 2.76* cMix (rsd)

The persistently high precomp averages in early December is what prompted me to move node on Dec 22. From ProtoNet I knew sub-11s precomp was possible on my hardware and didn’t want to wait for another cMix incident. Migration was uneventful (about 1 hour of planned downtime) and I managed to avoid missing an era.

Now earnings are still unstable, albeit higher (which is expected due to my location). And now I’m back to “manual” HA.

Points earned in December 2021

Week of Dec 27 isn’t done yet (Fri Dec 31, 2021), but so far both precomp and realtime failure rate have been good (apart from that mystery spike on Dec 26/27) and precomp average duration is 10.62 seconds which is better than I had in the cloud. Realtime failures and realtime duration are both great which is close or better to what I had in the cloud in past five-plus weeks, and very encouraging when compared with my mediocre ProtoNet performance on the same hardware with a 1% realtime and 3% precomp failure rates, and just 20,000 cMix rounds per week.

Now I’m close to best performers in my region (Oceania).

25,000 rounds per week was a nice round goal for this hardware configuration that I’ve finally achieved and intend to maintain while I attempt to launch the second node (“XXV2” in January 2022).

Due to the frequent incidents and five relocations in less than 50 days, all other plans have been progressing slowly. Here’s the current status, in order of perceived importance:

After a mysterious network cock up in era 13, which I suspect wasn’t caused by me (I think it was something about the network, but I’ve never found any proof), I had to restart validation and miss some 0.7 days worth of earnings. The problem resulted in long series (15-20 rounds) of quick precomp failures followed by 2-3 good rounds. Really weird, I’d never seen that before.

After that it’s been two-three weeks of unremarkable but stable performance ranging 48-52K points per day. This node is still closed to nominators, but the next one won’t be. Whether it will manage to get elected is another question.

After days of part-time preparations, I’ve successfully adjusted my configuration so that it performs faster and better.

The changes cost me (and the nominators) two hours of downtime yesterday (Era 12) and around 30 minutes today (Era 13), but should pay back in days and weeks to come.

If this configuration indeed performs better I will revisit other items (monitoring, etc.) which have been delayed by attending to performance and stability difficulties during the first 12 eras.

Now that adjustments have been made I plan to just monitor everything for at least one week.

Some things have been going according to the plan, some haven’t. The bad stuff first.

I had one service disruption on Nov 22 due to a Node VM misconfiguration that was my fault. And of course I was away when it happened. Once I did discover it, I fixed it quickly.

Partial service disruption on Nov 22

Actions taken:

With that corrective action which was noticed only in the second era after the incident, the two nominators remained on board. In a week or two they should recover the missed earnings.

Chain remained online during that time, so some points were still earned (only less) during that cMix service disruption.

The other problem happened due to external factors outside of my direct control, and I spotted it fairly quickly. But I didn’t do anything to cause it, and there was not much I could to to fix it either. cMix service simply started crashing every second or third round. After a while I tried to restart cMix service, then even reboot, then restart Gateway, too.

Service degradation on Nov 26

I’m not sure if any of those actions helped, but eventually precomp failures returned to normal. During a period that lasted approximately two hours the Node failed 30% of the rounds.

The impact on nominators was minimal (mainly due to small impact on points earned, but also thanks to their tiny share in my node compared to my own).

I suspect this problem is related to environment issues, as I had seen similar behavior before (ProtoNet). It only didn’t last as long. Who knows, maybe my rebooting and restarting delayed cMix recovery… I also wonder if I didn’t restart and reboot, maybe cMix would have eventually crashed and stayed offline like it once did on ProtoNet.

This is the good news. There’s a ton of good validators, so it’s not like people don’t have choices, but I want to explain why I’ve temporarily stopped accepting nominations:

The good news is I’m making improvements and plan to resume in coming weeks.

Hours ago we started to cMix on MainNet. Current block height is 15,694, cMix round number 10870, and the time is Thu 18 Nov 2021 09:18:48 AM UTC.

Thankfully, I’ve had no issues at the open when validation begun.

Current status report after close to 11,000 rounds on MainNet:

As you probably know, we had a false start and had to click the undo button.

But I did not know that when I spotted my MainNet setup in the state of complete disarray, so I changed a bunch of things in the hour prior to MainNet v1 shutdown.

It took me two hours to update, undo changes and recheck everything. Tomorrow we’ll know if I have missed something.

Three hours until ~cMixing kicks off~ validator election process begins!

The boxes (gateway & node) are ready to start validating.

I’m almost sure there will be problems with networking because I made some changes in the last hours of ProtoNet and they caused frequent timeouts in incoming connections. While I suspect and hope I saw a bug in permissioning behavior, if I see the same problem on MainNet, I’ll simply abandon that approach and revert to the old approach from BetaNet and ProtoNet.

ProtoNet ¶

We’re going to MainNet tomorrow, so although I’m going to keep the nodes up for another 12-24 hours, I’ll summarize this month’s run today:

As of now, Mon Nov 15 04:19:22 UTC 2021, these are my results for the weeks 1 and 2 of November, followed by the averages (unless indicated otherwise):

Highlights and observations:

I want to add few more lines regarding the low number of successful rounds:

No significant impact on rewards

I plan to revisit the performance in my next XX Network update (early December 2021).

This month my node successfully participated in a lot more rounds than in September. I suspect the update from late September decreased the number of cMix crashes which means the node spent much less time recovering from fatal errors (crashes). It still crashed a lot, though.

The other thing is I spent more time offline compared to September:

I don’t rely on online uptime checking service, but these failures reminded me that I should write custom service checkers for chain, cMix and gateway services.

September was quite a disappointing month as far as my XX Network statistics are concerned.

The last cMix update in August 2021 was the main reason because it introduced new bugs and made existing bugs worse - I’m talking about the one that causes cMix process to crash and make the node effectively spend more than 5% of its time recovering from crashes.

The update from late September (v3.1.0) improved things a lot (in the second week of September my precomp timeouts were 15%), but the rest of the problems remain.

Yesterday they pushed anther update (v3.3.0) and because it happened over a weekend I spent some time looking at latest behavior of the node and - with the help of a XX Network team member - confirmed the long held suspicion that the comparatively higher (in East Asia) failures could be related to the fact that ISP routes traffic to scheduling server in Frankfurt via the United States, rather than Eurasian links. The bottom line is the 350ms latency to scheduling service is very bad.

And combined with other bugs - such as cMix crashes when one of five nodes in a round is unreachable - make the node need to check in with scheduling after each such round, which results in the next missed round, and the end result is more errors and fewer successful rounds.

I could make changes to my network to mitigate part of the problem, but MainNet is coming in 30 days and that’s when Scheduling is going away. So I’ll just let it be.

(Edit: later I learned scheduling and permissioning will not be removed on Day 1 of MainNet.)

Item XXV Node XXV GW XXV2 Node XXV2 GW Comment
Storage mon OK OK OK OK mon = monitor
Simple cMix mon OK - OK - simple port check
Simple Gtwy mon - TODO - TODO TODO
Chain mon TODO TODO TODO TODO TODO
Adv cMix/GW mon TODO TODO TODO TODO TODO
DB backup TODO OK TODO OK OK
Chain backup OK OK OK OK
DB mon TODO TODO TODO TODO TODO
Adv. logging OK OK OK OK TODO: reports

Team Multiplier and Phragmen make life of independent validators difficult, I don’t want to improve operations beyond minimum, so I’ve also stop updating this section.