Does coordicide increase TPS / Scaling?

00xou · November 16, 2019, 3:16pm

Hi all,

(This is mainly copied from https://twitter.com/00xou/status/1192828864864702464, I’ve been told to repost here to get an authoritative response).

TL;DR: There is no QPS increase without a scaling solution (e.g. sharding). Coordicide is not that.

It is commonly stated[1] that Coordicide is going to increase TPS, as the coordinator is the bottleneck[2].

The coordinator is most likely doing a sequence of steps like

select-transactions, check-for-conflicts, create-approving-transaction, broadcast.

It does this in an infinite loop, and - presumably - on some high-power, high-memory machine in a datacenter[3].

Post-coordicide, all participating nodes will, in parallel

select-transactions, check-for-conflicts, create-approving-transaction, broadcast-bundle
2.a) On receiving such bundles, receiving nodes need to check for conflicts
2.b) If any node detects a conflict, the node needs to initiate a voting round and every node needs to participate in it (which includes, on all nodes, a check-for-conflicts, and then the FPC/CA rounds)

Step (1.) is exactly the same as pre-coordicide (and is already on a fast, presumably multi-core machine, can’t speed that up), and step (2.) comes in addition to that (and will take at least a few seconds).

You cannot parallelize this by running on a “subtangle”: A node approving a transaction at time T must try to see all transactions up to that time T, otherwise it (a) can’t actually see balances of addresses that are referenced at T, and (b) might trigger an expensive voting round (and to vote, the node would need to receive all missing transactions, which it won’t know how to retrieve unless it has already seen them). Subtangles are the subject of a potential subsequent sharding proposal, not coordicide.

Assuming that there are no conflicts(!), the parallel execution on multiple nodes should be able to get the latency down by a few percent (at the cost of repeating the same execution on all nodes). But for throughput (TPS), the optimal strategy is to have the fastest node repeatedly do step (1.). Which is what the coordinator is doing today.

Therefore I would argue that coordicide alone can not increase TPS.

Looking forward to the discussion

[1] e.g. youtube dot com/watch?v=8CwUeBR_FbM, twitter dot com/tangleblog/status/1180513912590090248 (sorry, new users have 2 links limit)
[2] If the coordinator wasn’t the bottleneck, coordicide would not effect TPS.
[3] Compare Snoopy’s post about “adding 10 cores” to what presumably refers to the coordinator: 2019-11-16-091910_1094x602_scrot|690x379. This is for illustrative purposes only, the argument holds even if this is not a high-power machine, as long as it faster than the mean speed of a post-coordicide consensus participant - given that the target market is IoT, the bar is low.

00xou · November 17, 2019, 10:55pm

I can’t edit the post because it was created too long ago, but I realised that the parallelism paragraph needs a bit more clarification. Suggested replacement:

You cannot parallelize this by running on a “subtangle”: A node approving a transaction at time T must try to see all transactions up to that time T, otherwise it (a) can’t guarantee that it will see transactions (or balances of addresses) that are referenced by the transactions it verifies. Subtangles are the subject of a potential subsequent sharding proposal, not coordicide.

It is reasonable[4] to assume that

cheap operations include
- maintaining a list of unapproved tips (as this should be a byproduct of maintaining the tangle structure),
- creating and signing the approving transaction (this is equivalent to what is done by an IOTA client, which is to run on a low-power IoT device).
expensive operations are
- building and maintaining the tangle (memory cost for maintaining the graph structure)
- detecting conflicts (memory and CPU cost for maintaining the data structures for looking up balances, relevant transactions)
- network I/O.

All of the expensive operations need to be (eventually) done for every transaction in the tangle, independent of whether a an approving transaction is created for those. Therefore, the part where we can expect a speedup through parallelism are only tip-selection and generating (and broadcasting) the approving transaction, which should be reasonably cheap operations already anyway.

hans.moog · November 18, 2019, 10:22am

Hello 00xou,

I am really happy to see you around and since we know each other from before, I am already getting prepared to answer some “hard” questions What you are writing is not entirely wrong, but also not entirely correct.

If coordicide would just be about turning off the coordinator, then we would most probably NOT see an increase in TPS. Coordicide is however much more than just killing off the coordinator. It is about maturing the protocol and getting it into a production ready state.

This means that coordicide will come with a few very fundamental changes where we adjust the protocol according to the things we have learned over the course of the last years.

These changes mostly include algorithmic optimizations, but also more drastic changes like the discussed switch from an account based model to a UTXO based model, the implementation of new signature schemes (to reduce the size of transactions), reusable addresses and so on.

All these changes combined are the reason why goshimmer (the coordicide prototype) can currently process around 20000 TPS while IRI tops out at around 250 TPS.

You seem to however also have a pretty fundamental misperception regarding the task of the coordinator. The coordinator does not have the job to “verify transactions for all the other nodes”. Every node already today, verifies every single transaction and does all the hard work. So it is not the case that nodes would suddenly have to do more work just because the coordinator is gone.

The coordinator simply issues “checkpoints” to guarantee some kind of “finality” (because at these low TPS values anybody could just come in an rewrite history if he would have enough hashing power). But nodes already today use the “full protocol” and come to consensus using the random walk and everything that is described in the original white paper (and the coordinator would for example not be able to “approve” a double spend).

We could have easily implemented a network where the coordinator would have worked in the way that you described and where it would essentially be “running the network”. And it would have most probably led to a much faster and much more reliable network, but then we wouldn’t have been able to study the behavior of the tangle and learn from it.

IOTA is a research project and instead of trying to simply make trade offs using “established knowledge”, to build a fast protocol like others, we are really trying to learn something new here.

00xou · November 20, 2019, 10:26am

Hi Hans!

I’m fully aware that you are already oversubscribed with your other projects, so I very much appreciate you taking the time to reply. I appreciate your answer, it definitely helped and provided some interesting insight. As you probably expected though, I do have a few comments

Let me get a few obvious things out of the way first - the Coordicide WP does (to the best of my knowledge) not really mention any of the other ‘fundamental changes’ you mentioned (e.g. UTXO, Signature schemes, reusable addresses). Similarly, the inner workings and performance characteristics of the coordinator are not well published (again, to the best of my knowledge). It is therefore a bit hard to reason about the effects of individual changes. The best I can go with is a “big O” intuition, which I still believe generally holds independent of the optimizations.

Let me state a few assumptions/observations.

(1.) IoT devices in scope have an approximate capacity of a Raspberry Pi 3 / 4.

The power usage of a Rpi 4 under load is about 5-10 Watts, which is a reasonable upper bound of what an end-user might be willing to spend; and is a reasonable upper bound of what could still be considered “green”. As more compute power requires more electrical power, assuming Raspberry-Pi levels of compute seems like a good assumption. Note: In the rest of this text, I’m using the RPi (4) as an example of ‘what 5-10 Watts of compute buys you’ - independent of whether this is 100% CPU on a RPi4, or 10% on a Core i7.

(2.) The majority of IOTA full-nodes are will run on IoT devices.

This seems pretty intuitive given IOTAs focus on IoT and the marketing material (“smart city mesh networks”). Mesh networks would not be reasonable when light nodes are the majority of the network (and there would be no advantages in using a DLT anymore - esp. for data transactions, AWS IoT would be cheaper and faster). I found little explicit statements on this (“Low-power devices are able to issue transactions and take part in the consensus.” from IOTA 2.0 DEVNET - Nectar Release being the closest), so I it might make sense to make this an assumption.

(3.) The maximum throuhput of the network is limited by the the speed of some absolute majority (or even supermajority) of nodes.

This is due to the need to build consensus, and the visibility requirements described in my initial post.

So, assuming that the assumption above are mostly accurate (please correct me if/where I’m wrong), let me attempt to explain why I seriously doubt the projected post-coordicide TPS numbers.

(a.) Using Avalanche as the baseline

I know, Avalanche is not IOTA, but there are significant similarities in some of the structure and they have somewhat detailed benchmarks, making it a decent starting point.

According to their benchmarks, Avalanche achieves about 3500 TPS in their Geo-replicated scenario (without attackers). A Raspberry Pi is significantly slower than the c5.large AWS instance they use. Assuming a factor of 10 (see math below, it’s likely worse), we’re already at 350 TPS. Now, Avalanche is conceptually significantly simpler than IOTA - maintaining mana, Autopeering incl. hash-chains, DRNG for FPC need to be accounted for. I think it is safe to assume that this will cost at least another 50%, for ~175 TPS. Now this still doesn’t account for misbehaving nodes (on purpose or accidentally), high-mana node overloads, timeouts and crashes etc., and the fact that for real-world applications you need a sync persistence layer etc., and now you’re likely at 100 or below.

(b.) Using existing software & benchmarks on IoT devices as baseline.

This is a weak/fuzzy point, but

The tangle will need some kind of persistance layer for UTXO / transactions. Postgres / PGbench R/W will get you something like 400 Read/Write TPS on a Raspberry Pi: Pi4 ARM 64 Vs 32 Benchmarks - OpenBenchmarking.org.
On a Raspberry Pi 4, I get about 40 TLS (HTTPS) connections a second. The official nginx blog gets about 450 per CPU on a server CPU, which mirrors the 10x observations above (www dot nginx dot com/blog/testing-the-performance-of-nginx-and-nginx-plus-web-servers/ - sorry, link-limit for new users).

I think it is a safe assumption that the persistance layer performance will look similar to the Postgres TPS, and the load introduced by network/CPU/crypto and maintaining the data structures load will at least in the same order of magnitude as a TLS connection setup; and that the upper bound(!) for tangle TPS is therefore somewhere between 40 and 400.

(c.) Bandwidth constraints

As others have pointed out, a 100MBit line can carry about 7500 Transactions at current size. But that’s maxing out 100MBit, and doesn’t account for gossiping - for fast transaction propagation, you need to send each TX to plenty neighbors (asking “do you have this?” first will create significant transaction overhead, too - it is likely to be more efficient to over-broadcast).

So let’s assume you gossip each transaction to 8 neighbors, then you get ~1k TPS, on a 100MBit uplink. If we assume 10MBit upload we get 100 TPS, if we assume that we want to have half of that bandwidth for Netflix etc., we get 50 TPS. If you add the auto-peering (list of nodes, reputation, salt/hash chains) and mana overhead, and you’ll end up at significantly less than that.

You can argue “not every node needs to broadcast that much”, but you still need at least the same order of magnitude in fan-out for the ‘average’ node. Also, 10MBit up-link is not even common - My “”“high-speed”"" hotel-room internet in Sydney downtown gives me 7MBit down/1MBit upload, my apartment here 2 years ago gave me < 12mbit/500kbit. I lived in Manhattan last year with 12/2 MBit.
(Also, some connections are metered - if your contract is for 200GB/month, at 1000 TPS you’d fill that up in a day).

(d.) Using the current system performance as a comparison

Now, I’ve stated before that I can’t know performance characteristics of the coordinator, so this is more of a question to you than an assessment on my part. You mention improvements in the signature scheme and reusable addresses, but neither of these sounds like they could give (orders-of-magnitude) performance boosts. I am not aware of how the account-based model works in detail, so I can’t say much about that

(i) I would however be surprised if the savings through UTXO and the algorithmic improvements you mentioned, do not get completely eaten up by the complexity overhead of FPC/Autopeering/Mana/etc.

(ii) I am actually also surprised by the “250” number given for IRI, as the network currently does not seem to exceed 15, even under pressure/spam? Is there another bottleneck somewhere?

Note that even if assumptions (1.) and (2.) were not true, points (c.) and (d.) still hold. Also, if you got this far, thanks for reading!

I hope this helps explaining the doubts that I have. Or maybe I missed something? In any case, if you have the time (and motivation ;)) to reply I am very much looking forward to it, but I am aware you are quite busy at the moment, so please don’t feel pressured.

In any case, thanks again!
Cheers

Math for Raspberry Pi vs c5.large

Math: Avalanche uses c5.large, which is a Xeon Skylake at 3.6Ghz dual-core, conservative (hopefully) equivalent on Phoronix is Xeon E3 1220 v5 divided by 2 (because quad → dual core).

Highlight: x264 encoding is ~20x faster on xeon (200/2 vs 5) and 17x faster on i5 (175/2 vs 5)
vs i5: Lame mp3: 125 secs vs 10 secs (Pi 3: 600 seconds). Flac: 100 vs 5.

It should be noted that I’ve assumed a Raspberry Pi 4 as the “standard IoT device”, but a RPi4 is already on the high end of ARM SBCs, and a RPi3 for example is already 3-5 times slower.

Raspberry Pi benchmarks: www dot phoronix dot com/scan.php?page=article&item=raspberry-pi4-benchmarks (the link limit strikes again)
Xeon Skylake benchmarks: www dot phoronix dot com/scan.php?page=article&item=intel-skylake-xeon9
Skylake i5: www dot phoronix dot com/scan.php?page=article&item=intel-6600k-linux&num=4

hans.moog · November 20, 2019, 2:52pm

You are absolutely right, the coordicide white paper is not talking about any of these additional changes but solely focuses on the consensus parts. In addition, it is already pretty outdated and we will release a new updated version soon.

A lot of the mentioned “additional changes” are not directly related to coordicide but are things that would introduce breaking changes to the network. Since breaking changes are always a bit tricky since everybody needs to upgrade (i.e. exchanges, libs, users, wallets …) we try to minimize the impact by doing only a single big update which then contains all of these changes at the same time. This way we will not have to go through the hassle of forcing everybody to update multiple times.

Now, let’s talk about the “big O notation” to compare some of the algorithmic optimizations. Currently to “validate” a transaction, we have to analyze the whole past cone of every transaction. Since every transaction directly or indirectly references an exponential number of other transactions, this is an extremely heavy operation. We are essentially doing a “binary search” in the past cone of a transaction to check if the used funds exist in this past cone and if they haven’t been used before. This is a huge bottleneck (even with checkpoints that are used to “cache” previous ledger state calculations by using a memory-time trade off). This accordingly translates to O(n) where n is the amount of directly or indirectly referenced transactions.

The new ledger state is able to validate transactions in O(1) so it is a MASSIVE optimization. Benchmarks show that IRI can do around 350 ledger operations per second in a high throughput scenario (the walks to the next checkpoint get longer and longer) while the new ledger state can do around half a million ledger operations on a normal desktop CPU + SSD (independently of the amount of TPS).

Note: Postgres / PGbench are not good benchmarks as they are using a much more complex logic than a K/V store based on LSM trees like badger.

A similar drastic optimization comes in from the new signature scheme. Currently most transactions in IOTA use “level 2 security” which translates to a signature size of around 3kb. The new signature scheme reduces the size of the signature to 64 bytes while still maintaining quantum security. This is a nearly 98% reduction in signature sizes.

I can not really say anything about the “quality” of algorithms and data structures used in Avalanche as their code is not open source, yet but even though we are able to optimize not just the size of messages and the computational- and IO-overhead of the algorithms your observations about the constraints of IoT hardware are still valid.

If you want to be able to allows IoT devices to take part in the network, then the performance numbers gathered using traditional desktop or even server hardware obviously do not translate 1:1 to numbers on these devices.

This is the reason why “sharding / slicing” plays such an important role. I am not sure if the sharding article is publicly accessible, yet as sharding is still considered to be somewhat “confidential” (https://govern.iota.org/t/slicing-sharding-the-iota-tangle/245) but sharding is absolutely essential for IOTA to reach its goal of allowing IoT devices to take part in the network.

I can maybe summarize it with a few words without giving away too much about “how it works”. The tangle will use a form of state sharding where the whole network is pre-sharded and nodes can individually decide how much of the “bigger picture” they want to “see and process” while still preventing double spends or requiring some form of complicated inter-shard communication process. It is totally fine for a hardware constrained node to only be able to process a “relatively small amount” of TPS as it only limits the radius of perception of this particular node.

I am pretty sure that some of the “building blocks” of sharding are public already (i.e. Merkle "Proofs of Inclusion") and we even started implementing it in goshimmer, so maybe you are already able to derive how it works by looking at the code

00xou · November 21, 2019, 9:32am

Hans,

again - thank you very much for your comments, appreciate that you took the time.

I think we are actually, for the most part, in agreement:
I believe the improvements that you mention (e.g. signature scheme, UTXO) are significant improvements over the status quo. I suspect you are going to run into problems in practice (you know me by know, you saw this coming), both in edge cases (but that’s something that should likely be discussed in separate threads on here) and in real-world performance gains (but that’s been discussed above). In terms of what TPS can be achieved, we are probably off by an order of magnitude, but I still think we’re actually fundamentally on the same page here.

One of the two key results that I see here are what you stated before, that “Coordicide” in it’s strictest sense (‘add fpc, remove coo’) will not do much for TPS, but that other changes that come along with it (UTXO, change of Signature format), that are possible in a one-off backwards-incompatible change, will.

The second key result, which I am deriving from your 4th- and 3rd-to-last paragraphs, is that “pre-sharding coordicide” will be focused on non-IoT devices, and it’s only once sharding is implemented that IoT devices can be full-node participants in the IOTA network.

I’m very curious to see the updated whitepaper & the sharding design, so unless I completely misrepresented the results (if so, please let me know!) I’ll let you get back to work now

Thank you very much for the discussion!

Cheers