Part 2: How We Actually Built The Scalability Test

Radix DLT
12th June 2019

This is the second part of a two-part series on how we built and deployed a test that pushed the entire transaction history of Bitcoin across the Radix ledger in less than 15 minutes.

Part 1: A primer on the test and Radix (non-technical)

The Code

We decided to put the 1M TPS test code into a shared GitHub repository. This repo also contained a bunch of scripts for creating/deleting the test infrastructure in Google Cloud and to run the tests themselves. The code is now available for anyone to view and test.

To effectively feed such a large dataset into the ledger, we also had to create an extra component called the “Atom Pump” to RadixCore. This component is included with the test code, but ist not currently open-source to prevent people using it to spam our test networks! This component is detailed in its own chapter below.

Finally, DevOps had to hack the Docker entrypoint in the node-runner project for introducing custom RadixCore configurations and integration with the new “Shard Allocator” microservice (below).

The Setup

The following RadixCore Node subsystems are not part of the data flow (validation):

  1. Client API (incl. TLS Term in NGINX)
  2. AtomSync

Test Scenario: The Explorer node and the Core nodes that it aggregates TPS from

 

Modified RadixCore with AtomPump Thread(s)

Lots of concerns were raised about the scope of the 1M TPS test scenario. The biggest technical compromise we had to make (due to time limit) is to exclude TLS termination and API-layer (WS/REST) in RadixCore from the test. In other words, we decided to focus on testing the Node → Node interface and NOT Client → Node interface in v1. In more detail, NGINX is excluded from the data path and so are the HTTP/Undertow components of RadixCore. The responsibility of the API layer is to perform basic HTTP/WS validation and deserialize incoming requests into the Radix Atom Model. We estimate that – apart from time – this saves us <20% CPU time. So our numbers will be approx. <20% lower when we test through the API interface in the future.

As for the Atom Sync, this component is responsible for bulk-syncing Atoms from a peer in the same shard space. It is typically used when new nodes come on-line with a clean state (empty ledger) and want to quickly catch up with the rest of the nodes in the same shard space. In this test scenario, the Atom Sync hash no role because data is directly pumped into the system.

The Test Method

We have automated most steps detailed here (see Infrastructure-as-Code) so that they can be reproduced easily.

  1. The centralized Shard Allocator microservice is reconfigured
  2. Boot (first) node is started
  3. Rest of the Core Nodes are started
    1. All Core Nodes (incl. Boot) get their shard configuration from the Shard Allocator at start
    2. All Core Nodes (incl. Boot) load their share of Atoms to Pump
  4. All Core Nodes (incl. Boot) wait until the scheduled test time
  5. All Core Nodes (incl. Boot) start pumping in Atoms at max capacity
  6. The test is finished when the Core Nodes (incl. Boot) have validated & gossiped their respective share of Atoms.

During the test, the aggregated (Global) TPS count can be observed from the Explorer UI.

To measure TPS in the explorer UI each Core Node has a /api/system endpoint which serves JSON encoded metrics.

The relevant metric (counter) for the calculation is { “ledger”: { “storedPerShard”: xxx }}.

This metric counts the number of Atoms stored normalized over a number of shards that each Atom involves. E.g.:

Atom #1: Input Shards {X}, Output Shard {X} => storedPerShard is increased by 1
Atom #2: Input Shards {X}, Output Shards {Y} => storedPerShard is increased by 0.5 (½)
Atom #3: Input Shards {X}, Output Shards {X, Y, Z} => storedPerShard is increased by 0.33 (⅓)

The Explorer service samples this metric from 10% of the Nodes (Shards) periodically and extrapolates it. Extrapolation is necessary because it was found too challenging to sample 1,000 nodes every 4 seconds from a single Java web-service in this short time. We have cross-checked the extrapolated value with the Prometheus metrics (that involve all Nodes) and the error is relatively small.

The BTC transaction history is not evenly distributed (some wallets have way more transactions than others). As a consequence of this, nodes in some shards will finish sooner than others. The TPS drop-off occurs usually after 390M BTC Atoms have been validated.

TPS drop-off ~390M BTC Atoms.

Causal dependencies between BTC Atoms are kept and honored during the test. Pumped atoms with missing dependencies are buffered until the relevant Atom arrives over Gossip. As you can see a few Nodes are initially blocked. This is the reason why the final 5% of the test runs so slow.  

Some Nodes start late due to missing dependencies.

The Millionaire Dataset Preparator

https://github.com/radixdlt/mtps/tree/master/millionaire-dataset-preparator

The millionaire-dataset-preparator takes standard blk*.dat files which bitcoind generates and converts those into corresponding Atoms storing these in a format that can be quickly deserialized by the Atom Pump process at a later time.

Multiple Inputs/Outputs in a Transaction are supported by the Radix Atom Model, which means that 1 BTC Transaction == 1 Atom. For example  “fat” BTC transactions like this are converted into a single Atom:

https://www.blockchain.com/btc/tx/9cffd36218e50d94bb45eefe4e60128dd04207a3fd2d3e2df009440c86f6b52f

The tool supports incremental conversion, which is an essential feature because it takes days to do a full conversion of the BTC Blockchain.

The output of this tool is a single proprietary datafile called atoms, which is around 250GB in size. The Zstandard compressed file size is approx 160GB.

The Atom Pump

To achieve the efficient pushing of BTC transaction data into the Radix ledger, we added a Java Thread into the RadixCore Application. This is what the threads do:

  1. On startup looks for the atoms.zst file on disk.
  2. It loads all atoms that are in its own Shard range from atoms.zst into memory.
  3. Waits for a START_PUMP file to appear in the file system.
  4. Pumps all Atoms as fast as possible into the underlying validation pipeline.

Obviously (2) requires a lot of RAM (30GB Java heap), which is why we have several data sets. We use the 6GB dataset with 10 shard test networks. The 48GB dataset works well with 100 shard test networks, and the full dataset works with 400+ shards.

The Shard Allocator

Normally a node.key file (secp256k1 private key) is randomly generated when a node starts for the first time. From the randomly generated `node.key` we deterministically derive an “Anchor Point” in the Shard space. The anchor point marks the center of the “Shard Range”, which a Core Node is configured to validating (the range is configurable). Depending on anchor point & range we get more or less “Shard Overlap” (redundancy factor) between nodes. Obviously, we want the 1M TPS test to be deterministic, which is why we need to be able to allocate the Anchor Points and shard ranges very precisely.

The Shard Allocator

The role of the Shard Allocator was to allocate shard ranges evenly across the full 2^64 Shard space.

We developed a simple CLI tool called “generate_node_key”, brute forces a node.key into the given range. Depending on the accuracy the tool takes more or less time to execute. With 10% overlap (0.1 x shard range) the tool finishes in <1s.

Although serving a similar purpose (secure identity, signature, etc.) the node key identifies a Core Node, and the keys generated by Client Applications (e.g. Wallet) are not the same. Unlike the Client Application key, the Node’s key is not PoW checked.

To learn more about how this function works, please see Dan’s explanation here: https://youtu.be/EOisSoQ9Oa4

The image above illustrates how the first version of the Shard Allocator performed allocating 20 nodes to 10 shards. You can see that it starts off nicely allocating exactly two nodes to the same shard range. You can also see the 10% shard overlap if you look closely, but you can also see that some shard ranges only have 1 node while others have 3 nodes redundancy. Finally, it is also possible to spot that the upper node range overlaps too much.

All these problems have now been fixed, and the shard allocation is even.

Prometheus

Checking the metrics of a single node network is no problem. But when you add 800 more nodes then you quickly realize that you need a tool that samples all nodes periodically and efficiently and presents the aggregated metrics for you on a single page. Essentially the aggregated metrics presented by Prometheus match the output of the Explorer service, which the public can view, but it contains a lot more detail and allows you to drill-down on a node level.

Prometheus is a good fit for our test networks because we can dynamically configure a central (global) instance to detect new nodes entering the network. The nodes that are serving metrics over the /metrics endpoint will be automatically detected and aggregated.

The Cloud Platform

We chose Google Cloud Platform for the 1M TPS project because we got approved for their startup program and wanted to evaluate their platform at scale. We want to stop and give a big thank you to Google and all their help in making this possible.

A lot of time was spent on finding the cost-optimal and trivial central storage for our huge (160+GB) atoms.zst file that needs to be processed by each node (VM) that boots on the network. Since all nodes (~1000) start at the same time we put a lot of pressure on the data store. Typically SANs are designed for this purpose and Google Storage seemed to be the best fit for this, however, we could not find a turn-key ready solution that takes care of global replication for us. We were looking for a place (server) to drop the file on, which would automatically replicate it to all of Google’s Regions. The best they could do is to geo-replicate within a continent (Multi-Regional Storage):

https://cloud.google.com/storage/docs/locations#location-mr

So the solution that we went with was to create a custom 200GB Disk Image containing the atoms.zst file(s) and then use the `–metadata-as-file` option get the files onto the filesystem of each node at boot time:

https://cloud.google.com/sdk/gcloud/reference/compute/instances/create#–metadata-from-file

This saved us a lot of headache with keeping the data geo-replicated.

Regarding Quotas; we learned the hard way that it easy to get the resources to start a 10s of n1-standard-8 machines in any location (DC). However, it is very hard to get resources for starting 100s of nodes in the same location. Most scarce resources are:

  1. Local SSDs
  2. Public IPv4 addresses

After some iterations and lots of goodwill from Google engineers, we managed to get access to resources to start 1000 n1-standard-8 nodes in 10+ different locations worldwide. Here are the locations we typically use:

  • us-central1 – Iowa
  • us-west1 – Oregon
  • us-east1 – South Carolina
  • us-east4 – West Virginia
  • northamerica-northeast1 – Montreal
  • europe-west1 – Belgium
  • europe-west4 – Netherlands
  • europe-north1 – Finland
  • asia-northeast1 – Tokyo
  • asia-southeast1 – Singapore
  • asia-east1 – Mumbai

Bitcoin Transaction to Radix Atom Conversion

The Bitcoin and Radix address formats are different totally different. So how can we validate bitcoin transactions involving Bitcoin addresses on the Radix DLT?

The trick we did is to use the Bitcoin address as a seed when hashing Radix Addresses. This way we get a one-to-one mapping between Bitcoin → Radix addresses:

https://github.com/radixdlt/mtps/blob/master/millionaire-dataset-preparator/src/main/java/org/radixdlt/millionaire/BlockchainParser.java#L702

The next trick was to convert the Bitcoin transaction history to the corresponding Radix Addresses. In plain English, we rewrote the Bitcoin history into a format that the RadixCore understands while maintaining transaction history. The only transactions we ignored are those that have zero outputs.

But what about block rewards?

The Radix DLT is not organized into blocks so we could not model that to anything meaningful.

Infrastructure-as-Code

You are welcome to run the tests yourself. We have the tooling to help you set up a test network in Google Cloud:
https://github.com/radixdlt/mtps

The Explorer

This is the community’s user interface toward 1M TPS test runs. It provides:

  • Information about scheduled test runs
  • Aggregates TPS during test runs (real-time)
  • Allows the community to search for BTC transactions (real-time)

Part 1: A primer on the test and Radix (non-technical)

Lastly, if you have any questions, or just want to follow us as we go towards the mainnet, please join the conversation with our community on Telegram or use the #1MTPS on Twitter to share your opinions.