The Factorio Architecture Part 2
Making the equations
TL;DR
Now that we have learned about ways math could be used to make better software, it is time to actually start making the equations so we can do the math!
Introduction
In part 1 of the Factorio Architecture I went over how Factorio relates to software and since it is a game that also is about scaling factories it also relates to the foundations of Chemical Engineering, specifically unit operations. Now that we have gone over how unit operations relates to software, I can now go over how to use them to our advantage and make equations to predict the performance of our code.
All the code used in this blog post can be found on GitHub.
Estimating Performance
Since there are no equations that will help predict the performance of unit operations, we will need to derive them. As I mentioned in part 1, if this way of thinking becomes mainstream we, as an industry, will have plenty of published equations to use at our disposal so deriving equations would be only necessary in unique circumstances.
In order to determine the equations for our unit operations, we will need to first come up with experiments that generate reproducible results. Designing experiments is always a challenging task to accomplish in science, but is the only way to produce reliable equations.
Once we conduct our experiments we will simply use trend lines to create equations
that describe the performance of our program. While it might be more satisfying to
have elegant equations like V = IR
, F = ma
, or PV = nRT
, you have
to keep in mind that generalized equations often take decades or even centuries
to derive. The ideal gas law, for
example, comprises the work of several scientists over a 200-year period.
Our industry might end up being luckier and advance faster than other
engineering disciplines, but in the meantime we can still use simple empirical
results to make equations and know that it is both helpful to us now and helpful
in creating more generalized equations in the future.
Something to keep in mind is that our equations do not need to work for all conditions. It is perfectly acceptable in science to do this. Going back to the ideal gas law, the equation only works on, you guessed it, ideal gases. There are other equations such as the Van der Waals equation which can describe a “real” gas. The ideal gas law itself is an empirical equation, meaning an equation based on applying trend lines to experimental data. Since the ideal gas law is empirical, it also has a “magic number”, the gas constant, which shows that not every variable in an equation needs to make sense from a theoretical prospective. Finally, the ideal gas law, while useful, is still just an estimate. Predicting the exact conditions of a gas is not only difficult, but, in most cases, completely unnecessary since the error is small enough not to make a difference in your approach. The only real requirement for good equations is that they reasonably predict an outcome based on the inputs you provide.
Hopefully academic researchers will be able to publish results so that for most unit operations you will not need to conduct the experiments yourself.
Methodology
The Algorithm
I wrote code in Go, C, C#, Java, JavaScript, and Python to do the following:
- Read a CSV file with fake people’s name and age (I/O)
- Validate each row’s data (Validate)
- Perform quick sort on all people by age (Sort)
- Print the people to the console (I/O)
- Free all the used memory (I/O)
You can see the algorithm visually in Figure 1. I did my best to keep the code as identical as possible to keep the results as fair as possible. As you look at these results your takeaway should not be something like “Go is so much faster than Python! You would have to be an idiot to choose Python!”. You should instead adopt an engineering mindset “When quick sorting under 1 megabyte of data python is only taking 10,000 more cycles than go, which is an acceptable performance hit for our use case. If we were to exceed 1 gigabyte of data, then the cycle difference is great enough to where we would need to choose a different language in order to still run on the same processor.”
The Computers
I ran the algorithm on the following machines
- An Ubuntu desktop with a 13th Gen Intel(R) Core(TM) i9-13900K processor
- A Raspberry PI 3 Model B
The CSV Files
For all programming languages and computers used, these test CSV files were used.
Measuring CPU Usage
I created a utility command to measure the cycle count at any given moment. All the programs made system calls to print these results to standard output during each run. Only cycles were measured for CPU performance and not time. Any reference to time is derived by dividing the number of cycles by an assumed clock speed.
i9 Cycles
The i9 processor runs were done using rdtsc to measure the cycles.
Raspberry Pi Cycles
The Raspberry Pi runs were done using pmccntr_el0 to measure the cycles.
Running the Tests
I used this script to perform the test. I forced the script to remain on the same core using taskset to help with the accuracy of the cycle measurements.
taskset 0x1 ./run-performance-tests.sh
Results
Whole Program
The most common way to measure performance tends to be by time. While measuring by time is not entirely a bad way to get a sense of performance issues, it does not give us the whole picture. Notably run time is greatly impacted by the clock speed of the processor. When an individual developer is looking to fix a performance issue there is nothing wrong with only using time as a measurement since the relative changes in run time are often good enough to identify if a performance problem has been addressed. With that in mind, let us take a look at the results for our program written in C.
Total Records vs Time (ms)
Total Records | PI 3 Model B@1.2 GHz | i9@5.0 GHz |
---|---|---|
0 | 11.7 | 0.6 |
10 | 12.2 | 0.6 |
1000 | 19.2 | 0.9 |
2500 | 29.8 | 1.2 |
5000 | 46.9 | 2.0 |
7500 | 66.1 | 2.7 |
10000 | 86.6 | 3.5 |
25000 | 232.9 | 8.2 |
50000 | 560.3 | 17.8 |
So what does this graph tell us? It tells us the performance hit that happens when picking the Raspberry Pi with actual numbers. While it is pretty obvious that a processor with a higher clock speed is more likely to be faster and that a Raspberry Pi cannot compete, in terms of speed, with an i9, having actual data means we know how much the difference is and if the cost of the i9 is worth it.
We can make our equations too by simply adding a trend line to our data and generating an equation which describes the line! Excel and its equivalents support this, so you do not even need to use a fancy python library!
PI 3 Model B@1.2 GHz:
y = 8.63e-08x^2 + 6.65e-03x + 1.20e+01, R^2 = 1.0000
i9@5.0 GHz:
y = 1.53e-09x^2 + 2.67e-04x + 6.04e-01, R^2 = 0.9999
I will mostly avoid showing you the equations for trend lines for the rest of this blog post when analyzing other graphs as it is the same process to create the equations. It is worth emphasizing again that having equations based on experiments is the foundation of engineering. Having a good objective understanding of the capabilities of your system can save you a lot of pain in the long run.
Let us take a look at another dimension to this problem by comparing different languages on the same processor.
Total Records vs Time (ms)
Total Records | c | c# | go | java | javascript | python |
---|---|---|---|---|---|---|
0 | 11.7 | 404.6 | 17.8 | 242.7 | 108.7 | 38.0 |
10 | 12.2 | 400.1 | 18.5 | 482.5 | 112.3 | 37.4 |
1000 | 19.2 | 428.2 | 27.6 | 816.7 | 268.0 | 65.5 |
2500 | 29.8 | 454.8 | 42.6 | 1108.4 | 561.2 | 115.2 |
5000 | 46.9 | 512.4 | 68.5 | 1345.9 | 753.9 | 223.7 |
7500 | 66.1 | 597.5 | 96.6 | 1633.3 | 913.7 | 365.9 |
10000 | 86.6 | 748.1 | 125.0 | 1852.0 | 1056.3 | 542.0 |
25000 | 232.9 | 1318.0 | 303.7 | 2654.9 | 1941.0 | 2172.6 |
50000 | 560.3 | 2887.7 | 680.4 | 3890.7 | 3516.4 | 7207.6 |
As we can see Go and C maintain pretty high performance while other languages quickly start being significantly slower which is not a surprise. There is an interesting result, however, where Python’s run time performance is near C’s for small data sets and get slower faster than all the other languages. I do not know enough about Python to know why this could be happening, but let us assume that there is not anything wrong with my code. This is an example of a good trade-off using hard data in action. Instead of having a subjective argument with your co-workers about which language you prefer or in general which language is faster, you can now talk about which language with be “good enough” for your specific use case. This is why doing the math is needed. You probably will not find very many surprising results, but you will have concrete data to know the exact trade-off you are making. Let us look at all the tested languages with the i9 now.
Total Records vs Time (ms)
Total Records | c | c# | go | java | javascript | python |
---|---|---|---|---|---|---|
0 | 0.6 | 12.4 | 0.8 | 13.2 | 4.9 | 1.3 |
10 | 0.6 | 12.9 | 0.8 | 18.0 | 5.0 | 1.3 |
1000 | 0.9 | 13.2 | 1.2 | 30.2 | 8.2 | 2.2 |
2500 | 1.2 | 14.2 | 1.8 | 43.3 | 18.1 | 3.4 |
5000 | 2.0 | 16.3 | 2.8 | 55.5 | 24.2 | 5.8 |
7500 | 2.7 | 17.7 | 3.7 | 66.8 | 28.5 | 9.0 |
10000 | 3.5 | 22.9 | 4.4 | 75.9 | 32.1 | 13.1 |
25000 | 8.2 | 38.5 | 10.6 | 120.3 | 52.9 | 46.7 |
50000 | 17.8 | 77.3 | 22.5 | 169.4 | 88.0 | 147.0 |
Of course the i9 is significantly faster, but we do get another interesting result. The relative performance of all the languages is about the same. The only notable exception is Python. Unlike on the Raspberry Pi, Python is not slower than Java at 50,000 records, but it still has the same characteristic of performing near C and then quickly slowing down relative to other languages. This seemingly uninteresting fact has a useful implication. If the languages have similar performance characteristics across different processors, then it means that you can do something like run performance tests locally and then have an ability to predict the performance on another processor. More research needs to be done to say this for sure, but, if true, creates a powerful tool for picking which processors would theoretically meet the performance requirements for a given project. I could see this being especially useful in a domain like PC gaming where there is a wide variety of machines your program will run on. Keep in mind that predicting the performance on another machine does not have to be 100% accurate. It only needs to indicate to a developer if their solution is viable. There is no substitute for running on the real hardware.
As we can see, time does give us some useful information, but let us look at another way to see the same problem by measuring CPU cycles.
Total Records vs Cycles
Total Records | c | c# | go | java | javascript | python |
---|---|---|---|---|---|---|
0 | 14049436 | 485547668 | 21379826 | 291228532 | 130455582 | 45654207 |
10 | 14673178 | 480137892 | 22151392 | 579027841 | 134769978 | 44894099 |
1000 | 23070695 | 513780196 | 33105671 | 980007552 | 321642420 | 78654262 |
2500 | 35755022 | 545718792 | 51157895 | 1330116181 | 673467684 | 138203850 |
5000 | 56255947 | 614851366 | 82142513 | 1615104160 | 904631965 | 268447346 |
7500 | 79357338 | 717013256 | 115972913 | 1959954743 | 1096460380 | 439043576 |
10000 | 103882221 | 897774909 | 150006710 | 2222374502 | 1267586764 | 650366166 |
25000 | 279428372 | 1581585683 | 364426834 | 3185905300 | 2329240783 | 2607152428 |
50000 | 672324581 | 3465236218 | 816512131 | 4668892689 | 4219671377 | 8649140063 |
As you can see this graph looks nearly identical to the time based graph seen earlier. We will only look at the Raspberry Pi since I only measured the cycles for my experiments, which means the graphs will look identical in shape when compared to their time equivalent graphs. So if they are practically the same in these experiments, then why look at the cycles? The cycles give us a proxy for the number of instructions being used to execute a program. This is useful for understanding if your series of unit operations is as efficient as it can be without needing to know the clock speed of the CPU. If we make the assumption that the code for each language I wrote is as efficient as can be and there is no way to beat the performance of C, then we now know how much it costs us to choose particular languages. Other people could try to implement my series of unit operations on their own and would know how efficient their implementation is in comparison to the “ideal” version. So for example, if their program in C# is 99% as efficient as my C# implementation then there is very little, in terms of “performance tricks”, that can be done. This is great! You do not have to waste time seeing if you can get more performance out of your C# implementation, you now know you need to pick a different series of unit operations, programming language, or processor. This is what engineering is all about; having concrete data to help you make good decisions in the way you invest your resources.
While all this information is useful, it still is not the best for convincing management to side with you. In my experience, the best language to use to talk to nontechnical people is a language everyone understands, money. How do we do that? If we know the number of watts our processor consumes and the cost of the watts in dollars per kilowatt-hour then we can multiply them both divided by 1,000 to get dollars per hour like so:
W * 1 kW / 1000 W * $/kWh = $/h
You can see the assumed power consumption and cost here. Since we already know the time taken to run our algorithm we simply need to multiply that time by the dollars per hour. As you can see below the results mirror the time and CPU cycle graphs because we multiplied all the results by the same constant to determine the cost.
Total Records vs Dollars
Total Records | c | c# | go | java | javascript | python |
---|---|---|---|---|---|---|
0 | 1.54e-09 | 5.31e-08 | 2.34e-09 | 3.18e-08 | 1.43e-08 | 4.99e-09 |
10 | 1.60e-09 | 5.25e-08 | 2.42e-09 | 6.33e-08 | 1.47e-08 | 4.91e-09 |
1000 | 2.52e-09 | 5.61e-08 | 3.62e-09 | 1.07e-07 | 3.51e-08 | 8.60e-09 |
2500 | 3.91e-09 | 5.96e-08 | 5.59e-09 | 1.45e-07 | 7.36e-08 | 1.51e-08 |
5000 | 6.15e-09 | 6.72e-08 | 8.98e-09 | 1.77e-07 | 9.89e-08 | 2.93e-08 |
7500 | 8.67e-09 | 7.84e-08 | 1.27e-08 | 2.14e-07 | 1.20e-07 | 4.80e-08 |
10000 | 1.14e-08 | 9.81e-08 | 1.64e-08 | 2.43e-07 | 1.39e-07 | 7.11e-08 |
25000 | 3.05e-08 | 1.73e-07 | 3.98e-08 | 3.48e-07 | 2.55e-07 | 2.85e-07 |
50000 | 7.35e-08 | 3.79e-07 | 8.92e-08 | 5.10e-07 | 4.61e-07 | 9.45e-07 |
Another unsurprising result, it turns out it is very cheap to run a computer for a few seconds. It is difficult to really get a sense of what this all means. Specifically, how do we convey the message to our nontechnical people about what trade-offs they are making when picking a processor and a language? Often in chemical engineering, the best way to illustrate the cost is in terms of the product you manufacture. In the world of software, our product is data. We can easily calculate this by taking the size of the file processed divided by the cost to run the whole algorithm to give us Gigabytes per Dollar which gives us the following results.
Total Records vs Gigabyte Per Dollar
Total Records | c | c# | go | java | javascript | python |
---|---|---|---|---|---|---|
0 | 6.48e+00 | 0.00e+00 | 4.32e+00 | 0.00e+00 | 0.00e+00 | 2.16e+00 |
10 | 1.14e+02 | 4.32e+00 | 7.56e+01 | 2.16e+00 | 1.30e+01 | 3.67e+01 |
1000 | 6.94e+03 | 3.11e+02 | 4.84e+03 | 1.64e+02 | 4.99e+02 | 2.04e+03 |
2500 | 1.12e+04 | 7.34e+02 | 7.82e+03 | 3.00e+02 | 5.94e+02 | 2.90e+03 |
5000 | 1.42e+04 | 1.30e+03 | 9.73e+03 | 4.94e+02 | 8.83e+02 | 2.98e+03 |
7500 | 1.52e+04 | 1.68e+03 | 1.04e+04 | 6.13e+02 | 1.10e+03 | 2.74e+03 |
10000 | 1.54e+04 | 1.79e+03 | 1.07e+04 | 7.21e+02 | 1.27e+03 | 2.47e+03 |
25000 | 1.44e+04 | 2.54e+03 | 1.10e+04 | 1.26e+03 | 1.72e+03 | 1.54e+03 |
50000 | 1.19e+04 | 2.31e+03 | 9.80e+03 | 1.71e+03 | 1.90e+03 | 9.26e+02 |
This is a very useful metric for business people to have, especially early on in a project, because it gives them an idea of how much money they would need to charge in order to make a significant profit. As you can see in my experiments, Java was generally the least profitable language to use and C was the most profitable to use. This gives us an ability to actually give more than a passing thought about the language picked. You can see that if we are not running our computer for very long or have a low volume of data then picking the “fastest” language does not give us much of a benefit, but if the only way to make the project profitable is to use C, then the nontechnical people can easily see that. Something that I often hear is a fear to change programming languages, because they could be difficult to hire for. If your company stands to make millions more dollars per year by simply hiring an expensive C developer or training their existing team in C programming, barring not having enough capital to fund the project, it is hard to imagine they would reject such an idea outright.
Let us now take a look at the cost of C running on different processors.
Total Records vs Gigabyte Per Dollar
Total Records | PI 3 Model B@1.2 | i9@4.7 | i9@5.0 |
---|---|---|---|
0 | 6.48e+00 | 1.29e+00 | 1.12e+00 |
10 | 1.14e+02 | 2.58e+01 | 2.25e+01 |
1000 | 6.94e+03 | 1.69e+03 | 1.47e+03 |
2500 | 1.12e+04 | 2.94e+03 | 2.56e+03 |
5000 | 1.42e+04 | 3.68e+03 | 3.21e+03 |
7500 | 1.52e+04 | 4.12e+03 | 3.59e+03 |
10000 | 1.54e+04 | 4.23e+03 | 3.69e+03 |
25000 | 1.44e+04 | 4.52e+03 | 3.94e+03 |
50000 | 1.19e+04 | 4.13e+03 | 3.60e+03 |
We finally now see a situation where the Raspberry Pi shines, cost to run! If data is low volume and is allowed to be slow to produce, then we now know that the Raspberry Pi is a perfect solution to our problem. This is what engineers do! They use experiments and mathematical models to guide them to a solution which avoids costly and unnecessary trial and error in production.
Unit Operations
For the unit operations section we are just going to take a look at the cycle count for the sorting section of our program since all the same cost and performance analysis can be done for the other unit operations.
Total Records vs Cycles
Total Records | c | c# | go | java | javascript | python |
---|---|---|---|---|---|---|
0 | 2792159 | 5114654 | 3688675 | 10687175 | 10681266 | 3956866 |
10 | 2832413 | 6012382 | 3514230 | 9621183 | 10935949 | 3872077 |
1000 | 3563015 | 9436473 | 4348165 | 41001187 | 45432890 | 20673766 |
2500 | 5113914 | 16487233 | 4628367 | 47206292 | 67640476 | 53269002 |
5000 | 8396329 | 37741884 | 6353428 | 72961195 | 79474389 | 136594279 |
7500 | 13346105 | 64668119 | 9374607 | 131656583 | 97613712 | 260292514 |
10000 | 19307134 | 142073658 | 12793925 | 172160720 | 110865507 | 426776896 |
25000 | 80935209 | 466851246 | 50487404 | 438295255 | 237565658 | 2103350194 |
50000 | 280796513 | 1475048036 | 175304622 | 722500522 | 668473923 | 7676206450 |
As you can see the languages perform about the same relative to each other.
Below are the results of deriving equations for our trend lines. Note I used
a x^2
instead of a x * log(x)
trend line because I could not get sklearn
to create a good equation for x * log(x)
. Since quick sort’s performance is
O(n * log(n))
then x * log(x)
should be a better fit, but this will be good
enough for our purposes.
c:
y = 9.75e-02x^2 + 6.85e+02x + 2.75e+06, R^2 = 1.0000
c#:
y = 4.31e-01x^2 + 8.01e+03x + -2.28e+06, R^2 = 0.9995
go:
y = 6.26e-02x^2 + 3.06e+02x + 3.56e+06, R^2 = 1.0000
java:
y = -9.11e-02x^2 + 1.90e+04x + 1.73e+06, R^2 = 0.9967
javascript:
y = 1.56e-01x^2 + 4.66e+03x + 4.12e+07, R^2 = 0.9945
python:
y = 2.78e+00x^2 + 1.45e+04x + 3.15e+05, R^2 = 1.0000
In a world where there are not readily available equations to use, focusing on performance critical unit operations is probably a good compromise. You can use your intuition to know which unit operations could potentially cause issues, and then you could test different algorithms/programming languages/hardware to see which approach will be the best for your situation. Having equations for individual unit operations also helps with doing “what if” scenarios with a tighter feedback loop than experimenting with performance in production. For example, you can see what different algorithms could improve the performance of your unit operation (e.g. use merge sort instead of quick sort) or use a different sequence of unit operations altogether.
Conclusion
As you can see the results generally showed that picking certain languages and hardware has a cost to it. This is unsurprising as I think most developers would agree that there is a trade-off in making decisions with regard to languages and hardware. What is key is that we actually graphed and derived equations describing the behaviors of the program. We no longer are making decisions based on gut feelings and hoping they are profitable. We are proving it using equations derived from experimentation. Welcome to the world of engineering!