Skip to content Skip to footer

Data Transport Phenomena

Creating Equations for Software Engineering

TL;DR

Transport Phenomena turned Chemical Engineering into a proper engineering discipline. If we follow the same roadmap, we too, as software engineers, can turn our profession into a proper engineering discipline. I am calling this new approach Data Transport Phenomena.

Software Engineering
Software Architecture

Introduction

In the Factorio Architecture we explored how Factorio and Chemical Engineering have closely related design problems to solve. I will assume you have already read those blog posts, so I will keep this refresher brief.

As a reminder, Factorio and Chemical Engineering deal with the manufacturing of chemicals, while Software Engineering deals with manufacturing data instead. There is some dispute about when Chemical Engineering became a proper engineering discipline, but many consider the introduction of Transport Phenomena as the moment Chemical Engineers became “real” engineers. This is largely because they were no longer dependent on empirical tests to predict system performance. Instead, they could use mathematics to avoid performing experiments on nearly every piece of equipment.

What is Transport Phenomena?

Transport Phenomena is the study of how mass, energy, and momentum are transferred within a system. It encompasses many equations that describe how each of these three quantities flows through a system. Transport Phenomena serves as a foundational concept in Chemical, Mechanical, and Aerospace Engineering.

To understand how Transport Phenomena equations are used by engineers, let us examine an equation that describes energy flow: Fourier’s Law.

Fourier’s Law

Fourier’s Law describes how heat flows from high temperature to low temperature. To visualize this, examine the figure below.

T (outside)T (Inside)XA (Wall Area)QWall

Here you can see a wall insulating a building. We will assume it is a cold day and the outside temperature is lower than the inside temperature, meaning heat will flow from inside to outside. We can describe this system using Fourier’s Law below.

  • Q is the rate at which energy flows from the inside temperature to the outside temperature.
  • A is the area of the wall.
  • k is the thermal conductivity. This is an experimentally discovered value that depends on the wall material. Higher values mean heat travels more quickly through the material. For example, glass has higher thermal conductivity than wood.
  • T outside is the outside temperature.
  • T inside is the inside temperature.
  • x is the wall thickness.

Fourier’s Law Limitations

When discussing the need for mathematical equations with other software developers, I often hear the critique that equations cannot be used in software development because software is too complex to predict. While it is difficult to predict performance down to the exact clock cycle, this does not mean there is no value in having good approximations.

A good example of how chemical engineers face this same problem is Fourier’s Law. As you can see, Fourier’s Law is very simple and intuitive, which does not reflect the reality of heat transfer if you want maximum accuracy. When it comes to heat transfer, you are trying to predict how billions of molecules bounce off each other and impact energy transfer. To be completely accurate, you would have to simulate all possible paths those molecules could take, resulting in a distribution of values for heat flow rate. This is impractical.

Instead, engineers simplify the world and make reasonable assumptions. In Fourier’s Law, some assumptions include constant temperatures and uniform thermal conductivity throughout the wall. In the real world, these assumptions are always wrong, but they don’t always matter. With thermal conductivity, even though materials never have truly uniform properties, the differences are often so small they’re not worth accounting for.

Regarding constant temperatures, they do change significantly from hour to hour. However, you can examine a “snapshot” of energy flow rate, which is useful for determining if a furnace can produce enough heat to maintain the desired temperature during the coldest expected nights. If it cannot, you know you need either a larger furnace or different wall materials. This is valuable because chemical engineers don’t need to conduct numerous experiments to determine energy loss on a cold day. Instead, they can do quick calculations and know the results will be close enough for design purposes.

What Does Thermal Conductivity Even Mean?

The other variables in Fourier’s Law measure easily identifiable real-world quantities such as temperature, wall area, and heat. However, thermal conductivity doesn’t relate to anything tangible in the real world. It is what we might call in computer science a magic number.

Thermal conductivity is experimentally derived by scientists and published for engineers to use, so engineers don’t need to run experiments themselves. This magic number avoids the need for complex simulations at the cost of requiring laboratory experiments. We too, as software developers, could have the same pipeline of experimentally derived magic numbers to speed up software design, but we lack standard equations.

The Discovery of Fourier’s Law

While researching for this blog post, I found an interesting paper: Reconstructing the early history of the theory of heat through Fourier’s experiments, which discusses the history of experiments used to prove Fourier’s ideas. I encourage you to read it if you want to know more. Interestingly, this paper acknowledges that the discovery of seemingly simple phenomena never follows a linear path.

[Adopting an approach aimed at following how the historical path leading to how given results developed] is evidently very useful especially for students in their understanding of given topics, when they can usually rely only on college or university textbooks that, even without trivializing the topic, nevertheless simplify it by linearizing a historical path that—as history teaches—is never at all linear.

For this post’s purposes, delving into Fourier’s Law history isn’t necessary. Please refer to the papers I’ve referenced if you want more details. The main takeaway is that discovering equations and novel theories is quite difficult, let alone proving their validity to an industry. It often takes many small incremental steps to build into a significantly new idea.

In fact, Fourier tried to publish his ideas at the French Academy in December 1807, but his heat transfer theories were not well received by his peers. He eventually published his ideas in a book titled Théorie analytique de la chaleur or The Analytical Theory of Heat, at which point his ideas began gaining wider acceptance.

If you want a more technical paper on the history, see Fourier’s heat conduction equation: History, influence, and connections.

Discovering Data Transport Phenomena

Creating all the equations needed to fully describe how data flows through software is too large a task for any one person. It will take many people many years to figure them all out. I believe the best approach is to make small incremental changes that slowly build toward a unified theory, just like other scientific and engineering disciplines have done in the past. The question is: how do we make progress? I will demonstrate how we might achieve this using sorting algorithms. You can see the code and experimental results here on GitHub.

Conducting Experiments

To create equations, we need reliable, reproducible results that isolate variables, making it obvious how an equation should be designed based on those results. I decided to accomplish this by timing bubble sort, merge sort, and quick sort on various array sizes using the following processors: AMD Ryzen 5 Microsoft Surface (R) Edition, AMD Custom APU 0932, i5-8600K, i9-13900K, and a Raspberry Pi 3 Model B.

Let’s examine the performance of bubble sort on the i5-8600K.

Second Order Algorithm

We can see the expected O(n²) response. Now it’s time to create an equation. This is a trial-and-error process, but we can start with something basic. We can then look for patterns in related data and try to create a more general equation from there (sounds like programming, doesn’t it?).

  • t is the runtime of the algorithm (ns).
  • b is the size of the array being processed (B).
  • k is what I’m calling the data conductivity as a nod to Fourier. Plus it sounds formal and scientific. (ns/B²)

I will refer to this equation as the second order data rate equation. It is second order because we raise b to the second power.

Through the magic of linear regression, we can estimate a value for k just like scientists do for Fourier’s law. When we do that, we get a value of about 6.67 ns/B².

Why did I choose this equation form? I eyeballed it and thought a line with b² as the dominant term would fit. I included no other terms because I wanted to ensure the equation result is zero when there’s no data to sort. After using linear regression to solve for k, I saw the R² was an amazing 0.999 (1 means the model is always accurate, 0 means you’d get better results asking a newborn to predict the stock market). Since the R² was so high, I decided it was a good equation.

Yes, I know overfitting is a risk, but this is sufficient for a blog post. While this method is quite arbitrary, it’s effective at developing equations. Let’s see how this equation performs against the experimental results.

As you can see, there’s a near-perfect match between the equation’s time prediction and the actual runtime. Now, when I performed a similar process to determine k for the remaining algorithms, I found that k was different for all of them. Having to calculate k for all processors by running experiments is annoying, to say the least. It would be better if we could predict with reasonable accuracy without always running tests. So let’s see if we can do it!

After creating graph after graph, I discovered there was, unsurprisingly, a relationship between processor clock speed and the k value.

As you can see, it’s not perfect, but we do get an R² value of 0.93, which might be good enough. The biggest error at 1.8 GHz is from the Raspberry Pi, which, as you probably know, uses an ARM processor. My guess is that this is the main reason for the error. It might be better to have separate equations for different architectures. Below is the equation I used to model the data. I will refer to this as the exponential k estimator.

  • k is the data conductivity. (ns/B²)
  • s is the clock speed. (GHz)
  • α is a constant I haven’t given a fancy name for. (unitless)
  • β is a constant I haven’t given a fancy name for. ((ns·GHz^α)/B²)

In this case, α is about 2.6 and β is about 206 ((ns·GHz^α)/B²). So why did I choose this equation? I used some eyeballing and intuition. I made a few assumptions:

  1. It is impossible for a k value to be negative. If it were negative, it would imply that runtime can be negative.
  2. It is impossible for CPU clock speed to ever be negative.
  3. Sorting will get huge performance gains with lower clock speeds and smaller gains when clock speeds are higher. This seems to match how the data worked out, but I need to test more CPUs to be confident this is actually true.

To meet these assumptions, I needed an equation that produces positive k values for all positive clock speeds, creates exponentially larger k values for smaller clock speeds, and exponentially smaller values for higher clock speeds. A 1/x equation gives us these exact characteristics. I added α and β to fine-tune the equation to better represent the data.

Combining the exponential k estimator with our second order algorithm equation, we get the following:

If we assume that α and β are inherent to the algorithm and independent of hardware, then all we need is the clock speed and the amount of data being processed to predict bubble sort performance. Clock speed and data amount are things any software developer should know when building a system. That’s it! We have much lower burden to do upfront planning and understand the hardware and cost to run our software before we build it!

Let’s examine what happens when you try to sort an array that’s already sorted.

First Order Algorithm

As you can see, bubble sort goes from O(n²) to O(n), which is not surprising. However, this is problematic for our equation since clearly how sorted the array is has a huge impact on performance.

  • t is the runtime of the algorithm (ns).
  • b is the size of the array being processed (B).
  • k is the data conductivity. (ns/B)

Here I created a first order algorithm equation, which is the same as our second order equation except b is raised to the first power. I found that the same exponential k estimator equation worked for defining k, giving us the following combined first order equation:

It would be nice to combine both the sorted and unsorted bubble sort equations into one so you don’t need two equations to describe the same algorithm. I need to figure out a reliable way to estimate how sorted an array is so I can update the k estimator equation to account for this, but I haven’t found a good method yet.

With that in mind, let’s compare bubble sort to quick sort when the array is sorted.

Picking the Right Algorithm for the Job

Here you can see that bubble sort actually outperforms quick sort when an array is completely sorted. If we as an industry had well-defined mathematical equations, we would be able to do the math and pick the best algorithm to meet our specific requirements.

You might have noticed that quick sort behaves like an O(n) algorithm even though it should be O(n log(n)). I found the same results for merge sort. Both merge sort and quick sort were still O(n) when the arrays were not sorted. I suspect the data needs to be more shuffled, but I’m not sure how to achieve that reliably. For this blog post, I’ll assume they are first order algorithms, though that’s obviously not what they actually are.

You might have also noticed a jump in runtime around the 64MB mark, which I suspect is because the L1 cache size can no longer fit the whole array. It would be interesting to see how accounting for different element sizes, cache sizes, and memory speeds affects the accuracy of the sorting algorithm equations, but this blog post is long enough as it is.

Final Equations

Here are the α and β values for all the sorting algorithms I tested:

  • Bubble Sort
    • Unsorted (Second Order Algorithm)
      • α = 2.6
      • β = 206 (ns·GHz^α)/B²
    • Sorted (First Order Algorithm)
      • α = 2.63
      • β = 140 (ns·GHz^α)/B
  • Merge Sort
    • Unsorted (First Order Algorithm)
      • α = 2.28
      • β = 133,545 (ns·GHz^α)/B
    • Sorted (First Order Algorithm)
      • α = 2.3
      • β = 133,829 (ns·GHz^α)/B
  • Quick Sort
    • Unsorted (First Order Algorithm)
      • α = 2.16
      • β = 2,765 (ns·GHz^α)/B
    • Sorted (First Order Algorithm)
      • α = 2.46
      • β = 3,285 (ns·GHz^α)/B

Conclusion

While the path to creating equations in software engineering is neither clear nor easy, I hope I have demonstrated that it is not only possible but also valuable. Simply having equations for sorting algorithms alone won’t be very useful, but I hope that with time, more people will feel inspired to make their own small contributions to the available equations for every kind of algorithm in our industry.

Eventually, we should be able to predict software performance and hardware requirements before the software is even built. As with other engineering disciplines, there will still be a need for experimentation, but hopefully that will only be necessary for truly novel approaches to software, not for approaches that many other developers have already implemented.

I encourage you, especially if you know you can do better than me, to publish your own equations and present them! Mathematics is the best tool we can use to help make software less costly and more scalable! It is now time for the era of Data Transport Phenomena!