How did we learn to solve complex challenges?
To begin with, we have a question: what is more difficult - to build a bridge across the Gulf of Corinth or to perform heart surgery? Building a bridge requires a team of a hundred people, days and months of work. Heart surgery requires several experts and a few hours. And yet, it is simpler to build a bridge. Why? Because the "difficulty" depends on the domain in which the challenges are located.
Building a bridge (or a house, a skyscraper) takes place in a complicated domain. This means that the domain, or system, can be divided into smaller units, each of which can be solved individually. So, if the ground is unstable - we will install piles. Then, we will lay the foundations, then build floor by floor. And so down to the finest details.
On the other hand, heart surgery is a thing that happens in a complex domain. A complex domain is a network of interactions, which cannot be observed (nor solved) separately - but uniquely, as a system. So it is essential that all aspects of the operation work together – and simultaneously.
We are always in a complex domain
United Cloud development centers are located in several countries in the region and Europe, where our products are developed. When you develop something, it means innovation. This means that there is no pre-established process or ready-made solution for the problem we are solving. Even when you know what you want the output to be (say, a feature that already exists on some other streaming platforms), the process by which we arrive at that output is always a story in itself.
When you operate in a complex domain, it means that not everything will go smoothly. The launch of EON TV is one of the stories we often recount over lunch at United Cloud.
Every beginning is complex
In the beginning, we needed to develop EON TV with complete functionalities very quickly. The first challenge was the streaming platform. Although we already had a streaming platform, we had to rework it a lot to fit our product. Another challenge was the Metadata servers, which provide the client application with data such as EPG, channel list, event description, but also take care of client packages, policies, credentials, etc. (So, everything that is necessary to exist for the platform to be functional). Also, it was necessary to design and set up the infrastructure, however, the biggest challenge with EON TV is the fact that the platform needs to be available 24 hours every day of the year from absolutely every device!
We spent most of 2017 developing the EON TV platform, with the goal of launching it on September 5th. And that would be great because that day was the beginning of the European Basketball Championship. We launched EON TV simultaneously in Serbia, Slovenia, Montenegro and Bosnia. It seemed to us that the start of the championship was a great start - to immediately attract viewers to EON TV. Due to tight deadlines, we developed the platform until the last day. We did all the tests we could do at the time, everything worked great, until…
It didn't start working! Unpredictability reigns in complex domains. What works perfectly in the test phase, sometimes doesn't work in the real world. So at the very start, we were met with several serious problems. The first problem was Distributed Denial of Service (DDoS). It is a distributed (hacking) attack with the idea of overloading the infrastructure and making it stop working. While we found out who is doing this to us... In the end, it turned out that our IOS application needed to be further optimized!
The next day was a different matter. Serbia played and the number of devices from which to watch the match increased more and more. The game was supposed to start at 20:00 and everything was working perfectly... Until 19:58, when everyone suddenly started to log in... Then we realized that our landing page had too much information. So it's too "heavy" because a lot of processing and memory has to be consumed to make it work. Under the onslaught of piles of logs, servers go down - and we get them back up. And all this in real-time, while the match is going on so that there is not even the slightest interruption in the transmission! We didn't sleep much those days.
During matches, the platform was more loaded, and outside of matches, we had less flow and more room to maneuver. During the championship, we joked that the only thing missing was Serbia and Slovenia playing in the final. A few days later, when Serbia and Slovenia made it to the finals, it wasn't so funny!
Now it's a different story
Because of such a challenging solution, we solved some complex problems at the very beginning. From 10,000 devices at the very beginning, we have reached more than half a million. Back then we were sending a few hundred megabits per second - and now 1.5 TERABIT. And that without anyone being awake at night!
Now we do everything differently. We have established a process that is a response to work in a complex domain. We don't allow ourselves to have such large deployments and we especially think about the moment in which we will go into production, but what is much more important, the entire process from the beginning of the implementation to the end of the implementation is different, much different.
The basic step for implementing a quality process is the quality of the code base:
Code base quality
We try to incorporate quality as much as possible in every step. So, at the beginning, we take care of the quality of our code base. That's the first thing. There are code standards, and code reviews within which we actually discuss the complexity of the implementation. At this stage, we ask ourselves the following questions:
a. Can something be coded more simply?
b. Is it simple enough to maintain?
c. Is it a robust enough solution?
d. Did we introduce any security problems at the start?
e. Should we update our code standards?
This first step is extremely important. Besides ensuring quality, it also ensures that all new team members use the same work practices. That way, they fit into the team faster, working in a similar way and with the same mindset.
Static analysis is an integral part of code-base quality
Through static analysis, we actually check the quality of our code. These are relatively complex algorithms that check the status, whether we added some "technical debt", and whether the application itself is robust enough (resistant to problems). We get feedback immediately in an automated way. So, the algorithm signals itself. For example, we get information that a class has to be changed because it is too long, has 300 lines of code, and that class is often accessed by more than one developer. This means that there will potentially be a problem. We even have suggested solutions, e.g. to split that particular class into multiple smaller ones or to refactor the code.
The next important step in the implementation of the process is the deployment pipeline, which we have successfully optimized for these five years. We wrote a separate text about it because we know that no one has the attention span to follow a text longer than three pages!
And finally, if you also code in the complex domain - contact us, and let's talk (more) about good practices! See our open positions on the United Cloud’s Joberty page and join us.
Author: Igor Tanacković, Chief System Architect