This story is about a big fintech company getting hacked. We’ll call it TwinPeaks to preserve anonymity, or TP for short. The quieter we become, the more we hear. It was a cold winter Friday evening...
LEAVE THE HARDWARE ALONE
Having worked in IT for 10 years I have realized that relying on hardware is simply unrealistic. It tends to break down. Very often. Unexpectedly.
Luckily, the world keeps spinning, more and more High Availability solutions appear each year. Building good HA architecture is not easy. Trying to balance quality and price I applied virtualization, drdb, pacemaker, nginx и haproxy, I built various high-availability database clusters postgres, mysql with automatic failover and what not.
Naturally, building it manually can be interesting only so many times. Later these solutions were automated with the help of Puppet, which in a year was followed by SaltStack.
Everything seemed to work just fine.
However, it’s more complicated than it sounds:
- It’s difficult to expand resources in case of rapid traffic growth – data centers don’t always distribute them promptly.
- Most of the time the resources are idle, and they still need to be paid for.
- No matter how well Puppet (or other automation system’s) manifests are written, there is always a risk that something might be different in the environment. Then we have to spend time looking for a floating issue on a bunch of servers.
- It’s difficult to launch applications that need different versions, yet the same libraries, on the same hardware. For example, an old site is on php5, while a new one is on php7.
Undoubtedly, virtualization solves most of these issues. I opted out for KVM back then and set up all the new environments on its base. A lot of virtual machines require control. Such platforms as CloudStack, OpenStack and some others cope excellently with this task.
However, such solutions are effective only for medium-sized and large-scale projects. If a project is small, and it starts growing dramatically – we have a problem. We either need to overpay and originally set up complicated and bulky architecture or start moving to the cloud while there already is a lot of production traffic. And, as you all know, production move to a different architecture with minimum downtime is a tall order.
Then I started looking at containers, FreeBSD was my favorite (to tell the truth, I almost became a FreeBSD fan). It was a fast and effortless way to run a container with a necessary set of software in an isolated environment, easily control and clone it. I loved working with Jail and ZFS, especially with ZFS, so much that I switched a Python/Django hosting project from ZFS to Jail.
The only disadvantage was that using FreeBSD everywhere was difficult: the people, especially developers, got used to Linux. It was understandable; Linux had pushed BSDN out of the hosting market long ago.
Then in 2013, while chilling somewhere along the Vietnamese coast, I came across an OpenSource project called Docker. The technology was so inspiring that I couldn’t help writing a one-page application titled launch your site in 1 click.
The idea was simple – you clicked a big button and got a domain, a git-repository and an SSH port. You could place your code on Python into the git and see it on the site, as well as get a root-access to the container if necessary.
Next morning I posted the link to that site in a python-dedicated google+ group and took my motorbike to the lighthouse on the south coast. When I was coming back to my village, my phone got Internet connection, and I saw that my google+ account was bursting with messages. As a matter of fact, hundreds of people rushed to test this service!
The sad thing was, though, someone writing, ‘I’ve downed your server.’ I started looking into it right away upon my return. It turned out that the SSD-disk on the server had run out of space. Someone had simply executed:
“while true; do date >> ~/1.bin; done”.
That person identified themselves in the comments. He said that Docker was a bad technology, that it stood no chance in this universe, and that openvz was the ultimate way.
Frankly speaking, I was frustrated with such a turn of events. There indeed were no limits to the disk space for a Docker container. I got used to quotas in ZFS and I couldn’t understand how something similar could be created in Docker.
All things said, that person shook my enthusiasm about Docker and I stopped developing in that direction. Not for a long time, though.
In 2014 I was doing free-lance jobs only, and I had a lot young projects with very similar requirements. Everyone needed similar architecture solutions, standard technologies and similar business needs:
- We want our developers to have the same development environment as in testing, staging and production.
- We want our developers’ code to be delivered quickly.
- We need to know about the problems with the code before it gets into testing!
Then I remembered about Docker. About the same time, I learnt about Gitlab (I used to set up gitolite + redmine before that, which wasn’t very convenient).
In the end, I came up with a solution which connected Docker with Gitlab. Git simply hooked image building after push to a repository, sent it to a private docker registry, launched container tests, and if everything was ok, they commanded different environments’ servers to recreate the container. It worked very smoothly.
By that time I had learnt really well how to work with Docker Volume, Docker network, and was collecting my own collection of Dockerfiles, which contained most typical images used in projects.
Later I got docker-compose files, which let me quickly develop compose for local development.
Everything was going swell until the project started getting bigger. Bash force wasn’t enough anymore and container orchestration issue came up. Then I learnt about Tutum (now it’s https://cloud.docker.com/).
Tutum was a dream. It gave an opportunity to turn a very dynamic project based on microservice architecture http://checkio.org into containers without experiencing any troubles, the only drawback being its requirement to store the images on its servers. Not every client would agree to that.
With each month, Docker was becoming more stable and full-functioned. In a year, I even stopped comparing it to FreeBSD Jail! It went like this till the fall of 2016 when it became necessary to build a high-availability cluster with orchestration and scaling for a betting company.
One of the important requirements was the speed of new environments’ deployment, modularity and hosting on their dedicated servers. Containers fit well into that paradigm, but something else had to be there to make it convenient to manage in multi-servicing architecture.
At the time there were:
- Docker Swarm
- and some other less known projects.
I started my research with Docker Swarm since it was much more stable and closer to the Docker itself. Swarm proved itself quite well during the tests, but the most important thing that made me keep searching was its lack of flexibility and even a certain extent of its technological scarcity.
Google’s Kubernetes proved to be a very big and promising project with flexible facilities and a convenient manifests format employing YAML.
I quickly developed a prototype out of one master and several minion nodes and started experimenting. During the first month, I found so many bugs that I spent my entire time on checking them in githib issues, and if a bug was new, reporting it. The code was fixed and developed at a fantastic speed!
I remember reporting installation bug and going to the railway station. While I was in my cab (about 40 minutes), the bug was fixed and shipped to master!
For a couple of months my experience with Kubernetes was similar to making my way through thorny shrubs somewhere in the mountains.
Every day Kuber was becoming more and more stable, while my skills working with it were improving.
Kubernetes started production with minimum downtime (a couple of minutes to redirect queries to database) and there it was, working splendidly.
Deployment with Rolling Update (more about it later) did away with downtime in the majority of subsequent deployments, architecture description in Kubernetes YAML files made it possible to deploy new environments super fast. And gitlab integration let the developers forget about different libraries incompatibility issues in different environments.
Later we set up limits that helped work efficiently in case of memory leak on one of the microservers, and even make this issue hardly noticeable for a user until it’s fixed in the code.
Monitoring helped collect the measurements according to specific service groups and understand how changes in the code affect latency and resource consumption.
During the last 6 months, our team have been specializing in projects that deal with cryptocurrencies. It’s an extremely curious and dynamic market and, what’s important, our product and expertise match the requirements of the field – high availability, security and autoscaling. So far, we have developed several ready-made architecture solutions for typical server-side projects which work with blockchain:
- failsafe BTCD
- fault-tolerant nod’s Ethereum
- cluster and its component monitoring with alert Slack notifications
- finance log receiving service and records from AWS ElasticSearch
Now we are working on creating a REST service, which can simplify cryptocurrencies work with blockchain – create addresses, carry out money transfer operations and automatically send the money to the remote wallet.
We are also working on creating an automatic installer HA Kubernetes in AWS with web interface, which will introduce you to some of our solutions and give you a chance to test them.
So, we’ve got lots of discoveries in store, which my team and I will be sharing with you.
The journey is on.
Daniel Yavorovych, Arilot CO-Founder