The cloud computing panorama is a huge and messy collection of technologies, services and providers. Building elastic distributed systems is fun but orientating such a vast environment can be quite challenging.
Talking about some do’s and don’ts in cloud computing, Suman Kumari and Wamika Singh from ThoughtWorks shared their adventure in migrating a monolithic application to the cloud at Codemotion Milan 2018.
ThoughtWorks was awarded by an Italian manufacturing company the task of re-engineering a multi-factory global scale application using an elastic infrastructure, delivering a proof of concept as soon as possible while keeping operational and maintenance costs low. The main objective of the project was to achieve better insight on the production process by collecting, streaming and aggregating data produced at each plant.
Another project requirement was to design a cloud agnostic application, to allow the integration of different cloud providers if needed.
Kumari and Singh described their journey by talking about four main topics: the overall system infrastructure, the data streaming architecture, the data retrieval system they used and the DevOps procedures they adopted.
To run and deploy their services, ThoughtWorks went with the well-known and industry acclaimed approach of containerisation. Containers are portable, safe and cost-effective so they quickly became the de-facto standard for cloud applications. In particular, ThoughtWorks decided to host their containers on a Kubernetes cluster to benefit from features such as auto scaling, automated roll-outs and roll-backs, autodiscovery, etc.
The infrastructure was created with Terraform on AWS and the Kubernetes cluster was provisioned using Kops.
A few YAML files later, the cluster was up-and-running.
Several services were evaluated to implement the data streaming infrastructure. In particular, they evaluated SQS and KINESIS from Amazon, before deciding to go with Apache Kafka. Using a custom deployed streaming platform rather than an hosted one allowed keeping the operational costs low without sacrificing performance.
Kafka was deployed on the Kubernetes cluster with confluent using the official Docker images.
For the querying service, ThoughtWorks went with Amazon Athena. Athena has a variety of built-in importers, supporting CSV, Parquet, JSON and others. It is based on the Presto engine and does not require extra ETL steps to run, as data is stored directly on S3 buckets. As with many other serverless services, it has a low infrastructure cost as the client pays only for the queries he/she runs.
Athena is interrogated by the application, written in Python, using the PyAthena interface library.
Adopting a continuous integration and deployment model is almost mandatory to maintain cloud applications, as they allow an effective improvement of the development team productivity.
ThoughtWorks evaluated two on-premise solutions to implement CI/CD pipelines for their application, comparing Travis and CircleCI. The latter was ultimately chosen for its better starting cost for enterprises.
Although Athena was initially chosen for the development of a proof of concept, it showed its limits when used as a frequently accessed service. Athena doesn’t handle high concurrent loads. Since it is designed as a non-ETL service, it doesn’t cache data and this may be inefficient in certain applications. ThoughtWorks ultimately decided to drop Athena and use RDS instead, developing a custom interface to RDS for Kafka and performing some pre-processing before dumping the raw data to RDS. As the reader may expect, moving from many small Parquet files to a relational database allowed a great performance improvement.
Once again, serverless services are great and powerful, but choosing the one that fits a specific use case is a matter of both experience and good testing.