Hailo, 마이크로서비스로의 여정 – Matt Heath

This blog post accompanies a talk titled “Scaling Microservices in Go” presented at HighLoad++ on 31st October 2014 in Moscow, which followed Hailo’s journey from a monolithic architecture to a Go based microservice platform. If you’d like to read more about Hailo’s approach to technology, check out more posts on their developer blog where this post is also available.

Starting Small

Hailo, like many startups, started small; small enough that our offices were below deck on a boat in central London — the HMS President.

The HMS President by Roger Marks

Working on a boat as a small focused team, we built out our original apps and APIs using tried and tested technologies, including Java, PHP, MySQL and Redis, all running on Amazon’s EC2 platform. We built two PHP APIs (one for our customers, and one for our drivers) and a Java backend which did the heavy lifting — real time position tracking, and geospatial searching.

Hailo’s original PHP and Java based application

After we launched in London, and then Dublin, we expanded from one continent to two, and then three; launching first in North America, and then in Asia. This posed a number of challenges — the main one being locality of customer data.

At this point we were running our infrastructure in one AWS region; if a customer used our app in London and then flew to Ireland, no problem — their data was still close enough.

Our customers flying to Osaka however created a more challenging problem as the latency from Japan to Europe is too high to support a realtime experience. To give our customers and drivers the experience we wanted it was necessary to place their data closer to them, with lower latency. For our drivers this was simpler, as they usually only have a taxi licence in one city we could home them to the nearest Amazon region. But for customers we needed their data to be accessible from multiple locations around the world.

Going Global

To accomplish this we would need to make our customer facing data available simultaneously from our three data centres. Eric Brewer’s CAP theorem shows that it is impossible for a distributed system to simultaneously provide Consistency, Availability and Partition Tolerance guarantees, and that only two of these can be chosen. However, Partition Tolerance cannot be sacrificed, and as we wanted our service to optimise for availability as much as possible, the only option was to move to an eventually consistent data store. Some of our team had prior experience with Cassandra, and with its masterless architecture and excellent feature set, this was a logical choice for us. But, this wasn’t the only challenge we had — our APIs in particular were monolithic, and complex, so we couldn’t make a straight switch.

Additionally we wanted to launch quickly, so we couldn’t change everything in one go. But, as our drivers were based in a single city, we could take a short cut and leave our driver-facing API largely unchanged; cloning the infrastructure for each city we launched in, and deploying these to the region closest to the city. This allowed us to continue expanding, and defer refactoring this section until later.

As a result we refactored our customer facing API; moving core functionality out into a number of stateless HTTP based services which could run in all three regions, backed by Cassandra, written in either PHP or Java.

Part way through our migration to a Service-Oriented Architecture

This
was a big step forward, as we could serve requests to customers with
low latency, account for movement, and have an increased degree of fault
tolerance — both benefitting from Cassandra’s masterless architecture,
and having the ability to route traffic to alternative regions in the
case of failures.

However, this was merely the first step on our journey.

“My God! It’s full of stars!”

Technically this isn’t when Dave goes through the monolith, but point stands — he’s got a long journey ahead

Having
dealt with one monolith, we had a long journey to deal with the next
monolith. In the process we had improved the reliability and scalability
of our customer facing systems significantly, but there were a number
of areas which were causing us problems:

Our
driver-facing infrastructure was still deployed on a per city basis, so
expansion to new cities was complex, slow, and expensive.
The
per city architecture had some single points of failure. Individually
these were very reliable, but they were slow to fail over or recover if
there was a problem.
Compounding this, we had a lack of automation; so infrastructure builds and failovers usually required manual intervention.
Our
services were larger than perhaps they should have been, and were often
tightly coupled. Crucially while on the surface they provided a rough
area of functionality, they didn’t have clearly defined
responsibilities. This meant that changes to features often required
modifications to several components.

A
good example of the final point is amending logic around our Payment
flow, which often required changes in both APIs, a PHP service and a
Java service; with a correspondingly complex deployment.

Changes often touched several components

Changing Gears

These
difficulties made us realise we needed to radically shift the way we
worked; to support growth of our customer base, our engineering team,
and to increase the speed of our product and feature development.

Working
on a small number of large code bases meant that we had a lot of
features in play at once, and this made scaling up our team
difficult — communication, keeping track of branches, and testing these,
took up progressively more time (due to Brooks’s Law).
Some of these problems could possibly have been solved with alternative
development strategies such as continuous integration into trunk and
flagging features on and off, but fundamentally having a small number of
projects made scaling more difficult. Increasing team size meant more
people working on the same project with a corresponding increase in
communication overhead; and increasing traffic often meant the only
option was to inefficiently scale whole applications, when only one
small section needed to be scaled up.

Our
first forays into a service oriented architecture had largely been a
success, and based on both these and the experiences coming from other
companies such as Netflix and Twitter we wanted to continue down this
path. But, with most of our developers not having experience of the JVM
(which would have allowed us to use parts of the brilliant Netflix OSS) we would need to experiment.

A Cloudy Beginning

From
the start there were a number of guiding principles we wanted to
instill into our new platform. One of these was taking a Cloud Native
approach; pioneered by Netflix. Adrian Cockcroft has spoken about the approach Netflix took, with their aim during this process being to:

“Construct a highly agile and highly available service from ephemeral
and assumed broken components”

Infrastructure
as a service obviously makes scaling much easier, and in Hailo’s case
we use a wide range of Amazon’s products; allowing us to match customer
demand and react quickly to changing markets, while keeping costs low.

However,
regardless of hosting provider, any large distributed system will have
components that are failing or degraded at any point in time. This is
something we all continue to strive to fix — trying to achieve a utopia
of perfect hardware systems, running perfect apps, connected by a
perfect network. Unfortunately this isn’t really attainable, and instead
we are stuck with a dystopia
of buggy apps, running on hardware which often fails, or disappears.
This isn’t necessarily a bad thing though — it forces us to face these
problems head on, and design software which can benefit from a rapidly
changing environment.

This concept of antifragility was popularised by Nassim Nicholas Taleb
and is central to becoming cloud native. Most systems become weaker
when under stress, however by deliberately introducing stressors into
our systems, or chaos
in Netflix’s case, we can identify these issues, and design out these
weaknesses — vastly improving the reliability of our service.

With
these concepts in mind we tried out several different prototypes, and
eventually settled on a service oriented architecture which supported
both Go and Java services, communicating Protocol Buffer
formatted messages over a RabbitMQ message bus. This allowed
interesting routing patterns, such as routing to specific versions of a
given service, or point to point messaging. Cassandra remained our
primary data store given that it was working so well, and fitted in
perfectly with our requirements.

Hailo’s Microservice Platform based around RabbitMQ and Cassandra

So,
why Go? Previously we had been using PHP and Java, and while we wanted
to retain the ability to write Java on our platform, we were less keen
to keep PHP. Go is a small language, and is easy to learn. Being
compiled it is mind bogglingly fast (especially when you are moving from
an interpreted language like PHP), and features such as its type system
and fantastic interface support make writing code fast; giving us
improvements both in development and compute time. Go also has excellent
concurrency primatives — something of particular importance for us as
our infrastructure would require a lot of inter-service communication.
Finally, it’s fun to write, and there is an amazing community! So not
only did our developers enjoy writing new services on our platform, but
we could recruit from an incredibly talented pool of developers in
London and across Europe.

But, what even is a Service?

The definition of a service, microservice, macroservice, mediumservice or even a moderatelylargeservice seems to vary wildly depending on who you talk to. As we previously discussed in Hailo’s Webapps as microservices post, Martin Fowler defines the microservice architecture as such:

“The
microservice architectural style is an approach to developing a single
application as a suite of small services … these services are built
around business capabilities and are independently deployable …
there is a bare mininum of centralized management of these services,
which may be written in
different programming languages”

In
our case we defined a service as a small, distinct, unit of
responsibility — something that did one job, and did it well. These
ranged widely from some very small services with maybe a hundred lines
of logic (excluding libraries), to a few which were much larger; such as
the system which tracks our drivers in realtime. Regardless, the
guiding principle remained the same — that these services should each
have clearly defined responsibilities.

Secondary to the responsibilities of the service, was the interface it provided to the outside world. In our case we chose Protocol Buffers,
which gave us well defined, strongly typed messages to be passed
between our services. Each handler (or endpoint) on a service defined a Request and Response
envelope message which it would accept and reply with. As Protobuf is
extensible, these could be changed and added to during the lifetime of
the service, while still supporting older versions of clients.

Finally
there was the internal implementation of the service. Arguably this was
the least important part of the service as most services were small,
and could easily be rewritten or replaced if necessary. In comparison,
changing the interface or responsibilities of the service would usually
require changes to other services, with the corresponding cross team communication overheads.

Developers, Developers, Developers

Having established what a service should look like, we now needed to provide tooling to make it extremely easy for developers to build services, so we could get on with solving real-world problems!

Services were based on our ‘platform-layer’
library which abstracted away the message bus transport layer, and
delivered messages to and from registered message handlers within the
service. This also provided the framework for inter-service RPC calls,
service discovery, monitoring information, authentication and
authorisation, and A/B testing.

In addition to this our ‘service-layer’
libraries provided abstraction layers over most of our internal 3rd
party services, such as Cassandra, Memcache, Zookeeper and NSQ, adding
convenience methods, host discovery, automatic configuration, and logic
to refresh configuration dynamically as it changed.

To increase productivity we created templates for services, and added Hubot integration so we could quickly provision a new project with a single command:

Services as a Service. Seriously, it’s turtles all the way down.

Setting up Continuous Integration for a service was also as simple as asking Hubot. This used Janky to set up the project in Jenkins, and add post receive hooks to our Github repositories.

Finally,
the end result of a successful build was a deployable artifact stored
on Amazon’s S3. This could then be provisioned to any of our
environments using our internal deployment tools.

Always be Shipping

Once developers had built their services, the next step was deploying these. Docker was around version 0.4
when we started working on our platform, and although it was tempting
to try using this in production, we didn’t want to spend our time
debugging issues until it was more stable. We also wanted something
fairly lightweight, so we could swap it out in the future, and it had to
be simple to use.

Go made
this very easy, as the output of our build was a statically linked
binary which could be uploaded to S3 and then downloaded and executed on
any machine. Couple this with a command line tool for the platform,
along with a snazzy Web dashboard, and we had a deployment system.

The Provisioning dashboard for an environment

Provisioning
a service through either the Provisioning dashboard or platform command
line communicates with our provisioning manager service (a scheduler
which itself runs on the platform). This decides where the service
instances should be run. A provisioning service daemon, which runs
locally on every machine, polls this and identifies when it needs to run
a new service. This then pulls down the service from Amazon’s S3 and
manages execution of the binary during its lifetime.

As
Docker stabilised we added this into our build chain, so that services
could be run within containers. This process is almost identical, with
the exception that we pull down a Docker image, and request that the
Docker daemon execute the container:

Once
a service starts running it automatically discovers and connects to
RabbitMQ, and publishes its existence, registering with our service
discovery system.

Services automatically discover RabbitMQ and publish their existence

The Binding Service
then sets up the the correct queue binding rules within RabbitMQ,
connecting the delivery queues to this new service instance. This
supports some advanced traffic routing features, including per-version weighting, so particular versions of a service can receive a weighted amount of traffic.

Once discovered, services are bound to RabbitMQ and receive messages

Dealing with Complexity

Decomposing
our infrastructure into a large number of small simple services, each
with a single responsibility, has benefitted us hugely. It allows us to
more easily understand each component and their behaviour, allowing us
to scale both our software and teams far more easily, and letting us to
develop new functionality extremely quickly.

However,
all this speed has drawbacks. At the time of writing Hailo have over
200 services in production, running across three continents, each with
three availability zones. This is clearly a massive
increase in moving parts, and as such, complexity. Rationalising the
behaviour of each individual component may be simpler, but understanding
the behaviour of the whole system, and ensuring correctness, is more
difficult. Dealing with this complexity was our next challenge.

Rise of the Machines

Testing
a complex system for correctness is clearly a good starting point, and
like everyone else we built suites of unit tests and integration tests
which tested functional behaviour. But if we wanted a high percentage of
uptime, and a fault tolerant system we needed to do a lot more.

Testing
our systems under load, with failed or failing components, and with
degradation was the important next step, and we built an integration
framework around this. Simulating cities with ‘robot’ customers booking ‘robot’ taxis for journeys allowed us to put significant load through our systems. Running tools similar to Netflix’s Simian Army
during these tests, and ensuring correctness, identified a lot of
issues, both in our code and in third party libraries; and fixing these
massively increased the resiliance of our systems.

Running a simulation of Dublin, while terminating nodes and simulating random latency

Developing
this system further we began running the simulations against our
production infrastructure, continuously, in a real world ‘city’. This
identified problems which would directly affect customers or drivers
using our service extremely quickly, so much so that we now use it as
one of our primary monitoring tools.

Continuous production testing in Kerguelen

Monitoring

In
addition to using our robot drivers and customers, we built more
conventional monitoring systems. Each service automatically published a
number of healthchecks via Pub/Sub over RabbitMQ, which were collected
by our monitoring service. These included some built in healthchecks
which our platform-layer library added, so all services automatically included them, giving the service author health indicators right off the bat.

Built-in
healthchecks included service level system metrics, such as if
configuration had correctly loaded, or if the service had enough
capacity to serve its current request volume. In addition, our service-layer
libraries registered healthchecks for the appropriate third-party
services each service utilised (such as connection status to Cassandra),
handlers were compared against their performance expectations, and
custom healthchecks could be registered by the service’s author. This
information was aggregated, and displayed in our monitoring dashboard
(seen below) which provided auto-generated dashboards for all services
discovered by our service discovery mechanism.

Monitoring dashboards auto-generated for all discovered services

We also take measuring everything very seriously at Hailo, and are all paid up members of the Church of Graphs. Instrumenting timing data into Graphite via statsd
is almost free, and we built this into all of our internal libraries.
This meant that all of our services have a huge amount of performance
information available when necessary, and we can provide dashboards for
all services automatically.

![](https://www.dropbox.com/s/jegae3fqejc1syr/monitoring-per-service-graphs.png?raw=1)

Observability

Finally, distributed tracing tools, such as Twitter’s Zipkin or Google’s Dapper,
are invaluable for determining issues in production systems, as they
enable the tracing of requests as they traverse through disparate
systems. Our tracing infrastructure was built into the RPC library
fairly early on, and enables free tracing to developers without any
additional code. This was then augmented with a number of web
applications which let developers dig into their tracing information.

A
good example of this is the diagram below. This was taken from our
production environment when we were debugging performance issues on a
particular endpoint after new features had been added. Looking at the
web sequence diagram we can see that we are calling a number of
services, but some of these calls are happening sequentially — this is
likely the cause of the performance issues.

Before trace analysis this call had performance problems

Having
investigated this, it turned out we could refactor the api endpoint to
make a number of these calls in parallel, and aggregate the responses
once they had returned.

Optimised execution after trace analysis, with service calls executed in parallel

This
reduced the response time of the endpoint from 120ms to under 70ms. And
overall this was reduced from nearly 500ms when running through the
previous PHP based API.

In
addition, we trace (but do not persist to storage) a percentage of
internal inter-service requests. This allows us to aggregate performance
and success information in memory using Richard Crowley’s go-metrics library, giving us 1, 5 and 15 minute rates and system health, which we can then visualise.

Health is indicated by colour, and relative traffic by arc width

This diagram, taken from late 2014, illustrates the interactions between services running on Hailo’s platform.

Conclusions, aka TL;DR.

During
the process of migrating to our new microservice platform we have
completely changed the way we build software as a company, enabling us
to become significantly more agile, and develop features much faster
than before.

Building our
platform up first, with a small specific use case, allowed us to test,
and gain valuable experience running our new systems in production. We
could then expand the scope, gradually replacing areas of functionality
and API endpoints with zero downtime; and by picking off specific use
cases, and continuously shipping to production, we avoided the common
pitfall of the never ending rewrite.

Moving to a microservice architecture is not a silver bullet, and the increased complexity means there are a lot of areas
which need to be carefully considered. However, we have found that
there are a huge number of benefits which vastly outweigh any
disadvantages.

Hailo’s
infrastructure is decomposed into a large number of very simple pieces
of software — each of which is independently deployed and monitored, and
can easily be reasoned about. Tooling and automation simplify
operational burdens, and by adopting a cloud native approach with antifragility as a core concept, we have significantly increased the availability of our service.

Crucially
with a well developed toolchain it’s extremely easy to create new
services, which has lead to emergent behaviour with unexpected novel use
cases and features. Developers are freed up to take features from
inception to production in hours rather than weeks or months, which is
completely game changing (the current record is 14 minutes to staging,
and 25 minutes to production); and this ability allows
experimentation — our Go based websocket server Virtue is an example of a side project which would likely not have happened otherwise.

Tackling large projects, especially rewrites
or replatforms, is always a daunting prospect, and we would never have
succeeded without the involvement and support of everyone in the
business, and for that we are all truly grateful.

— — — —

Image credits

HMS President, Roger Marks