Thanks for taking the time to stop by! I try to post every few weeks on topics ranging from niche aspects of computer programming to organizational and business challenges. By all means subscribe to the RSS feed!

Find me on & &


CribML, Part 1

As a parent of an infant, one thing you quickly learn is that they wriggle all the time, even when they are supposedly asleep. My little one likes to do laps around his crib when he’s fast asleep, regularly doing 360 degree turns as he does so. Like many parents, we have a baby monitor to keep an eye on him whilst we are in another room and he’s sleeping (or not, as the case may be). What’s annoying about these products is that they usually don’t tell you anything more than “there’s motion or noise!” which can make it tedious for notifications and you resort to essentially watching him all the time on the monitor.

While on paternity leave I found this annoying, so started working on this project: CribML. The idea is that we can have a system learn what the baby is doing, so the parents are only notified when there is some actionable behavior that needs assistance. To achieve this, I decided to give Apple’s relatively recent ML suite of tools a go instead of the more broadly used PyTorch / TensorFlow, as i’d be running this application on my phone and / or on my mac laptop. Specifically, here i’m using CreateML for training and exploration of the model, which then gets exported into a model file that can be utilized with Apple’s CoreML, which I plan to utilize using Swift UI.

Having never used these tools (but being familiar with similar ones) I have to say, it was very easy to get up and running. In order to do training, I needed at least 50 examples - I went about collecting contents from the historical camera video and annotating them appropriately. For the annotation steps, I used RectLabel Lite and just worked through the training set adding annotations. To make it easy to start, I decided to just tackle sleeping positions (front, back and side), with the intention of adding various awake behaviors later (e.g. scratching, awake, crying). The nature of babies makes it easier to find more content for them sleeping than it does when they are unhappy / awake!

With CreateML you only need throw all your images into a single folder, along with a JSON file describing the labels and x/y position of the bounding boxes are you’re away to the race. RectLabel Lite can export the relevant JSON file for you from your annotated set. After doing this, you conduct training:

After approximateily an hour (on a 2021 Macbook Pro M1) the model converged with relatively good accuracy. Here’s a couple of output examples (child’s face redacted for privacy):

For 67 examples, this is pretty good and as yet, we’ve written no code to speak of. As it would turn out, the network being used under the hood is YOLOv2 which is by default a pretty capable image detection net and so we were able to quickly come to something functional.

In the next part of this series i’ll be utilizing CoreML to load and apply this model dynamically to content coming from the camera, wrapped up in a small application UI.

Leave a comment

Thesis for the Future of Work

2020 was a rough year for many people and businesses globally; America is experiencing record levels of unemployment, and for many, the very nature of work during the pandemic has changed fundamentally. For those who’s profession allows for it, the daily commute has been replaced with working from home, trying to be productive and deliver their work as best as possible. Compounded by the prolonged lock-downs, many people decided to relocate themselves away from historical economic centers in order to seek living conditions that are more conducive to long-term working from home. For example, San Francisco has seen a 26% drop in rental prices through 2020 - for a market that was one of the hottest in 2019, this sudden drop is unprecedented. Such changes have led many in the industry to ask if these kinds of changes are here to stay, or if this is simply a transient situation arising from the pandemic. In this post I’d like to present what I think might happen going forward in our industry, and discuss some of the lasting changes that the pandemic might have on the nature of work.

Everyone Will Work Remotely

In my view, all-remote-all-the-time is a fantasy. The notion that everyone will work from home is simply not feasible, for a variety of practical reasons (not least, high variance between individuals home-life situations). For example, electrical engineers often require a suite of specialized equipment on their workbench in order to complete their work. This is often expensive and intractable for staff to have at home; just think about equipment costs for the business and power usage for the homeowner!

Any company that was geographically distributed pre-COVID is familiar with the challenges of remote work across major timezones. For example, if you are based in San Francisco, working with a team out of Taipei, your effective collaboration time in a week is an hour every afternoon, 4 days a week (as it’s already tomorrow there!). Similar challenges exist with collaborations between PST locales working with Central Europe - except it’s an hour or two in the morning. Remote work isn’t hard because of COVID, it’s always had challenges - it’s just that everyone got to experience those challenges first-hand in the largest workplace experiment ever conducted.

With this frame, I’d like to propose the following spectrum of remote work viability.

  • Remote Infeasible: This is probably obvious, but as mentioned above, working with hardware and other such roles are going to make remote infeasible in the nominal case. These positions will remain on-site, and industries that require these skills will continue to geographically cluster. Aerospace is probably the most clear example of this branch: you can’t build a plane in your kitchen!
  • Fully Remote: The opposite of Remote Infeasible is being entirely remote, 100% of the time. Software-only businesses are super amenable to this model, and for many people this has essentially been their life through lockdown.
  • Hybrid Remote: For some roles - even software roles - there can be benefits of being physically present and colocated with your colleagues. In a hybrid remote model, it seems plausible to imagine working a period remote, and a period in the office. For some workplaces it might make sense to have this on a weekly cadence, such as 3 days at home, 2 days at the office. For other workplaces, it might be less frequent: 3 weeks at home, 1 week at the office for example.

Remote Infeasible is the least interesting category for the purposes of the future of work (as its the status quo), so the following section explores what fully remote and hybrid remote mean for both recruiting, geography and local markets.

How Remote is Remote?

It is my view that the future of work is, at least, partially remote for many industries and roles. This presents an interesting problem for growing businesses: what are your recruiting bounds? Historically, when recruiting for a given role it is common to find companies looking for a candidate within a specific locale as the is role attached to a certain office. In the future working models things change significantly: instead of being attached to a city, I believe that roles will be attached to a timezone. Consider the timezone distribution in the continental United States:

If you attach a role to, for example, Pacific Time (PST), then it would not matter if you lived in Seattle or San Diego; likewise for Eastern Time (EST), New York or Miami. Companies that choose to operate timezone-based recruiting could stand to gain significant benefits: they get access to hiring pools in multiple large metros. This could also be extended to Timezone +/- 1. For example, if you attached a role to PST/PST-1 then you also include everything west of Denver, CO - an absolutely massive hiring pool.

This approach has a few benefits (and drawbacks):

  • Flight durations within a timezone (and the +/- approach) bound the flight time to around 2hrs. This is short enough where people could travel to a particular metro for in-person gatherings as needed, with relative ease.

  • Distribution of employment opportunities will in the mid-to-long term cause a rebalancing of compensation and local economic norms across a wider area. In short, a redistribution of wealth at what will likely be a lower cost-basis to employers (it turns out offices, catering, heating etc are all large overheads).

  • Geographically separate but with timezone alignment makes day-to-day logistics feasible, and requires less fundamental change to how your business operates. As mentioned in the introduction, working internationally has always been challenging and many company cultures won’t (can’t?) adapt to a fully written and asynchronous culture. Don’t put a square peg in a round hole.

  • The most glaring downside is around administration of benefit programs and taxation. With different systems in different states (different countries are even worse), I am aware this presents real challenges. With that said, there are firms that have been working on solving these problems for a number of years now, so I believe it will become entirely tenable on large-scale in the very near future (if it’s not already).

If you are a person who fits into the hybrid remote or fully remote working buckets, ask yourself this: within your timezone (+/- one timezone), where would you love to live? What lifestyle would you want and what is it worth to you? There is over 1.19 million square miles in the western states alone - our country is massive, with some incredible places to live. In my view, this makes the future of work super exciting and could lead to better lives for all our families.

Leave a comment

Nelson integrates Kubernetes

I was thrilled earlier this week to receive a pull request from Target that added support for Kubernetes to Nelson - the open-source continuous delivery system. Whilst this support is a work in progress, it demonstrates several really important (and validating) aspects which we will discuss in this article. Before we do that however, a little bit of context:

In recent years the battle to become predominant (or even at all popular) within the cluster scheduling space has really exploded. Mesos, Nomad and Kubernetes being some of the more popular ones, with each bringing something slightly different to the table. For example, Mesos is at one end of the spectrum, bringing a low-level toolkit for building custom two-phase schedulers. Kubernetes is at the other end of the spectrum with a monolithic scheduler and many of the ancillary bells and whistles bundled right into the project (discovery, routing etc). This leaves Nomad somewhere in the middle between Mesos and Kubernetes, providing a kick-ass monolithic scheduler, but little in the way of prescriptive choices higher up the stack.

Whilst these systems all carry a very different set of trade-offs and operational experiences, they are often operated in a similar manner and all equally suffer from several distinct drawbacks:

  • Scheduling systems typically democratize access to compute resources within an organization, and increase development iteration velocity significantly. Such improvements are a boon for the organization as a whole, but introduces a slew of additional complexities that are seldom considered ahead of time. One such complexity that is highly problematic is garbage collection, and the associated lifecycle management. Stated simply, if you previously deployed your monolithic application once a week but you are now deploying micro services 100 times a day, then you have 499 deployments (weekly) that are simply wasting resources or serving customers with old/buggy code revisions. Engineering staff seldom spend the time during the day to go back and figure out what unnecessary revisions they need to clean up - frankly it is not a good spend of engineering time to have them doing that, especially when the robots can do a better job (more on this in the following section).

  • More often than not, operators of cluster schedulers end up with multiple distinct clusters. This is often an artifact of Conway’s law (very prevalent in large companies), but more broadly stems from historical operational thinking where implementors had hard separation between “environments”, and they look for an analog (with many operators not currently trusting micro-segmentation of the network, or,application layer TLS alone). Another common case that results in multiple distinct clusters is a desire for global distribution; having separate clusters for East Coast America versus West Coast America, for example. Whatever the cause, the result is the addition of swaths of incidental complexity by having many control planes which can hamper operational use cases when considering the organization at large. For example, how can an operator quickly assess for a given application, in which clusters or datacenter domains an application is deployed in, and discern which of those are active? Often the answer is this is not possible, or an operator will pull out some janky bash script to scrape the result from every available cluster sequentially.

  • Scheduling systems often provide a great deal of control over low-level runtime parameters, sandboxing configurations, networking, security and so forth. A powerful tool to be sure. However, this power and flexibility comes with cyclomatic and cognitive complexity - is this a complexity cost that you wish every single developer or user of your cluster to pay? Typically this cost is too high, and instead we as operators look for the minimally powerful tools which we can distribute to a wider engineering organization. For example, each and every developer is – in most organizations – not deciding how they will manage ingress edge traffic, service to service traffic, or secure introduction (the act of provisioning credentials or secrets which should not be known by the majority/any staff). These are typically defined by a central group, or x-functional set of staff who decide on these policies for everybody - often such structures are required to ensure compliance or governance, which results in everybody else simply copying these configurations into their projects verbatim. Over time this broadens the security and maintenance surface area significantly, rather than decreasing it, making evolution and improvements over time ever more difficult. For example, consider needing to update thousands of project repositories simply because the preferred TLS cipher list needs to be updated to account for another cipher being compromised.

Not only are these challenges not new, they are extremely widespread. At one point or another, any team operating a scheduling system will run into one or more of these problems. During my tenure running infrastructure engineering at Verizon, my group set about building a solution to these problems. That solution is Nelson.

Kubernetes Support

First and foremost I’d like to reiterate how awesome it is to be receiving community contributions for major features (just look how little code is needed). This is a testament to how easy Nelson is to extend and that its pure functional composition of algebras cleanly demarcates areas of functionality. From a more practical perspective I have a few goals with the Kubernetes support:

  • Nelson itself should be deploybale either as standalone, or deploybale also via Kubernetes. This should be a near-zero cost to make happen but it is an explicit goal as there are users out there who want to “kubernetes everything”.
  • Vault support (and automatic policy management) should work just as they do for the Nomad-based Magnetar workflow. For the unfamiliar reader, this essentially means that Nelson generates a policy on the fly for use by the deployed pod(s), which at runtime determines what credentials are supplied to the runtime containers.
  • When using Kubernetes, Nelson will have its routing control plane disabled. Istio is already becoming the defacto routing system for Kubernetes, and as such we will simply make the Nelson workflow integrate with the Istio pilot APIs. The net effect here is that users of Nelson can still specify traffic shifting policies but they will be implemented via Istio at runtime.
  • Cleanup works exactly as-is for Kubernetes and is first-class just like any other scheduler integration. Nelson’s graph pruning and logical lifecycle management systems will work across all scheduling domains Nelson is aware of (I.e multiple data-centers, clusters etc).
  • The addition of a health checking algebra to Nelson, such that we can remove the last hard dependency on Consul and provide a pluggable interface. Whilst a key tenant of Nelson is that it is not in the runtime hot path, the health checking (or delegation to some health-aware system) is required for Nelson to know if an application successfully warmed up and indicated it was ready to receive traffic. Without this, applications could fail and Nelson would erroneously be reporting said application as “ready”.

Future work

Whilst we will make a concerted effort to make the initial Kubernetes support broadly functional and reliable, I’m certain there are going to be areas of friction given the much more prescriptive nature of the Nelson interface (which is constrained by design). Additionally, I would love to think that we will be able to suffice with a single Kubernetes workflow, but in all probability there will be a variety of needs. If this becomes an intractable problem then the project could revisit earlier exploration around a mechanism to externalize workflow definitions (an eDSL for our internal workflow algebra). As such, I would really welcome feedback from users - or potential users - about these trade-offs. Striking the best balance between minimally powerful tools and sufficient flexibility is frequently a challenge with software engineering in the main.

That’s about all for now. If you’re interested to learn more about Nelson please visit the documentation or checkout a talk I gave earlier in the year. If you prefer something more interactive, we have a Gitter channel that is relativity active.

Leave a comment

Envoy with Nomad and Consul

The past couple of years of my professional life have been spent working in, on and around datacenter and platform infrastructure. This ranges from the mundane activities like log shipping, through to more exciting areas like cluster scheduling and dynamic traffic routing. It’s certainly fair to say that the ecosystem of scheduling, service mesh and component discovery - along with all the associated tools - has absolutely blossomed in the past few years, and it continues to do so at breakneck speed.

This pace of development can in my opinion largely be attributed to the desire to build, evolve and maintain increasingly larger systems in parallel within a given organization. If we look back perhaps to the start of the last decade, monolith applications were the norm: deploying your EJB EAR to your Tomcat application server was just how business got done. Applications were typically composed of multiple components from different teams and then bundled up together - the schedules, features and release processes were tightly coupled to the operational deployment details. In recent years, organizations have - overwhelmingly - moved to adopt process and technologies to enable teams to produce services and projects over time in a parallel manner; velocity of delivery can massively affect the time to market for a broader product, which in many domains has a very tangible value.

The layers in this new stack change the roles and responsibilities of system building quite significantly; consider the diagram below, outlining these elements, and annotated with their associated typical domain of responsibility.

Had I been diagraming this a decade ago, it would have been all yellow except for the engineering related to the specific business product (shown here in red). Instead of that, what we see here is an effective and practical commoditization of those intermediate components: operations staff are largely removed from the picture, freed up to solve hard problems elsewhere, whilst platform-minded engineering staff provide a consistent set of tools for the wider product engineering groups - everyone wins!

In this article, I’ll be covering three of the hottest projects that are helping usher in these organizational changes, and enabling teams to ship faster and build larger systems out of small building blocks, often solving long-time problems with infrastructure engineering:

The next few sub-sections review these tools at a high-level - feel free to skip these if you’re already familiar or don’t want the background.

Nomad

Nomad hit the scene in the middle of 2015, and for the most part has been quietly improving without the fanfare or marketing of other solutions in the scheduling space over the last two years. For the unfamiliar reader, Nomad is a scheduler that allows you to place tasks you want to run onto a computer cluster - that is a selection of machines which run the Nomad agent. Unlike other scheduling systems in the ecosystem you may be familiar with, Nomad is not a prescriptive PaaS, nor is it a low-level resource manager where you need to provide your own scheduler. Instead, Nomad provides a monolithic scheduler and resource manager (see the Large-scale cluster management at Google with Borg paper for a nice discussion on monolithic schedulers) which supports the handful of common use cases most users would want, out of the box.

For the purpose of this blog post, the exact runtime setup of Nomad doesn’t really matter that much, but i highly encourage you to read the docs and play with it yourself. One feature I will point out which I think is awesome: out of the box integration with Vault. If you want dynamic generation of certificates and other credentials for your tasks, this is so useful, and its nice to have a solid, automated story for it that your security team can actually be happy signing off on.

Consul

Once you start running more than one system on Nomad, those discrete systems will need a way to locate and call each other. This is where Consul comes in. Consul has been around since early 2014, and sees deployments at major companies all around the world. Consul offers several functional elements:

  • Service Catalog
  • DNS Provider
  • Key Value storage
  • Serf-based failure detector

Reportedly, there exist Consul meshes in the field that run into the tens of thousands of nodes! At this point the project is battle hardened and more than ready for production usage. The feature we’re most interested in for the purpose of this article is the service catalog, so that we can register deployed systems, and have some way to look them up later.

In order to look up services in the catalog, using DNS is a no-brainer for most systems, as DNS is as old as the internet and practically every application already supports it. Generally speaking, i’d recommend having a Consul cluster setup so that you have a delegate domain for DNS, such that consul “owns” a subdomain from whatever your main TLD is. This ends up being super convenient as you can reference any service with a simple DNS A-record (e.g. foo.service.yourdatacenter.yourcompany.com), which lets you integrate all manner of different systems even if those systems have no idea about Consul, with zero extra effort.

When you deploy a system with Nomad you have the option for it to be automatically registered with Consul. Typically, when when your container exposes ports you wish to access when its running, some re-mapping is required as - for example - two containers cannot expose and occupy port 8080 on a given host. In order to avoid port collision, Nomad can automatically remap the ports for you so the ports bound on the host are dynamically allocated; for example, 65387 maps to 8080 inside the container. This quickly becomes problematic because each and every container instance will have a different remapping depending on which Nomad worker it lands on. By having Nomad automatically register with Consul, you can easily lookup all the instances for a service from the catalog. This works incredibly well because as a caller, you don’t then need any a-priori information about the IP:PORT combinations… its just a DNS query or HTTP request.

Envoy

In September 2016, Lyft open-sourced Envoy. On the face of it, Envoy may appear to be a competitor to something like Nginx - Envoy however does much more than a simple proxy. Envoy fills an emerging market segment known as the service mesh, or service fabric. Every in-bound and out-bound call your application makes - regardless of if you run it; containerized on a scheduler, or on bare metal - is routed via an Envoy instance.

This might seem counter-intuitive - applications have traditionally handled everything related to making requests, retries, failures and so on… and this has largely worked. However, if the application itself is handling all the retry, monitoring, tracing and other infrastructure plumbing required to make a viable distributed system, then as an organization you have a tricky trade off to make:

  1. Preclude a polyglot ecosystem because the cost of re-implementing all that missing critical system insight, or:
  2. Pay a high operational cost by having to support these intricate systems in many different languages, and have to retain staff expert enough in all these languages to solve problems over time.

Envoy alleviates this problem by providing a hardened side-car application that handles retries, circuit breaking, monitoring and tracing - your applications just make “dumb” requests. You then only have one way to operationally deal with the ecosystem across your entire distributed system. Requests are retried, traffic can be shaped and shifted transparently to the caller, throttling can be put in place without modifying applications in the event of an outage… the list goes on. Even organizations that have strived for homogeneity in their software ecosystem inevitably find that other technologies are going to creep in: are you going to build your marketing website using Rust and re-implement the world needed to render Javascript to a browser? No. You’ll likely end up node.js, or - god forbid - some PHP… shudder. But, you get the point dear reader: its inevitable even for those with the best of intentions, and in this frame Envoy quickly becomes attractive.

Usage

For clarity, i’m going to start out with the following reasonable assumptions so that we don’t have to waste time discussing them later:

  1. Consul and Nomad are clustered with a minimum of at least five nodes. This allows you to conduct rolling upgrades without outages or split brains.

  2. You setup Consul using DNS forwarding so you can just blindly use Consul as your local DNS server without having to futz with /etc/resolve.conf or the like (which can get hairy in containerized setups).

  3. Nomad agent and Consul agent are run on every host. They have full, unadulterated network line of sight to their relevant servers on all ports (assuming a perimeter security model, without micro segmentation of the network).

These designs are suggestions, and there are potentially awkward trade-offs with any design you choose to implement in a system. Before copying anything you see here, make sure you understand the trade-offs and security considerations.

The first thing to consider is what elements of the infrastructure one wants to install on the underlying host: do you want to run monitoring from the host? Or embed it? Do you want a per-host Envoy or an embedded one? Honestly, there are no slam-dunk solutions, and as mentioned all come with their own particular flavor of down-sides, so we’ll walk through both a host-based model and an embedded model for Envoy.

Clusters and Discovery

Envoy has the concept of “clusters”… this is essentially anything Envoy can route too. In order to “discover” these clusters, Envoy has several modes by which it can learn about the world. The most basic is a static configuration file which requires you to know in advance where and what the cluster definitions will be like. This tends to work well for proxies and slow-moving external vendors and the like, but is a poor choice for a dynamic, scheduled environment. On the opposite end of the spectrum Envoy supports HTTP APIs that it will reach out to periodically to perform several operations:

  1. Learn about all the available clusters - this is called CDS.
  2. Given a closer name, resolve a set of IP addresses with an optional set of zone-weighting so Envoy can have routing bias to the most local providers first - this is called SDS
  3. Fetch a configuration about a route, which can alter the way a certain cluster or host is handled for circuit breaking, traffic shifting or a variety of other conditions. This is called RDS.

Envoy provides these API hooks so that it is inherently non-specific about how discovery information is provided in a given environment. Envoy never announces itself - it is instead a passive listener about the world around itself, and with a few lines of code we can provide the APIs needed to make a thin intermediate system that converts the Consul API into the Envoy discovery protocol.

Regardless wether you supply your own xDS implementation or use the off the shelf one provided by Lyft (be aware that there is a more principled gRPC protocol in the works with envoy-api), the design for how you’re going to run your containers with Envoy on Nomad is probably more interesting. The next few subsections consider the various alternative designs with a short discussions on the pros and cons of each.

Embedded Envoy

The most obvious way to deploy Envoy would be to simply have it embedded inside your application container and run a so-called “fat container”, with multiple active processes spawned from a supervising process such as runit or supervisord.

Let us consider the upsides:

  • Lazy ease. This is the simplest approach to implement as it requires very little operations work. No special Nomad job specs etc… just bind some extra ports and you’re done.

  • SSL can be terminated inside the exact same container the application is running in, meaning traffic is secure all the way until the loopback interface.

  • Inbound and outbound clusters are typically know a-priori (i.e. who will Envoy route too), so this could be configured ahead of time with a static configuration.

The downsides:

  • Upgrading across a large fleet of applications may take some time as you would have to get all users to upgrade independently. Whilst this probably isn’t a problem for many organizations, in exceedingly large teams this could be critical.

  • Application owners can potentially modify the Envoy configuration without any oversight from operations, making maintenance over time difficult if your fleet ends up with a variety of diverging configurations and ways of handling critical details like SSL termination.

There are a variety of reasons that many people do not favor running multi-process containers operationally, but none the less it is still common. This tends to be the easiest approach for users who are transitioning from a VM-based infrastructure.

Host-based Envoy

As Envoy can be considered a universal way of handling network operations in the cluster, it might be tempting to consider deploying Envoy on every host and then having containers route from their private address space to the host and out to the other nodes via a single “common” Envoy per-host.

The upsides here are:

  • Fewer moving parts operationally: you only have as many Envoy instances as you have hosts.

  • Potential for more connection re-use. If each host in the cluster has a single Envoy, and there’s more than a single application on each node, then a higher probability exists that there exists a higher chance for SSL keep-alive and connection re-use, which - potentially - could reduce latency if your request profile is quite bursty as you would not be constantly paying SSL session establishment costs.

The downsides are:

  • SSL termination is not happening next to the application. In an environment that is in anyway non-private (perhaps even across teams within a large organization) it might be undesirable - or indeed, too risky, depending on your domain - to terminate the SSL of a request “early” and continue routing locally in plain text.

  • Potentially negative blast radius. If you run larger hosts machines for your Nomad workers, then they can each accommodate more work. In the event you loose the Envoy for a given host (perhaps it crashed, for example) then every application on the host looses its ability to fulfill requests. Depending on your application domain and hardware trade-offs, this might be acceptable, or it might be unthinkable.

  • Maintenance can be difficult. Patching such a critical and high-throughput part of the system without an outage or affecting traffic in anyway is going to be exceedingly difficult. Unlike the Nomad worker which can be taken offline at runtime and then updated, allowing it to pickup where it left off, Envoy has active connections to real callers.

Whilst i’ve never seen this design in the field, this is not dissimilar to how Kubernetes runs kube-proxy. If you have a completely trusted cluster the security concerns could be put aside and, potentially, this design could work well as it is operationally simpler. It does however come with some unknowns, as Envoy is expecting to be told the node address and cluster for which it is logically representing, at the time Envoy boots.

Task Group Envoy

In Nomad parlance, the job specification defines two tasks to be spawned within the same task group; your application container, and an Envoy container. This patten is often used with logging side cars, but can happily be adapted for other purposes. In short, being in the same task group means Nomad will place those tasks on the same host, and then propagate some environment variables into each task member about the location (TCP ports, as needed) of the other container. Some readers might draw a parallel here to the Kubernetes Pod concept.

The upsides here are:

  • Global maintenance is easy. If you want to modify the location of your xDS systems, or SSL configuration then you simply need to update the Envoy container and you’re done without having to engage application development teams.

  • Mostly the same runtime properties as the embedded design

The downsides are:

  • Operationally a little more complicated as there are important details that must be paid attention too. For example, when submitting the task group, setting the application task as the “leader” process so that the companion containers get cleaned up is really important. Without this you will leak containers over time and not realize.

This is perhaps the most interesting design within the article, as it represents an interesting trade off between host-based and embedded deployment models. For many users this could work well.

Conclusions

In this article we’ve discussed how Envoy, Nomad and Consul can be used to deploy containerized applications. Frankly, I think this is one of the most exciting areas of infrastructure development available today. Being able to craft a solution using generic pieces which are awesome at just one thing goes to the very heart of the unix philosophy.

Whilst the solutions covered in this article are not zero-cost (I don’t believe that solution will ever exist!), they do represent the ability to enable fast application development and evolution, whilst lowering overall operational expenditure by providing a converged infrastructure runtime. Moreover, the advent of broadly available commodity cloud computing has forced a refresh in the way we approach systems tooling; traditional methodologies and assumptions such as hardware failing infrequently no longer hold true. Applications need to be resilient, dynamically recovering from a variety of failures and error modes, and infrastructure systems must rapidly be improved to build platforms that development teams want to use: Nomad, Vault, Consul and Envoy represent - in my opinion - the building blocks for these kinds of improvements.

If you liked this article, but perhaps are interested or committed in alternative components for the various items listed here, then consider these options:

Schedulers

Coordination

Routing

Thanks for reading. Please feel free to leave a comment below.

Leave a comment

Frameworks are fundamentally broken

This post was origionally written in 2012, and later revised in 2014. At the time, I refrained from posting it due to concerns about how certain topics were articulated, and how it might be recieved in the community. After accidentially publishing this in 2016, the positive feedback I recieved encouraged me to release this officially. The article is an opinion piece that I hope resonates with other functional programmers. Certainly, not everyone will agree with what’s written here, and that’s absolutely fine with me. All I ask is that you read the article for what is, and recieve it with the good intentions it was written.

I’ve been thinking about writing this post for a while - several years in fact - but its reached a point where I have to get this out of my head and onto the screen: Frameworks are the worst thing to happen to software development in the last decade (and possibly ever).

For the purpose of this article, I shall define a Software Framework as this: one or more pieces of software that are designed to work in tight unison, with the aim of smoothing / easing / hastening / or otherwise “improving” the development of a given application development cycle in a particular domain. The software in question is typically bundled together and binary modules of the project are typically not used outside the intended framework usage or the framework itself. Examples of software frameworks include AngularJS, Play!, Ruby on Rails etc. At this point in my software engineering career, I have used a wide range of software frameworks and have been involved in writing more than one, and I even wrote a book about Lift. With this frame good reader, please appreciate that one does not come to such a decision to criticise frameworks as a programming paradigm lightly. The following sub-sections outline what I see as the primary issues that make frameworks fundamentally flawed.

Lack of Powerful Abstraction

Business domains are often inherently complex, and this impacts the engineering that needs to take place to solve a problem within that business domain in a very fundamental way. In this regard, frameworks tend to be intrinsically limiting because they were written by another human-being without your exact, complex business requirements in mind - you are programming inside someone else’s constraints and technical trade-offs. More often than not, those trade-offs are not documented explicitly or encoded formally, which means users encounter these limitations through trail-and-error usage in the field.

Many framework authors take the approach that they are solving a general problem in a given engineering sector (web development, messaging, etc), but typically they end up solving the problem(s) at hand in a monolithic way. Specifically, authors have a “outside in” approach to design, where they allow “plugin points” for users of the framework to write their own application logic… the canonical example here can be found in MVC-style web framework controllers. In all but the most trivial applications, this is a totally broken approach as one often observes users either writing all their domain logic directly in the controller (i.e. inside the framework constraints), or alternatively, observing parts of the domain logic or behaviour “leaking” into the controller. Whilst it could be argued that this is simply an education problem with users, I would disagree and argue that it takes a high-degree of discipline from users to do the right thing… The easy thing is most certainly not the right thing. Instead of the root cause being an education issue, I would propose that a fundamental problem exists with the mindset of the frameworks themselves - which often encourage this kind of poor user behaviour - in short, frameworks do not compose. Frameworks make composition of system components difficult or impossible, and without composition of system components there can never be any truly powerful abstraction… which is absolutely required to build reasonable systems. To clarify, the lack of composability exists both in the micro and macro levels of a system; components should plug together like lego bricks, irrespective of which lego pack those bricks came from (here’s hoping you follow that tenuous analogy, good reader). Users don’t wish to extend some magic class and be beholden to some bullshit class hierarchy and all its effects; users wish to provide a function that is opaque to the caller, provided the types line up, naturally. When frameworks do not do this, its a fundamental issue with the design of these software tools.

An obvious supposition might be that these kinds of monolithic, uncomposable designs occur because framework authors are trying to optimise for certain cases - more often than not, a case high on the list to satisfy is to make the framework “easy” to get started with. An interesting side-effect of this is that authors usually assume that users won’t know too much about what they are using and that the code they write needs to be minimal. Whilst i’m all for writing less code, assuming that users won’t know how to use framework APIs only applies when the system is not based on any formal or well-known abstractions. The ramification of this lack of formalisation is two fold:

  1. Enhanced burden on the framework author(s) as the lack of formalisation requires them to “teach” the user how to do everything from scratch. In practice this means writing more documentation, more tests and examples and more time spent on the community mailing lists helping users - ad infinitum.

  2. Users have to invest their time fairly aggressively in a technology without truly understanding it. This typically means getting up to speed with all the framework-specific terminology (e.g. “bean factory”, “interceptor”, “router”) and programming idioms. As an interesting side-note, I believe this aggressive investment without understanding is actually what gives rise to a lot of “fanboism” in technology communities at large: people get invested quickly and feel the need to evangelise to others simply because they invested so much time themselves, and subsequently need to ensure that the tool they selected gains critical mass and long-term viability / credibility… that is no doubt a subject for another article though.

Let’s consider for a moment what would happen if a framework component were implemented in terms of a formally known concept… For example, if one knows that a given component is a Functor, then one can immediately know how to reason about the operations and semantics of that component because they are encoded formally as a part of the functor laws. This immediately frees framework authors and users from the burdens listed in points one and two above. However, what if framework users don’t know what a Functor is? and they are not familiar with the relevant laws? Well, there is no denying that to learn many of the formal constructs will require effort on the part of users, but critically, what they learn is fundimentally useful when it comes to reasoning about problems in any domain. This is wildly more beneficial than learning how to operate in one particular solution space inside one particular framework. They will have learnt something fundamental about the nature of solving problems - something that will serve them well for the rest of their careers.

Concepts such as Functor should not be scary. Many engineers in our industry suffer from a kind of dualism where theory and practice are somehow separate, and formal concepts like Functor, Monad and Applicative (to name a few) are often considered to “not work in practice”, and users of such academic vernacular are accused of being ivory tower elitists. Another possible explanation might be that engineers (regardless of their training: formal or otherwise) are simply unaware of the amazing things that have been discovered in computer science to date, and proceed to poorly reinvent the wheel many times over. In fact, I would wager that nearly every hard problem the majority of engineers will encounter in the field has had its generalised case be the subject of at least one study or paper… the tools we need already exist; its our job as good computer scientists to research our own field, and edify ourselves on historical discoveries and make best advantage of the work done by those who went before us.

Short-term Gain

All software is optimised for something; sometimes its raw performance, sometimes its type-checked APIs and sometimes its other things entirely. Whatever your tools are optimised for, some trade-offs have been made to achieve said optimisation. Many, many frameworks usually include phrases like these listed below:

  • “Increased productivity”
  • “Get started quickly!”
  • “Includes everything you need!” for XYZ domain

These kinds of benefits usually indicate the software is optimised for short-term gain. Users are hooked on the initial experience building “TODO” applications. More often than not, these users then later become frustrated when they hit a wall where the framework cannot do exactly what the business needs, and they have to spend time wading through the framework internals to figure out a gnarly work-around to solve their particular problem.

The real irony here is that optimising for the initial experience is such a wildly huge failure: the majority of engineers will not spend their time writing new applications, rather, they will be maintaining existing applications and having to - in many cases - reason about code that was not written by them. On large software projects, there are usually a myriad of technologies being employed to deliver the overall product, and having each and every software tool have similar concepts with different names and annoying edge cases is frankly untenable. Once again, the lack of formalisation or composability causes havoc in many areas (lest we forget taking the time to figure out work arounds is usually painful and time-expensive).

Community fragmentation

For the vast majority of frameworks, they usually have a particular coding or API style, or a set of conventions users need to know in order to produce something - disastrously, these conventions are often not enforced or encoded formally with types. Whilst these conventions are probably obvious for authors of the framework, it makes moving from one framework to another a total mind-fuck for users - essentially giving users (and ergo, companies) a vendor lock-in long-term. Whilst vender lock-in is clearly undesirable, there is another more important aspect: frameworks create social islands in our programming communities. How many StackOverflow questions have you seen with a title along the lines of “what’s the Rails way to do XYZ operation?”, or “How does AngularJS do ABC internally?”. Software is written by people, for people, and it must always be consumed in that frame. Fragmenting a given language community with various ways to achieve the same thing (with no formal laws) just creates groups with arbitrary divisions that really make no sense; these dividing lines usually end up being taste in API design, or familiarity with a given practice.

Whilst the argument could be made that branching, competing and later merging of software projects is beneficial, when it comes to the people and the soft elements related to a technical project, the mental fallout from the fork/compete/merge cycle is extremely heavy and usually the merge process never occurs (if it does, it usually takes years). Moreover, if a given framework community island fails, its incredibly hard on the engineers involved. I have both experienced this personally, and witnessed it happening in multiple other communities - which is a worrying trend (again, lots of material for a later post to lament about that).

Looking forward

It is imperative to understand that the need for composability in our software tools is an absolute requirement. If we as an industry have any hope of not repeating ourselves time and time again, we have to change our ways. In conclusion, dear reader, if you’re wondering what you can do to make the industry a better place going forward: study the past and read as many releevant academic papers as you can reasoanbly consume… be curious and continually ask questions. Demand lawlful programs and excellent tools. Engage in software communites in a meaningful and positive way, and always look to improve the world around you :-)

Leave a comment