The data landscape: ELT, Event Streaming, and ESB

A sign reads "These woods open to all" at the entrance to a forest trail.
A sign reads "These woods open to all" at the entrance to a forest trail I hiked last weekend in Shoreview Park, Shoreview, WA.

Every company will deal with data, and if you're a SaaS company, the data is coming in from all sides like a strong storm front blowing in snow. You'll be buried in data, more data than you'll ever use.

How you do use that data can vary. You're going to want to collect, instrument and tag data for your operations to understand your performance and logging. Maybe you end up using a Datadog (perhaps the most spammy company I've ever dealt with), Splunk, New Relic, or Prometheus. There your goals tend to be minimizing noise, making useful data visualization, and making actionable alerts.

On the analytics side, you'll want to collect and manage user data, both from activities the user is engaging with and your core personalization data to make useful reports. segment.io is a space leader here, but there are literally hundreds of choices in this space. I usually recommend companies get a 30-day rolling log of their analytics data themselves to in an S3 bucket or some other file store to make switching vendors or recovery less painful down the road.

But on the data processing side for the running of your application, that's where things can get interesting. Traditionally you're making either a monolithic application with internal calls running off some sort of relational database and then a fast key value storage for caching (à la Redis). If you're going down the microservice route, you'll have internal and external APIs, often REST or GraphQL.  If you're hyper optimizing, you might also be generating static resources directly as partials on disk and loading them in with the same rules from a reverse proxy.

Where things get interesting is when you start wanting to do integrations between your app and other tools. Especially if those other tools don't have a strong API system themselves.

I spent some time researching a few approaches, and wanted to share what I found. This isn't a well-thought-out "here's exactly how to carry out x" article from me, but more of an overview of what's out there and maybe a way to get some creative juices flowing.

On the simplest side of the equation is Meltano. Meltano focuses on ELT (Extract, Load, Transform) operations and lets you define programmatically what data you want to extract from a source (like a relational database, a flat file, or other sources they support), transform that data programmatically to normalize it, and then put it into another system. It's designed to be easily fit into a CI/CD workflow for writing and testing your transformations, and it largely is focused on doing large-scale batch operations.

I can see Meltano having been helpful in a past life where I was dealing with nightly account reconciliation data in a massive flat file format from banking institutions. Having a way to express those operations, as well as build and test them in a more traditional CI/CD flow, would have made my life a lot easier. It's worth checking out if you have some large data sets you need to do something to.

Next on my exploration train was Mulesoft, which used to call itself an Enterprise Service Bus but seems to have mostly shrugged off the label. I used it 7-8 years ago before it was bought by Salesforce. I have to say, Salesforce has put a lot of effort into it, but they've also obviously tried to make it less accessible to individual developers, instead focusing on sales pages and driving you to set up a call with a solutions partner to actually set up the tooling. They very much hide that Mulesoft is Open Source and you can download it for free as well. Salesforce being Salesforce.

With a bit of digging through you can find the developer documentation for the Anypoint platform. Mulesoft is a bit of an odd combination of abilities in that it lets you make your own API endpoints, and then set up scripting for what those APIs will return. They can proxy other APIs internally, while giving logging and access control in which case it runs like a heavyweight Apigee or APIcast. Or, it can also generate APIs from scratch by pulling data from data sources (like a relationship database) directly. There's also pub/sub and webhook support for more asynchronous operations.

The main downsides here are it's a fairly heavy app. Salesforce has seemed to make it only heavier as well, and so it's a large system to support. The upside is it's relatively well known and supported in this industry, and partly because of that is often trusted. Mulesoft is heavily used inside of government, finance and healthcare. It's also unlikely to go away anytime soon with a company like Salesforce supporting it and depending on it heavily for their own integrations.

I could definitely see Mulesoft as being useful to make connections between many services an enterprise is using to connect systems. I could also see it being used by an org to make simple public APIs. More than that though, it feels like too heavy a tool.

Last on the tools I reviewed was Kafka, which I had the extreme pleasure of being able to review an animated picture book of how Kafka works, and I get the pleasure of sharing it with you now. Get a cup of tea and go here, even if you know what Kafka is, it's worth the time. It does a good job of explaining event streams and processing. It's worth noting that Kafka is an underlying tool, and you're going to need to build services to be the emitters and consumers of these event streams, including how to get new data back to the user where it's needed. If you've used something like RabbitMQ in the past, this review of the differences will also be helpful and help you avoid using Kafka when a lighter tool would work as well with less maintenance upkeep.

Ultimately, the goal with software design should always be focused on the minimal set of tools to achieve your expected needs, to avoid spending a lot of time supporting a large tool chain that you're only minimally using. "We're only using 10% of x" is usually a good sign of a tool you should replace with something simpler. If you're already using Kafka for your analytics and other processing tasks, though, extending that use to asynchronous processing wouldn't be much of a leap.

This is a bit of a rambled collection of my thoughts from playing with these tools over the holiday here in the US. Hopefully you found it valuable, and feel free to share added information on tools or uses you've found helpful in the comments.

Make room to be awesome,
://Robin

PS: I've moved my social presence over to ActivityPub, and am now on Mastodon at @[email protected]. Feel free to follow me from whever you feel at home.