There are two industry-wide trends this year which landed squarely on data teams’ lap in 2023:
AI and LLMs: Teams are being encouraged by leadership to experiment with LLMs to find opportunities for both growth (new products) and efficiency (through automation and access).
AI and LLMs are squarely in a domain where the data team is expected to carry the torch.
The Tech recession in 2023: staff got cut and data teams were asked to do more with less, consolidate tools and simplify their data stacks.
Data teams, which are oftentimes not revenue-generating were cut disproportionately.
The trends on data teams reflect the feast and famine we saw in the broader tech ecosystem and startup ecosystem earlier this year: AI startups were still raising at 50 million pre-seed / pre-product (or more) rounds. SaaS companies with millions in revenue on the other hand were having a hard time raising a series A.
So data teams are at a crossroads: looking for best practices and efficiencies from the Modern Data Stack (the last wave of tools) as a new tech cycle seems to be coming towards us.
LLMs have had a huge impact on the efficiency of engineers, data scientists and technical staff. LLMs can help engineers write unit tests, fill in boilerplate code and get a great first draft of a SQL query. They are incredibly strong on synthesis and generation, which really levels up technical staff.
Leaders, though, are demanding LLM / AI usecases beyond just technical staff. To accommodate these demands we’re seeing a few broad strategies for companies trying to sell natural language interfaces:
New products are trying to pitch LLM-centric approaches to analytics workflows with text-to-sql. These entrants are trying to position new products as needing to be AI-centered. It’s clearly too early to write any of these off, but my gut is that these tools are going to have to end up building out fairly complete analytics solutions.
Integrated approaches: a chatbot is able to query other BI tools and analytics tools and answers questions and link you to your existing BI solutions.
Existing analytics solutions / BI tools are trying to fill the demand for AI-enabled workflows by building chat interfaces interfaces on top of existing tools.
Hashboard built two NLP features this year: the first was an AI assistant / chatbot for analytics. The demo was pretty slick, you could ask natural language questions and Hashboard would build charts and analyses. We didn’t see a huge amount of traction with the AI assistant and have pivoted to a more structured data search, read our full take on the development process here.
The TLDR on the year is that cool demos of self-service analytics and text-to-sql for business users have not yet translated to measurable impact on how the broader organization consume data. The CEO isn’t doing novel data science yet with their data science chatbot.
In 2020 and 2021 we got a flood of new tools with the rise of the cloud data warehouse and the “Modern Data Stack”. We could store infinite data in Snowflake. So… we did store probably too much data in Snowflake.
Tools came out to take advantage of cloud-based warehouses.
These tools (Fivetran et al) allowed you to move raw data right into storage that was fast to query.
You now had lots of raw data (and could get more). Cleaning up data became a lot easier, so we did much more of it.
dbt allowed us to scale and compose your data-cleaning.
So why change anything upstream when you could get all this raw data and just clean it later?
So we did more cleaning, more transferring of data, and more cleaning.
This was exciting, because it allowed analysts to take on more scope: they could help build products, automate and take on an increasingly important role in organizations.
Eventually though, we ended up with DAG’s with thousands or tens-of-thousands of nodes.
I empathize with where we ended up. I worked at at Flatiron Health from 2013 to 2019 and we went through a similar experience. We created a SQL-oriented pipelining tool that was accessible to data scientists in the organization. That DAG and data pipeline scaled to three hundred software engineers, data engineers and data scientists contributing to a single codebase.
A SQL-native DAG-building tool that was accessible turned out to be incredibly delightful for small data projects and was incredibly painful for scaled multi-team projects.
We are victims of all of the scope that we took on. It was exciting to take on more scope, but now we’re left wondering how to make it less painful to maintain. The data world is getting too complex. dbt projects are too complex. And some of the efficiency gains of the modern data stack are being eroded by the increased complexity.
We’re still figuring out how to scale these tools, potentially with less staff. We don’t really need a new constellation of tools (for the first time), we need to simplify our approaches to existing tools.
I was excited at Coalesce that they seemed to acknowledge this core challenge. ‘dbt mesh’ is an architecture and set of best practices. It has two critical features:
It allows you to access data between projects. So you can encapsulate some logic (like a software module)
It allows you to configure public and private data.
I actually think these are critically important improvements. One big downside is that you can only really implement dbt mesh in dbtCloud. This seems a bit tricky to me: it seems like this is solving a critical and core challenge of dbt as a technology, but it’s only available in a subset of runtimes. I think in the longterm this might actually open up dbt to competition, since this idea is good enough to copy and make open source.
All of this to say, we adopted a bunch of tools, we were in awe of the new power we had to suck up tons of data, store infinite data, cleanup anything anyone had done upstream.
Some old ideas could help us manage the rising complexity. Data Contracts, Metrics Layers and Data Observability could be on the trends list for the past few years, but seem to have gotten real traction to varying levels of success. I’ll run through a quick take of each and how I’ve seen their progress
Metrics layers: Semantic layers as a separate product category still see a lot of chatter. I personally have not seen semantic layers in the wild that have been meaningfully adopted for analytics: meaning their adoption is still pretty close to zero for analytics and BI workflows. AtScale might be the notable exception that has seen enterprise adoption, but hasn’t been widely adopted by modern teams.
There is more effort and energy than ever towards these tools. Cube and the dbt semantic layer are certainly promising.
Data Observability: Data observability has seen the most widespread adoption of the three “old trends” on this list. Data Observability is the first step to helping data teams monitor, test and improve their data quality over time. Tools like MetaPlane, Data Fold and Great Expectations have seen increasing usage as data teams look to add basic quality checks and monitoring or their data quality.
Data Contracts: Data contracts have emerged as practical mechanism for ensuring quality and consistency in data exchanges between different systems and teams. These contracts function like agreements on data structure, format, and quality, providing a clear framework for data production and consumption. I definitely see teams starting to adopt data contracts - so it’s further along than the Metrics Layer. The current challenges are typical challenges for operationalizing CI systems. If the data team just adds a some data contract validations to the application team’s CI process and blocks application builds, the business probably won’t tolerate the slowdown of product velocity. The data team is looking to push their failures upstream, to “shift left”, but this only works if the data team is on-call and is quickly unblocking the application teams from merging and pushing the product forward.
One of the bigger trends I’ve seen in 2023 is a bit of a vibe shift: the conversation is shifting away from Twitter/X and Data Community Slack Channels. During the rise of the “Modern Data Stack” there seemed to be a coherent set of debates and conversations happening around data and data infrastructure. There were a set of new and coming tools and they were getting debated on twitter and these slack channels.
Twitter is not filled with eager and new data scientists anymore. Maybe they are burned out by the challenges of scaling our old tools.
We used to get excited about new tools and releases, and it seems a lot of the excitement and energy around the data space has quieted over the past year.
Data doesn’t always have to be complicated.
People were pretty amped on DuckDB this year. People were amped about DuckDB in the way that they used to be amped about new tools in the data space: the way we used to be amped about Hex or dbt or the newest fastest cloud data warehouse. It was honestly refreshing to see some giddy excitement in the data space.
I think the excitement about DuckDB is an excitement around additional constraints: don’t move all your data around, move the data that you need. You don’t need infinite data: maybe you actually just need one node to process your data.
I’m not saying DuckDB is always the right tool for analytics workloads... But I do think that’s what data teams were excited: to have an opinionated tool that rolled back to constraints of a previous generation of tools.
Data used to have immense inertia and constraints: it was hard to move data, once you got it somewhere it had to stay there and you were scared of deleting it. We were constantly battling inertia.
This inertia had some benefits though: we had to be thoughtful every time we moved data. We had to design and think about how it would be consumed. We would sketch and design data models. We had to think about access patterns.
As the inertia went away and the marginal cost of restructuring, re-modeling and copying it went to zero, the obvious happened: we didn’t have to be as thoughtful.
I think the data analytics industry broadly is looking for more constraints. Not just tips and best-practices for smart ways to model data, but modeling constraints that don’t let you make as many mistakes.
This is broadly the philosophy of Hashboard: be opinionated and be constrained.
Looking forward, I’m excited for us to spend more time on what we used to spend time on: careful curation, governance and education. Deleting and retiring as much as you create.
The next few years will be about bespoke data instead of letting the entire organization freely surf the entirety of the data lake.