Glean is now Hashboard! Introducing Hashboard v1.0:  Collaborative data exploration for all!  → Read more about our launch!

Hashboard blog

Keep up to date with what's new

BI demo in a box with DuckDB and dbt

BI demo in a box with DuckDB and dbt

I can go into the Figma Community and look at a set of icons or a design system and feel confident that I can import it into my own project and start playing around with them. Design assets are portable and easily plugable into different environments. Same with software packages. I see a live demo of a Python package and I’m pretty sure I can pip install the package and use it myself. Business Intelligence tools give me the exact opposite vibe. When I see a pretty demo on a marketing website, I have a sneaking suspicion I can’t use any of it or couldn’t get it working myself. Looker has a whole library of Looker Blocks —but you need to bring your own data to play with the code. Which, if I’m curious about learning about Hubspot data but don’t want to connect my own data, isn’t particularly interesting or useful. If you want evidence that BI tools aren't particularly shareable / reproducible across organizations: you can just look at their public examples: they are very hard to reproduce! BI is hard to make reproducible Business Intelligence demos are hard to make reproducible because they live at the end of a long pipeline of tools and processes: extraction tools/loading tools, transformation and warehousing tools. Historically, BI tools have always carried a ton of external dependencies. If you want to try them out, you need data and you need a database to store that data. Using DuckDB for the Pizza Bytes retail demo I’m excited for DuckDB because it reduces your dependence on external tools and resources, which makes it great for small prototypes and demos. When I started developing the Pizza Bytes demo for the launch of Hashboard (fka Glean), I wanted to build example code people could easily pick up and reproduce in their own projects. To make that possible, I ended up pulling inspiration from the rewritten and improved Jaffle Shop which is self-contained and follows the MDS-in-a-box pattern to pull it off: by using parquet files as intermediates instead of a data warehouse. MDS BI demo in a Box The Modern Data Stack in a Box introduces a single-node analytics approach that capitalizes on DuckDB, Meltano and dbt to make a whole data processing pipeline run on a single machine. This is really useful for testing… and demos! Contrasting with the complexities often associated with scaling data pipelines, this method emphasizes trying to create a complete analytics environment locally. A typical Modern Data Stack (MDS) incorporates various tools to handle the full data lifecycle. MDS-in-a-box promises developers simplified operations, reduced costs and easy deployment options when you are running tests or other experiments. The stack is especially useful for open-source contributors and SaaS vendors, as it provides a contained example that's easy to share and replicate. This is a great way to experiment, build demo’s and prototype solutions. Still TBD whether DuckDB is useful for production analytical workloads - but I’m excited to see MotherDuck and other entrants try to pull it off! Running the Pizza Bytes demo (in-a-box) Pizza Bytes follows this same paradigm: use a single machine to run all the parts of your data processing pipeline, with parquet files as intermediates and no external dependencies. Here are the steps to building our Pizza Bytes demo on a single node. Note that the makefile will automatically run the extract, transform and dashboard creation scripts listed below. Step by step instructions (instead of the make script) 1. Extract and Transform synthetic data Data is generated synthetically with python scripts and written to parquet files. The scripts were mostly written with chat gpt and generate synthetic trends in the data. 2. Transform data with dbt dbt Just run dbt: Dbt will operate on the parquet files directly and the output will also be parquet files. This is enabled in the profiles.yml file which tells dbt to output parquet files: The example also maps our python-generated parquet files as a dbt source, so that dbt models can reference the parquet files in _generated_sources.yml 3. Deploy BI with Hashboard The metadata to produce Hashboard models is embedded right alongside the dbt models (in this example, but it’s also possible to separate this config into separate config files). We also have a few other resources (like a color palette ) configured in the project. When a Hashboard model is attached to a dbt model, it looks for the corresponding “tables” in the data warehouse attached to the dbt models. Since there isn’t a data warehouse, Hashboard expects to see appropriately named parquet files. We’ll upload the parquet files that are generated in the dbt pipeline as a first step to make sure the deployment works properly. The magic that makes this work is configured in the dbt_project.yml Note that you could build this whole demo without dbt as well. You would just reference the parquet files explicitly in the Hashboard configuration spec and you could even embed the logic and sql queries in the Hashboard models themselves. See the Hashboard CLI quickstart to get started with or without dbt. The promise of Dashboard Templates Templates (like Looker Blocks)—which could be used across orgs—come with their own set of challenges. Most pronounced is the issue of creating unified standards around terminology, semantics and definitions across different organizations. What is “churn”? What is a “New User”? What is an “Antineoplastic drug”? We’re now seeing some standard models for pulling in data—for example, FiveTran manages a wealth of standard dbt pipelines for processing data from sources like Stripe or ERP systems. This process is straightforward because the inputs are relatively well constrained for source systems like Stripe. Things get harder when you move past the source layer and dive into the semantics of an organization. It’s not just about handling Stripe data; it’s about synthesizing Stripe data in a way that works with your particular business model and the specific way that your org defines and structures data. Because it’s still so difficult to share data across organizations, we aren’t yet seeing the deep level of collaboration that can solve these common semantic problems. My hope is that our synthetic example data is a modest attempt and starting point for sharing examples between organizations and building mutual knowledge.

Carlos Aguilar
Carlos Aguilar
September 20th, 2023
Read more
Introducing Hashboard!

Introducing Hashboard!

When we launched in March of 2022 under the name Glean.io we were a small team on a mission to simplify data exploration. The past year and a half our team has doubled in size, shipped a lot of product and learned more than we could have imagined from our customers. Today, we're excited to announce a new name, a new brand, and the start of a new chapter. As Hashboard, we're doubling down on our commitment to simplify data exploration and help teams everywhere fall back in love with data. While we’re thrilled about our new identity, there's more to celebrate. Our rebrand also coincides with the launch of our most substantial feature releases to date! 🚀 Introducing Hashboard v1.0 Today we're launching: Quick explore to make data more accessible and explorable A metrics page , to keep your whole team on goal and aligned Model join , to link related models and build better metrics New explore experience to quickly navigate measures, attributes and joined models Also check out our dbt integration that we launched recently Hashboard is an opinionated business intelligence tool that aims to redefine data exploration and brings teams closer to the reality of self-service. Hashboard helps organizations love data again by simplifying data exploration and enabling teams to publish data in a consistent, high quality and beautiful way. With Hashboard you can define metrics on your data warehouse, allowing your entire organization to effortlessly search, explore and uncover insights that propel business growth. Who Hashboard is for: Hashboard is for people that are both data and product oriented. For folks that are focused not just on building dashboards and charts, but on sharing knowledge and understanding. Check out our approach here . Data engineers have the flexibility to manage all of their resources in code and version control. Non-technical people are able to easily explore data via a friendly UI with visualizations best practices baked in. Executives can check our metrics page to see which different KPI’s are trending towards goals in one place or via automatic reports straight to their email or Slack. Why our customers love us: We’ve had the chance to work with amazing users over the last couple of years! They’ve taught us so much and we’ve really enjoyed building to better solve their data challenges together. The themes we’ve heard from our early customers on why they love Hashboard have remained consistent. They love the product because of: Simplified Exploration: Hashboard has an intuitive UI that empowers all types of users (technical and non-technical) to effortlessly explore data. This ease of use has been key for teams to foster a culture of curiosity and work towards higher adoption of self-service. Faster Insights: Hashboard has intelligent defaults, automatic profiling and hyperspeed queries, which makes it easier for folks to discover insights in their data. This has helped teams to respond faster to requests and last minute asks and has been essential to prevent bottlenecks previously caused by the data teams capacity. Deeper Collaboration: Hashboard functions as a single semantic layer with standardized metrics. Finally teams can have a unified source of truth with their data! Teams have been able to collaborate across departments that were typically siloed from having data access — empowering more folks to truly become data-driven and better aligned with org-wide goals. Code-Driven Flexibility: Hashboard empowers data people to maintain control over mission-critical analytics via the Command Line Interface (CLI), Git integration, and seamless connection to dbt. These integrations supercharge data teams workflows, ensuring that breaking changes are identified and addressed before they impact production. With these significant investments in our product and our team, it was the right time for us to refresh our brand identity. This new chapter of Hashboard brings an intensified focus on empowering scrappy data teams and their stakeholders to easily explore data and discover insights. Get Started We’ve officially opened up self-service! Try out Hashboard for free today. We can’t wait to hear your feedback, if you have questions or would like to know more get in touch with us.

Carlos Aguilar
Carlos Aguilar
September 7th, 2023
Read more
Monitoring BigQuery costs in Hashboard with dbt and GitHub Actions

Monitoring BigQuery costs in Hashboard with dbt and GitHub Actions

DataOps for an orderly, collaborative data culture At Hashboard, we write a lot of code, review every change, and deploy releases at (frequent) intervals after running a suite of continuous integration tests. A disciplined culture of DevOps increases our developer productivity, allowing us to ship new features faster while spending less time tracking down regressions. The core premise of DataOps, or BI as Code, is to bring the same discipline of DevOps into the business intelligence space. That means writing data transformations as code which are checked into a version control system, reviewing changes, and running continuous integration to catch problems early. At Hashboard, we use dbt to transform our raw data into structured tables through a fully automated data transformation pipeline. We have developed a DataOps workflow that runs, tests, and deploys dbt transformations alongside our Hashboard project. This helps us identify any breaking changes to Hashboard resources and validate our entire data pipeline. Our dbt models and Hashboard resource configs are stored in the same GitHub repository. To make changes to dbt, we open a new pull request in that repository. Then, a GitHub Action creates a new BigQuery dataset, runs our dbt transformations, and posts a comment on the pull request with a link to a Hashboard preview (assuming everything succeeds!). The entire process takes only a few minutes and provides rapid feedback when issues arise. A worked example of DataOps at Hashboard Recently, I noticed that we were spending quite a bit of money on BigQuery jobs in Google Cloud Platform. To dig a bit deeper into the problem and identify usage patterns, it would be useful to analyze the BigQuery jobs our systems are running and estimate their costs. Fortunately, BigQuery makes this data available through a special table called INFORMATION_TABLE. Creating a new dbt model I started by creating a new dbt model from the BigQuery table. Here’s what that query looked like: -- models/gcp/bigquery_jobs.sql SELECT creation_time, project_id, user_email, job_id, transaction_id, job_type, statement_type, start_time, end_time, state, total_slot_ms, total_bytes_processed, total_bytes_billed, cache_hit, ARRAY( SELECT project_id || '.' || dataset_id || '.' || table_id FROM UNNEST(referenced_tables) ) AS referenced_tables, (error_result is not NULL) as is_error, ((total_bytes_billed / POW(10, 12)) * 6.25) as compute_cost FROM `region-us`.INFORMATION_SCHEMA.JOBS I calculated the cost of a job by multiplying the terabytes billed by the GCP listed price-per-TiB, which at the time of writing is $6.25. # models/gcp/schema.yml version: 2 models: - name: bigquery_jobs meta: hashboard: hbVersion: "1.0" name: BigQuery Jobs description: Jobs executed in BigQuery for the hashboard-analytics Google Cloud project in the last 180 days. # Hashboard metrics we want to add to our model cols: - id: total_slot_ms name: total_slot_ms type: metric aggregate: sum physicalName: total_slot_ms description: Sum of the slot milliseconds for the jobs over their entire durations. - id: total_bytes_billed name: total_bytes_billed type: metric aggregate: sum physicalName: total_bytes_billed description: | If the project is configured to use on-demand pricing, then this field contains the total bytes billed for the job. If the project is configured to use flat-rate pricing, then you are not billed for bytes and this field is informational only. Note: This column's values are empty for queries that read from tables with row-level access policies. - id: total_bytes_processed name: total_bytes_processed type: metric aggregate: sum physicalName: total_bytes_processed description: Total bytes processed by jobs. - id: total_compute_cost name: total_compute_cost type: metric aggregate: sum physicalName: compute_cost formattingOptions: { formatAsDollars: true, fixedDecimals: 2 } description: The total cost (in dollars) of compute based on the total bytes billed. # dbt model attributes columns: - name: creation_time type: TIMESTAMP description: Time the job was created. meta: { hashboard: { primaryDate: true } } - name: project_id type: STRING description: The ID of the project. - name: user_email type: STRING description: Email address or service account of the user who ran the job. - name: job_id type: STRING description: The ID of the job. For example, bquxjob_1234. meta: { hashboard: { primaryKey: true } } - name: transaction_id type: STRING description: ID of the transaction in which this job ran, if any. - name: job_type type: STRING description: The type of the job. Can be QUERY, LOAD, EXTRACT, COPY, or NULL. A NULL value indicates an internal job, such as a script job statement evaluation or a materialized view refresh. tests: - accepted_values: values: ["QUERY", "LOAD", "EXTRACT", "COPY"] - name: statement_type type: STRING description: The type of query statement. For example, DELETE, INSERT, SCRIPT, SELECT, or UPDATE. See [QueryStatementType](https://cloud.google.com/bigquery/docs/reference/auditlogs/rest/Shared.Types/BigQueryAuditMetadata.QueryStatementType) for list of valid values. - name: start_time type: TIMESTAMP description: Start time of this job. - name: end_time type: TIMESTAMP description: End time of this job. - name: state type: STRING description: Running state of the job. Valid states include PENDING, RUNNING, and DONE. tests: - accepted_values: values: ["PENDING", "RUNNING", "DONE"] - name: cache_hit type: BOOLEAN description: Whether the query results of this job were from a cache. If you have a multi-query statement job, cache_hit for your parent query is NULL. - name: referenced_tables type: STRING description: Array of tables referenced by the job. Only populated for query jobs. - name: is_error type: BOOLEAN description: Whether the query resulted in an error. Reviewing and previewing the change Once I was satisfied with the model, I opened a pull request with my changes in our dataops repository. I tagged some colleagues (thanks, Dan & Anna!) for code review. A few minutes later, I got a notification that our continuous integration had succeeded. We use GitHub Actions to automatically run our dbt workflow and create a Hashboard Preview build for open pull requests. Each pull request gets its own BigQuery dataset, so all of the preview links continue to work simultaneously. The preview looks good, but during review, Anna spotted an issue with the model: Google Cloud Platform charges per tebi byte (TiB), not tera byte (TB) — that means we should be dividing the number of bytes billed by POW(1024, 4), not POW(10, 12). Whoops! If I had made a mistake in the dbt syntax, the GitHub Action would fail and I would be able to see all of the logs and error messages with a single click. Deploying to production Once I pushed a fix to the model to the pull request, the CI ran again and gave me another green checkmark. With Anna’s stamp of approval, I merged the pull request into our main branch, where it will forever live (and be documented) until the end of time: Once the pull request was merged, it triggered yet another GitHub Action to deploy the changes to production. And that’s it! Now, my team can start exploring the data to understand which kinds of queries are expensive, and to optimize our pipelines accordingly. Woah...certainly, there is some room for improvement here! I’ve simplified the story a bit for the sake of illustration. There were actually several rounds of review (and even followup pull requests) to converge on this model. Even now, we’re making improvements to our metrics and updating our documentation — all without breaking dashboards or presenting inconsistent views to our stakeholders. What could DataOps mean for you? At Hashboard, we’re focused on building a collaborative data culture. An ergonomic BI as Code workflow with automated builds, previews, and deployments is central to making that data culture feel good . People really want to use data to answer questions and make better business decisions when its presentation is clear, intuitive, and, above all, consistent. But the converse is also true: without discipline and processes, data becomes noise and using it becomes a chore. We’re getting ready to share our automations with the world in the hopes that others might find them useful or draw inspiration. I hope this example helps illustrate the value of DataOps, even if you end up using different tools or a different workflow.

Meyer Zinn
July 25th, 2023
Read more
Choosing the Right BI Tool to Grow a Data-Driven Culture: Comparing Hashboard, Looker, Metabase, and Tableau

Choosing the Right BI Tool to Grow a Data-Driven Culture: Comparing Hashboard, Looker, Metabase, and Tableau

Empowering all levels of an organization to make high tempo, data driven decisions is crucial for driving innovation, improving performance, and building a shared understanding of your business. Building a collaborative data culture requires accessible and collaborative tools. Business Intelligence (BI) tools have emerged as the cornerstone of a data-driven organization, providing the means to transform complex data into actionable insights. Yet, with a myriad of BI tools available, selecting the one that best aligns with your organization's aspirations can be challenging. Why you should trust me (and why you shouldn’t) I’ve built and used data products at one of the biggest tech companies in the world, a successful startup right as they got acquired, and at a seed-stage startup. Along the way, I’ve used and advised on a broad range of commercial and internally developed BI solutions. Notably, I work at Hashboard as a Technical Product Manager, but I’ve also seen other tools succeed and want to highlight when they’re the right choice depending on your needs. What does self service BI even mean? When I talk about "Self-Service BI", I’m referring to tools that allow users—regardless of their technical expertise—to independently access and answer their questions using data. Ever been in a position where you're reliant on a data analyst to make sense of data? Traditional BI tools create a divide: data specialists answer questions, and business users consume these answers. This setup can lead to bottlenecks, with users dependent on data teams for every single question. The data team often doesn’t have the capacity to answer everyone’s questions, forcing business users to make decisions based entirely on their intuition. Another common result is that data teams waste effort building resources no one uses. Self-service BI, on the other hand, integrates data exploration into everyone's workflow, promoting rapid, confident decision-making. This independence is a key driver for a data-driven culture, empowering all within an organization to make data-informed decisions swiftly and confidently. So, what key features facilitate this? Governed Metrics and Dimensions Centralizing metrics ensures accuracy across all explorations, reducing the risk of errors and misunderstandings. Similarly, clearly defining dimensions and attributes, such as what qualifies an active account, is crucial. The metric definitions ideally include organizing them in a way that mirrors your organization, promoting accountability and alignment. This arrangement allows users to understand not only their metrics, but also the broader organizational ones, and how these metrics interrelate. Exploratory Data Analysis When a metric exhibits an unexpected shift, business users need to be able to identify why. This involves drilling down into the data, breaking it out by various dimensions to pinpoint what's driving the change. For instance, if shipping volume dips, a breakout by carrier can reveal if a specific carrier's performance is the cause. Identifying the root cause usually takes more than one level of questioning. Users must be able to quickly test different breakouts and combinations of breakouts. This necessitates flexible questioning beyond what a static dashboard can provide. Context and Collaboration Sometimes, data doesn't tell the whole story, and contextual information is needed. A national holiday might explain a drop in orders, but it's not captured in the raw data. Self-service BI should allow for the incorporation of such knowledge, enabling users to make sense of the data in its fuller context. Practical features for this are commenting and mentioning, but even descriptions can help. Comparing Hashboard, Metabase, Looker, and Tableau Hashboard Who Hashboard is for: Data teams building tools for non-technical users to be able to answer their own questions with as little overhead as possible using modern version control and CI/CD pipelines. Governed Metrics and Dimensions Hashboard has a metrics-focused lightweight semantic layer. It supports model joining and is safe for non technical users to add breakouts and change the granularity of metrics. Exploratory Data Analysis Hashboard automatically builds a visual data explorer as an interface for users to analyze metrics without SQL. Users can easily slice and group data via visualizations while metrics are safely recalculated based on the semantic layer. Contextual Knowledge There is no commenting or mentioning system. Since most users are able to create their own dashboards, users can use markdown blocks as a way to store context. Descriptions fields for metrics and attributes show up consistently in the app. Pricing $4.2k per year for up to 10 users. https://hashboard.com/pricing Metabase Who Metabase is for: Early teams where everyone knows SQL and are looking for a self-hosted open source solution. Governed Metrics and Dimensions Metabase has an extremely limited data modeling feature. Users cannot use Metabase data models to breakout and drill into metrics. Exploratory Data Analysis Metabase best serves users who know already know SQL. It is primarily a GUI for building SQL queries. Non technical users aren’t able to easily play with and test different cuts of data and metrics Contextual Knowledge There is no commenting or mentioning system, but there is an interesting events and timeline feature where users can can record events to surface in charts. ( https://www.metabase.com/docs/latest/exploration-and-organization/events-and-timelines ) Pricing Metabase is open source and can be self-hosted. There is also a commercially hosted option, but I’ve generally seen teams stick with the self hosted option. https://www.metabase.com/pricing/ Looker Who Looker is for: Data-driven organizations that need a comprehensive, feature-rich dashboard tool with strict governance and can manage high operational costs. Governed Metrics and Dimensions Looker has a very sophisticated system of data modeling and joining which can be very powerful, but is also so complicated that business users can get confused about which slices of metrics are safe to calculate. Exploratory Data Analysis Looker has a data explorer where users can create their own visualizations based on the semantic layer defined by the data team. Users are able to create breakouts and create their own charts and tables. Metrics are generally safely recalculated based on the semantic layer, but can be sometime dangerous if the data team has allowed for many to many joining. Contextual Knowledge There is no commenting or mentioning system and end users aren’t able to make dashboards to store notes. Descriptions fields for metrics and attributes show up consistently in the Explore interface. Pricing $60k per year minimum. You need to pay both a platform cost and for user seats. https://cloud.google.com/looker/pricing Tableau Who Tableau is for: Less technical data teams and organizations that are willing to invest in training everyone on a specific tool. Primarily for creating dashboards, not interactive exploration. Governed Metrics and Dimensions Tableau doesn’t have centrally defined metrics and dimensions. All definitions only live within a workbook. Overtime, upstream changes in the data warehouse can lead to a lot of dead and broken dashboards. Exploratory Data Analysis Tableau can be a great unlock for less technical users who need to be able to create their own dashboards and has a very expressive visualization system. However, once those dashboards are uploaded, there’s no way for the other users to continue to explore the data. Contextual Knowledge There is no commenting or mentioning system and most users cannot create their own dashboards to store notes. Pricing Seat based pricing. $840 for first user per year. Additional users are cheaper. https://www.tableau.com/pricing/tableau-server Summary While all four tools strive to make data accessible, the chances of success at your organization of driving data driven decision making will vary depending on the technical knowledge of the users. Hashboard and Looker are explicitly designed for self service data exploration as a managed solution by the data team. Metabase is an open source option for SQL users. Tableau enables non technical users to make dashboards, but isn’t a real self service solution. Appendix Other options and why I didn’t include them Hex : Very cool notebooks for technical analysts and data team. I would've loved to use this in any of my data roles. So far, I’ve mostly seen it work within data teams, not as a general BI tool for the rest of the organization. Requires SQL or Python knowledge (ideally both) to explore data effectively. ‍ Mode : Expects everyone to learn SQL. I've seen Mode used at large companies and small ones, but the bet you need to make here is that everyone will eventually learn SQL. This doesn't meet my definition of self service. ‍

Andrew Lee
Andrew Lee
July 7th, 2023
Read more
Five Signs It’s Time to Adopt a Data Visualization Tool

Five Signs It’s Time to Adopt a Data Visualization Tool

Hi! I’m Sarah - I spend my days at Hashboard obsessing over our customers and digging into how our team can create a delightful product to help teams build data driven cultures. My background is in consulting, so I’m always asking “why” and excited to peel back the layers on complex problems. I joined Hashboard in January because I personally related to the struggle of being a non-technical user, desperate to get my hands on data, but not having a great way to do so. I couldn’t resist the opportunity to help build a team and product that is working to bridge the gap for data teams and the business. If you have any questions about what we’re building - and WHY we’re building it - please reach out to me, I’d love to connect and hear from you! ‍ When it comes to building a data driven culture at your organization, there are endless opinions and tools to help teams embark on that journey. In this post we’re starting at the very beginning. If your organization or team is early stage, you’ll inevitably find yourself asking the question of, “Is it time for our team to graduate from visualizations in excel?!” The guide below will help you determine if it’s time to answer ‘yes’ and level up your business intelligence. 1. Your team is starting to get overwhelmed with increasing amounts of data: When you’re early, analyzing data doesn’t make sense…because there isn’t enough data! You have five customers, you know them by name and you pretty much know everything that they’re doing. But, at a certain scale it’s hard to manage things without data. This point comes much sooner for particular business models: ad tech and consumer for example. Another factor is growth - as your business grows, you'll be collecting more and more data about your product usage, customer information and internal activity. The data gets larger and more complex, if it’s not maintained it will become useless. As you grow, fast monitoring becomes essential as you’re going to need to easily access and understand this data in order to scale. Pro Tip: If you don’t already have a data warehouse set up, we would suggest putting it on the to-do list! Here’s a guide on how to set up a data warehouse in 30 minutes or less. If setting up a dwh is out of the question for your org, there are still options! For small to moderate data sets you can use DuckDB + Hashboard to upload CSVs and visualize your data. 2. You can’t afford to wait on your “data person” to surface key insights: As your business grows, each person on your team is going to need to make decisions more quickly than before. This may look like customer success teams that are under pressure to deliver answers faster to customers. Or it could be that your product team needs fresh and dynamic data to solve critical product issues or to inform major roadmap decisions. If this entire process is stuck in a bottleneck because you’ve been relying on one or two data people, ring the alarm bells! Having a self-serve data visualization tool in place will help all folks in your org quickly analyze large amounts of data, identify important insights ad unblock themselves. This shift will free up your data folks’ valuable time for high leverage questions and projects. 3. Top line KPIs haven’t been evangelized to your entire team: If your organization’s KPIs are keeping you up at night, but you have a sinking suspicion that half of your company has no idea what they are, it’s time to get everyone on board. Visualizing and visibly tracking progress of your company’s critical KPIs is the best way to get every single person engaged - gone are the barriers of “I don’t know SQL” or “I don’t have access to our data warehouse.” Creating a simple dashboard to track KPIs is a great starting point. Pro Tip: Current Glean customers leverage our Slack integration to share out org-wide goals to every single employee weekly. Choose what you want to share and how frequently with our automated reporting feature . If your team has already done the work to define and track KPIs, sharing should be the easy part! 4. Difficulty interpreting data due to legacy knowledge: As your team grows you’ll quickly realize that the way in which data was stored and interpreted in the early days may cause mass confusion for newcomers. When you’re frequently hearing questions like “How exactly do we define account?” or “Why are there marketing metrics mixed in with product analytics?,” it’s time to start clearly defining your metrics and documenting everything along the way. Visualizing these core metrics with documentation side by side provides your users with an easy workflow to absorb both the data - but arguably more important - the context that makes the data meaningful! Glean’s explore links also make sharing context alongside your data incredibly easy. 5. Need to communicate insights to external stakeholders: If you’re considering raising another round or providing your customers with analytics, things are bound to get complicated. Sending a spreadsheet + a word doc with context + an email with additional commentary is not an effective way to communicate. It will likely result in either 1) no one ever looking at your data or 2) the floodgates will open with questions due to the data being static. Communicating insights in a clear and visually appealing way will guarantee higher engagement and happier customers. Don’t let a subpar data visualization be a distraction from the amazing results you have to show off! It absolutely can be challenging to invest up front in cleaning up your data and implementing a new tool. However, the tradeoff is that you’ll be able to free up your data team’s valuable time to work on more impactful projects. ‍

Sarah Davidson
Sarah Davidson
May 19th, 2023
Read more
Security and Compliance at Hashboard

Security and Compliance at Hashboard

Being a BI product means that our users often trust us with their most sensitive data. With that in mind, we’re building Hashboard with careful consideration for security and privacy every step of the way. So what does that look like in practice? For starters, we’re SOC2 Type II certified, with no noted exceptions in our most recent audit report Developed by the AICPA, SOC 2 Type II is an extensive auditing procedure that ensures a company is handling customer data securely and in a manner that protects the organization as well as the privacy of its customers. SOC 2 is designed for service providers storing customer data in the cloud. We’re also HIPAA compliant With many folks on our founding team coming from Flatiron Health, we’ve designed Hashboard with healthcare orgs in mind. We support Health Insurance Portability and Accountability Act (HIPAA) compliance and also sign Business Associate Agreements (BAAs). There is no extra charge in Hashboard for HIPAA compliance or BAAs. Storing Data Hashboard operates by issuing queries to your existing data warehouse. We do not ingest and store the full underlying data of your tables. To enable fast interactive data explorations, Hashboard caches aggregated query results within our infrastructure and in your local browser session. Users can customize Hashboard's cache usage to meet specific performance or data freshness requirements. ‍ If any users or prospects need more information around our security procedures and/or would like to request a review of our SOC2 report, please reach out to support@hashboard.com and we’d be happy to provide you the necessary documentation.

Sarah Davidson
Sarah Davidson
May 19th, 2023
Read more
Why I Stopped Worrying and Learned to Love Denormalized Tables

Why I Stopped Worrying and Learned to Love Denormalized Tables

Hey, I'm Andrew and I'm a product manager at Hashboard. Previously, I've worked on analytics in a variety of roles; analyst, data engineer, visualization specialist, and data scientist. Normalized tables: Tables designed to avoid repeating information, keeping data organized and easy to maintain. Denormalized tables: Tables that have repeated information, but make data retrieval faster and simpler to understand. aka One Big Table (OBT) After years of meticulously modeling data in relational databases and avoiding duplication at all costs, I’ve come full circle on the power of wide flat tables for analytics. In this post, I'll share my journey of how I've ended up embracing denormalized tables, how to use tools like dbt to bridge normalized and denormalized tables, and how denormalized data accelerates exploratory data analysis and self service analytics. Quickly answering questions with spreadsheets and dataframes Starting out as an analyst, I first learned to analyze data in Excel, then in Jupyter notebooks and R Studio. I spent a lot of time in live Q&A sessions with stakeholders answering questions about the metrics that mattered to them and figuring out why, why, and why. Questions like "Are the metrics going up or down?" or "Why did revenue dip this day?" or "How does this segment compare to this one?" were common. I could usually start off with a simple SQL query, but as we layered on business logic, the number of joins would inevitably explode. I’d have to pause the live Q&A and follow up after I untangled the query I had generated. I quickly learned that writing one giant query with a bunch of joins or even bunch of Python helper functions could get me stuck. My transformation functions weren't flexible enough, or my joins were too complicated to answer the endless variety of questions thrown my way while keeping the numbers correct. Instead, the easiest way to be fast, nimble, and answer all the unexpected questions was to prepare a giant table or dataframe and limit myself to it. As long as I understood the table's contents, it was harder to make mistakes. I could group by and aggregate on the fly with confidence. Conveniently, I also found this made transporting data across a variety of tools really convenient. I could play with the same table in Excel pivot tables or a Pandas notebook since I could export to a single CSV. Learning to model data As my technical skills grew, I took on more responsibilities such as data ingestion, ETL pipelines, and application building. Along the way, I learned about data modeling and schema design to maintain organized and easily maintainable databases. select name from users just makes sense! I was a firm believer in the standard practices of normalization and I began my projects thinking about foreign key relationships and indexing. I believed that duplicate data would lead to inefficiencies and errors, so I avoided it at all costs. However, as I gained more experience and began building tools for exploratory data analysis, I discovered a surprise: those big, flat tables I used to make as an analyst were still incredibly powerful and valuable for exploring data. Without the flat table in the warehouse, I’d end up having the same big query or sequence of Pandas spaghetti in a dozen different places. Inevitably, the logic would drift and I couldn’t rely on older resources. I started baking the join and filter logic into database views so I could have freshly prepared data with a consistent structure. I could run ETL into a normalized schema that made it easy to validate data hygiene and quality, but still quickly query data using the denormalized view. It turned out normalizing my data was helping me denormalize my data. The Aha Moment: Leveraging dbt and modern warehouses to unlock the power of denormalization Transformation tools such as dbt (Data Build Tool) have revolutionized the management and maintenance of denormalized tables. With dbt, we can establish clear relationships between table abstractions, create denormalized analytics datasets on top of them, and ensure data integrity and consistency with tests. Modern warehouses, such as Snowflake and BigQuery, combined with ELT (extract load transform) patterns allowed me to simplify my pipeline code into a bunch of SQL and generally never have to worry about the volume of data I was processing. The excellent query performance of denormalized tables in modern warehouses also make it quicker to run analyses. I could now safely build an analytics table of millions of records and build many different visualizations and analyses with simple filters and aggregations, no complex joins or CTE’s required. The many columns ensured I could flexibly create the series and aggregations I needed. If there were any data quality issues, I could easily rebuild my tables with dbt and the new data would flow into my charts and dashboards. Conclusion Denormalized tables prioritize performance and simplicity, allowing data redundancy and duplicate info for faster queries. By embracing denormalization, we can create efficient, maintainable data models that promote insightful analysis. So, why not give denormalized tables a chance? You might just find yourself wondering why you ever worried in the first place. If you’re looking for a BI tool that fully takes advantage of denormalized tables and the rest of the modern data stack, we’re building one at Hashboard! We’re built for quick and agile metrics exploration, with powerful data visualizations and tables. Come check us out! ‍

Andrew Lee
Andrew Lee
May 11th, 2023
Read more
Making Custom Colors “Just Work” in Hashboard

Making Custom Colors “Just Work” in Hashboard

It comes in as a simple request: “we want custom colors for our charts in Hashboard”. It seems simple, but doing it well is more nuanced than you may expect. Just Picking Colors When you build charts in Hashboard, you’re not just building a single chart, you’re building a jumping off point for future exploration. Picking individual color values is okay if you are building something one-off, but the issue is that it becomes cumbersome when exploring data. As the series change, it becomes very time consuming to keep the colors up the date, and often results in repeating the same colors twice (making the chart unreadable). a chart with repeated colors makes it difficult to understand what marks correspond to which series So Hashboard's approach to colors needs to be more dynamic. If changing the series broke all the colors, that wouldn’t be a fluid experience as you explore, it’d be an interruptive one. We want to allow people to focus on their data, their questions, and their answers. We don’t want to have people bounce into formatting options in the middle of their exploration. Colors need to “just work”. a corrected chart with distinct colors ‍ Color Palettes Instead of associating a color to each series, a chart has a color palette. This palette is a bundle of colors that fit your brand, theme, or particular use case. Hashboard's default palette: “Twilight” Whenever the charts need colors, the visualization requests a distribution of distinct colors from the palette which maximize contrast. If the chart needs two colors from a three color palette we’ll pick the two most distinct colors. If the chart needs all three, we’ll grab all three. But what if you need 4, or 5, or 100 colors? It could repeat the provided colors (and we have an option for that) but generally we want to have distinct colors so that all the series are uniquely identifiable. Interpolation When the chart needs 4 colors from a 3 color palette, we need to generate new colors. To do so, we need to interpolate. Map the 3 colors in the palette into a shared color space Draw connections between the colors Sample colors along the connection Form a gradient from the samples Evenly divide the gradient to produce final colors Voila! Four distinct colors (two of them brand new), which fit our theme and maximize contrast. If we needed five colors, we don’t even need to recalculate our gradient, we just need to divide it into five segments. Color Spaces This process can have vastly different results depending on how you map the colors into a color space. Using a different color space, we could produce a very different set of colors: The results from using these 2D color spaces are okay, but if we want a better interpolation from yellow to teal, we need a more nuanced color space. There are actually hundreds of ways to map colors into a space for interpolation. Many of them are 3D, and a few are 4D! Some commonly used color spaces in computer graphics. Image courtesy of the University of Kansas School of Engineering Dependent on the color spaces, we get slightly different results: In the end, we decided to use the HCL color space, as we found it works well for a variety of palettes. Colorful Charts So that’s the ballgame. Simple requests sometimes need careful implementations to preserve Hashboard's first class user experience. Now… you didn’t think you’d leave empty handed did you? You read a whole blog post about colors! Here’s a couple of great looking color palettes, on us, alongside their DataOps config (Hashboard lets you control your visualizations as code if you desire). You import these directly into your own Hashboard project.

Charlie Imhoff
Charlie Imhoff
April 4th, 2023
Read more
How to do Version Control for Business Intelligence

How to do Version Control for Business Intelligence

Charts, dashboards, and other business intelligence assets are some of the highest leverage resources at a company. They provide visibility into the business, track goals, and identify issues that need to be solved. If you own resources at this layer of the data stack, you’ll want to think about your version control strategy. Just like with any other important engineering system, it’s important to have a record of what changes are being applied to your BI layer, tracking what exactly changed, when it changed, and who made the change. Your approach for tracking these changes will vary depending on the needs of your team and the capabilities of your BI tool. Here are some options. Keep a changelog The simplest way to track BI changes is to keep a log of what changes are made. At a minimum, this should include a timestamp, who made the change, and a short description of what changed. This approach works for only the simplest of setups, where you have a small number of dashboards and a limited number of people who have the ability to make changes. Otherwise, you’ll have trouble keeping the changelog up to date. Once you’ve moved beyond a handful of charts or have important decisions that depend on your dashboards, it’s a good idea to start treating your BI like production by introducing some tooling. Save immutable copies of each version Save a copy whenever you make a change and store it somewhere. This is a bit old-fashioned but might be the only viable approach when using a desktop-based tool like Tableau. Make sure you have a clear naming pattern or you’ll end up with notoriously confusing files like okr_dashboard_v2_new_final . You’ll also want some sort of deployment strategy to ensure users are only using the most recent version. This only works for resources that do not change often, and are not edited simultaneously by multiple people. Even then, the number of files will quickly become unwieldy – you don’t want to end up with different people looking at different versions of your dashboard, which can be worse than having no version control at all! Use your BI tool’s built-in change history Some BI tools natively track the changes of your resources, and it might be enough to just rely on these built-in features for simple version control. In Glean each resource has a change history showing who has recently made changes. Metabase and Sisense also have this feature, though Apache Superset / Preset and Microsoft Power BI do not. The main downside to only relying on native change histories is that it doesn’t allow you to correlate BI changes with other changes happening upstream. Often, a dashboard is updated as a result of a schema change in the database that it reads from. If you want to track this relationship, you’ll end up needing a separate changelog again. Use code to define and deploy your BI resources The most thorough approach for version control is to use code that is committed to git , usually alongside your data pipeline code. This allows you to capture the full context and intent of your changes. There are a few different variants here. In a mature BI setup, there are typically two different types of resources: A small set of models that define core concepts and metric definitions A larger set of views and dashboards that represent summaries and adhoc explorations on top of your models Among these layers, models probably don’t change very frequently, but when they do, they typically have upstream dependencies, and changes here can have a large impact on downstream views. On the other hand, your dashboards are probably changing often as business users iterate on what’s useful. A common approach we see among Glean users is to start by prototyping models in the web UI and then commit the configuration into git once they are stable, while leaving downstream dashboards and explorations to be managed via the UI. This usually strikes a good balance between keeping tight control over the foundational bits of logic in your reporting stack, while not requiring business users to deal with code workflows to make changes. With Looker , there’s no prototyping phase – your model needs to be written in LookML to start, so version controlling it just involves checking that code into git. Once particular dashboards become more mature, you can incrementally migrate these into code over time, although not all BI tools support defining visualizations themselves as code. The tooling will also have a big impact on how easy it is to define and deploy visualizations with a code-based workflow. The best workflows will have a very tight feedback loop between the code and the visualization – an inline editor in the browser helps a lot. ‍ Glean’s web-based code editor and CLI workflows help to shorten the iteration cycle when developing BI as code. It’s surprisingly common to not have any version control for BI assets. If you don’t have it, it’s a good idea to get it in place sooner rather than later. This post covers the basics; in future posts we’ll do some deep dives into some of these techniques.

Dan Eisenberg
Dan Eisenberg
March 16th, 2023
Read more
Using DuckDB for not-so-big data in Hashboard

Using DuckDB for not-so-big data in Hashboard

Big Data is cool - and so are infinitely scalable Data Warehouses. But sometimes, you just have not-so-big data. Or maybe you have a .csv file and you don’t even know what a Data Warehouse is, or don’t care about the “Modern Data Stack” - like, what is that? That shouldn’t stop you from exploring your data in an intuitive, visual way in Hashboard. There's lots of hype about DuckDB and for good reason: it's a fun tool! But it also has some tangible benefits for the data ecosystem. For Hashboard there a few features that made it the obvious choice for small data: Improved accessibility - no external database required Isolation and testing Portable computation For those that don't know, DuckDB is a columnar in-process database engine that can read a variety of data formats and works especially well with Apache Arrow, which makes it nice for quickly writing fast sql queries on medium-sized data. Making data tools more accessible All the technical features are neat-o for engineers / data nerds - but more importantly for us, we saw DuckDB as an opportunity to improve the accessibility of Hashboard. Before we integrated DuckDB, you had to have a database server to use Hashboard (something like Snowflake or Postgres). This means if you had data you wanted to analyze, you also had to figure out how to get it INTO that database. With our DuckDB integration, you can now upload CSV files and don't need any external database (or upload json, tsv, parquet files etc.). Hashboard's whole reason for existing is to make data more accessible and useful and allowing file uploads and our DuckDB integration fits in with that perfectly. To our surprise, DuckDB even increased our own usage of Hashboard. I didn't realize how often I had a loose spreadsheet that I wanted to quickly pivot and trellis in Hashboard but found myself using excel instead. Excel is great too, but it was more fun to track bug-bash progress in Hashboard without having to use FiveTran and the rest of the Modern Data Stack™. One final note that DuckDB has a pretty awesome SQL dialect that is designed for analysis. So this has been fun to use with random lightweight datasets (and public datasets from Kaggle). Isolation and testing More on the developer / internal side - it's made Hashboard higher quality by improving our system's isolation in testing. Hashboard is built on the Modern Data Stack™ - we buy into having lots of standardized logic in a database like Snowflake. Fun! Except, this sorta sucked when you don't want to depend on an external service. Previously all integration tests would run against databases that were external to Hashboard. Since we run DuckDB ourselves, we can now run more integration tests without relying on an external process or database. Portable computation The first iteration of our DuckDB integration just uses a process / service in our backend to process files in DuckDB and run sql on them. Hashboard natively uses Apache Arrow as a serialization format so that we can do data processing on both the backend and frontend natively. This lends itself well to potentially bringing DuckDB into our client as a next iteration and not having to ship data to our servers at all in order to explore it.‍

Carlos Aguilar
Carlos Aguilar
February 13th, 2023
Read more
Building in Public

Building in Public

Have you ever wondered if that feature you provided feedback on ever made it onto the roadmap? Or maybe you noticed something that looked or felt slightly different in Hashboard?! If so, you’re not alone! This is because our team is always busy shipping new features and improvements. We have to admit, up until this point we haven’t been totally consistent with publishing all of the new features and improvements released. We heard you - which is why we’re thrilled to announce the launch of our public Changelog and Product Roadmap! These two resources will provide an updated overview of all product changes, as well as a preview of items our team will be tackling in the future. We'll update these at least monthly, but will look to increase the cadence of updates in the future. Check out the Changelog and Product Roadmap today! We’re always curious to hear your feedback - let us know if we’re missing anything by reaching out to us in your project Slack channel or drop us a note at product@hashboard.com .

Nathaniel Stokoe
Nathaniel Stokoe
February 8th, 2023
Read more
Set up your data warehouse on Postgres in 30 minutes

Set up your data warehouse on Postgres in 30 minutes

So you want a data warehouse, but you don’t have a ton of time to get it all setup. If you’re on postgres and using GCP, AWS or heroku, we can get the infrastructure setup for your data warehouse in about 30 minutes. If you’re managing your own postgres database, the only part I won’t cover here is setting up your replica. I’ll just take you through setting up the pipes to get things started - then you can get started with the real work of analyzing and modeling your data :) What we’ll cover: What is a data warehouse and why is it useful? When is Postgres a bad idea? Step 1: set up replication of production data Step 2: create a new database and enable querying prod data Step 3: start analyzing and modeling Step 4: (optional) integrate other data sources Step 5: (optional) visualize Layout of what we'll cover in steps 1 through 4 What is a data warehouse and why is it useful? A data warehouse is a database to hold consistent data representations of your business concepts. Having a single source of truth for important concepts will unlock your ability to start driving improvements in your business and help get consistent insights from your data. When everyone is just jamming on random production SQL queries - things are going to get confusing and your organization won’t have a shared vocabulary about what is going on. It’s like having ten versions of a spreadsheet. Should I be looking at customers or customers_final_v3.xls? Except worse because everyone is just looking at different cuts of the same sql queries. Good things about a data warehouse strategy: Safely query your data : separates analytics queries from your production database. A data warehouse is in a separate database than your application database, so you can hammer on it with complex queries and don’t worry about interfering with production systems. Model your data for analytics : a data warehouse allows you to come up with standard views of data so that you can make future analytics queries consistent and easy. This process of creating derivative, canonical views is called data modeling. Maybe product information or clinical patient information is scattered across 10 different tables that you query in a similar way over and over again. In your data warehouse, you are going to create one consistent table (or view) to represent patients, or orders or whatever you want to analyze over and over again. Tools like dbt, or even just scheduled queries can help you maintain your consistent views. Combine disparate datasets : a data warehouse strategy allows you to have a common place to start syncing data from other tools. Maybe your accounting, crm or marketing systems When is a Postgres data warehouse a bad idea? Okay, so building a data warehouse in postgres might not be a great idea. If you expect to have tens or hundreds of millions of objects / events / users to analyze pretty soon, I would invest in a more scalable solution, like BigQuery or Snowflake. See step 4 for some ideas on ways to pretty easily implement BigQuery or Snowflake. I personally love BigQuery and recommend it most of the time: it’s really easy to manage and has a powerful sql dialect. But sometimes, you just have data in a production postgres database and you want to start analyzing it quickly. This guide is for the scrappy founder / PM / engineer trying to hack something together quickly. Step 1: replicate your data Allows you to safely query your data Step 1: create a database replica You shouldn’t analyze data in your application’s database directly because you might write a bad query that uses all of the database’s resources. If you know what your doing and you don’t have any customers yet - this might be okay. But if you’re analyzing data, it’s likely you do have customers. Also, it’s really easy to mitigate this risk by creating a replica of your production data. The first step to setting up your postgres data warehouse is to create a read-only replica of your application database. Streaming replication will copy your data more or less in realtime to a different database, which means the new replicated data will be safe to query without impacting your original db server. If you’re using AWS Relational Database Service (like I am in this tutorial) - see the AWS instructions , GCP and heroku have equivalently easy steps for setting up a replica. Let me know if it’s helpful to list instructions for self-managed streaming replication, it’s a bit less magic / straightforward than the rest of this guide. Step 2: create data warehouse and add foreign data wrappers Allows you to mutate / change data and create models Step 2: creating a new database - getting prod data in there So now you can safely query data. But we want to be able to write data as well. There are two instances where we may want to write data: We want to write data so we can create new, cleaned up versions of data with CREATE TABLE and CREATE VIEW statements. We want to import and write data from other systems, like our CRM, from segment or from marketing tools into our data warehouse. First step here is we are going to create a second and final database which we’ll call data_warehouse. This will require you to create a new database in AWS RDS or GCP etc. Our new database is going to have three schemas that we’re going to create for three distinct purposes over the next few steps of this guide: production_replica_fdw: just below we’ll create this new schema and populate it with the data from production. analytics: we’ll create tables and views for analytics here - this is where the action will happen in step 3. source_data: we’ll drop some new sources here like crm data in step 4 later on In order to get our production data into our postgres data warehouse we’ll use a magical extension in postgres called Foreign Data Wrappers (fdw). Foreign Data Wrappers create virtualized resources inside of your database that allow you to query data outside of your database directly with normal sql syntax. ✨magic✨ - this is a pretty flexible and cool feature, in this case we’re going to simply use fdw to query our replicated postgres data. Set up Foreign Data Wrapper to get prod data: Step 3: model your data with scripts or dbt Allows you to mutate / change data and create models Step 3: create a new schema where you can modify data Final step here now that we have production Postgres setup is to start creating some analytics resources for my org inside the data_warehouse machine. We’ll create our analytics schema and start building out some analytics tables and views: Step 4: integrate other data sources Step 4: get some more data from other systems So we have our application data in our warehouse, but what about all the other tools that are collecting important marketing, sales and other domain data. The way most people do this now is with an “ELT” strategy or an Extract and Load strategy (followed by Transforming the data later in the warehouse). So you’re going to just extract and load data into postgres without manipulating it. In the good old days (say five to ten years ago) it was more common to filter data and clean it up before putting it into the data warehouse to avoid data duplication and reduce the footprint in the warehouse. Since databases have gotten more scalable and faster - people have just started copying data exactly from other systems to make the transport code easier to reason about. There are a few different services that will sync data for you so you don’t have to worry about it. For this demo we’ll use Airbyte - since they have a free-to-start service that you can sign up for. Fivetran and Stitch are two other commonly used cloud services. Two free / open source tools: Airbyte also has an open source version and Meltano is another open source tool. Now go through the Airbyte onboarding: specify your data warehouse credentials. A couple thoughts here: Since we’re just using postgres here, there is a limit to the scalability assumption of an ELT strategy here. Since we’re going to use a service like airbyte or fivetran anyways, it may be worth just setting up your postgres database as a source and transferring it into bigquery. Step 5: visualize your data Once you’ve standardized some of your data you’ll probably want to start visualizing it as well. Hashboard helps you visualize data in a consistent way by developing data models. There are also cheap tools like Looker Studio that can get you up and running quickly.

Carlos Aguilar
Carlos Aguilar
May 18th, 2022
Read more
Your dashboard is probably broken

Your dashboard is probably broken

Your dashboard is probably broken right now. How could I know that? “It’s not broken!” Go check it. It’s broken. Interactive dashboards are complicated by nature. They have an essentially unlimited number of different possible states, and usually sit on top of many disparate systems that are constantly changing. Eventually, something is going to become misaligned. The world devolves into chaos if left unchecked. The beautiful thing about well-conceived Business Intelligence and dashboarding tools is that they allow for fast iteration and insights — a single time. The challenge is that optimizing for free-wheeling experimentation can be counter to the goal of keeping something maintained and stable and high quality. When adhoc suddenly becomes production Often, a data visualization is built out to answer a point-in-time question for the business. But every so often, you hit gold. Stakeholders get so much value from a dashboard that they start visiting it every day. It gets added to checklists and integrated into their workflows. Suddenly, a dashboard that was initially thrown together in a few hours has become mission critical. Eventually, something about that dashboard is going to need to change. Maybe an upstream data column is being renamed, or users want to add an additional filter control. And this is where things start to go wrong. Somebody spends time clicking around in a WYSIWG editor until things look right. Except, there are a few edge cases that aren’t handled correctly… which isn’t discovered until after a user makes a bad decision with incorrect data. Or, the dashboard has become so critical that nobody is confident enough to make a change without breaking it, and so nothing gets changed at all. The dashboard eventually becomes stale, and eventually people just stop using it. This vicious cycle leads to a breakdown of trust. Every broken dashboard hurts your team’s credibility and causes the organization to drift away from data-driven decisions. DataOps: Bringing change management to the BI layer Allowing users to explore and experiment with your data is crucial for building a data-driven culture. But when a report has become a production system – a real product with real users that depend on it – it’s important to start treating it more like an application. We can look at other production systems for inspiration. When writing code, software engineers leverage a host of best practices to manage change and keep things from breaking. And in the past few years this discipline has extended to the data pipeline layer as well, through tools like Airflow, dbt, and Terraform. It’s long overdue that we start applying new "DataOps" best practices to BI and create a true Data Application layer. Here are some ways you can apply DataOps principles to your visualizations: Version control Important visualizations and dashboards ought to be version controlled, just like the rest of the software stack. Changes should be tracked over time and attributable to individual authors. Developers and users feel free to make changes with confidence. Rolling back a broken change should be just as easy (or easier!) than making the change in the first place. Version history also helps identify who the experts are for a given dashboard and gives a signal on whether the resource is up-to-date. Code review Code review for business intelligence is important for the same reasons it is important for application code: Ensure that a change actually accomplishes its intent Get feedback on implementation, structure, and usability Use a second pair of eyes to catch bugs and unforeseen issues Spread knowledge so that the context of a change is not limited to a single person Unit tests There should be a fast and easy way to validate that a change to your data will not break your visualization layer, and vice versa. Major issues should be surfaced to the author of a change, before it gets in front of other users. Unit tests also document the expectations of a system and clarify what is expected (or not expected) to stay the same over time. Continuous integration Dashboards don’t live in isolation. They sit on top of a sometimes large and complex set of systems that populate data into the data warehouse. When any part of this system changes, your BI tool should validate that all your tests pass and that your dashboards are valid. Catching integration issues quickly makes it easy to identify the root cause before other changes are committed on top of it. Deployment environments You would never deploy a risky change to a user-facing application without trying it out first in a staging environment. And yet, this happens all the time with many business intelligence tools. Changes to visualizations and dashboards should flow through a standard deployment lifecycle, and then shipped to production using a process that is fully under your control. At Hashboard we see production-quality Data Applications as a missing and critical layer of the modern data stack. We have customers using Hashboard to build trust in their production dashboards and are starting to see how it changes how teams operate. If you want to play around with DataOps or have other ideas about how we should be thinking about code-based analytics, get in touch ! Read the Glean DataOps documentation

Dan Eisenberg
Dan Eisenberg
April 13th, 2022
Read more
From the archives: Introducing Glean

From the archives: Introducing Glean

TLDR: Glean is building a new way to make data exploration and visualization accessible to everyone. Half of our founding team spent five years building data products at Flatiron Health as early employees. We’ve raised $7M from some great investors to try to reinvent this old and crowded space. Data visualization and exploration are broken It’s never been easier to collect data and drop it in a database so you can analyze it. If you’re an early team, analytics-ready databases like Snowflake and Google’s BigQuery allow you to quickly collect data about your entire business. But what are you going to do with it? It can still feel surprisingly hard to start visualizing your data and get to real insights. We founded Glean to fix that. It’s easy enough to write a single SQL query, but if you want to build a culture around data you’re going to need to teach your team to be curious about data and give them the tools to answer their own questions. Start exploring early My team of 25 data scientists, data analysts, and data engineers at Flatiron Health spent the majority of our time building data products—organizing millions of cancer patient records so that our internal and external customers could make a positive impact on cancer care. We built data tooling to support patient programs like financial assistance, clinical trials, revenue cycle management, and countless other initiatives as part of Flatiron’s cancer center and cancer research products. It’s hard to get people into data. It was hard at Flatiron. It took training, good alignment, and data tools to empower people to dive in. From our early days we were committed to getting the whole team into the data. Building a data-driven culture is possible—just start looking at your data earlier. Begin with simple counts and measures and developing an intuition for what more you need to collect. And take the next step. Set up a tool that enables your entire organization to explore your data. Introducing Glean Glean is the easy way to start visualizing and exploring data with your team. Data visualization and data presentation are hard, and we want to get you going in the right direction in minutes—not hours, weeks, or months. We think you’ll be surprised by how much you can find out about your data and your business in the first ten minutes of using Glean. We built Glean with a few principles in mind, formed from our own experience with good—and very bad—data visualization over the course of our careers. Principles: Ten minutes to insights: Our design goal is that you can start getting insights from your data within ten minutes of logging in for the first time. Glean always has a visual entry point into data and never presents a blank canvas or a data grid with a steep learning curve. There’s no run query button—you can dive right in and interact with your data. Exploration for all teams: We want to teach everyone who’s interested how to be an analyst. Setting up Glean requires some basic SQL knowledge, but exploration after that can be code-free. Our objective is to increase access and have ways for everyone to be able to explore data in an error-minimizing way. Built for engineers: Technical teams are our top-level customers. Glean ships with a git integration, CLI, native build tool, code review, and tests built right in. Fights the entropy: BI and reporting tools almost always end with inconsistent answers and too many dashboards that go unchecked. We aim to battle dashboard cruft—making it obvious for every user what they should be looking at through standard, consistent metrics. During our closed beta we’ve found that Glean is especially useful for early teams that are trying to standardize and track metrics. We’d love your feedback on how we’re thinking about the problem—please give us a shout or sign up for our list to hear about our progress. Who we are and fundraising We’re excited to share Glean with the world and announce our $7M seed round led by Ilya Sukhar at Matrix Partners. Ilya brings great expertise in the data space—he was the first institutional investor in Fivetran and sits on their board. We also have participation from great product angels like Elad Gil, Dylan Field (Figma), Shana Fisher, Scott Belsky (Behance), Cristina Cordova, and data angels like DJ Patil (fmr US CIO) and Anthony Goldbloom (Kaggle). Half of our founding team worked at Flatiron Health, where we built data systems and data products to empower cancer centers and cancer researchers. Every member of our team has worked in healthcare, EdTech, or FinTech, so we feel grounded in real-world problems and are excited about how data can make an impact in society more broadly. We have seen firsthand how data can materially improve people’s lives and are excited to share data curiosity with the world. Get started We’d love to get you into the product, please request access or get in touch with us if you want to learn more. We’re also hiring, check out our open roles !

Carlos Aguilar
Carlos Aguilar
March 30th, 2022
Read more