August 24th, 2024

Pipe Syntax in SQL

GoogleSQL enhances SQL by introducing a piped data flow syntax, improving usability and learning while maintaining compatibility with existing systems, allowing for gradual adoption of new features.

Read original articleLink Icon
Pipe Syntax in SQL

SQL, while widely used as the standard language for data manipulation, has inherent design flaws that complicate its learning, usage, and extensibility. Despite numerous attempts to create alternative languages, none have successfully replaced SQL due to the challenges of adoption. In response, GoogleSQL has introduced a novel solution by extending SQL rather than replacing it. This extension incorporates a piped data flow syntax, which aligns with successful patterns found in other modern data languages. The implementation of this syntax transforms SQL into a more flexible and user-friendly language, facilitating easier learning and usage while maintaining compatibility with the existing SQL ecosystem. This approach allows users to gradually adopt new features without the need for extensive migrations or the steep learning curve associated with entirely new languages, ultimately enhancing productivity in data management tasks.

- SQL has significant design problems that hinder its usability and extensibility.

- Alternative languages have failed to displace SQL due to adoption challenges.

- GoogleSQL extends SQL by adding piped data flow syntax.

- The new syntax makes SQL easier to learn and use while preserving its existing ecosystem.

- Incremental adoption of new features is possible without major migrations or learning new languages.

Link Icon 43 comments
By @tmoertel - 8 months
I'm glad to see pipe syntax moving forward! In my time at Google, I wrote a lot of SQL queries that required a long sequence of intermixed filtering, grouping, aggregation, and analytic window operations. But the ordering of those operations rarely matched SQL's inside-out syntax, so I commonly had to chop up my logic and glue it into multiple chained CTEs. This busywork made my code's logic harder to follow and was frustrating, especially since I was familiar with alternative coding models, such as R's dplyr, that would have allowed me to express my logic as a simple linear pipeline.

I hope SQL pipes become commonplace so that I can use them anywhere I have to write SQL queries.

By @Ciantic - 8 months
Here is one example from the PDF:

    FROM r JOIN s USING (id)
    |> WHERE r.c < 15
    |> AGGREGATE sum(r.e) AS s GROUP BY r.d
    |> WHERE s > 3
    |> ORDER BY d
    |> SELECT d, s, rank() OVER (order by d)
Can we call this SQL anymore after this? This re-ordering of things has been done by others too, like PRQL, but they didn't call it SQL. I do think it makes things more readable.
By @simonw - 8 months
To see how well it works, I uploaded the PDF to Google AI Studio and ran the prompt "Convert this document to neatly styled semantic HTML" against the gemini-1.5-pro-exp-0801 model - the result is actually pretty good!

https://static.simonwillison.net/static/2024/Pipe-Syntax-In-...

By @ku1ik - 8 months
I would welcome this! I’ve been using SQL for over 20 years and I still sometimes have trouble expressing certain queries, which in my head are well understood. I typically get there but I wish the amount of trial and error was shorter, and the process of turning what I have in my head inside out disappeared.
By @simonw - 8 months
Google's open source ZetaSQL project added documentation covering pipe syntax last week, which I'm finding easier to follow than this paper:

https://github.com/google/zetasql/blob/2024.08.2/docs/pipe-s...

By @darksaints - 8 months
I've got plenty of complaints about SQL, but as one of the most useful programming languages to have ever been invented, I have to say that syntax complaints are one of lowest items I would have on my list of things to be prioritized for a change. Sure, the syntax could be better, but why do we care so much about it over the dozens of other problems with it?

How about we get a SQL successor with algebraic data types, true boolean logic (as opposed to SQL's ternary logic), or functional composition? Null values are the bane of any query writer's existence, and we should have a reasonable solution by now...we've already done it with other programming languages.

By @snidane - 8 months
This style is familiar to those writing dataframe logic in df libraries with sql semantics - spark, polars or duckdb relational (https://duckdb.org/docs/api/python/relational_api.html).

It definitely makes things easier to follow, but only for linear, ie. single table, transformations. The moment joins of multiple tables come into the picture things become hairy quick and then you actually start to appreciate the plain old sql which accounts for exactly this and allows you to specify column aliases in the entire cte clause. With this piping you lose scope of the table aliases and then you have to use weird hacks like mangling names of the joined in table in polars.

For single table processing the pipes are nice though. Especially eliminating the need for multiple different keywords for filter based on the order of execution (where, having, qualify (and pre-join filter which is missing)).

A missed opportunity here is the redundant [AGGREGATE sum(x) GROUP BY y]. Unless you need to specify rollups, [AGGREGATE y, sum(x)] is a sufficient syntax for group bys and duckdb folks got it right in the relational api.

By @solidsnack9000 - 8 months
This may be the most practical way to make things better for developers, analysts, data scientists...

There have been so many attempts to alleviate the difficulty of writing SQL -- from ORMs, to alternate syntaxes, to alternate databases -- and none have been successful. The authors identify many reasons for this. The approach the authors have taken is incremental and makes only slight demands one people already familiar with elementary SQL -- and even people familiar with advanced SQL will likely find this approach to be easier to use for advanced queries.

By @10000truths - 8 months
The pipeline syntax as presented is nicer than the status quo, but I'd prefer a syntax that models query execution as a directed graph of operations. Doing so would not only make some of the more complex SQL query constructs much more straightforward to represent:

* Joins can be modelled as a "cross-referencing" operation that consume two (or more) data streams and produce a single data stream

* CTEs can be modelled as producing multiple data streams

* Recursive CTEs can be modelled as cycles in the execution graph

By @slt2021 - 8 months
reminds me PRQL https://prql-lang.org

and SPL from Splunk

By @misiek08 - 8 months
Looks like from Elixir. If old SQL syntax will be supported then why not - but this one will introduce a lot less readable queries when multiple JOINs, subqueries and aggs are involved - it is very easy in plain SQL and here they will probably look bad.
By @mingodad - 8 months
I've just update the EBNF for rarilroad diagram here https://github.com/google/zetasql/issues/134#issuecomment-14... with the full grammar.
By @hintymad - 8 months
Piping syntax is particularly useful when querying time series too, especially when we need to write complex queries in a one-liner to share with others.
By @kmoser - 8 months
I understand the desire to reorder clauses in a way that makes sense to the user, but they seem to discount the value in the formality of enforcing a particular order to the clauses: you're assured that if the query contains, say an ORDER BY clause, it will always appear at the end, no need to hunt for it.

Also, why the need to match semantic evaluation order when there are far more important things happening under the hood that affect the execution plan (indexes, etc.)?

> Side-effects at a distance...The same columns are cross-referenced in SELECT, GROUP BY and ORDER BY, and corresponding edits are often required in three places

Can't this be avoided by using aliases?

By @jwilber - 8 months
This is almost exactly how R’s dplyr library works, and it’s honestly a pretty ergonomic, intuitive way to write queries.
By @DaiPlusPlus - 8 months
The first-page of the paper has 13 co-authors listed - but all with the same affiliation ("Google, Inc") - so this is ultimately a single-vendor making a unilateral proposal to break with the past - which means I'm confident this proposal won't be gracing the pages of the ISO/IEC 9075 (ISO SQL) standards in my lifetime - no matter how badly we all need QoL improvements to SQL.

...okay, if I dial-back my feelings of resignation to mediocrity, then I'll admit that Google probably does have enough clout to make this go somewhere - but they'd need to add this to all their database offerings (BigQuery, Spanner, Firebase's SQL mode) and contribute patches to Postgres and MySQL/Maria - maybe after Microsoft relents a decade later to add it to MSSQL we'll maybe start to see Oracle's people refer to it vaguely as a nice-to-have they'll implement only after they start losing more blue-chip customers[1].

Also, it's giving me M (Excel PowerQuery) vibes too.

-------

[1]For context, Oracle's DB lacked a `bit`/`bool` column type for the past 40 years until last year. People had to use `char(1)` columns with CHECK constraints to store '0'/'1' - or worse: 'T'/'F' or 'Y'/'N' (see https://stackoverflow.com/a/3726846/159145 )

By @yazaldefilimone - 8 months
I built a compiler based on the ideas from this paper: https://github.com/yazaldefilimone/spipe.
By @imperfect_light - 8 months
>But SQL is an old language with significant design problems, making it difficult to learn, difficult to use, and difficult to extend.

It's difficult to learn and use? Compared to what? The "difficult to extend" might be accurate, but I really question the initial premise here.

By @r1b - 8 months
It’s strange to me that the interop story here only considers compatibility with GoogleSQL.

These days, we have tooling that can generate an IR for SQL and transpile it to any dialect (see: sqlglot).

Doesn’t coupling SQL syntax to dialect-specific SQL semantics throw a wrench in this?

By @saltcured - 8 months
I don't have time to read this closely or ponder the grammar right now. They have a section on "complex queries" that acknowledges my first concern, but doesn't seem to really address it. Namely, that SQL allows tree-like composition of queries which is more general than a linear composition.

Has anybody figured out whether they are proposing this pipeline syntax to be mixed with regular compositional forms like CTEs and subqueries? Or is it another limited, top-level syntax, similar to how some DB engines do not allow nesting of CTE syntax?

By @chuckadams - 8 months
I wonder if the pipe operator will land in SQL before we see it in JS.
By @gfody - 8 months
I seriously wonder if the people who are so adament that sql is flawed have spent as much time using at as they have trying to "fix" it. After 20 years of sequeling I have come to believe that this language is so far ahead of its time that we're only just beginning to see what a proper tooling for it looks like. Azure Data Studio w/Copilot makes starting queries with "select" the most natural thing in the world and this pipe syntax is barbaric in contrast.
By @nsonha - 8 months
What's wrong with single character pipe (|)? Not everyone uses ligatures even if they like that particular symbol. Just keep things simple please
By @mocamoca - 8 months
Question for people writing highly complex SQL queries.

Why not write simple SQL queries and use another language to do the transformations?

Are SQL engines really more efficient at filtering/matching/aggregating data when doing complex queries? Doesn't working without reusable blocks / tests / logs make development harder?

Syntax is one thing, but actual performance (and safety/maintenance) is another deal?

By @sklivvz1971 - 8 months
SQL doesn't have problems. It's arguably one of the oldest and most successful languages of all times.

Some developers have problems, because they don't understand its abstraction. Periodically they try to re-implement feature to "fix" SQL. Every time it's an absolute disaster.

I realize this is how the world works, but it's just a waste of time.

By @sgarland - 8 months
I personally find the syntactical flow of SQL to be quite logical: you SELECT tuples FROM a table, JOINing others as needed, filtering WHERE necessary. It’s much like list comprehensions in Python, which I love, so maybe that's biasing me.
By @rawgabbit - 8 months
It looks great. When is Google going to create an IDE or extension that I can use?
By @sitkack - 8 months
Reminds me of Pig. Or a pathological stack of CTEs.

Doesn't look compelling for the lockin.

By @scop - 8 months
Reminds me of Elixir’s Ecto library.

https://hexdocs.pm/ecto/crud.html#pipe-based-queries

By @typedef_struct - 8 months
You can tell the authors realized this was a bad idea when they had to add the 'OVER' keyword, which isn't documented and hardly mentioned in the paper.
By @vendiddy - 8 months
Are there any gentle intros to how a simple SQL engine implemented under the hood?

I vaguely remember someone telling me that it has mathematical underpinnings.

By @ram_rar - 8 months
I appreciated the paper's effort to highlight data analysis techniques, but I was hoping for a more forward-thinking perspective.

The inclusion of pipe syntax or data processing using pipe-structured data flow syntax is not very novel at all. Splunk's SPL and similar languages have been a longstanding industry practice.

I wish the paper had provided more extensive references to the existing body of work in this domain, as it would have added depth and context to the discussion.

By @croes - 8 months
Can the first example not simply be done with a window function?
By @mrjin - 8 months
Starting with SELECT was a mistake in SQL as you need to know where before what to select from. What we really need is something like

FROM r JOIN s on r.Id = s.Id ORDER BY r.Id SELECT *

But the thing is, such changes will break pretty much all existing code, so the author added |> to distinguish, but why not use | instead? Don't make people typing one more character please.

By @brikym - 8 months
It's been done before: Kusto Query Language
By @singularity2001 - 8 months
on iPad FF when I tried to view the PDF it downloads binary gibbous instead
By @chucke1992 - 8 months
Fix it by piping
By @ccorcos - 8 months
Now we just need a healthy SQL transpiler ecosystem so we can all live in the future!
By @flusteredBias - 8 months
... so dplyr.
By @thom - 8 months
Of all the problems with SQL, this solves one I absolutely don’t care about at all. Kill SQL or don’t kill SQL, but don’t add pointless syntax sugar to make normal queries look alien.
By @cryptonector - 8 months
This strikes me as a) yet another FROM-clause-first variant of SQL, b) syntax for LinkQ.
By @simonw - 8 months
Google: you are a web company. Please learn to publish your research papers as web pages.