There was a stretch of time ~6 months where I was pretty well versed in the ancient art of SQL. It was, unsurprisingly, because I had a class that relied on our ability to use it. CGS 2545 - Database Concepts. This class taught me how to construct queries to get data that you need, what a primary key was, what a foreign key was, structured procedures etc. These skills have now atrophied, so I've decided to learn them again. There also happens to be a company that sent me a skill assessment for an internship involving SQL, so there's another big motivator.
The DISTINCT and GROUP BY keywords
There is some common wisdom about teaching being the benchmark for truly understanding a concept well, so I'm going to try my hand at explaining something that initially confused me. I was trying to solve this problem from sql-practice.com:
That bit at the end of the problem statement seemed redundant to me at first. The "which only occurs once in the list". To me that's just the definition of unique and the clarification and the bottom is even more strange. They really want you to know that unique means there's only 1 copy. At least that's what I thought, so I just used the DISTINCT keyword.
SELECT DISTINCT first_name FROM patients;
This was my first intuition.
It was wrong. Why? I looked at the hint and this is when I was re-introduced to the GROUP BY keyword.
I didn't remember anything about it, so like any self-respecting IT professional I turned to an expert, ChatGPT. We had a long and thorough conversation.
The original conversation is gone, but you get the idea. ChatGPT told me that the DISTINCT keyword is meant to return unique rows and that the GROUP BY keyword grouped records with the same values in specified columns into summary rows. I still didn't get it. I asked ChatGPT to give me an example and explain it to me like I was a 10-year-old. That's when it clicked. Here's my rendition of the example:
Let's say you have the following table
If you submit this query, SELECT DISTINCT FIRST_NAME from Students;
SQL will return something like this
But that's weird, there are two Clarks in the database... and that's when it hit me. I thought that when ChatGPT said unique (and therefore DISTINCT) that meant SQL would be returning the names that occurred only once in the database as a whole. That's what I misunderstood. Incidentally, this is also when I figured out the differences (and similarities) between DISTINCT and GROUP BY. Both kind of perform the same operation in this context as evidenced by the screenshots below, but GROUP BY is used with aggregate functions like SUM or AVG and does some sorting.
This was a conceptual failure. A failure to understand exactly what SQL is doing when you submit a query. When you query a database with SQL, you are asking SQL to construct a view of the database that might be useful for your purposes. This view might give you information directly about the database, but you should be careful and understand the keywords you're using. When you use the DISTINCT keyword, SQL is not showing you all the unique records (with respect to the specified values) in the database as a whole. It is constructing a view of the database that will not include duplicates. There could be other records with the same values in the database, but SQL will only be including one.
If you're thinking, "This was a roundabout and convoluted way to explain something super obvious.", you might be right. But I figured it out in the end, and hopefully, this helps someone else figure it out too.
Happy querying.