Which Of The Following Keywords Can Be Included In A Select Statement To Suppress Duplicate Data?

The get to solution for removing duplicate rows from your result sets is to include the distinct keyword in your select statement. Information technology tells the query engine to remove duplicates to produce a issue fix in which every row is unique. Did you know that the group by clause tin can likewise be used to remove duplicates? If non, read on to find out what the master differences are betwixt them and which to use to produce a desired result.

The Singled-out and Distinctrow Keywords

The distinct keyword comes straight after the SELECT in the query statement and replaces the optional all keyword, which is the default. Distinctrow is an alias for distinct and produces the verbal same results:

          SELECT [ALL | DISTINCT | DISTINCTROW ]     select_expr     [FROM table_references     [WHERE where_condition]

To illustrate how information technology works, permit's select some information from the post-obit tabular array, which contains a listing of fruits and their colors:

proper noun	color
apple	red
apple tree	green
apple tree	yellow
assistant	yellow
banana	green
grape	red
grape	white

The following query will retrieve all the fruit names from the table and listing them in alphabetical social club:

SELECT name FROM fruits;

Without the color information, we take multiples of each fruit blazon:

name

apple tree

apple

banana

grape

Now let's effort the query again with the distinct keyword:

SELECT Singled-out proper noun FROM fruits;

As expected, we now have but 1 instance of each fruit type:

If but it were ever that like shooting fish in a barrel! A quick Internet search on the phrase "sql eliminating duplicates" shows that there's more to removing duplicate values than inserting the distinct keyword into your SELECT statements.

When are Indistinguishable Rows Not Duplicate Rows

One problem that the distinct keyword does nothing to solve is that sometimes removing duplicates creates misleading results. Observe the following scenario:

The client wants to generate a listing of their employees to generate some statistics. Here's some SQL to do that:

SELECT name,        gender,               salary FROM employees Club BY name;

Strangely, this produces duplicate rows for "Kristen Ruegg":

Proper noun	gender	salary
Allan Smithie	k	4900
Barbara Breitenmoser	f	(NULL)
Jon Simpson	m	4500
Kirsten Ruegg	f	5600
Kristen Ruegg	f	5600
Peter Jonson	m	5200
Ralph Teller	m	5100

The customer responds that they don't want duplicates, then the developer adds the trusty distinct keyword to the SELECT statement. This produces the desired results, except for ane small detail: In that location are ii employees with the aforementioned name! Adding the singled-out keyword created wrong results by removing a valid row. Including the unique emp_id_number to the field list confirms that there are indeed two Kristen Rueggs:

SELECT proper noun,        gender,               salary,        emp_id_number FROM employees Club Past name;

Here's the data in question showing the unique emp_id_numbers:

name	gender	salary	emp_id_number
Kirsten Ruegg	f	5600	3462
Kristen Ruegg	f	5600	2223

The moral of the story is this: When using the distinct keyword, exist sure that you aren't inadvertently removing valid data!

Comparison Distinct to Group By

Using distinct is logically equivalent to using grouping by on all selected columns with no aggregate function. For such a query, grouping past simply produces a listing of singled-out grouping values. When displaying and grouping by a single column, the query produces the distinct values in that column. However, if you lot display and grouping by multiple columns, the query produces the distinct combinations of values in each column. For example, the following query produces the same set of rows as our showtime SELECT distinct did:

SELECT proper noun  FROM fruits  Group By name;

Similarly, the following statement produces the aforementioned results as our SELECT distinct did on the employees table:

SELECT name,        gender,               salary  FROM employees Group Past name;

A deviation betwixt distinct and group past is that group past causes row sorting. Hence:

SELECT name,        gender,               salary  FROM employees GROUP By name;

…is the aforementioned as:

SELECT Singled-out name,                 gender,                        salary  FROM employees Lodge BY name;

Counting Duplicates

Distinct can be used with the COUNT() office to count how many distinct values a column contains. COUNT(distinct expression) counts the number of singled-out (unique) non-NULL values of the given expression. The expression can be a column name to count the number of distinct non-NULL values in the column.

Here's the full employee table data:

id	dept_id	gender	name	salary	emp_id_number
one	2	m	Jon Simpson	4500	1234
2	4	f	Barbara Breitenmoser	(NULL)	9999
3	iii	f	Kirsten Ruegg	5600	3462
four	i	yard	Ralph Teller	5100	6543
5	2	m	Peter Jonson	5200	9747
6	two	m	Allan Smithie	4900	6853
vii	4	f	Kirsten Ruegg	5600	2223
8	iii	f	Kirsten Ruegg	4400	2765

Applying the Count distinct function on the name field produces 6 unique names:

SELECT Count(Singled-out proper noun) FROM employees;

It's also possible to give a list of expressions separated by commas. In this case, COUNT() returns the number of singled-out combinations of values that contain no Naught values. The following query counts the number of distinct rows for which neither the name nor salary is NULL:

SELECT Count (DISTINCT proper name, salary) FROM employees;

Count(DISTINCT name, salary)

You can also grouping counts of duplicates per group using a chip of math in conjunction with the group by clause. Here's a query to count duplicated names for each department:

SELECT dept_id,         COUNT(*) - COUNT(DISTINCT name) AS 'duplicate names' FROM   employees  GROUP BY dept_id;

dept_id	duplicate names
1	0
2	0
three	1
4	0

These queries assistance you characterize the extent of duplicates, but don't show you which values are duplicated. To see which names are duplicated in the employees table, use a summary query that displays the non-unique values along with the counts:

          SELECT dept_id,            proper noun,            count(proper noun) as name_count    FROM   employees     GROUP Past name,              dept_id;

dept_id	proper noun	name_count
2	Allan Smithie	1
4	Barbara Breitenmoser	1
2	Jon Simpson	1
3	Kirsten Ruegg	2
4	Kirsten Ruegg	1
ii	Peter Jonson	ane
ane	Ralph Teller	1

Since we're but interested in duplicates, we can filter out everything else using the HAVING clause. It'south like a WHERE clause, except that information technology's used with grouping past to narrow down the results:

SELECT dept_id,         name,         count(proper noun) equally name_count FROM   employees  GROUP Past name,         dept_id HAVING name_count > 1;

Now we tin can see which names are duplicated, also as how many at that place are:

dept_id	name	name_count
3	Kirsten Ruegg	ii

Displaying Per-Group Minimum or Maximum Values in Duplicated Rows

Equally we saw in the terminal case, the group by clause causes aggregate functions to be applied for each unique value in the field list. You should exist enlightened that columns that are not in the group by field list exercise not necessarily belong to the same row as the aggregated values! An example is definitely in order here. The post-obit query displays the highest bacon for each section:

SELECT dept_id,        name,        gender,               max(salary) as max_salary  FROM   employees Grouping BY dept_id;

The intention is to as well display information well-nigh the individual who earns the highest salary. However, that is non what is returned here:

dept_id	proper noun	gender	max_salary
1	Ralph Teller	m	5100
two	Jon Simpson	chiliad	5200
iii	Kirsten Ruegg	f	5600
4	Barbara Breitenmoser	f	5600

The problem is that the salary is the simply aggregated field considering the Max() amass office is applied to it. Consequently, the get-go proper name and gender values encountered for each group past field are what are displayed. Looking at the table, you'll see that, while Ralph Teller is the merely fellow member of section 1, Jon Simpson simply earned $4500. Peter Jonson is really the possessor of that distinction, but the query engine selected the get-go proper name and gender that it came beyond having a dept_id of 2.

The solution is to join the GROUP_BY results with the original table using the grouped fields. In this example, we only have one field, and that is the salary:

SELECT emp2.dept_id,         emp1.name,         emp1.gender,         emp2.max_salary FROM (   SELECT dept_id,                 Max(salary) as max_salary    FROM   employees    GROUP Past dept_id ) as emp2 Bring together employees as emp1 ON emp1.bacon = emp2.max_salary GROUP Past dept_id;

Now the name and gender fields belong to the earner of the greatest bacon:

dept_id	name	gender	max_salary
one	Ralph Teller	m	5100
2	Peter Jonson	chiliad	5200
three	Kirsten Ruegg	f	5600
4	Kirsten Ruegg	f	5600

There are other techniques that were not covered, such equally the use of temporary tables and dynamic SQL. Hither is more in-depth information on removing duplicate records. This article discusses the grouping by and HAVING clauses in more detail.

» Meet All Articles by Columnist Rob Gravelle

Robert Gravelle

Rob Gravelle resides in Ottawa, Canada, and has been an It guru for over 20 years. In that time, Rob has congenital systems for intelligence-related organizations such as Canada Border Services and various commercial businesses. In his spare time, Rob has go an accomplished music creative person with several CDs and digital releases to his credit.