Development is ongoing, so no optimization tip is reliable for the long term. The following list provides some interesting tricks that you might want to play with:
Use subquery clauses that affect the number or order of the rows in the subquery. For example:
SELECT * FROM t1 WHERE t1.column1 IN (SELECT column1 FROM t2 ORDER BY column1); SELECT * FROM t1 WHERE t1.column1 IN (SELECT DISTINCT column1 FROM t2); SELECT * FROM t1 WHERE EXISTS (SELECT * FROM t2 LIMIT 1);
Replace a join with a subquery. For example, try this:
SELECT DISTINCT column1 FROM t1 WHERE t1.column1 IN ( SELECT column1 FROM t2);
Instead of this:
SELECT DISTINCT t1.column1 FROM t1, t2 WHERE t1.column1 = t2.column1;
Some subqueries can be transformed to joins for compatibility with older versions of MySQL that do not support subqueries. However, in some cases, converting a subquery to a join may improve performance. See Section 12.2.9.11, “Rewriting Subqueries as Joins”.
Move clauses from outside to inside the subquery. For example, use this query:
SELECT * FROM t1 WHERE s1 IN (SELECT s1 FROM t1 UNION ALL SELECT s1 FROM t2);
Instead of this query:
SELECT * FROM t1 WHERE s1 IN (SELECT s1 FROM t1) OR s1 IN (SELECT s1 FROM t2);
For another example, use this query:
SELECT (SELECT column1 + 5 FROM t1) FROM t2;
Instead of this query:
SELECT (SELECT column1 FROM t1) + 5 FROM t2;
Use a row subquery instead of a correlated subquery. For example, use this query:
SELECT * FROM t1 WHERE (column1,column2) IN (SELECT column1,column2 FROM t2);
Instead of this query:
SELECT * FROM t1 WHERE EXISTS (SELECT * FROM t2 WHERE t2.column1=t1.column1 AND t2.column2=t1.column2);
            Use NOT (a = ANY (...)) rather than
            a <> ALL (...).
          
            Use x = ANY ( rather than table containing
            (1,2))x=1 OR
            x=2.
          
            Use = ANY rather than
            EXISTS.
          
            For uncorrelated subqueries that always return one row,
            IN is always slower than
            =. For example, use this query:
          
SELECT * FROM t1 WHERE t1.col_name= (SELECT a FROM t2 WHERE b =some_const);
Instead of this query:
SELECT * FROM t1 WHERE t1.col_nameIN (SELECT a FROM t2 WHERE b =some_const);
        These tricks might cause programs to go faster or slower. Using
        MySQL facilities like the
        BENCHMARK() function, you can get
        an idea about what helps in your own situation. See
        Section 11.11.3, “Information Functions”.
      
Some optimizations that MySQL itself makes are:
            MySQL executes uncorrelated subqueries only once. Use
            EXPLAIN to make sure that a
            given subquery really is uncorrelated.
          
            MySQL rewrites IN,
            ALL, ANY, and
            SOME subqueries in an attempt to take
            advantage of the possibility that the select-list columns in
            the subquery are indexed.
          
            MySQL replaces subqueries of the following form with an
            index-lookup function, which
            EXPLAIN describes as a
            special join type
            (unique_subquery or
            index_subquery):
          
... IN (SELECTindexed_columnFROMsingle_table...)
            MySQL enhances expressions of the following form with an
            expression involving MIN() or
            MAX(), unless
            NULL values or empty sets are involved:
          
value{ALL|ANY|SOME} {> | < | >= | <=} (uncorrelated subquery)
            For example, this WHERE clause:
          
WHERE 5 > ALL (SELECT x FROM t)
might be treated by the optimizer like this:
WHERE 5 > (SELECT MAX(x) FROM t)
See also the MySQL Internals Manual chapter How MySQL Transforms Subqueries.


User Comments
Some tests I did show that using the join syntax (for a right join to exclude certain rows from the results set based on the values found in another table by using the WHERE KEY IS NULL tric) is slightly (but consistently) faster than using the sub query syntax (with a WHERE NOT IN).
Here are the two queries...
-- Faster
SELECT COUNT(*) FROM (
SELECT
region.PDB,
region.CHAIN
FROM
region LEFT JOIN split_domain USING (SUNID)
WHERE
split_domain.SUNID IS NULL
GROUP BY
PDB, CHAIN
HAVING
COUNT(*)>1
) AS X;
-- Slower
SELECT COUNT(*) FROM (
SELECT
region.PDB,
region.CHAIN
FROM
region
WHERE
SUNID NOT IN (
SELECT
SUNID
FROM
split_domain
)
GROUP BY
PDB, CHAIN
HAVING
COUNT(*)>1
) AS X;
Just in case you got confused, the results of both queries are identical, but the former is significantly faster.
I found using a 'virtual table' instead of a ROW sub query is *much* faster on my table. It seems that the row subquery isn't optimized, where the join over the 'virtual table' is optimized.
3 rows in set (0.04 sec)Below are the queryies and the 'EXPLAIN' returned for educational purposes.
-- Query using ROW subquery
EXPLAIN
SELECT
*
FROM
region
WHERE
ROW (PDB,CHAIN) IN (
SELECT
region.PDB,
region.CHAIN
FROM
region LEFT JOIN split_domain USING (SUNID)
WHERE
split_domain.SUNID IS NULL
GROUP BY
PDB, CHAIN
HAVING
COUNT(*)>1
)
LIMIT
10
;
I can't get any results out of the above (takes too long) - Perhaps the limit clause isn't kicking in?
-- Query using joined virtual table
EXPLAIN
SELECT
*
FROM
region
INNER JOIN (
SELECT
region.PDB,
region.CHAIN
FROM
region LEFT JOIN split_domain USING (SUNID)
WHERE
split_domain.SUNID IS NULL
GROUP BY
PDB, CHAIN
HAVING
COUNT(*)>1
) AS x
ON
region.PDB = x.PDB
AND
region.CHAIN = x.CHAIN
LIMIT
10
;
4 rows in set (1.02 sec)
The above returns ...
10 results in about 1 second
100 results in about 1 second
1000 results in about 1.5 seconds
Full set (20437) in about 2 seconds
The former query dosn't return in 5 mins (even with limit 10).
I hope this is useful to anyone designing (or trying to optimize) complex subqueries, and that the precice details of the data are not necessary to convey the results presented here.
As an example of factoring out correllated subqueries:
In at least 4.1, the following query is very slow:
-- retreive rows with duplicate usernames
SELECT id, username, companyname
FROM users
WHERE id IN
(SELECT id FROM users GROUP BY username HAVING COUNT(username) > 2)
This takes about 4'40" on a 2.8GHz machine with well-indexed tables (index on username)
This rewrite is very odd looking (to my eyes), but much faster:
SELECT id, username, companyname
FROM users AS u,
(SELECT id
FROM users
GROUP BY username
HAVING COUNT(username) > 2
) AS dups
WHERE u.id = dups.id
(executes in a few ms)
I thought maybe rewriting the first query to use two different implied table names would help, but no dice:
SELECT u.id, u.username, u.companyname
FROM users AS u
WHERE u.id IN
(SELECT d.id FROM users as d GROUP BY d.username HAVING COUNT(d.username) > 2)
Add your own comment.