12 Joining Data in SQL
https://learn.datacamp.com/courses/joining-data-in-postgresql
12.1 Introduction to joins
Left join : When we want to join everything on the first data frame and the ones which match the first data frame from the second data frame.
Inner Join : When we want to join everything thats present in both the data frames.
Example Codes:
SELECT *
FROM left_table
INNER JOIN right_table
ON left_table.id = right_table.id;
-- 1. Select name fields (with alias) and region
SELECT cities.name AS city, countries.name AS country, region
FROM cities
INNER JOIN countries
ON cities.country_code = countries.code;
Sometimes it’s easier to write SQL code out of order: you write the SELECT statement after you’ve done the JOIN.
-- 3. Select fields with aliases
SELECT c.code AS country_code, name, year, inflation_rate
FROM countries AS c
-- 1. Join to economies (alias e)
INNER JOIN economies AS e
-- 2. Match on code
ON c.code = e.code;
We can join multiple tables using the following code:
SELECT *
FROM left_table
INNER JOIN right_table
ON left_table.id = right_table.id
INNER JOIN another_table
ON left_table.id = another_table.id;
Prompt :
Inner join countries (left) and populations (right) on the code and country_code fields respectively.
Alias countries AS c and populations AS p.
Select code, name, and region from countries and also select year and fertility_rate from populations (5 fields in total).
Add an additional inner join with economies to your previous query by joining on code.
Include the unemployment_rate column that became available through joining with economies.
Note that year appears in both populations and economies, so you have to explicitly use e.year instead of year as you did before.
Code:
6. Select fields
SELECT c.code, name, region, e.year, fertility_rate, unemployment_rate
-- 1. From countries (alias as c)
FROM countries AS c
-- 2. Join to populations (as p)
INNER JOIN populations AS p
-- 3. Match on country code
ON c.code = p.country_code
-- 4. Join to economies (as e)
INNER JOIN economies AS e
-- 5. Match on country code and year
ON c.code = e.code AND e.year = p.year;
USING:
If we have the same key variable name for both the dataframes we can use the using verb instead of ON.
-- 4. Select fields
SELECT c.name AS country, continent, l.name AS language, official
-- 1. From countries (alias as c)
FROM countries AS c
-- 2. Join to languages (as l)
INNER JOIN languages AS l
-- 3. Match using code
USING(code);
Self-join
Join populations with itself ON country_code. Select the country_code from p1 and the size field from both p1 and p2. SQL won’t allow same-named fields, so alias p1.size as size2010 and p2.size as size2015.
-- 4. Select fields with aliases
SELECT p1.country_code,
p1.size AS size2010,
p2.size AS size2015
-- 1. From populations (alias as p1)
FROM populations AS p1
-- 2. Join to itself (alias as p2)
INNER JOIN populations AS p2
-- 3. Match on country code
ON p1.country_code = p2.country_code;
Extend the ON in your query to include only those records where the p1.year (2010) matches with p2.year - 5 (2015 - 5 = 2010). This will omit the three entries per country_code that you aren’t interested in.
-- 5. Select fields with aliases
SELECT p1.country_code,
p1.size AS size2010,
p2.size AS size2015
-- 1. From populations (alias as p1)
FROM populations AS p1
-- 2. Join to itself (alias as p2)
INNER JOIN populations AS p2
-- 3. Match on country code
ON p1.country_code = p2.country_code
-- 4. and year (with calculation)
AND p1.year = p2.year - 5;
With two numeric fields and , the percentage growth from to can be calculated as .
Add a new field to SELECT, aliased as growth_perc, that calculates the percentage population growth from 2010 to 2015 for each country, using p2.size and p1.size.
SELECT p1.country_code,
p1.size AS size2010,
p2.size AS size2015,
-- 1. calculate growth_perc
((p2.size - p1.size)/p1.size * 100.0) AS growth_perc
-- 2. From populations (alias as p1)
FROM populations AS p1
-- 3. Join to itself (alias as p2)
INNER JOIN populations AS p2
-- 4. Match on country code
ON p1.country_code = p2.country_code
-- 5. and year (with calculation)
AND p1.year = p2.year - 5;
Case when and then:
Often it’s useful to look at a numerical field not as raw data, but instead as being in different categories or groups.
You can use CASE with WHEN, THEN, ELSE, and END to define a new grouping field.
SELECT name, continent, code, surface_area,
-- 1. First case
CASE WHEN surface_area > 2000000 THEN 'large'
-- 2. Second case
WHEN surface_area > 350000 THEN 'medium'
-- 3. Else clause + end
ELSE 'small' END
-- 4. Alias name
AS geosize_group
-- 5. From table
FROM countries;
SELECT country_code, size,
-- 1. First case
CASE WHEN size > 50000000 THEN 'large'
-- 2. Second case
WHEN size > 1000000 THEN 'medium'
-- 3. Else clause + end
ELSE 'small' END
-- 4. Alias name (popsize_group)
AS popsize_group
-- 5. From table
FROM populations
-- 6. Focus on 2015
WHERE year = 2015;
-- 2. Select all columns of pop_plus
SELECT * FROM pop_plus;
-- 5. Select fields
SELECT name, continent, geosize_group, popsize_group
-- 1. From countries_plus (alias as c)
FROM countries_plus AS c
-- 2. Join to pop_plus (alias as p)
INNER JOIN pop_plus AS p
-- 3. Match on country code
ON c.code = p.country_code
-- 4. Order the table
ORDER BY geosize_group;
12.2 Outer joins and cross joins
Left Join
When we want to join everything on the first data frame and the ones which match the first data frame from the second data frame.
SELECT c1.name AS city, code, c2.name AS country,
region, city_proper_pop
FROM cities AS c1
-- 1. Join right table (with alias)
LEFT JOIN countries AS c2
-- 2. Match on country code
ON c1.country_code = c2.code
-- 3. Order by descending country code
ORDER BY code DESC;
/*
5. Select country name AS country, the country's local name,
the language name AS language, and
the percent of the language spoken in the country
*/
SELECT c.name AS country, local_name, l.name AS language, percent
-- 1. From left table (alias as c)
FROM countries AS c
-- 2. Join to right table (alias as l)
LEFT JOIN languages AS l
-- 3. Match on fields
ON c.code = l.code
-- 4. Order by descending country
ORDER BY country DESC;
-- Select fields
SELECT region, AVG(gdp_percapita) AS avg_gdp
-- From countries (alias as c)
FROM countries AS c
-- Left join with economies (alias as e)
LEFT JOIN economies AS e
-- Match on code fields
ON c.code = e.code
-- Focus on 2010
WHERE year = 2010
-- Group by region
GROUP BY region
-- Order by descending avg_gdp
ORDER BY avg_gdp DESC;
Right Join : When we want to join everything on the second data frame and the ones which match the first data frame from the first data frame.
-- convert this code to use RIGHT JOINs instead of LEFT JOINs
/*
SELECT cities.name AS city, urbanarea_pop, countries.name AS country,
indep_year, languages.name AS language, percent
FROM cities
LEFT JOIN countries
ON cities.country_code = countries.code
LEFT JOIN languages
ON countries.code = languages.code
ORDER BY city, language;
*/
SELECT cities.name AS city, urbanarea_pop, countries.name AS country,
indep_year, languages.name AS language, percent
FROM languages
RIGHT JOIN countries
ON languages.code = countries.code
RIGHT JOIN cities
ON countries.code = cities.country_code
ORDER BY city, language;
Full Join:
We use full join when we want everything in both the data frames.
SELECT name AS country, code, region, basic_unit
FROM countries
-- 1. Join to currencies
INNER JOIN currencies
USING (code)
-- 2. Where region is North America or null
WHERE region = 'North America' OR region IS NULL
-- 3. Order by region
ORDER BY region;
-- 7. Select fields (with aliases)
SELECT c1.name AS country, region, l.name AS language,
basic_unit, frac_unit
-- 1. From countries (alias as c1)
FROM countries AS c1
-- 2. Join with languages (alias as l)
FULL JOIN languages AS l
-- 3. Match on code
USING (code)
-- 4. Join with currencies (alias as c2)
FULL JOIN currencies AS c2
-- 5. Match on code
USING (code)
-- 6. Where region like Melanesia and Micronesia
WHERE region LIKE 'M%esia';
Cross Join:
The CROSS JOIN is used to generate a paired combination of each row of the first table with each row of the second table.
-- 4. Select fields
SELECT c.name AS city, l.name AS language
-- 1. From cities (alias as c)
FROM cities AS c
-- 2. Join to languages (alias as l)
CROSS JOIN languages AS l
-- 3. Where c.name like Hyderabad
WHERE c.name LIKE 'Hyder%';
12.3 Set theory clauses
The UNION, EXCEPT and INTERSECT operators of SQL enable you to combine more than one SELECT statement to form a single result set. The UNION operator returns all rows. The INTERSECT operator returns all rows that are in both result sets. The EXCEPT operator returns the rows that are only in the first result set but not in the second.
Union
Union represents merging two data frames which include all records in both tables without duplicates.
-- Select fields from 2010 table
SELECT *
-- From 2010 table
FROM economies2010
-- Set theory clause
UNION
-- Select fields from 2015 table
SELECT *
-- From 2015 table
FROM economies2015
-- Order by code and year
ORDER BY code, year;
-- Select field
SELECT country_code
-- From cities
FROM cities
-- Set theory clause
UNION
-- Select field
SELECT code
-- From currencies
FROM currencies
-- Order by country_code
ORDER BY country_code;
Union All
Union represents merging two data frames which include all records in both tables with duplicates.
-- Select fields
SELECT code, year
-- From economies
FROM economies
-- Set theory clause
UNION ALL
-- Select fields
SELECT country_code, year
-- From populations
FROM populations
-- Order by code, year
ORDER BY code, year;
As you think about major world cities and their corresponding country, you may ask which countries also have a city with the same name as their country name?
-- Select fields
SELECT name
-- From countries
FROM countries
-- Set theory clause
INTERSECT
-- Select fields
SELECT name
-- From cities
FROM cities;
Except
We use except to select some variables from one data frame and then merge it with another dataframe except some varibale from the second data frame.
-- Select field
SELECT name
-- From cities
FROM cities
-- Set theory clause
EXCEPT
-- Select field
SELECT capital
-- From countries
FROM countries
-- Order by result
ORDER BY name;
-- Select field
SELECT capital
-- From countries
FROM countries
-- Set theory clause
EXCEPT
-- Select field
SELECT name
-- From cities
FROM cities
-- Order by ascending capital
ORDER BY capital;
Semi Join and Anti Join
Semi Join : A semi-join is a join between two sets of data (tables) where rows from the first set are returned, based on the presence or absence of at least one matching row in the other set.
/*
SELECT code
FROM countries
WHERE region = 'Middle East';
*/
-- Select field
SELECT DISTINCT name
-- From languages
FROM languages
- Where in statement
WHERE code IN
-- Subquery
(SELECT code
FROM countries
WHERE region = 'Middle East')
-- Order by name
ORDER BY name;
Anti Join : Anti-join between two tables returns rows from the first table where no matches are found in the second table. It is opposite of a semi-join. An anti-join returns one copy of each row in the first table for which no match is found. Anti-joins are written using the NOT EXISTS or NOT IN constructs.
-- 3. Select fields
SELECT code, name
-- 4. From Countries
FROM countries
-- 5. Where continent is Oceania
WHERE continent = 'Oceania'
-- 1. And code not in
AND code NOT IN
-- 2. Subquery
(SELECT code
FROM currencies);
Summary problem :
-- Select the city name
SELECT name
-- Alias the table where city name resides
FROM cities AS c1
-- Choose only records matching the result of multiple set theory clauses
WHERE country_code IN
(
-- Select appropriate field from economies AS e
SELECT e.code
FROM economies AS e
-- Get all additional (unique) values of the field from currencies AS c2
UNION
SELECT c2.code
FROM currencies AS c2
-- Exclude those appearing in populations AS p
EXCEPT
SELECT p.country_code
FROM populations AS p
);
12.4 Subqueries
Subqueries are queries inside queries or nested queries.
Subqueries inside SELECT:
-- Select fields
SELECT *
-- From populations
FROM populations
-- Where life_expectancy is greater than
WHERE life_expectancy >
-- 1.15 * subquery
1.15 * (SELECT AVG(life_expectancy)
FROM populations
WHERE year = 2015) AND
year = 2015;
SELECT countries.name AS country, COUNT(*) AS cities_num
FROM cities
INNER JOIN countries
ON countries.code = cities.country_code
GROUP BY country
ORDER BY cities_num DESC, country
LIMIT 9;
/*
SELECT ___ AS ___,
(SELECT ___
FROM ___
WHERE countries.code = cities.country_code) AS cities_num
FROM ___
ORDER BY ___ ___, ___
LIMIT 9;
*/
Subqueries inside FROM:
SELECT local_name, subquery.lang_num
FROM countries,
(SELECT code, COUNT(*) AS lang_num
FROM languages
GROUP BY code) AS subquery
WHERE countries.code = subquery.code
ORDER BY lang_num DESC;
-- Select fields
SELECT name, continent, inflation_rate
-- From countries
FROM countries
-- Join to economies
INNER JOIN economies
-- Match on code
ON countries.code = economies.code
-- Where year is 2015
WHERE year = 2015
-- And inflation rate in subquery (alias as subquery)
AND inflation_rate IN (
SELECT MAX(inflation_rate) AS max_inf
FROM (
SELECT name, continent, inflation_rate
FROM countries
INNER JOIN economies
ON countries.code = economies.code
WHERE year = 2015) AS subquery
-- Group by continent
GROUP BY continent);
Challenge Excercises:
Select unique country names. Also select the total investment and imports fields. Use a left join with countries on the left. (An inner join would also work, but please use a left join here.) Match on code in the two tables AND use a subquery inside of ON to choose the appropriate languages records. Order by country name ascending. *Use table aliasing but not field aliasing in this exercise.
-- Select fields
SELECT DISTINCT name, total_investment, imports
-- From table (with alias)
FROM countries AS c
-- Join with table (with alias)
LEFT JOIN economies AS e
-- Match on code
ON (c.code = e.code
-- and code in Subquery
AND c.code IN (
SELECT l.code
FROM languages AS l
WHERE official = 'true'
) )
-- Where region and year are correct
WHERE region = 'Central America' AND year = 2015
-- Order by field
ORDER BY name;
Select the city name, country code, city proper population, and metro area population. Calculate the percentage of metro area population composed of city proper population for each city in cities, aliased as city_perc. Focus only on capital cities in Europe and the Americas in a subquery. Make sure to exclude records with missing data on metro area population. Order the result by city_perc descending. Then determine the top 10 capital cities in Europe and the Americas in terms of this city_perc percentage.
-- Select fields
SELECT name, country_code, city_proper_pop, metroarea_pop,
-- Calculate city_perc
city_proper_pop / metroarea_pop * 100 AS city_perc
-- From appropriate table
FROM cities
-- Where
WHERE name IN
-- Subquery
(SELECT capital
FROM countries
WHERE (continent = 'Europe'
OR continent LIKE '%America'))
AND metroarea_pop IS NOT NULL
-- Order appropriately
ORDER BY city_perc DESC
-- Limit amount
LIMIT 10;