11 Introduction to SQL
https://learn.datacamp.com/courses/introduction-to-sql
SQL, which stands for Structured Query Language, is a language for interacting with data stored in something called a relational database.
You can think of a relational database as a collection of tables. A table is just a set of rows and columns, like a spreadsheet, which represents exactly one type of entity. For example, a table might represent employees in a company or purchases made, but not both.
Each row, or record, of a table contains information about a single entity. For example, in a table representing employees, each row represents a single person. Each column, or field, of a table contains a single attribute for all rows in the table. For example, in a table representing employees, we might have a column containing first and last names for all employees.
11.1 Selecting columns
A query is a request for data from a database table (or combination of tables). Querying is an essential skill for a data scientist, since the data you need for your analyses will often live in databases.
In SQL, you can select data from a table using a SELECT statement. For example, the following query selects the name column from the people table:
SELECT name
FROM people;
In this query, SELECT and FROM are called keywords. In SQL, keywords are not case-sensitive, which means you can write the same query as:
select name
from people;
That said, it's good practice to make SQL keywords uppercase to distinguish them from other parts of your query, like column and table names.
It’s also good practice (but not necessary for the exercises in this course) to include a semicolon at the end of your query. This tells SQL where the end of your query is!
To select multiple columns :
To select multiple columns from a table, simply separate the column names with commas!
For example, this query selects two columns, name and birthdate, from the people table:
SELECT name, birthdate
FROM people;
Sometimes, you may want to select all columns from a table. Typing out every column name would be a pain, so there's a handy shortcut:
SELECT *
FROM people;
If you only want to return a certain number of results, you can use the LIMIT keyword to limit the number of rows returned:
SELECT *
FROM people
LIMIT 10;
SELECT DISTINCT:
Often your results will include many duplicate values. If you want to select all the unique values from a column, you can use the DISTINCT keyword.
This might be useful if, for example, you’re interested in knowing which languages are represented in the films table:
SELECT DISTINCT language
FROM films;
What if you want to count the number of employees in your employees table? The COUNT statement lets you do this by returning the number of rows in one or more columns.
For example, this code gives the number of rows in the people table:
SELECT COUNT(*)
FROM people;
It’s also common to combine COUNT with DISTINCT to count the number of distinct values in a column.
For example, this query counts the number of distinct birth dates contained in the people table:
SELECT COUNT(DISTINCT birthdate)
FROM people;
11.2 Filtering rows
In SQL, the WHERE keyword allows you to filter based on both text and numeric values in a table. There are a few different comparison operators you can use.
For example, you can filter text records such as title. The following code returns all films with the title ‘Metropolis’:
SELECT title
FROM films
WHERE title = 'Metropolis';
SELECT(title, release_year)
FROM films
WHERE release_year > 2000;
WHERE AND
Often, you’ll want to select data based on multiple conditions. You can build up your WHERE queries by combining multiple conditions with the AND keyword.
For example,
SELECT title
FROM films
WHERE release_year > 1994
AND release_year < 2000;
This gives you the titles of films released between 1994 and 2000.
More Ex:
SELECT *
FROM films
WHERE release_year > 2000
AND release_year < 2010
AND language = 'Spanish';
WHERE AND OR What if you want to select rows based on multiple conditions where some but not all of the conditions need to be met? For this, SQL has the OR operator.
For example, the following returns all films released in either 1994 or 2000:
SELECT title
FROM films
WHERE release_year = 1994
OR release_year = 2000;
## Aggregate Functions
SELECT title
FROM films
WHERE (release_year = 1994 OR release_year = 1995)
AND (certification = 'PG' OR certification = 'R');
BETWEEN
In SQL the BETWEEN keyword provides a useful shorthand for filtering values within a specified range.
EXAMPLE:
SELECT title
FROM films
WHERE release_year
BETWEEN 1994 AND 2000;
It’s important to remember that BETWEEN is inclusive, meaning the beginning and end values are included in the results!
WHERE IN As you’ve seen, WHERE is very useful for filtering results. However, if you want to filter based on many conditions, WHERE can get unwieldy. For example:
SELECT name
FROM kids
WHERE age = 2
OR age = 4
OR age = 6
OR age = 8
OR age = 10;
Enter the IN operator! The IN operator allows you to specify multiple values in a WHERE clause, making it easier and quicker to specify multiple OR conditions! Neat, right?
So, the above example would become simply:
SELECT name
FROM kids
WHERE age IN (2, 4, 6, 8, 10);
In SQL, NULL represents a missing or unknown value. You can check for NULL values using the expression IS NULL. For example, to count the number of missing birth dates in the people table:
SELECT COUNT(*)
FROM people
WHERE birthdate IS NULL;
SELECT name
FROM people
WHERE birthdate IS NOT NULL;
LIKE and NOT LIKE
In SQL, the LIKE operator can be used in a WHERE clause to search for a pattern in a column. To accomplish this, you use something called a wildcard as a placeholder for some other values. There are two wildcards you can use with LIKE:
The % wildcard will match zero, one, or many characters in text. For example, the following query matches companies like ‘Data’, ‘DataC’ ‘DataCamp’, ‘DataMind’, and so on:
SELECT name
FROM companies
WHERE name LIKE 'Data%';
The _ wildcard will match a single character. For example, the following query matches companies like ‘DataCamp’, ‘DataComp’, and so on:
SELECT name
FROM companies
WHERE name LIKE 'DataC_mp';
You can also use the NOT LIKE operator to find records that don’t match the pattern you specify.
11.3 Aggregate Functions
SQL provides a few functions, called aggregate functions, to help to perform some calculation on the data in a database.
SELECT AVG(budget)
FROM films;
SELECT MAX(budget)
FROM films;
SELECT SUM(budget)
FROM films;
Combining aggregate functions with WHERE
Aggregate functions can be combined with the WHERE clause to gain further insights from your data.
For example, to get the total budget of movies made in the year 2010 or later:
SELECT SUM(budget)
FROM films
WHERE release_year >= 2010;
Get the average amount grossed by all films whose titles start with the letter ‘A’.
SELECT AVG(gross)
FROM films
where title LIKE 'A%';
Get the amount grossed by the best performing film between 2000 and 2012, inclusive.
SELECT MAX(gross)
FROM films
WHERE release_year BETWEEN 2000 AND 2012;
Aliasing:
SQL allows you to do something called aliasing. Aliasing simply means you assign a temporary name to something. To alias, you use the AS keyword, which you’ve already seen earlier in this course.
For example, in the above example we could use aliases to make the result clearer:
SELECT MAX(budget) AS max_budget,
MAX(duration) AS max_duration
FROM films;
Get the title and net profit (the amount a film grossed, minus its budget) for all films. Alias the net profit as net_profit.
SELECT title, gross - budget AS net_profit
FROM films;
Get the percentage of people who are no longer alive. Alias the result as percentage_dead. Remember to use 100.0 and not 100!
-- get the count(deathdate) and multiply by 100.0
-- then divide by count(*)
SELECT COUNT(deathdate) * 100.0 / COUNT(*) AS percentage_dead
FROM people;
ORDER BY :
In SQL, the ORDER BY keyword is used to sort results in ascending or descending order according to the values of one or more columns.
Get the title and gross earnings for movies which begin with the letter ‘M’ and order the results alphabetically.
SELECT title, gross
FROM films
WHERE title LIKE 'M%'
ORDER BY title;
By default ORDER BY will sort in ascending order. If you want to sort the results in descending order, you can use the DESC keyword. For example,
`SELECT title FROM films ORDER BY release_year DESC;
This gives you the titles of films sorted by release year, from newest to oldest.
GROUP BY
In SQL, GROUP BY allows you to group a result by one or more columns, like so:
SELECT sex, count(*)
FROM employees
GROUP BY sex;
Get the release year, country, and highest budget spent making a film for each year, for each country. Sort your results by release year and country.
SELECT release_year, country, MAX(budget)
FROM films
GROUP BY release_year, country
ORDER BY release_year, country;
HAVING
In SQL, aggregate functions can’t be used in WHERE clauses. For example, the following query is invalid:
SELECT release_year
FROM films
GROUP BY release_year
WHERE COUNT(title) > 10;
This means that if you want to filter based on the result of an aggregate function, you need another way! That’s where the HAVING clause comes in. For example,
SELECT release_year
FROM films
GROUP BY release_year
HAVING COUNT(title) > 10;
shows only those years in which more than 10 films were released.
Get the country, average budget, and average gross take of countries that have made more than 10 films. Order the result by country name, and limit the number of results displayed to 5. You should alias the averages as avg_budget and avg_gross respectively.
-- select country, average budget,
-- and average gross
SELECT country, AVG(budget) AS avg_budget,
AVG(gross) AS avg_gross
-- from the films table
FROM films
-- group by country
GROUP BY country
-- where the country has more than 10 titles
HAVING COUNT(title) > 10
-- order by country
ORDER BY country
-- limit to only show 5 results
LIMIT 5;