443-970-2353
[email protected]
CV Resume
%load_ext sql
%sql mysql://studentuser:[email protected]/dognitiondb
%sql USE dognitiondb
Let's say we want to know which Dognition customers received access to Dognition's first four tests for free. These customers have a 1 in the "free_start_user" column of the users table. The syntax you would use to select the data for these customers would be:
%%sql
SELECT user_guid, free_start_user
FROM users
WHERE free_start_user=1
LIMIT 10;
Question 1 : How would you select the Dog IDs for the dogs in the Dognition data set that were DNA tested (these should have a 1 in the dna_tested field of the dogs table)? Try it below (if you do not limit your output, your query should output data from 1433 dogs):
%%sql
SELECT dog_guid
FROM dogs
WHERE dna_tested=1
LIMIT 10;
The SELECT statement can be used to interact with all data types, and there are many operators and functions that allow you to interact with the data in different ways. Here are some resources that describe these operators and functions:
http://dev.mysql.com/doc/refman/5.7/en/func-op-summary-ref.html
http://www.w3resource.com/mysql/mysql-functions-and-operators.php
Some of the most common operators include: =,<,>,<=, and >=. If you want to select something that is NOT a specific value, use != or <>. You can also use logical operators, such as AND and OR.
Let's start by examining how operators can be used with numerical data.
If you wanted to examine the Dog IDs of dogs who weighed between 10 and 50 pounds, you could query:
SELECT dog_guid, weight
FROM dogs
WHERE weight BETWEEN 10 AND 50;
The above query provided an example of how to use the BETWEEN operator (described in the links provided above), as well as an example of how AND can be used to specify multiple criteria. If you wanted to examine the Dog IDs of dogs who were "fixed" (neutered) OR DNA tested, you could use OR in the following query:
SELECT dog_guid, dog_fixed, dna_tested
FROM dogs
WHERE dog_fixed=1 OR dna_tested=1;
If you wanted to examine the Dog IDs of dogs who were fixed but NOT DNA tested, you could query:
SELECT dog_guid, dog_fixed, dna_tested
FROM dogs
WHERE dog_fixed=1 AND dna_tested!=1;
Question 2: How would you query the User IDs of customers who bought annual subscriptions, indicated by a "2" in the membership_type field of the users table? (If you do not limit the output of this query, your output should contain 4919 rows.)
%%sql
SELECT user_guid
FROM users
WHERE membership_type=2
LIMIT 10;
Now let's try using the WHERE statement to interact with text data (called "strings").
Strings need to be surrounded by quotation marks in SQL. MySQL accepts both double and single quotation marks, but some database systems only accept single quotation marks. Whenever a string contains an SQL keyword, the string must be enclosed in backticks instead of quotation marks.
'the marks that surrounds this phrase are single quotation marks'
"the marks that surrounds this phrase are double quotation marks"
`the marks that surround this phrase are backticks`
Strings enclosed in quotation or backticks can be used with many of the same operators as numerical data. For example, imagine that you only wanted to look at data from dogs of the breed "Golden Retrievers." You could query (note that double quotation marks could have been used in this example is well):
SELECT dog_guid, breed
FROM dogs
WHERE breed='golden retriever';
The IN operator allows you to specify multiple values in a WHERE clause. Each of these values must be separated by a comma from the other values, and the entire list of values should be enclosed in parentheses. If you wanted to look at all the data from Golden Retrievers and Poodles, you could certainly use the OR operator, but the IN operator would be even more efficient (note that single quotation marks could have been used in this example, too):
SELECT dog_guid, breed
FROM dogs
WHERE breed IN ("golden retriever","poodle");
The LIKE operator allows you to specify a pattern that the textual data you query has to match. For example, if you wanted to look at all the data from breeds whose names started with "s", you could query:
SELECT dog_guid, breed
FROM dogs
WHERE breed LIKE ("s%");
In this syntax, the percent sign indicates a wild card. Wild cards represent unlimited numbers of missing letters. This is how the placement of the percent sign would affect the results of the query:
Question 3: How would you query all the data from customers located in the state of North Carolina (abbreviated "NC") or New York (abbreviated "NY")? If you do not limit the output of this query, your output should contain 1333 rows.
%%sql
SELECT *
FROM users
WHERE state IN ("NC","NY")
LIMIT 10;
Next, let's try using the WHERE statement to interact with datetime data. Time-related data is a little more complicated to work with than other types of data, because it must have a very specific format. MySQL comes with the following data types for storing a date or a date/time value in the database:
DATE - format YYYY-MM-DD
DATETIME - format: YYYY-MM-DD HH:MI:SS
TIMESTAMP - format: YYYY-MM-DD HH:MI:SS
YEAR - format YYYY or YY
One of the interesting things about time-related data is that SQL has commands to break the data into different "time parts" or "date parts" as described here:
http://www.tutorialspoint.com/mysql/mysql-date-time-functions.htm
A time stamp stored in one row of data might look like this:
2013-02-07 02:50:52
The year part of that entry would be 2013, the month part would be "02" or "February" (depending on the requested format), the seconds part would be "52", and so on. SQL functions easily allow you to convert those parts into formats you might need for specific analyses. For example, imagine you wanted to know how many tests Dognition customers complete on different days of the week. To complete this analysis, you would need to convert the time stamps of each completed test to a variable that outputted the correct day of the week for that date. DAYNAME is a function that will do this for you. You can combine DAYNAME with WHERE to select data from only a single day of the week:
SELECT dog_guid, created_at
FROM complete_tests
WHERE DAYNAME(created_at)="Tuesday"
You can also use common operators like =,<,>,<=,>=,!=, or <> with dates just like you would with other types of data, but whether you refer to the date as a number or text will depend on whether you are selecting individual date parts or treating the date/time entry as a single clause. For example, you could select all the Dog IDs and time stamps of tests completed after the 15 of every month with this command that extracts the "DAY" date part out of each time stamp:
SELECT dog_guid, created_at
FROM complete_tests
WHERE DAY(created_at) > 15
You could also select all the Dog IDs and time stamps of completed tests from after February 4, 2014 by treating date entries as text clauses with the following query:
SELECT dog_guid, created_at
FROM complete_tests
WHERE created_at > '2014-02-04'
Note that you have to use a different set of functions than you would use for regular numerical data to add or subtract time from any values in these datetime formats. For example, instead of using a minus sign to find the difference in time between two time stamps or dates, you would use the TIMEDIFF or DATEDIFF function. See the references provided above for a list of these functions.
Question 4: Now that you have seen how datetime data can be used to impose criteria on the data you select, how would you select all the Dog IDs and time stamps of Dognition tests completed before October 15, 2015 (your output should have 193,246 rows)?
%%sql
SELECT dog_guid,created_at
FROM complete_tests
WHERE created_at < '2015-101-5'
LIMIT 10;
Last, let's use the WHERE statement in combination with two very important operators: IS NULL and IS NOT NULL. IS NULL will indicate rows of data that have null values. IS NOT NULL will indicate rows that do not have null values. We saw in previous exercises that many of the entries in the free_start_user field of the user table in the Dognition data set had NULL values. To select only the rows that have non-null data you could query:
SELECT user_guid
FROM users
WHERE free_start_user IS NOT NULL;
To select only the rows that have null data so that you can examine if these rows share something else in common, you could query:
SELECT user_guid
FROM users
WHERE free_start_user IS NULL;
Question 5: How would you select all the User IDs of customers who do not have null values in the State field of their demographic information (if you do not limit the output, you should get 17,985 from this query -- there are a lot of null values in the state field!)?
%%sql
SELECT membership_id
FROM users
WHERE state IS NOT NULL
LIMIT 10;
%%sql
SELECT dog_guid,subcategory_name,test_name
FROM complete_tests
WHERE YEAR(created_at)=2014
LIMIT 10;
Question 7: How would you select all of the User IDs of customers who have female dogs whose breed includes the word "terrier" somewhere in its name (if you don't limit your output, you should have 1771 rows in your output)?
%%sql
SELECT user_guid
FROM dogs
WHERE gender ="female" AND breed LIKE ("%terrier%")
LIMIT 10;
Question 8: How would you select the Dog ID, test name, and subcategory associated with each completed test for the first 100 tests entered in October, 2014?
%%sql
SELECT dog_guid,test_name,subcategory_name
FROM complete_tests
WHERE created_at LIKE ("2014-10-%")
LIMIT 100;
We can also use this:
%%sql
SELECT dog_guid,test_name,subcategory_name
FROM complete_tests
WHERE YEAR(created_at)=2014 AND MONTH(created_at)=10
LIMIT 100;