SQL "Wrapped" in Python: A Movie Database

Download this file, which is an sql database for movies. It contains information about the the movies as well as reviews for the movies.

Open the file in a text editor and spend a few minutes examining its structure.

Make sure to save the file to an appropriate folder: perhaps 'class14_files'

Open the file in python and spend a few minutes examining its structure.

First, take a quick look at this documentation on using sqlite3 in python.

Next, create an empty database called movie_db.

In [2]:
import sqlite3

# connect to a database (or create it, if it doesn't already exist)
con = sqlite3.connect('movie_db.db')

#generate a 'live' cursor that can execute commands as if you were entering them in the sqlite terminal
cur = con.cursor()

If you just created an empty database, you can load an .sql datafile into it by reading and executing the string contained in the file in sqlite3.

First, you can load the .sql file into a string.

In [3]:
#run the following only once for your .db to load the .sql data into it
with open("movie_db.sql") as f: 
    sql_data = f.read()
print(sql_data[0:400],10) 
/* Delete the tables if they already exist */
drop table if exists Movie;
drop table if exists Reviewer;
drop table if exists Rating;

/* Create the schema for our tables */
create table Reviewer(raterID int UNIQUE, name text);
create table Movie(movieID int UNIQUE, title text, year int, director text);
create table Rating(raterID int, movieID int, stars int, ratingDate date);

/* Populate the ta 10

Next, you can execute the string in sqlite, which parses the .sql file into a database. Note that the .sql file essentially just contains code for sql. In other words, you could enter the lines in the .sql file, line by line, into the sqlite> prompt, and it would create alse create the correct database. Theres's no need to do that here. We can exucute all the commands from the .sql file with one command:

In [4]:
cur.executescript(sql_data);

Now we can create and run queries using an easier interface.

In [ ]:
cur.execute("SELECT name FROM sqlite_master WHERE type='table';")

What happened? You need to handle the output to view it. Use fetchall()

In [ ]:
cur.execute("SELECT name FROM sqlite_master WHERE type='table';")
cur.fetchall()

As you can see. There are 3 tables. Let's use pandas to make the viewing of these tables much nicer!

In [22]:
import pandas
pandas.set_option('display.max_rows',6)
def sql_request(q):
    cur.execute(q)
    return pandas.DataFrame(cur.fetchall(),columns=[d[0] for d in cur.description])  # just using Pandas for nice tabular display of results, nothing more

Now we can just sql_request() to make request and have a nice display

In [23]:
sql_request("SELECT name FROM sqlite_master WHERE type='table';")
Out[23]:
name
0 Reviewer
1 Movie
2 Rating
In [24]:
sql_request("SELECT * FROM Movie;")
Out[24]:
movieID title year director
0 56 Captain America-The First Avenger 2011 Joe Johnson
1 1 Birdman 2014 Alejandro González Iñárritu
2 2 Pulp Fiction 1994 Quentin Tarantino
... ... ... ... ...
43 466 The legend of 1900 1998 Giuseppe Tornatore
44 72 The Village 2004 M. Night Shyamalan
45 555 Gattaca 1997 Andrew Niccol

46 rows × 4 columns

In [25]:
sql_request("SELECT * FROM Reviewer;")
Out[25]:
raterID name
0 184 Jennifer
1 100 Jenna
2 419 Ted
... ... ...
26 410 Bobby
27 120 Eddie
28 555 Rosanna

29 rows × 2 columns

In [26]:
sql_request("SELECT * FROM Rating;")
Out[26]:
raterID movieID stars ratingDate
0 17 56 4.8 2017-03-07
1 100 1 5.0 2017-03-07
2 666 3003 5.0 2017-03-07
... ... ... ... ...
50 67 96 5.0 2017-03-07
51 67 5 2.0 2017-03-07
52 444 466 5.0 2017-03-07

53 rows × 4 columns

Now let's consider a more difficult query:

In [12]:
query = "select * from movie,rating,reviewer on movie.movieID=rating.movieID;"
sql_request(query)
Out[12]:
movieID title year director raterID movieID stars ratingDate raterID name
0 56 Captain America-The First Avenger 2011 Joe Johnson 17 56 4.8 2017-03-07 184 Jennifer
1 56 Captain America-The First Avenger 2011 Joe Johnson 17 56 4.8 2017-03-07 100 Jenna
2 56 Captain America-The First Avenger 2011 Joe Johnson 17 56 4.8 2017-03-07 419 Ted
3 56 Captain America-The First Avenger 2011 Joe Johnson 17 56 4.8 2017-03-07 101 Doug
4 56 Captain America-The First Avenger 2011 Joe Johnson 17 56 4.8 2017-03-07 132 Batman
... ... ... ... ... ... ... ... ... ... ...
1532 466 The legend of 1900 1998 Giuseppe Tornatore 444 466 5.0 2017-03-07 466 Mr. Ed
1533 466 The legend of 1900 1998 Giuseppe Tornatore 444 466 5.0 2017-03-07 115 Lee
1534 466 The legend of 1900 1998 Giuseppe Tornatore 444 466 5.0 2017-03-07 410 Bobby
1535 466 The legend of 1900 1998 Giuseppe Tornatore 444 466 5.0 2017-03-07 120 Eddie
1536 466 The legend of 1900 1998 Giuseppe Tornatore 444 466 5.0 2017-03-07 555 Rosanna

1537 rows × 10 columns

Note that the rater ids are not lined up. Lets now force this:

In [ ]:
query = "select * from movie,rating,reviewer on movie.movieid=rating.movieid and rating.raterID=reviewer.raterID"
sql_request(query)

We can sort these results by year

In [ ]:
query = "select * from movie,rating,reviewer on movie.movieid=rating.movieid and rating.raterID=reviewer.raterID order by year"
sql_request(query)

We can sort these results by descending year

In [ ]:
query = "select * from movie,rating,reviewer on movie.movieid=rating.movieid and rating.raterID=reviewer.raterID order by year desc"
sql_request(query)

We can sort the results by year descending, and then by title:

In [ ]:
query = "select * from movie,rating,reviewer on movie.movieid=rating.movieid and rating.raterid=reviewer.raterid order by year desc, title"
sql_request(query)

Extended questions

1. Write a query that returns movies released after 2015.
**Hint, you should find 6 of them.
In [ ]:
query1 = .....
sql_request(query1)
2. Write a query that returns information about the movies and ratings in which the rating was 5 stars. 
**Hint 1: You should find that there are 28 such ratings.
**Hint 2: Make sure the movie ids match.
In [ ]:
query2 = .....
sql_request(query2)
3. Write a query that returns all movies related to Harry Potter, and order the results by title. 

*Hint: You should find 8 of them.
In [ ]:
query3 = ....
sql_request(query3)
4. Write a query that returns all movies related to Harry Potter, and order the results by year (in descending order). 
In [ ]:
query4 = ....
sql_request(query4)
5. Write a query that returns for only the Harry Potter movies, their (a) movie title, (b) rating in stars, and (c) reviewer name.       

*Hint 1: The raterIDs and movieIDs should all match. 
**Hint 2: Your request should return only 2 reviews.
***Hint 3: I recommend using ON and placing all constraints in parantheses (). 
In [ ]:
query5 = ...
sql_request(query5)

More questions

  1. What is the average movie rating by year?
  2. What movies were released before 1990?
  3. Does the movie involve dinosaurs? (search for Jurassic)
  4. How many different directors are there?
  5. Movies between 2008 and 2013?
  6. How many movies after 1993 made the list?
  7. What is the average rating of all of the movies?
  8. What is the average title length?
  9. Which 90s movies had greater than 4 stars?
  10. What is the lowest rated movie?
  11. What are the range of years of movies reviewed?
  12. How many movies did each director direct?
  13. Who reviewed the most movies?
  14. Which years have the most reviews?
  15. Check if a reviewer didnt review anything.