Think Summer: Introduction — 2024
Submission
Students need to submit the following file by 10:00PM EST through Gradescope inside Brightspace.
- 
A Jupyter notebook (a .ipynbfile).
We’ve provided you with a template notebook for you to use. Please carefully read this section to get started.
| When you are finished with the project, please make sure to run every cell in the notebook prior to submitting. To do this click . Next, to export your notebook (your  | 
Project
Motivation: SQL is an incredibly powerful tool that allows you to process and filter massive amounts of data — amounts of data where tools like spreadsheets start to fail. You can perform SQL queries directly within the R environment, and doing so allows you to quickly perform ad-hoc analyses.
Context: This project is specially designed for Purdue University’s Think Summer program, and is coordinated by The Data Mine.
Scope: SQL, SQL in R
Dataset
The following questions will use the imdb database found in Anvil, our computing cluster.
This database has 6 tables, namely:
akas, crew, episodes, people, ratings, and titles.
You have a variety of options to connect with, and run queries on our database:
- 
Run SQL queries directly within a Jupyter Lab cell. 
First, you must launch a new Jupyter Lab instance. To do so, please follow the instructions below.
- 
Open a browser and navigate to ondemand.anvil.rcac.purdue.edu, and login using your ACCESS credentials. 
- 
Click on "My Interactive Sessions". 
- 
Click on Jupyter Notebook in the left-hand menu under "The Data Mine" section (near the bottom of the screen). Select the following settings: - 
Allocation: cis220051 
- 
Queue: shared 
- 
Time in Hours: 3 
- 
Cores: 1 
- 
Use Jupyter Lab instead of Jupyter Notebook: Checked 
 
- 
- 
When satisfied, click Launch, and wait for a minute. In a few moments, you should get a note indicating that your session is ready to run. 
- 
When you are ready, click Connect to Jupyter. A new browser tab will launch. 
- 
Under the "Notebook" menu, please select the seminar (look for the big "S"; we do not want seminar-r). You now have a running Jupyter Lab notebook ready for you to use. This Jupyter Lab instance is running on the Anvil cluster. By using OnDemand, you’ve essentially carved out a small portion of the compute power to use. Congratulations! 
To run queries directly in a Jupyter Lab cell, please do the following.
- 
In the first cell, run the following code. This code establishes a connection to the imdb.dbdatabase, which allows you to directly run SQL queries in a cell as long as that cell has%%sqlat the top of the cell.%sql sqlite:////anvil/projects/tdm/data/movies_and_tv/imdb.db
- 
After running that cell (for example, using Ctrl+Enter), you can directly run future queries in each cell by starting the cell with %%sqlin the first line. For example.%%sql SELECT * FROM titles LIMIT 5;