final-project: Final Project

ready? assigned due
true Fri 05/31 01:00PM Fri 06/14 11:59PM

Update (6/10): Project Deadline has been extended to June 14.

INT15 Final Project

The guidelines for the project are in the int15-final-project.ipynb notebook.


Below is a list of vetted datasets along with a brief description. We’ve also included some questions that you may want to answer. You are not required to answer these questions and you are encouraged to propose (and answer!) additional questions. If you can find additional data related to one of the projects below, you should feel free to use it, making sure to cite and link to the source of the data. You are not constrained to use only the data provided.

If none of the datasets below is of interest and you have your own dataset that you are excited to analyze, please check with Prof. K or Prof. Franks before Wednesday, June 3 to get approval.

Firearm permits and background checks

The data in this repository comes from the FBI’s National Instant Criminal Background Check System.

Mandated by the Brady Handgun Violence Prevention Act of 1993 and launched by the FBI on November 30, 1998, NICS is used by Federal Firearms Licensees (FFLs) to instantly determine whether a prospective buyer is eligible to buy firearms or explosives. Before ringing up the sale, cashiers call in a check to the FBI or to other designated agencies to ensure that each customer does not have a criminal record or isn’t otherwise ineligible to make a purchase. More than 100 million such checks have been made in the last decade, leading to more than 700,000 denials. The FBI provides data on the number of firearm checks by month, state, and type — but as a PDF. The code in this GitHub repository downloads that PDF, parses it, and produces a spreadsheet/CSV of the data. Click here to download the data, which currently covers November 1998 – April 2019.

Analyzing OKCupid Data

Take an in-depth look at the world of online dating through this dataset! This dataset is a set of anonymized data from the popular online dating site, OkCupid.

Suggested directions:

Baseball Data

These datasets cover 1871-2018 batting/pitching stats for baseball (covers both players and teams) with plenty of metadata. Possible areas of exploration could include player performance comparisons with a visualization of player success over time (measures of success include, home runs, wins, salaries, etc.)

The data is separated into different files with different themes, so students need to look at the files and decide which to analyze and which topic to focus on. Very friendly for insights via regression.

Spotify & Pitchfork Reviews

(Warning: large files!)

Spotify is a leading music streaming service that collects a lot of data about the music they have access to, and user’s listening habits. Take advantage of their data-driven approach by looking at the EchoNet information about some of the top tracks on the billboard 200. You could even use the EchoNest API to see what kind of information is available about your favorite songs! More information about working with EchoNest is available here:

Spotify collaborates with EchoNest to understand music at a deeper level by breaking down songs to “acoustic” information. A look into the spotify_echonest DataFrame reveals what kinds of data that consists of. It ranges from things like “danceability” and “energy” to more conventional music theory ideas like tempo and key. This gives us numerical power in analyzing music easily! We can use this information to see what kind of music tends to be successful, and what trends we can aim for if we wanted to make the next summer playlist.

You can also integrate Pitchfork music reviews.

Possible directions:

Crime Statistics and Police Data

Possible questions:

Final Project Rubric