Welcome Back to UCA

Hope everyone had a restful break, and is transitioning smoothly back into the Berkeley grind. As you all know, it’s club recruitment season, and we at the Undergraduate Communication Association…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Capstone Data Acquisition and Cleaning

For those likely not following along at home very closely in my last blog post I was struck by the muse of inspiration via perspiration, and after some mental gymnastics came up with a serviceable idea for my Capstone project. In this post I’ll walk through some of the steps I took from Data Acquistion and Cleaning, to paint a portrait of what a typical Data Science Process looks like.

STEP 1 — Data Acquisition

Can’t be a Data Science Project without any Data, so I had to go about getting my scavenger on and finding data points I could use. So I had my Project Idea, just needed to find the movies (editors note FX doesnt REALLY have the movies). In prior labs we used web scraping, so I did what most people these days do when trying to find something, consult Google!

I’ve used API’s before in a Project, so looked into using them again. Hoping to cash in on all the time I’ve spent *not wasted* on reading random trivia tidbits of Quentin Tarantino movies (follow up- when WILL Kill Bea get filmed and Vernita Green’s daughter can get her revenge). So that’s when I ran into my first snag, response from my API.

API stands for Application Programming Interface, which essentially is just something we can use to allow applications to communicate with one another. In other words they allow us to get data from outside sources.

Many large tech companies, government organizations, fan sites, and social media companies have APIs in order to allow us common folk to access their data. IMDB has one that is relatively straightforward. After importing it and initializing a class, all you have to do is plug in the IMDB ID and you can get a bevy of information. EG when I plugged in the ID of gun-to-my-head favorite movie ever I could get this data.

NOT PICTURED: Honey Bunny, Pumpkin, Jack Rabbit Slim’s Waitress, Vicent Vega, Jules Winfield

MoviesLens had what the dubbed “The Movies Dataset”, and upon downloading and reading into Python I was happy to see it should be what I need for my Capstone. Over 45,000 movies, 15 columns including budget and revenue, release dates and plot summaries, had everything I believed I would need for a Capstone America.

PICTURED: Your data vs the data your girl told you not to worry about

So data in hand time to move onto the 2nd phase — Data Cleaning

STEP 2— Data Cleaning

Kind of similar to real life everyone likes their data/stuff to be clean but not many people actually like doing it. But also like real life without out people won’t like you if you don’t!

So loading in my dataset and luckily it was pretty much optimized for Python analysis. Only one chance to make a first impression and when I see something like this it’s good news for me…

PICTURED: 90’s Classics

When I see something like this a few things immediately jump to mind. Beyond missing Robin Williams and wondering how many more IP remakes Hollywood can possibly make starring The Rock I’m glad to see that I will likely have enough features available to run a model. Running a df.columns() and dtypes command shows me all my column values and data types, and when I have Budget, Revenue, Vote Count, Popularity Count, Runtime, and Release Date I know I’ll have enough numeric values to AT LEAST run a Multi-Linear Regression.

Data cleaning involves the little things though, as just like most of life things often take multiple steps before they’re actually working. In this case I had to rename all my column titles, change my release date to month using the DateTime function, then create a new column for season released and map those to the new. Then came my first real challenge — how to isolate the genres listed into a way I could mathematically analyze.

Knowing that genre should have a HUGE impact on revenue (see: box office of every Marvel movie vs every Wes Anderson movie) I knew I needed to find a way to get the genre of every movie in my Dataset. The problem was that for every movie in my dataset, they had multiple genres listed. For example Toy Story is an Animation film, but also a Family film and a Comedy Film. Along those lines Transformers would be a Bad film, but also a Terrible film and a Waste of Time film. But herein lies the rub, when I attempted to list them there were three dictionaries in a list with key-value pairs with corresponding genres. Luckily, the problem solving I wrote about could apply for this-

Luckily at least knew what I should do — a FOR loop. For those uninitiated a For loop is a versatile Python feature in which you can do — almost anything! . The possibilities are almost limitless, some examples of what one could do with a FOR loop : go through a list of numbers and return the primary, go through a list and print out certain values, or even…go through a Dataframe column of Genres, split them up into strings, create two new columns for Primary Genre and All Genres, and populate them with the correct values.

PICTURED: Something that took me days and TA help to figure out and get right.

Some things to understand about For Loops are that first off like I mentioned before they are iterative. Essentially this means that they will continue to run over what they are set to until told not to. Above was nested FOR loop in that I needed to have a FOR Loop WITHIN a FOR Loop and IF loop(see:Inception), because I needed to tell my FOR loop go through my list of genres once, if had more than one value create a list, and then FOR each additional take the first genre and put it in primary genres and put the entirety of the list in my all genres column.

So in summary I hope this post helps show the importance of data acquisition in the Data Science process. A project can only be as good as the data you’re working with, and no matter how complex or involved a model is it won’t have any impactful results without the data being correctly formatted and cleaned. Join me for my final Blog post later this week where I’ll recap my experience at DSI boot camp and some of the stuffs I learned!

Add a comment

Related posts:

Sonepar opened its doors to facilitate staff interaction with an event app

Sonepar Group is the global leader in the distribution of electrical materials. Sonepar Brasil has been in operation since 2001 and is formed by companies Centelha, Dimensional, DW, Eletronor, Etil…

What do we need?

No matter what model of cooler you use — a marine cooler, an insulated backpack ice chest or just a regular cooler, you want to keep it clean after each use. Who wants to spend a lot of money on a…

Fact Check 5

Atlanta Mayor Keisha Lance Bottoms (D) suggested that the GOP is using the Chinese coronavirus pandemic to “spread misinformation and interfere with voting,” forcing many to “risk their lives” to…