Project 6

Movie Ratings Project

In this project, I used classification modeling to examine what factors lead to certain ratings for movies, based on a dataset I collected from IMDB (www.imdb.com).

Assumption

Before we go to the results, here are two major assumptions I made for this project.

  1. Data collection: data was collected from two parts
    • Top 250 movies on IMDB
    • The 50 most popular movies by genre (action, adventure, animation, biography, comedy, crime, documentary, drama, family, fantasy, history, horror, musical, mystery, romance, sport, thriller, war, western).
  2. Labeling: imdb rating is my dependent variable. The lowest imdb rating for the top 250 movies was 8.0. I labeled
    • 1 for movies that had imdb rating equal or above 8.0
    • 0 for movies that had imdb rating below 8.0
  3. Imputing: imputed means for NaN values in Box office and Metascore columns

Summary

I applied Random forest and Extra trees for the modeling section. Both models had the number of imdbvotes, movie year and Metascore in the top 5 features.

A greater number of imdbvotes indicates more people have watched the movie. This can a factor to see if the movie is sucessful or not.

Old movies are more likely to be rated higher. Since I collected the data from the best and the most popular movies, old movies that are still popular are usually classic and great ones.

Metascore has a relatively positive relationship with imdbrating. That means it contains some information for my dependent variable.

Random forests had a higher accuracy score on test dataset (0.78 vs 0.72)

Project Outline

  • Get data
  • Clean data
  • Add features
  • Visualization
  • Random Forest
  • Extra trees

Next Steps

In next steps, I would like to take a deeper look around my assumptions.

  1. Data collection: get a list of randomly selected movies since 1900 and webscrape more data
  2. Apply KNN imputer for NaN values
  3. Examine outliners in visualizations
  4. Try more models eg. baggings, boosting, logistic regression.
import requests
import pandas as pd
from imdbpie import Imdb
import numpy as np

Get data

# Get top250 movies from imdbpie
imdb = Imdb()
imdb = Imdb(anonymize=True)
top_250 = imdb.top_250()
df_top_250 = pd.DataFrame(top_250)
df_top_250.head()
can_rate image num_votes rating tconst title type year
0 True {u'url': u'https://images-na.ssl-images-amazon... 1808376 9.3 tt0111161 The Shawshank Redemption feature 1994
1 True {u'url': u'https://images-na.ssl-images-amazon... 1234863 9.2 tt0068646 The Godfather feature 1972
2 True {u'url': u'https://images-na.ssl-images-amazon... 849748 9.0 tt0071562 The Godfather: Part II feature 1974
3 True {u'url': u'https://images-na.ssl-images-amazon... 1788444 9.0 tt0468569 The Dark Knight feature 2008
4 True {u'url': u'https://images-na.ssl-images-amazon... 489687 8.9 tt0050083 12 Angry Men feature 1957
id_list_1 = df_top_250['tconst'].tolist()
# Get 50 popular movies for each genre
import requests
from bs4 import BeautifulSoup
genre = ['action','adventure','animation','biography','comedy','crime', 'documentary','drama','family', 'fantasy'\
         ,'history', 'horror', 'musical', 'mystery', 'romance','sport','thriller','war', 'western']
id_list_2 = []

for i in genre:
    url = 'http://www.imdb.com/search/title?genres={}&title_type=feature&sort=moviemeter,asc&page=1&ref_=adv_nxt'.format(i)
    r = requests.get(url)
    soup = BeautifulSoup(r.content,'lxml')
    div = soup.findAll('div',class_='ribbonize')
    for k in div:
        id_number = k.get('data-tconst')
        id_list_2.append(id_number)
len(id_list_2)
900
id_list = id_list_1 + id_list_2
len(id_list)
1150
# Search imdb ID on omdbapi

df_origin = pd.DataFrame()

for i in id_list:
    r = requests.get('http://www.omdbapi.com/?i={}&plot=short&r=json'.format(i))
    df_i=pd.DataFrame(r.json())
    df_origin = df_origin.append(df_i)

Clean data

1. Drop duplicates

df = df_origin.drop_duplicates(subset =['imdbID'], keep='first').reset_index(drop=True)

2. Drop columns

df.columns
Index([u'Actors', u'Awards', u'BoxOffice', u'Country', u'DVD', u'Director',
       u'Genre', u'Language', u'Metascore', u'Plot', u'Poster', u'Production',
       u'Rated', u'Ratings', u'Released', u'Response', u'Runtime', u'Title',
       u'Type', u'Website', u'Writer', u'Year', u'imdbID', u'imdbRating',
       u'imdbVotes'],
      dtype='object')
df = df.drop(['Poster','Website','Type','Response','Ratings','Production'],axis=1)

3. Replace ‘N/A’ with NaN

df = df.replace('N/A',np.nan)
df = df.dropna(subset=['imdbRating'])

4. Clean Country, Genre and Language

def clean_orders(column_name):
    df[column_name] = df[column_name].apply(lambda x: x.split(', '))
    df[column_name] = df[column_name].apply(lambda x: sorted(x))
    df[column_name] = df[column_name].apply(lambda x: ', '.join(x))
clean_orders('Country')
clean_orders('Genre')
df['Language'][df['Language'].isnull()] ='English'
clean_orders('Language')

5. Clean Runtime, Year, imdbRating and imdbVotes

df['Runtime'][df['Runtime'].isnull()] = '91 min'
df['Runtime'] = df['Runtime'].apply(lambda x: int(str(x).strip(' min')))
df['Year'] = df['Year'].apply(lambda x: int(x))
df['imdbRating'] = df['imdbRating'].apply(lambda x: float(x))
df['imdbVotes'] = df['imdbVotes'].apply(lambda x: x.replace(',',''))
df['imdbVotes'] = df['imdbVotes'].apply(lambda x: int(x))

6. Clean DVD and Released dates

df['DVD'] = pd.to_datetime(df['DVD'])
df['Released'] = pd.to_datetime(df['Released'])
df['DVD'].fillna(df['Year'],inplace=True)
df['Released'].fillna(df['Year'],inplace=True)

7. Clean Rated

df['Rated'] = df['Rated'].apply(lambda x: str(x).replace('NOT RATED','UNRATED'))
df['Rated'] = df['Rated'].apply(lambda x: str(x).replace('PASSED','APPROVED'))
df['Rated'] = df['Rated'].apply(lambda x: str(x).replace('nan','UNRATED'))

8. Clean Awards

df['Nominations']= df['Awards'].str.extract('([0-9]+ nominations)')
/Users/KatieJi/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:1: FutureWarning: currently extract(expand=None) means expand=False (return Index/Series/DataFrame) but in a future version of pandas this will be changed to expand=True (return DataFrame)
  if __name__ == '__main__':
df['Nominations'] = df['Nominations'].apply(lambda x: float(str(x).replace(' nominations','')))
df['Nominations'] = df['Nominations'].fillna(0)
df['Wins']= df['Awards'].str.extract('([0-9]+ wins)')
/Users/KatieJi/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:1: FutureWarning: currently extract(expand=None) means expand=False (return Index/Series/DataFrame) but in a future version of pandas this will be changed to expand=True (return DataFrame)
  if __name__ == '__main__':
df['Wins'] = df['Wins'].apply(lambda x: float(str(x).replace(' wins','')))
df['Wins'] = df['Wins'].fillna(0)
df['Oscars']= df['Awards'].str.extract('([0-9]+ Oscars)')
/Users/KatieJi/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:1: FutureWarning: currently extract(expand=None) means expand=False (return Index/Series/DataFrame) but in a future version of pandas this will be changed to expand=True (return DataFrame)
  if __name__ == '__main__':
df['Oscars'] = df['Oscars'].apply(lambda x: float(str(x).replace(' Oscars','')))
df['Oscars'] = df['Oscars'].fillna(0)
df = df.drop(['Awards'],axis=1)
df['BoxOffice']=df['BoxOffice'].apply(lambda x: str(x).replace('$',''))
df['BoxOffice']=df['BoxOffice'].apply(lambda x: str(x).replace(',',''))
df['BoxOffice']=df['BoxOffice'].apply(lambda x: str(x).replace('.',''))
df['BoxOffice']=df['BoxOffice'].apply(lambda x: float(x)/100)
df['Metascore']=df['Metascore'].apply(lambda x: float(x))
df.head(3)
Actors BoxOffice Country DVD Director Genre Language Metascore Plot Rated ... Labels rated__APPROVED rated__G rated__GP rated__PG rated__PG-13 rated__R rated__TV-14 rated__TV-MA rated__UNRATED
0 Tim Robbins, Morgan Freeman, Bob Gunton, Willi... NaN USA 1970-01-01 00:00:00.000001994 Frank Darabont Crime, Drama English 80.0 Two imprisoned men bond over a number of years... R ... 1 0 0 0 0 0 1 0 0 0
1 Marlon Brando, Al Pacino, James Caan, Richard ... NaN USA 2001-10-09 00:00:00.000000000 Francis Ford Coppola Crime, Drama English, Italian, Latin 100.0 The aging patriarch of an organized crime dyna... R ... 1 0 0 0 0 0 1 0 0 0
2 Al Pacino, Robert Duvall, Diane Keaton, Robert... NaN USA 1970-01-01 00:00:00.000001974 Francis Ford Coppola Crime, Drama English, Italian, Latin, Sicilian, Spanish 80.0 The early life and career of Vito Corleone in ... R ... 1 0 0 0 0 0 1 0 0 0

3 rows × 32 columns

df_clean = df
df = df_clean

Adding Features

1. Add a new feature Age

df['Age'] = 2017 - df['Year']

2. Label y based on imdb ratings

# top 250 movies have ratings greater than 8 - consider ratings greater than 8 as high ratings
df['Labels'] = df['imdbRating'].apply(lambda x: 1 if x >= 8.0 else 0)
df_high = df[df['Labels']==1]
df_low = df[df['Labels']==0]

Visualization

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

1. Histogram on imdb ratings

From the imdbratings histogram, these movies have a mean of ratings about 7.5.

Most of the ratings are around 6-8.5

fig = plt.figure(figsize=(11,5))
df['imdbRating'].hist()
plt.axvline(df['imdbRating'].mean(),color='b', linestyle='dashed', linewidth=2)
plt.title('imdb rating histogram',fontsize = 16)
<matplotlib.text.Text at 0x111ee2710>

png

2. Heatmap

Age and Year have a very strong relationship because I calculated Age by deducting year from 2017

Wins and Nominations also have a strong relationship, so do Metascore and imdbrating

fig=plt.figure(figsize=(8,6))
sns.heatmap(df.corr()**2)
plt.title('Correlation Heatmap', fontsize=15);

png

3. imdbvotes histogram

I plotted imdb votes by type (high score or low score).

It looks like the low-scored movies are more likely to have fewer votes.

high-scored movies are more widely distributed compared to low-scored movies

df_high.columns
Index([     u'Actors',   u'BoxOffice',     u'Country',         u'DVD',
          u'Director',       u'Genre',    u'Language',   u'Metascore',
              u'Plot',       u'Rated',    u'Released',     u'Runtime',
             u'Title',      u'Writer',        u'Year',      u'imdbID',
        u'imdbRating',   u'imdbVotes', u'Nominations',        u'Wins',
            u'Oscars',         u'Age',      u'Labels'],
      dtype='object')
fig = plt.figure(figsize=(16,6))
sns.distplot(df_high['imdbVotes'],color='lightblue',label='high score',bins=5)
sns.distplot(df_low['imdbVotes'],color='lightgreen',label='low score',bins=5)
plt.xlim(0)
plt.title('Imdb Votes Histogram', fontsize=15)
plt.legend();

png

4. Year histogram

Low-scored movies are more likely to be recent movies after 2000

High-scored movies are more evenly distributed through 1920 - 2017, while the amount of movies have been increasing through years

fig = plt.figure(figsize=(16,6))
sns.distplot(df_high['Year'],color='lightblue',label='high score',bins=5)
sns.distplot(df_low['Year'],color='lightgreen',label='low score',bins=5)
plt.xlim(1900,2017)
plt.title('Movie Year Histogram', fontsize=15)
plt.legend();

png

5. Scatter for Nominations and Wins

From the graph, it looks like Nominations and Wins have a positive linear relationship.

High-scored movies have a higher chance to get above 50 wins and 50 nominations.

mask_high= df_high[df_high['Year']!=2017]
mask_low= df_low[df_low['Year']!=2017]
fig = plt.figure(figsize=(12,7))
ax = fig.gca()
plt.scatter(mask_high['Nominations'],mask_high['Wins'],c='lightblue',s=80)
plt.scatter(mask_low['Nominations'],mask_low['Wins'],c='lightgreen',s=80)

ax.set_xlabel('Nominations',fontsize=14)
ax.set_ylabel('Win',fontsize=14)

plt.legend(['high score', 'low score']);

png

Feature engineering

1. Get dummy variables for rated

df= df.join(pd.get_dummies(df['Rated'],prefix='rated_'))

2. Train Test split

from sklearn.model_selection import train_test_split
X = df.drop(['imdbRating','imdbID','Labels', 'Rated','DVD','Released'],axis=1)
y = df['Labels']

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state = 42, test_size=.33)
X_train.reset_index(drop=True, inplace=True)
X_test.reset_index(drop=True, inplace=True)

y_train.reset_index(drop=True, inplace=True)
y_test.reset_index(drop=True, inplace=True)
X_train.head(3)
Actors BoxOffice Country Director Genre Language Metascore Plot Runtime Title ... Age rated__APPROVED rated__G rated__GP rated__PG rated__PG-13 rated__R rated__TV-14 rated__TV-MA rated__UNRATED
0 Mia Goth, Martin McCann, Andrew Simpson, Barry... NaN UK Stephen Fingleton Drama, Sci-Fi, Thriller English NaN In a time of starvation, a survivalist lives o... 104 The Survivalist ... 2 0 0 0 0 0 0 0 0 1
1 Charlie Hunnam, Sienna Miller, Tom Holland, Ro... 66320.68 USA James Gray Action, Adventure, Biography English, German, Portuguese, Spanish 84.0 A true-life drama, centering on British explor... 141 The Lost City of Z ... 1 0 0 0 0 1 0 0 0 0
2 Charles Chaplin, Paulette Goddard, Henry Bergm... NaN USA Charles Chaplin Comedy, Drama, Family English 96.0 The Tramp struggles to live in modern industri... 87 Modern Times ... 81 0 1 0 0 0 0 0 0 0

3 rows × 26 columns

3. Create dummy variables using NLP

from sklearn.feature_extraction.text import CountVectorizer
def cvec(df,columns,n_range,max_f):
    cvec = CountVectorizer(ngram_range=n_range,max_features=max_f)
    cvec.fit(df[columns])
    return pd.DataFrame(cvec.transform(df[columns]).todense(),
             columns=cvec.get_feature_names())
actors = cvec(X_train,'Actors',(2,3),20)
country = cvec(X_train,'Country',(1,1),20)
genre = cvec(X_train,'Genre',(1,1),20)
language = cvec(X_train,'Language',(1,1),20)
writer = cvec(X_train,'Writer',(2,3),20)
title = cvec(X_train,'Title',(1,1),20)
director = cvec(X_train,'Director',(2,3),20)
cvec = CountVectorizer(ngram_range=(1,1),max_features=20,stop_words='english')
cvec.fit(X_train['Plot'])
plot = pd.DataFrame(cvec.transform(X_train['Plot']).todense(),
         columns=cvec.get_feature_names())
dict_join = {'actors':actors, 'country': country, 'genre': genre, \
             'language': language, 'title': title, 'writer': writer, 'director':director, 'plot': plot}
for k in dict_join:
    X_train = X_train.join(dict_join[k],rsuffix=('_'+k))
from sklearn.feature_extraction.text import CountVectorizer
def cvec(df,columns,n_range,max_f):
    cvec = CountVectorizer(ngram_range=n_range,max_features=max_f)
    cvec.fit(df[columns])
    return pd.DataFrame(cvec.transform(df[columns]).todense(),
             columns=cvec.get_feature_names())
actors_test = cvec(X_test,'Actors',(2,3),20)
country_test = cvec(X_test,'Country',(1,1),20)
genre_test = cvec(X_test,'Genre',(1,1),20)
language_test = cvec(X_test,'Language',(1,1),20)
writer_test = cvec(X_test,'Writer',(2,3),20)
title_test = cvec(X_test,'Title',(1,1),20)
director_test = cvec(X_test,'Director',(2,3),20)
cvec = CountVectorizer(ngram_range=(1,1),max_features=20,stop_words='english')
cvec.fit(X_test['Plot'])
plot_test = pd.DataFrame(cvec.transform(X_test['Plot']).todense(),
         columns=cvec.get_feature_names())
dict_join_test = {'actors':actors_test, 'country': country_test, 'genre': genre_test, \
             'language': language_test, 'title': title_test, 'writer': writer_test, \
                  'director':director_test, 'plot': plot_test}
for k in dict_join_test:
    X_test = X_test.join(dict_join_test[k],rsuffix=('_'+k))
X_train = X_train.drop(['Actors','Country','Director','Genre','Language','Plot','Title','Writer','Year'],axis=1)
X_test = X_test.drop(['Actors','Country','Director','Genre','Language','Plot','Title','Writer','Year'],axis=1)

4. Impute mean for Boxoffice and Metascore

from sklearn.preprocessing import Imputer
 
imputer_m = Imputer(strategy='mean',axis=0).fit(X_train[['Metascore']])
imputer_b = Imputer(strategy='mean',axis=0).fit(X_train[['BoxOffice']])
X_train['Metascore'] = imputer_m.transform(X_train[['Metascore']])
X_test['Metascore'] = imputer_m.transform(X_test[['Metascore']])
X_train['BoxOffice'] = imputer_b.transform(X_train[['BoxOffice']])
X_test['BoxOffice'] = imputer_b.transform(X_test[['BoxOffice']])

5. Scaler

X_train[X_train.isnull().any(axis=1)]
BoxOffice Metascore Runtime imdbVotes Nominations Wins Oscars Age rated__APPROVED rated__G ... original story original story by screen story screenplay by story by story development story material the book the novel the novel by

0 rows × 177 columns

X_test[X_test.isnull().any(axis=1)]
BoxOffice Metascore Runtime imdbVotes Nominations Wins Oscars Age rated__APPROVED rated__G ... on the on the novel original story original story by screen story screenplay christopher stanley kubrick_writer story by story material the novel

0 rows × 177 columns

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler = scaler.fit(X_train)
X_train_s = scaler.transform(X_train)
X_test_s = scaler.transform(X_test)

Random Forest

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold, GridSearchCV
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, BaggingClassifier

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
cv = StratifiedKFold(n_splits=5 , shuffle = True, random_state = 0)

for i in [1,2,3,4,5,None]:
    print 'max depth: {}'.format(i)
    clf = DecisionTreeClassifier(max_depth=i)
    print "DT Score:\t", cross_val_score(clf, X_train_s, y_train, cv=cv, n_jobs=1).mean()
max depth: 1
DT Score:	0.700420168067
max depth: 2
DT Score:	0.763977591036
max depth: 3
DT Score:	0.830140056022
max depth: 4
DT Score:	0.851484593838
max depth: 5
DT Score:	0.849103641457
max depth: None
DT Score:	0.825518207283
grid = {
    'n_estimators': [10, 20, 30, 50, 100],
    'max_features': [1,2,3,4,5,6,'auto'],
    'criterion': ['gini','entropy'],
    'class_weight': ["balanced","balanced_subsample",None]
}

cv = StratifiedKFold(n_splits=5 , shuffle = True, random_state = 0)


clf = DecisionTreeClassifier(max_depth=4)
rf = RandomForestClassifier(clf)
gs = GridSearchCV(rf, grid)

model_rf_gs = gs.fit(X_train_s, y_train)
gs.best_params_
{'class_weight': None,
 'criterion': 'entropy',
 'max_features': 'auto',
 'n_estimators': 100}
rf = RandomForestClassifier(max_depth=4, max_features=gs.best_params_['max_features'], n_estimators=gs.best_params_['n_estimators'],\
                            criterion=gs.best_params_['criterion'],class_weight=gs.best_params_['class_weight'])
model_rf = rf.fit(X_train_s, y_train)
y_pred_train = model_rf.predict(X_train_s)
y_pred_test = model_rf.predict(X_test_s)

# Confusion matrix on test data
pd.DataFrame(confusion_matrix(y_test,y_pred_test,labels=[1,0]),\
            columns=['predicted_high','predicted_low'], index=['is_high','is_low'])
predicted_high predicted_low
is_high 87 4
is_low 41 77
# Accuracy score
print 'accuracy score on training data:', accuracy_score(y_train,y_pred_train)
print 'accuracy score on test data:', accuracy_score(y_test,y_pred_test)
accuracy score on training data: 0.929245283019
accuracy score on test data: 0.784688995215
# Get features Gini scores
feature_importances = pd.DataFrame(model_rf.feature_importances_, 
                                   index = X_train.columns, columns=['importance'])
feature_importances[feature_importances['importance']!=0].sort_values(by='importance', ascending=False)
importance
imdbVotes 0.163171
Metascore 0.130159
Age 0.113337
Wins 0.088582
english 0.063368
Oscars 0.035528
Runtime 0.027717
Nominations 0.027712
usa 0.023858
BoxOffice 0.022196
italian 0.015489
rated__PG-13 0.014189
italy 0.013644
musical 0.013157
hindi 0.010673
sport 0.010616
canada 0.009466
alfred hitchcock 0.008164
based on 0.007927
for 0.006920
rated__APPROVED 0.006887
horror 0.006730
comedy 0.006597
based on the 0.005929
on the 0.005812
rated__UNRATED 0.005732
french 0.005157
adventure 0.004849
war 0.004746
harrison ford 0.004523
... ...
new 0.000495
ethan coen 0.000493
arabic 0.000490
joel coen 0.000485
man 0.000478
created by 0.000455
with 0.000451
latin 0.000448
cantonese 0.000448
in 0.000410
tom hardy 0.000399
sweden 0.000389
american 0.000370
on 0.000358
to 0.000354
quentin tarantino 0.000313
additional story 0.000304
chris pratt 0.000294
ben affleck 0.000273
life 0.000243
street 0.000196
story material 0.000194
germany 0.000180
chinese 0.000176
story 0.000159
original story 0.000147
matthew mcconaughey 0.000129
book by 0.000063
johnny depp 0.000059
francis ford 0.000014

140 rows × 1 columns

Extra trees

grid = {
    'n_estimators': [10, 20, 30, 50, 100],
    'max_features': [1,2,3,4,5,6,'auto'],
    'criterion': ['gini','entropy'],
    'class_weight': ["balanced", "balanced_subsample"]
}

cv = StratifiedKFold(n_splits=5 , shuffle = True, random_state = 0)


clf = DecisionTreeClassifier(max_depth=4)
et = ExtraTreesClassifier(clf,n_jobs=1)
gs_es = GridSearchCV(et, grid)

model_et_gs = gs_es.fit(X_train_s, y_train)
gs_es.best_params_
{'class_weight': 'balanced',
 'criterion': 'entropy',
 'max_features': 'auto',
 'n_estimators': 100}
es = ExtraTreesClassifier(max_depth=4, max_features=gs_es.best_params_['max_features'], n_estimators=gs_es.best_params_['n_estimators'],\
                            criterion=gs_es.best_params_['criterion'],class_weight=gs_es.best_params_['class_weight'])
model_es = es.fit(X_train_s, y_train)
y_pred_train = model_es.predict(X_train_s)
y_pred_test = model_es.predict(X_test_s)

# Confusion matrix on test data
pd.DataFrame(confusion_matrix(y_test,y_pred_test,labels=[1,0]),\
            columns=['predicted_high','predicted_low'], index=['is_high','is_low'])
predicted_high predicted_low
is_high 86 5
is_low 53 65
# Accuracy score
print 'accuracy score on training data:', accuracy_score(y_train,y_pred_train)
print 'accuracy score on test data:', accuracy_score(y_test,y_pred_test)
accuracy score on training data: 0.926886792453
accuracy score on test data: 0.722488038278
# Get features Gini scores
feature_importances = pd.DataFrame(model_es.feature_importances_, 
                                   index = X_train.columns, columns=['importance'])
feature_importances[feature_importances['importance']!=0].sort_values(by='importance', ascending=False)
importance
Age 0.108544
Metascore 0.079075
usa 0.074534
imdbVotes 0.070593
Oscars 0.063962
english 0.055273
rated__PG-13 0.045749
musical 0.036524
italy 0.033469
canada 0.028664
Wins 0.023811
italian 0.023613
hindi 0.023553
comedy 0.021585
harrison ford 0.017797
horror 0.015362
alfred hitchcock 0.014934
sport 0.014587
rated__UNRATED 0.011067
based on the 0.011026
rated__APPROVED 0.009953
christopher nolan 0.009481
Nominations 0.008576
sergio leone 0.008012
based on 0.007023
french 0.006956
uk 0.006466
Runtime 0.006441
family_genre 0.006251
the 0.005942
... ...
wife 0.000550
australia 0.000543
clint eastwood 0.000542
john huston 0.000529
help 0.000528
rated__PG 0.000509
tom hardy 0.000473
wars 0.000453
ben affleck 0.000439
son 0.000433
joel coen 0.000428
ewan mcgregor 0.000413
star 0.000402
life 0.000381
john musker 0.000373
animation 0.000358
back 0.000355
the book 0.000336
hungarian 0.000321
the novel 0.000255
cantonese 0.000243
ridley scott 0.000226
tom hanks 0.000193
natalie portman 0.000192
tim burton 0.000191
thriller 0.000187
germany 0.000154
gaelic 0.000135
biography 0.000117
novel by 0.000035

134 rows × 1 columns

Written on May 14, 2017