Outbrain_click_modeling

Summary

This is the modeling part for Outbrain Click Prediction Project.

After I’ve done the EDA part (see my previous post: Outbrain Click Prediction EDA) which really helped me understand the data and the structure better, I moved to the feature engineering stage including combining data into an aggregated table, creating new features, scaling data and so on.

Finally I can have a well-prepared dataset to build models. In this project, I used Logistic Regression, Decision Trees and Keras for the prediction.

Assumptions

Before I go into the details about my steps, I want to introduce two assumptions for the modeling part.

1. Sample Modeling

Due to the large amount of data (87 million rows) I have to make it work on my local computer, I used Sample Modeling approach to train my model.

I kept the most recent data as a fixed test dataset. Among the rest of the data, I randomly selected 3 train datasets as df_train_1, df_train_2 and df_train_3 for my modeling. This notebook is an example for df_train_1

  • train: 0.8 million rows
  • test: 0.32 million rows (0.4 train test split)

2. NaN Values

Again, due to the large amount of data and very few NaN values, I chose to drop them instead of using imputing methods.

Steps

1. Combining data by SQL

I think the best way to combine large tables like the datasets in this project is SQL - it won’t use my local computer’s memory. Before I started my analysis in this notebook, I used PostgresSQL queries to aggregate tables. My output are 3 train samples and a test sample that combined all information from click, pageview, event and promoted content tables

2. Read & Join data

Read data from SQL server and join the train / test data with the document tables.

3. Feature Engineering

  • Create new features: Day, Hour, State, Category difference, Topic difference
  • Get dummified variables for categorical columns (Platform Day, Hour and State)
  • Data scale

4. Modeling

A. Decision Trees: the accuracy score is very close to the baseline, because this model tends to predict much more zeros than ones. Therefore, this model is not very helpful for future predictions - Avg. accuracy rate on test data: 0.814 - Important features: Category difference, Mobile users, Ad category 1503, Ad topic 37

B. Logistic Regression: it gives us the probability of an ad to be clicked on. Additionaly, it predicts a good amount of ones. - Avg. accuracy rate on test data: 0.747 - Important features: Ad topic 37 (negative)

C. Keras - Avg. accuracy rate on test data: 0.776 - Hard to interpret the results

Next Steps:

  1. Run PCA in order to reduce the # of features
  2. More complex models: random forest, deep learning
import numpy as np
import pandas as pd

import datetime

from sklearn.model_selection import cross_val_score, StratifiedKFold, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, BaggingClassifier

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import psycopg2

try:
    conn = psycopg2.connect("dbname='KatieJi' host='localhost'")
except:
    print "I am unable to connect to the database"

Read Data

  • Read train sample data 1 as df_train_1
  • Read test sample data as df_test
  • Read train sample data with document information as df_train_1_doc
c = conn.cursor()
# the entire df_train has 87,141,731 rows
df_train_1 = pd.read_sql('SELECT * from df_train_random_1',con=conn)
df_test = pd.read_sql('SELECT * from df_test',con=conn)
df_train_1.dropna(inplace=True)
df_test.dropna(inplace=True)
df_test = df_test.drop(['row_number'],axis=1)
len(df_train_1)
4999934
len(df_test)
1999953
df_train_1.head()
display_id ad_id clicked ad_document_id document_id timestamp platform geo_location
0 9327736 465857 0 2286920 2381779 635179742 2 IN>16
1 7925247 211592 0 1535449 2113032 550512614 1 US>MO>616
2 9205593 436995 1 2109662 2330651 626029929 3 CA>ON
3 3970408 386928 0 1346973 1585713 252743480 1 US>NC>560
4 11278824 87269 0 1110172 1402366 749943363 2 CA>BC
train_1_doc = pd.read_pickle('train_1_doc_cleaned.pkl')
test_doc = pd.read_pickle('test_doc_cleaned.pkl')
len(train_1_doc.columns)
402
len(test_doc.columns)
402
train_1_ad_doc = pd.read_pickle('train_1_ad_doc_cleaned.pkl')
test_ad_doc = pd.read_pickle('test_ad_doc_cleaned.pkl')
len(train_1_ad_doc.columns)
401
len(test_ad_doc.columns)
401
df_train_1 = df_train_1.merge(train_1_doc, on = 'document_id',how='inner' )
train_1_ad_doc.columns = ['ad_'+str(col) for col in train_1_ad_doc.columns]
train_1_ad_doc.head()
ad_document_id ad_1000 ad_1100 ad_1200 ad_1202 ad_1203 ad_1204 ad_1205 ad_1206 ad_1207 ... ad_290 ad_291 ad_292 ad_293 ad_294 ad_295 ad_296 ad_297 ad_298 ad_299
1 3173 0.0 0.0 0.0 0.0 0.00 0.0 0.0 0.00 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0
8 6399 0.0 0.0 0.0 0.0 0.00 0.0 0.0 0.00 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.159124 0.0 0.0 0.0
9 7692 0.0 0.0 0.0 0.0 0.07 0.0 0.0 0.00 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0
10 11671 0.0 0.0 0.0 0.0 0.00 0.0 0.0 0.00 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0
13 12668 0.0 0.0 0.0 0.0 0.00 0.0 0.0 0.92 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0

5 rows × 401 columns

Join document tables and train / test tables

  • Add document information into train sample data 1
  • Add document information into test sample data
df_train_1 = df_train_1.merge(train_1_ad_doc, on = 'ad_document_id',how='inner' )
df_train_1.dropna(inplace=True)
len(df_train_1)
1317750
df_train_1.head()
display_id ad_id clicked ad_document_id document_id timestamp platform geo_location 1000 1100 ... ad_290 ad_291 ad_292 ad_293 ad_294 ad_295 ad_296 ad_297 ad_298 ad_299
0 9568038 145924 0 1148812 2381779 647954043 2 US 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 9723207 142732 0 1148812 2381779 655487928 2 US>OH>535 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 10049392 145924 0 1148812 2381779 671399752 2 US>IL>602 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 10370992 189810 1 1148812 2353654 691096022 2 US 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 9456525 189812 0 1148812 2353654 642108520 2 US>CA>807 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 809 columns

df_test = df_test.merge(test_doc, on = 'document_id',how='inner' )
test_ad_doc.columns = ['ad_'+str(col) for col in test_ad_doc.columns]
df_test = df_test.merge(test_ad_doc, on = 'ad_document_id',how='inner' )
df_test.dropna(inplace=True)
len(df_test)
445025
df_test.head()
display_id ad_id clicked ad_document_id document_id timestamp platform geo_location row_number 1000 ... ad_290 ad_291 ad_292 ad_293 ad_294 ad_295 ad_296 ad_297 ad_298 ad_299
0 16871271 547924 0 2823769 300181 1122973454 2 US>MN>613 3 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 16458028 546572 1 2803551 300181 1097299961 2 US>FL>539 3873 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 16462654 546572 0 2803551 2794894 1097582191 2 US>NM>790 3857 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 16447853 546572 0 2803551 65310 1096695884 2 US>TX 3859 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 16458527 546572 0 2803551 65310 1097331155 2 US>CA>807 3867 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 810 columns

Feature Engineering Part I

  • Extracted Hour and Day data from timestamp
  • Extracted State data from timestamp - if the traffic came from out of US, use ‘Outside US’ for the State column
  • Clean data format for Platform column

1. Extracted Hour and Day data from timestamp (train data)

df_train_1['datetime'] = df_train_1.timestamp.apply(lambda x: datetime.datetime.fromtimestamp((int(x)+1465876799998)/1000.0).\
                             strftime('%Y-%m-%d %H:%M:%S.%f'))
df_train_1['hour'] = df_train_1['datetime'].apply(lambda x: x[11:13])
df_train_1['day'] = df_train_1['datetime'].apply(lambda x: x[:10])

2. Extracted State data from timestamp (train data)

def extract_state(x):
    if str(x)[0:2] == 'US':
        try:
            y = str(x)[3:5]
        except:
            y = np.nan
    else:
        y = 'Outside US'
    return y
df_train_1['state'] = df_train_1.geo_location.apply(lambda x: extract_state(x))
df_train_1.head()
display_id ad_id clicked ad_document_id document_id timestamp platform geo_location 1000 1100 ... ad_295 ad_296 ad_297 ad_298 ad_299 datetime hour day country state
0 9568038 145924 0 1148812 2381779 647954043 2 US 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 2016-06-21 11:59:14.041000 11 2016-06-21 US
1 9723207 142732 0 1148812 2381779 655487928 2 US>OH>535 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 2016-06-21 14:04:47.926000 14 2016-06-21 US OH
2 10049392 145924 0 1148812 2381779 671399752 2 US>IL>602 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 2016-06-21 18:29:59.750000 18 2016-06-21 US IL
3 10370992 189810 1 1148812 2353654 691096022 2 US 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 2016-06-21 23:58:16.020000 23 2016-06-21 US
4 9456525 189812 0 1148812 2353654 642108520 2 US>CA>807 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 2016-06-21 10:21:48.518000 10 2016-06-21 US CA

5 rows × 814 columns

df_train_1 = df_train_1.drop(['timestamp','datetime','geo_location'],axis=1)
df_train_1.head()
display_id ad_id clicked ad_document_id document_id platform 1000 1100 1200 1202 ... ad_294 ad_295 ad_296 ad_297 ad_298 ad_299 hour day country state
0 9568038 145924 0 1148812 2381779 2 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 11 2016-06-21 US
1 9723207 142732 0 1148812 2381779 2 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 14 2016-06-21 US OH
2 10049392 145924 0 1148812 2381779 2 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 18 2016-06-21 US IL
3 10370992 189810 1 1148812 2353654 2 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 23 2016-06-21 US
4 9456525 189812 0 1148812 2353654 2 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 10 2016-06-21 US CA

5 rows × 811 columns

len(df_train_1.day.unique())
13
len(df_train_1.hour.unique())
24
len(df_train_1.state.unique())
56
len(df_train_1.country.unique())
221
len(df_train_1.platform.unique())
3
len(df_test.day.unique())
6
len(df_test.hour.unique())
24
len(df_test.state.unique())
55
len(df_test.country.unique())
215
df_test.platform.unique()
array(['2', '1', '3'], dtype=object)

3. Extracted Hour and Day data from timestamp (test data)

df_test['datetime'] = df_test.timestamp.apply(lambda x: datetime.datetime.fromtimestamp((int(x)+1465876799998)/1000.0).\
                             strftime('%Y-%m-%d %H:%M:%S.%f'))
df_test['hour'] = df_test['datetime'].apply(lambda x: x[11:13])
df_test['day'] = df_test['datetime'].apply(lambda x: x[:10])

4. Extracted State data from timestamp (test data)

df_test['state'] = df_test.geo_location.apply(lambda x: extract_state(x))
df_test.head()
display_id ad_id clicked ad_document_id document_id timestamp platform geo_location row_number 1000 ... ad_295 ad_296 ad_297 ad_298 ad_299 datetime hour day country state
0 16871271 547924 0 2823769 300181 1122973454 2 US>MN>613 3 0.0 ... 0.0 0.0 0.0 0.0 0.0 2016-06-26 23:56:13.452000 23 2016-06-26 US MN
1 16458028 546572 1 2803551 300181 1097299961 2 US>FL>539 3873 0.0 ... 0.0 0.0 0.0 0.0 0.0 2016-06-26 16:48:19.959000 16 2016-06-26 US FL
2 16462654 546572 0 2803551 2794894 1097582191 2 US>NM>790 3857 0.0 ... 0.0 0.0 0.0 0.0 0.0 2016-06-26 16:53:02.189000 16 2016-06-26 US NM
3 16447853 546572 0 2803551 65310 1096695884 2 US>TX 3859 0.0 ... 0.0 0.0 0.0 0.0 0.0 2016-06-26 16:38:15.882000 16 2016-06-26 US TX
4 16458527 546572 0 2803551 65310 1097331155 2 US>CA>807 3867 0.0 ... 0.0 0.0 0.0 0.0 0.0 2016-06-26 16:48:51.153000 16 2016-06-26 US CA

5 rows × 815 columns

df_test = df_test.drop(['timestamp','datetime','geo_location'],axis=1)
df_test.head()
display_id ad_id clicked ad_document_id document_id platform row_number 1000 1100 1200 ... ad_294 ad_295 ad_296 ad_297 ad_298 ad_299 hour day country state
0 16871271 547924 0 2823769 300181 2 3 0.0 0.000000 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 23 2016-06-26 US MN
1 16458028 546572 1 2803551 300181 2 3873 0.0 0.000000 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 16 2016-06-26 US FL
2 16462654 546572 0 2803551 2794894 2 3857 0.0 0.000000 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 16 2016-06-26 US NM
3 16447853 546572 0 2803551 65310 2 3859 0.0 0.058407 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 16 2016-06-26 US TX
4 16458527 546572 0 2803551 65310 2 3867 0.0 0.058407 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 16 2016-06-26 US CA

5 rows × 812 columns

5. Clean data format for Platform column (test data)

df_test.platform = df_test.platform.map(lambda x: '1' if x == '\\N' else x)

Feature Engineering Part II

  • Get dummies for Platform and Hour columns
  • Use Count Vectorizer for Day and State columns
  • Create X and y varibles
  • Create cat_diff and topic_diff columns
  • Data scaling

1. Get dummies for Platform and Hour columns

df_test = pd.get_dummies(df_test,columns=['platform','hour'],drop_first=True)
df_train_1 = pd.get_dummies(df_train_1,columns=['platform','hour'],drop_first=True)
df_test = df_test.head(320000)
df_test = df_test.drop(['display_id','ad_id','document_id','ad_document_id'],axis=1)
df_test = df_test.drop(['publisher_id','publish_time','source_id','ad_source_id','ad_publisher_id','ad_publish_time'],axis=1)
df_train_1 = df_train_1.head(800000)
df_train_1 = df_train_1.drop(['display_id','ad_id','document_id','ad_document_id'],axis=1)
df_train_1 = df_train_1.drop(['publisher_id','publish_time','source_id','ad_source_id','ad_publisher_id','ad_publish_time'],axis=1)
len(X)
445025

2. Use Count Vectorizer for Day and State columns

from sklearn.feature_extraction.text import CountVectorizer
v = CountVectorizer(binary=True)

v = v.fit(df_train_1['day'])
df_train_1_day = v.transform(df_train_1['day']).todense()
df_train_1_day = pd.DataFrame(df_train_1_day, columns=v.get_feature_names())
df_test_day = v.transform(df_test['day']).todense()
df_test_day = pd.DataFrame(df_test_day, columns=v.get_feature_names())
df_test_day.head()
06 14 15 16 17 18 19 20 2016 21 22 23 24 25 26
0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1
1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1
2 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1
3 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1
4 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1
df_train_1_day.head()
06 14 15 16 17 18 19 20 2016 21 22 23 24 25 26
0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0
1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0
2 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0
3 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0
4 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0
from sklearn.feature_extraction.text import CountVectorizer
v = CountVectorizer(binary=True)

v = v.fit(df_train_1['state'])
df_train_1_state = v.transform(df_train_1['state']).todense()
df_train_1_state = pd.DataFrame(df_train_1_state, columns=v.get_feature_names())
df_test_state = v.transform(df_test['state']).todense()
df_test_state = pd.DataFrame(df_test_state, columns=v.get_feature_names())
df_train_1_state.head()
aa ae ak al ap ar az ca co ct ... tn tx us ut va vt wa wi wv wy
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 56 columns

df_test_state.head()
aa ae ak al ap ar az ca co ct ... tn tx us ut va vt wa wi wv wy
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 56 columns

X_test_state.head()
df_train_1 = df_train_1.drop(['day','state'],axis=1)
df_test = df_test.drop(['day','state'],axis=1)
df_train_1 = pd.concat([df_train_1,df_train_1_day,df_train_1_state],axis=1)
df_test = pd.concat([df_test,df_test_day,df_test_state],axis=1)
df_train_1.dropna(inplace=True)
df_test.dropna(inplace=True)

3. Create X and y varibles

y_train_1 = df_train_1.clicked
y_test = df_test.clicked
X_train_1 = df_train_1.drop(['clicked'],axis=1)
X_test = df_test.drop(['clicked'],axis=1)
len(X_train_1.columns)
891
len(X_test.columns)
892
len(X_train_1)
800000
len(y_train_1)
800000
X_train_1.shape
(799997, 891)
X_test.shape
(319989, 891)
X_train_1.head()
1000 1100 1200 1202 1203 1204 1205 1206 1207 1208 ... tn tx us ut va vt wa wi wv wy
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 891 columns

4. Create cat_diff and topic_diff columns

  • cat_diff is the sum of absolute values between document category scores and ad category scores, which means the bigger the cat_diff is the larger differences exsiting in document and ad in terms of category
  • same rules apply to topic_diff
doc_cat[400]
299
doc_cat[101]
0
doc_col = train_1_doc.columns
ad_col = train_1_ad_doc.columns
X_train_1['cat_diff'] =0
for i in range(1,98):
    X_train_1['cat_diff'] = X_train_1['cat_diff'] + abs(X_train_1[doc_col[i]]- X_train_1[ad_col[i]])
X_train_1['topic_diff'] =0
for i in range(101,401):
    X_train_1['topic_diff'] = X_train_1['topic_diff'] + abs(X_train_1[doc_col[i]]- X_train_1[ad_col[i]])
X_test['cat_diff'] =0
for i in range(1,98):
    X_test['cat_diff'] = X_test['cat_diff'] + abs(X_test[doc_col[i]]- X_test[ad_col[i]])
X_test['topic_diff'] =0
for i in range(101,401):
    X_test['topic_diff'] = X_test['topic_diff'] + abs(X_test[doc_col[i]]- X_test[ad_col[i]])
X_train_1.dropna(inplace=True)
X_test.dropna(inplace=True)

5. Data scaler by MinMaxScaler

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler = scaler.fit(X_train_1)
X_train_1_s = scaler.transform(X_train_1)
X_test_s = scaler.transform(X_test)

6. Baseline accuracy score

y_zeros = np.zeros(len(y_test))
print 'accuracy score on zeros:', accuracy_score(y_test,y_zeros)
accuracy score on zeros: 0.814337367847
X_train_1.head()
1000 1100 1200 1202 1203 1204 1205 1206 1207 1208 ... us ut va vt wa wi wv wy cat_diff topic_diff
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.219692 0.419998
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.219692 0.419998
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.219692 0.419998
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.224484 0.360472
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.224484 0.360472

5 rows × 893 columns

Modeling

  • Decision trees
  • Logistic Regression
  • Keras

1. Decision trees

cv = StratifiedKFold(n_splits=3, random_state=21, shuffle=True)
for i in [4,5,6,7,8,10]:
    print 'max depth: {}'.format(i)
    clf = DecisionTreeClassifier(max_depth=i)
    print "DT Score:\t", cross_val_score(clf, X_train_s, y_train_1, cv=cv, n_jobs=1).mean()
max depth: 4
DT Score:	0.822213750082
max depth: 5
DT Score:	0.822554999951
max depth: 6
DT Score:	0.823117499829
max depth: 7
DT Score:	0.82309874989
max depth: 8
DT Score:	0.823462500124
max depth: 10
DT Score:	0.824014999054
dt = DecisionTreeClassifier(max_depth=8)
model_dt = dt.fit(X_train_1_s,y_train_1)
len(X_train_1_s)
799997
y_pred_train = model_dt.predict(X_train_1_s)
y_pred_test = model_dt.predict(X_test_s)

print 'accuracy score on training data:', accuracy_score(y_train_1,y_pred_train)
print 'accuracy score on test data:', accuracy_score(y_test,y_pred_test)
accuracy score on training data: 0.823893089599
accuracy score on test data: 0.813246705355
feature_importances = pd.DataFrame(model_dt.feature_importances_, 
                                   index = X_train_1.columns, columns=['importance'])
feature_importances[feature_importances['importance']!=0].sort_values(by='importance', ascending=False)
importance
cat_diff 0.132341
ad_1503 0.094225
ad_37 0.088036
ad_16 0.058429
platform_2 0.057949
ad_234 0.047426
ad_105 0.043944
ad_1403 0.043020
ad_1000 0.042945
ad_1609 0.039736
ad_183 0.036692
ad_145 0.034606
ad_258 0.029758
ad_242 0.027735
ad_243 0.026088
ad_1702 0.022672
ad_36 0.019138
ad_1510 0.017814
ad_10 0.015826
ad_1515 0.014082
1806 0.012370
ad_1610 0.009848
count 0.007107
ad_74 0.006592
ad_292 0.005249
ad_100 0.004939
ad_285 0.004251
ad_138 0.003731
1510 0.003723
22 0.002326
... ...
1611 0.000220
mo 0.000218
173 0.000207
269 0.000207
wa 0.000200
1408 0.000200
249 0.000198
hour_09 0.000182
19 0.000176
1604 0.000167
172 0.000156
hour_23 0.000156
118 0.000156
hour_16 0.000149
93 0.000144
247 0.000142
20 0.000142
56 0.000137
tn 0.000132
1610 0.000130
ny 0.000129
co 0.000121
ga 0.000120
21 0.000118
257 0.000114
24 0.000101
140 0.000091
nj 0.000085
mi 0.000071
sc 0.000065

133 rows × 1 columns

y_pred_test.sum()
1147.0
y_test.sum()
59410.0

2. Logistic Regression

from sklearn.linear_model import LogisticRegression
grid = {
    'C': [5,50],
    'penalty': ['l1','l2']
}


lr = LogisticRegression()
gs = GridSearchCV(lr, grid)

model_rf_gs = gs.fit(X_train_1_s, y_train_1)
gs.best_params_
{'C': 5, 'penalty': 'l2'}
lr = LogisticRegression(penalty='l1',C=1)
model_lr = lr.fit(X_train_1_s,y_train_1)
y_pred_lr = model_lr.predict(X_test_s)
y_pred_lr_train = model_lr.predict(X_train_1_s)
y_pred_lr.sum()
43466.0
model_lr.predict_proba(X_test_s)
array([[ 0.59788266,  0.40211734],
       [ 0.82896689,  0.17103311],
       [ 0.82739831,  0.17260169],
       ..., 
       [ 0.57050048,  0.42949952],
       [ 0.71194759,  0.28805241],
       [ 0.67568248,  0.32431752]])
print 'accuracy score on test:', accuracy_score(y_test,y_pred_lr)
accuracy score on test: 0.733647094119
print 'accuracy score on train:', accuracy_score(y_train_1,y_pred_lr_train)
accuracy score on train: 0.822663084987
pd.DataFrame(model_lr.coef_.T,index=X_train_1.columns,columns=['features']).sort_values(by='features')
features
ad_170 -13.240516
ad_91 -10.625985
ad_280 -9.088172
ad_219 -8.487405
ad_124 -7.855157
ad_62 -7.640861
ad_74 -6.192028
ad_189 -5.046144
ad_263 -4.192241
ad_186 -4.142643
ad_77 -3.936205
ad_1607 -3.787026
ad_230 -3.662474
ad_135 -2.998066
ad_65 -2.562998
ad_1306 -2.483390
ad_226 -2.447212
ad_245 -2.352690
ad_1305 -2.342052
ad_276 -2.322265
ad_192 -2.113407
ad_281 -2.103492
ad_191 -1.978850
ad_47 -1.921278
ad_162 -1.876128
ad_139 -1.843883
ad_165 -1.740234
ad_36 -1.712579
ad_116 -1.668522
1709 -1.629266
... ...
ad_212 1.558198
ad_253 1.558306
ad_1911 1.573074
ad_1912 1.628766
ad_90 1.696874
ad_1915 1.698822
ad_141 1.710795
ad_1705 1.756154
ad_222 1.765685
ad_1807 1.777869
ad_1507 1.873391
ad_108 1.955805
ad_2002 2.039848
ad_1909 2.048332
ad_7 2.096812
ad_38 2.201761
ad_13 2.214839
ad_1000 2.300956
ad_1806 2.315986
ad_1307 2.362928
35 2.459311
ad_17 2.794878
ad_1707 2.831282
ad_1302 2.851958
ad_202 3.692022
ad_287 3.844629
ad_266 4.023591
ad_1512 4.113135
ad_214 5.226553
ad_23 7.908640

893 rows × 1 columns

3. Keras

X_train_1_s.shape
(799997, 893)
from keras.layers import Dense, Dropout
from keras.models import Sequential

model = Sequential()
Using TensorFlow backend.
model.add(Dense(512,input_dim=893,kernel_initializer='uniform',activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(200,kernel_initializer='uniform',activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(100,kernel_initializer='uniform',activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(1,kernel_initializer='uniform',activation='sigmoid'))
model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])
model.fit(X_train_1_s, y_train_1, batch_size = 1000, epochs =10, verbose =1, validation_data=(X_test_s, y_test))
Train on 799997 samples, validate on 319989 samples
Epoch 1/10
799997/799997 [==============================] - 385s - loss: 0.4401 - acc: 0.8213 - val_loss: 0.5727 - val_acc: 0.8107
Epoch 2/10
799997/799997 [==============================] - 409s - loss: 0.4235 - acc: 0.8249 - val_loss: 0.5893 - val_acc: 0.7970
Epoch 3/10
799997/799997 [==============================] - 423s - loss: 0.4166 - acc: 0.8274 - val_loss: 0.6057 - val_acc: 0.7924
Epoch 4/10
799997/799997 [==============================] - 659s - loss: 0.4105 - acc: 0.8299 - val_loss: 0.6211 - val_acc: 0.7931
Epoch 5/10
799997/799997 [==============================] - 602s - loss: 0.4043 - acc: 0.8326 - val_loss: 0.6380 - val_acc: 0.7858
Epoch 6/10
799997/799997 [==============================] - 577s - loss: 0.3975 - acc: 0.8354 - val_loss: 0.6720 - val_acc: 0.7850
Epoch 7/10
799997/799997 [==============================] - 759s - loss: 0.3905 - acc: 0.8385 - val_loss: 0.6929 - val_acc: 0.7816
Epoch 8/10
799997/799997 [==============================] - 687s - loss: 0.3835 - acc: 0.8409 - val_loss: 0.7014 - val_acc: 0.7776
Epoch 9/10
799997/799997 [==============================] - 891s - loss: 0.3771 - acc: 0.8440 - val_loss: 0.7312 - val_acc: 0.7741
Epoch 10/10
799997/799997 [==============================] - 369s - loss: 0.3705 - acc: 0.8464 - val_loss: 0.7575 - val_acc: 0.7766





<keras.callbacks.History at 0x1092dd290>
score = model.evaluate(X_test_s, y_test,batch_size=1000)
319989/319989 [==============================] - 19s    
y_pred_test = model.predict_classes(X_test_s, batch_size=1000)
319989/319989 [==============================] - 19s    
y_pred_test.sum()
24849
y_test.sum()
59410.0
score
[0.75753827237961135, 0.77660169742361829]

Next Steps: use AWS to run

  • Random Forest
  • PCA

1. Random Forest

grid = {
    'n_estimators': [5],
    'max_features': [1/3.0,'auto'],
    'criterion': ['gini','entropy'],
    'class_weight': ["balanced",None]
}

clf = DecisionTreeClassifier(max_depth=8)
rf = RandomForestClassifier(clf)
gs = GridSearchCV(rf, grid)

model_rf_gs = gs.fit(X_train_s, y_train)
gs.best_params_

2. PCA

pca_train_1_doc = X_train_1[X_train_1.columns[:397]]
pca_train_1_ad = X_train_1[X_train_1.columns[398:795]]
from sklearn.decomposition import PCA
pca = PCA().fit(pca_test_doc)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler = scaler.fit_transform(pca_test_doc)
pca_df = pd.DataFrame(pca.transform(pca_test_doc),columns=['pca_doc_' + str(i) for i in range(1,398)])
pca_df.head()
pca_doc_1 pca_doc_2 pca_doc_3 pca_doc_4 pca_doc_5 pca_doc_6 pca_doc_7 pca_doc_8 pca_doc_9 pca_doc_10 ... pca_doc_388 pca_doc_389 pca_doc_390 pca_doc_391 pca_doc_392 pca_doc_393 pca_doc_394 pca_doc_395 pca_doc_396 pca_doc_397
0 0.814322 0.159102 0.04568 0.014588 -0.016929 -0.038994 -0.015696 -0.009949 0.009235 0.006185 ... 5.689065e-34 -5.648666e-34 3.880497e-34 3.828385e-34 -3.041575e-49 -4.497263e-50 0.0 0.0 1.231052e-16 -4.002307e-17
1 0.814322 0.159102 0.04568 0.014588 -0.016929 -0.038994 -0.015696 -0.009949 0.009235 0.006185 ... 5.689065e-34 -5.648666e-34 3.880497e-34 3.828385e-34 -3.041575e-49 -4.497263e-50 0.0 0.0 1.231052e-16 -4.002307e-17
2 0.814322 0.159102 0.04568 0.014588 -0.016929 -0.038994 -0.015696 -0.009949 0.009235 0.006185 ... 5.689065e-34 -5.648666e-34 3.880497e-34 3.828385e-34 -3.041575e-49 -4.497263e-50 0.0 0.0 1.231052e-16 -4.002307e-17
3 0.609322 0.103691 0.02265 0.006223 -0.005915 -0.011015 -0.006361 0.003796 0.002637 0.005794 ... -1.777675e-34 3.444746e-35 1.138304e-34 -1.654121e-34 2.140248e-49 1.407010e-49 0.0 0.0 1.946151e-16 -1.535390e-16
4 0.609322 0.103691 0.02265 0.006223 -0.005915 -0.011015 -0.006361 0.003796 0.002637 0.005794 ... -1.777675e-34 3.444746e-35 1.138304e-34 -1.654121e-34 2.140248e-49 1.407010e-49 0.0 0.0 1.946151e-16 -1.535390e-16

5 rows × 397 columns

pca.explained_variance_ratio_
array([  1.71977408e-01,   1.21430998e-01,   8.13294670e-02,
         6.79151622e-02,   6.03275833e-02,   4.79480565e-02,
         4.10455393e-02,   3.50131312e-02,   2.95224565e-02,
         2.82458263e-02,   2.65929942e-02,   2.39265639e-02,
         2.18825523e-02,   1.96003613e-02,   1.53906358e-02,
         1.32369198e-02,   1.22149340e-02,   1.17207727e-02,
         1.07209735e-02,   1.01217721e-02,   9.59078800e-03,
         9.15474157e-03,   8.95815328e-03,   7.17259360e-03,
         6.40856673e-03,   5.78088261e-03,   5.65789739e-03,
         5.48267523e-03,   5.14660235e-03,   4.03761659e-03,
         3.23904350e-03,   3.17710418e-03,   3.14156343e-03,
         3.06130583e-03,   2.88269197e-03,   2.70836575e-03,
         2.35990447e-03,   2.31007650e-03,   2.18323399e-03,
         1.99186909e-03,   1.88529263e-03,   1.80223775e-03,
         1.73852220e-03,   1.70984438e-03,   1.68573628e-03,
         1.62235882e-03,   1.60532988e-03,   1.55966033e-03,
         1.51183981e-03,   1.49993339e-03,   1.45808667e-03,
         1.42112932e-03,   1.37524223e-03,   1.27676732e-03,
         1.20399638e-03,   1.17867059e-03,   1.14579239e-03,
         1.10728906e-03,   1.08562357e-03,   1.03575995e-03,
         9.98336909e-04,   9.72343690e-04,   9.66805392e-04,
         8.62357879e-04,   8.40232609e-04,   8.25772648e-04,
         7.53967902e-04,   7.32486708e-04,   7.23449928e-04,
         7.02919827e-04,   6.03279468e-04,   5.92315038e-04,
         5.75286584e-04,   5.48615612e-04,   5.32297477e-04,
         5.24047970e-04,   5.17511214e-04,   4.74042535e-04,
         4.69678338e-04,   4.52153337e-04,   4.39577832e-04,
         4.17314145e-04,   4.04025494e-04,   3.84699250e-04,
         3.76428080e-04,   3.61272678e-04,   3.43474023e-04,
         3.28197592e-04,   3.17553820e-04,   3.06208508e-04,
         2.99258678e-04,   2.83461910e-04,   2.79246976e-04,
         2.67090789e-04,   2.63229651e-04,   2.43306480e-04,
         2.41125418e-04,   2.34895231e-04,   2.20197999e-04,
         2.11970933e-04,   2.09606164e-04,   1.94168385e-04,
         1.87012303e-04,   1.83884326e-04,   1.81536438e-04,
         1.76729576e-04,   1.74678369e-04,   1.69103306e-04,
         1.67335875e-04,   1.62876986e-04,   1.52196126e-04,
         1.50990861e-04,   1.46116046e-04,   1.44746433e-04,
         1.43489955e-04,   1.43329324e-04,   1.37046052e-04,
         1.36346241e-04,   1.30783483e-04,   1.27880670e-04,
         1.26650893e-04,   1.21104967e-04,   1.12715432e-04,
         1.09742262e-04,   1.05763538e-04,   9.89946074e-05,
         9.82761056e-05,   9.48680416e-05,   9.37856698e-05,
         8.58605556e-05,   8.46174980e-05,   8.01754579e-05,
         7.69152396e-05,   7.53546364e-05,   7.33714072e-05,
         7.16523485e-05,   6.91842575e-05,   6.65720532e-05,
         6.50741397e-05,   6.45359261e-05,   6.26227984e-05,
         6.03529025e-05,   5.98859974e-05,   5.69899402e-05,
         5.44908208e-05,   5.35563342e-05,   5.26046043e-05,
         4.92200556e-05,   4.83023068e-05,   4.78457388e-05,
         4.38193256e-05,   4.36011527e-05,   4.24094007e-05,
         3.98585948e-05,   3.90488654e-05,   3.67286393e-05,
         3.57234965e-05,   3.51160648e-05,   3.35711711e-05,
         3.22973672e-05,   3.22385117e-05,   3.03053897e-05,
         2.90680555e-05,   2.87369585e-05,   2.68285528e-05,
         2.64047184e-05,   2.61985492e-05,   2.45791935e-05,
         2.42512123e-05,   2.36442095e-05,   2.29251580e-05,
         2.11161068e-05,   2.06795396e-05,   1.98528838e-05,
         1.90925722e-05,   1.87718583e-05,   1.76010084e-05,
         1.66673655e-05,   1.42572653e-05,   1.35497653e-05,
         1.26655206e-05,   1.20684168e-05,   1.12734256e-05,
         1.06532099e-05,   1.02469258e-05,   9.61515872e-06,
         9.03744268e-06,   8.65246378e-06,   7.86547658e-06,
         7.55583438e-06,   6.55575001e-06,   6.48356572e-06,
         6.12674575e-06,   6.02318360e-06,   5.59267563e-06,
         5.44065335e-06,   5.31012694e-06,   5.22928706e-06,
         4.84494267e-06,   4.70274236e-06,   4.34565867e-06,
         4.07007473e-06,   3.56928536e-06,   3.51137534e-06,
         3.20579877e-06,   2.93062521e-06,   2.80107115e-06,
         2.50719704e-06,   2.42787185e-06,   2.17231090e-06,
         2.05128450e-06,   1.92334370e-06,   1.65799681e-06,
         1.53039158e-06,   1.45069228e-06,   1.35692718e-06,
         1.22186798e-06,   1.07379299e-06,   1.04963061e-06,
         9.47185533e-07,   8.57587106e-07,   7.73172004e-07,
         6.51632621e-07,   5.79781402e-07,   5.31607476e-07,
         5.09147049e-07,   4.30221943e-07,   3.76051968e-07,
         3.31583655e-07,   3.19903368e-07,   3.07649135e-07,
         3.03750921e-07,   2.79324022e-07,   2.58721046e-07,
         2.34742346e-07,   2.19276793e-07,   2.09272889e-07,
         2.02938745e-07,   1.88367769e-07,   1.72079290e-07,
         1.56454495e-07,   1.33524780e-07,   1.25373747e-07,
         1.22642259e-07,   1.08910528e-07,   9.75373327e-08,
         8.63612409e-08,   8.52866446e-08,   6.90999053e-08,
         5.63690450e-08,   4.58143511e-08,   4.44360589e-08,
         3.53843502e-08,   2.91251541e-08,   2.72094208e-08,
         1.86510203e-08,   1.62228527e-08,   1.19511714e-08,
         9.96980522e-09,   6.88258016e-09,   4.15446335e-09,
         3.22856085e-09,   1.30626732e-09,   2.93560343e-10,
         9.64161202e-11,   1.65877064e-11,   4.01827966e-32,
         2.04626326e-32,   1.47302445e-32,   1.12130861e-32,
         8.55970531e-33,   7.18331400e-33,   4.69076496e-33,
         3.89943750e-33,   2.80719749e-33,   2.14791736e-33,
         1.14105614e-33,   9.20868969e-34,   9.20868969e-34,
         9.20868969e-34,   9.20868969e-34,   9.20868969e-34,
         9.20868969e-34,   9.20868969e-34,   9.20868969e-34,
         9.20868969e-34,   9.20868969e-34,   9.20868969e-34,
         9.20868969e-34,   9.20868969e-34,   9.20868969e-34,
         9.20868969e-34,   9.20868969e-34,   9.20868969e-34,
         9.20868969e-34,   9.20868969e-34,   9.20868969e-34,
         9.20868969e-34,   9.20868969e-34,   9.20868969e-34,
         9.20868969e-34,   9.20868969e-34,   9.20868969e-34,
         9.20868969e-34,   9.20868969e-34,   9.20868969e-34,
         9.20868969e-34,   9.20868969e-34,   9.20868969e-34,
         9.20868969e-34,   9.20868969e-34,   9.20868969e-34,
         9.20868969e-34,   9.20868969e-34,   9.20868969e-34,
         9.20868969e-34,   9.20868969e-34,   9.20868969e-34,
         9.20868969e-34,   9.20868969e-34,   9.20868969e-34,
         9.20868969e-34,   9.20868969e-34,   9.20868969e-34,
         9.20868969e-34,   9.20868969e-34,   9.20868969e-34,
         9.20868969e-34,   9.20868969e-34,   9.20868969e-34,
         9.20868969e-34,   9.20868969e-34,   9.20868969e-34,
         9.20868969e-34,   9.20868969e-34,   9.20868969e-34,
         9.20868969e-34,   9.20868969e-34,   9.20868969e-34,
         9.20868969e-34,   9.20868969e-34,   9.20868969e-34,
         9.20868969e-34,   9.20868969e-34,   9.20868969e-34,
         9.20868969e-34,   9.20868969e-34,   9.20868969e-34,
         9.20868969e-34,   9.20868969e-34,   9.20868969e-34,
         9.20868969e-34,   9.20868969e-34,   9.20868969e-34,
         9.20868969e-34,   9.20868969e-34,   9.20868969e-34,
         9.20868969e-34,   9.20868969e-34,   9.20868969e-34,
         9.20868969e-34,   9.20868969e-34,   9.20868969e-34,
         9.20868969e-34,   9.20868969e-34,   9.20868969e-34,
         9.20868969e-34,   9.20868969e-34,   9.20868969e-34,
         9.20868969e-34,   9.20868969e-34,   9.20868969e-34,
         9.20868969e-34,   9.20868969e-34,   9.20868969e-34,
         9.20868969e-34,   9.20868969e-34,   9.20868969e-34,
         9.20868969e-34,   9.20868969e-34,   9.20868969e-34,
         9.20868969e-34,   9.20868969e-34,   9.20868969e-34,
         9.20868969e-34,   9.20868969e-34,   9.20868969e-34,
         9.20868969e-34,   9.20868969e-34,   9.20868969e-34,
         9.20868969e-34,   9.20868969e-34,   9.20868969e-34,
         9.20868969e-34,   9.20868969e-34,   3.14895564e-34,
         3.59429841e-35])
fig, ax = plt.subplots(figsize=(8,6))
ax.plot(range(1,398), pca.explained_variance_ratio_, lw=2)
ax.scatter(range(1, 398), pca.explained_variance_ratio_, s=100)
ax.set_title('explained variance of components')
ax.set_xlabel('principal component')
ax.set_ylabel('explained variance');

png

cum_var_exp = np.cumsum(pca.explained_variance_ratio_)*100
plt.figure(figsize=(9,7))
component_number = range(1,398)
plt.plot(component_number, cum_var_exp, lw=7)
plt.axhline(y=0, linewidth=5, color='grey', ls='dashed')
plt.axhline(y=100, linewidth=3, color='grey', ls='dashed')
ax = plt.gca()
ax.set_xlim([1,398])
ax.set_ylim([-5,105])
ax.set_ylabel('cumulative variance explained', fontsize=16)
ax.set_xlabel('component', fontsize=16)  
ax.set_title('component vs cumulative variance explained\n', fontsize=20);

png

pca.components_[0]
array([ -4.27775135e-04,  -1.97595238e-03,   2.22044605e-16,
        -2.22044605e-16,  -2.77525246e-04,  -2.77555756e-17,
        -4.20772209e-02,  -0.00000000e+00,  -9.01586307e-05,
        -2.71050543e-20,  -5.75092875e-04,  -1.09340736e-04,
        -2.64697796e-23,  -1.10065066e-03,   3.86028305e-03,
        -3.49034004e-03,  -3.01215960e-03,   2.82663443e-02,
         1.57772181e-30,   4.93038066e-32,  -0.00000000e+00,
         3.41582509e-03,  -7.06343888e-02,  -1.38840268e-03,
        -1.40980937e-03,  -6.19051792e-03,  -2.16408330e-03,
        -9.49580554e-03,  -0.00000000e+00,   4.89359698e-45,
        -5.41405307e-04,   2.05815712e-45,  -1.28785406e-03,
        -2.40267641e-04,  -0.00000000e+00,  -0.00000000e+00,
        -8.86168829e-04,  -0.00000000e+00,  -2.82577431e-04,
        -1.55512625e-02,  -1.35233594e-03,  -4.72453008e-04,
        -0.00000000e+00,  -0.00000000e+00,   6.74594897e-04,
        -1.29928815e-03,  -5.36697071e-03,  -0.00000000e+00,
        -1.52359866e-03,  -0.00000000e+00,  -1.05091287e-02,
        -2.75353884e-03,  -5.31711523e-03,  -5.64304289e-03,
        -1.39619871e-02,  -9.26834762e-04,  -0.00000000e+00,
        -0.00000000e+00,  -3.22343295e-01,  -1.55157898e-03,
        -3.02997599e-03,  -1.57497385e-03,  -1.55855387e-02,
        -8.70940531e-03,  -1.07712735e-01,  -5.01236673e-05,
        -8.71246932e-04,  -1.06228013e-02,  -0.00000000e+00,
        -1.18079504e-03,  -0.00000000e+00,  -9.21925471e-04,
        -1.63242772e-02,  -5.39310286e-04,  -5.73865296e-03,
        -0.00000000e+00,  -0.00000000e+00,  -2.66703403e-03,
        -5.07949235e-02,  -4.63939624e-04,  -0.00000000e+00,
        -6.59781637e-03,  -9.47875368e-03,  -1.31911430e-03,
        -0.00000000e+00,  -0.00000000e+00,  -0.00000000e+00,
        -0.00000000e+00,  -4.05365071e-05,  -4.87098897e-03,
        -0.00000000e+00,   4.72295908e-03,   9.24672301e-01,
        -1.76873684e-02,   4.19937889e-03,   1.14875620e-02,
        -1.42598535e-02,  -0.00000000e+00,  -0.00000000e+00,
        -2.95128494e-04,  -0.00000000e+00,  -4.38428988e-04,
        -0.00000000e+00,  -7.49442234e-06,  -0.00000000e+00,
        -5.16817458e-03,  -0.00000000e+00,   2.20355178e-03,
        -1.93159165e-03,  -0.00000000e+00,  -3.02898338e-05,
        -0.00000000e+00,  -1.26088672e-03,   1.61280982e-03,
        -0.00000000e+00,  -0.00000000e+00,  -8.72120233e-05,
        -1.77320836e-03,  -3.51638746e-03,  -0.00000000e+00,
        -5.22860105e-05,  -5.95381883e-04,  -3.80528063e-04,
        -1.50569815e-04,  -2.90356077e-04,  -0.00000000e+00,
        -9.36652930e-05,  -0.00000000e+00,  -0.00000000e+00,
        -4.23782453e-04,  -0.00000000e+00,  -0.00000000e+00,
        -1.08941755e-03,  -2.05475011e-03,  -2.60848476e-04,
        -2.90321871e-04,  -1.17107380e-05,  -1.22715048e-05,
         1.58765697e-03,  -1.41118991e-03,  -3.30008566e-04,
        -1.32612199e-03,  -3.74621108e-04,  -1.11954226e-03,
         8.96320852e-03,  -0.00000000e+00,  -1.55441721e-03,
        -1.24047024e-03,  -3.43541030e-04,  -5.72027838e-04,
        -3.50734175e-05,  -3.54792746e-04,  -3.96954260e-05,
        -0.00000000e+00,  -4.11544006e-05,  -0.00000000e+00,
        -0.00000000e+00,  -1.15584677e-04,  -2.21969165e-03,
        -7.79448834e-05,  -2.08613846e-05,   1.52745004e-03,
        -5.99655611e-05,   3.31371318e-02,  -0.00000000e+00,
        -7.38089466e-04,  -0.00000000e+00,  -3.61911395e-03,
        -3.91062914e-04,  -0.00000000e+00,  -0.00000000e+00,
        -3.29988894e-03,   4.26545449e-03,   4.69250836e-02,
        -6.77440221e-04,  -0.00000000e+00,  -6.82625080e-05,
        -4.72356122e-03,  -1.60735445e-05,  -1.49654312e-04,
        -2.88192209e-04,  -2.52658390e-04,  -1.08855885e-03,
        -7.04790597e-04,  -0.00000000e+00,  -0.00000000e+00,
        -1.45122847e-03,  -1.34238820e-05,  -1.47588930e-03,
        -3.62963638e-04,  -3.85884381e-04,  -0.00000000e+00,
        -8.83726395e-04,  -6.03748697e-04,  -2.65329758e-03,
        -0.00000000e+00,  -6.48244441e-06,   1.32171662e-02,
        -0.00000000e+00,   8.04941816e-03,  -2.90689366e-04,
        -0.00000000e+00,  -0.00000000e+00,  -1.28551441e-03,
        -3.80461538e-03,  -6.20910024e-03,  -2.71196090e-04,
        -6.61272503e-05,  -0.00000000e+00,  -8.17463983e-04,
        -3.99009776e-05,  -4.70219227e-04,  -0.00000000e+00,
        -1.86082811e-04,  -4.14178886e-04,  -4.49141523e-05,
        -9.30511634e-05,  -0.00000000e+00,  -4.49030054e-05,
        -0.00000000e+00,  -0.00000000e+00,  -2.63571226e-03,
        -5.04361817e-04,  -4.81296557e-05,  -7.68962074e-04,
        -0.00000000e+00,  -1.06756046e-03,  -5.87568783e-04,
        -1.16799754e-03,  -3.48504479e-04,  -7.36097037e-05,
        -9.55202291e-04,  -0.00000000e+00,  -2.83566034e-03,
        -1.68961021e-03,  -0.00000000e+00,  -9.53349058e-04,
         1.63784335e-02,  -0.00000000e+00,  -0.00000000e+00,
         2.41423416e-03,  -1.77404993e-03,  -2.65489143e-03,
        -9.63676658e-04,  -6.74319804e-06,  -3.48626823e-03,
        -0.00000000e+00,  -3.13585499e-04,  -0.00000000e+00,
        -0.00000000e+00,  -5.79309033e-05,  -8.26018114e-05,
        -5.18255519e-05,   1.08955325e-01,  -0.00000000e+00,
        -0.00000000e+00,  -4.95900089e-04,  -1.55276591e-03,
        -0.00000000e+00,  -1.38668829e-03,  -4.28245669e-04,
        -1.37624247e-04,  -1.65159667e-04,  -0.00000000e+00,
        -0.00000000e+00,  -2.18269109e-04,  -0.00000000e+00,
        -3.66307422e-04,   8.75503885e-03,  -1.37695579e-03,
        -7.22541250e-04,  -3.88366723e-04,  -7.27697658e-03,
        -0.00000000e+00,  -1.03600212e-03,  -2.26625755e-03,
        -4.27708562e-04,  -0.00000000e+00,  -9.47986425e-04,
        -1.92949343e-05,  -1.89828445e-03,  -2.27552445e-03,
        -0.00000000e+00,  -1.27711589e-03,  -0.00000000e+00,
        -0.00000000e+00,  -1.17453639e-04,  -0.00000000e+00,
        -5.86283446e-04,  -7.11657588e-05,  -8.89298639e-04,
        -1.08749506e-04,  -0.00000000e+00,  -2.95185354e-03,
        -4.02331095e-04,  -3.68780521e-03,  -8.35562929e-03,
        -2.39765560e-03,  -8.48198918e-04,  -2.38746175e-03,
        -1.11760864e-04,   9.06610551e-03,  -6.35102949e-05,
        -4.39644950e-04,  -4.47405763e-05,  -0.00000000e+00,
        -1.68173313e-04,  -0.00000000e+00,  -4.12595037e-05,
        -1.93108330e-04,  -0.00000000e+00,  -6.92533132e-03,
        -0.00000000e+00,  -8.13187372e-03,  -3.37814803e-04,
        -0.00000000e+00,  -2.33915253e-04,  -0.00000000e+00,
        -0.00000000e+00,  -1.32795111e-03,  -4.46485207e-05,
        -0.00000000e+00,  -1.93311072e-04,  -2.47521443e-03,
        -3.66511984e-04,  -3.82571921e-04,  -1.12702970e-05,
        -8.11970584e-04,  -7.64940890e-05,  -6.68427815e-06,
        -0.00000000e+00,  -5.71702794e-04,  -3.00144245e-04,
        -0.00000000e+00,  -3.14687006e-05,  -3.13357370e-05,
        -7.10648877e-04,  -0.00000000e+00,  -7.45313607e-04,
        -1.63721122e-05,  -0.00000000e+00,  -1.95818158e-05,
        -0.00000000e+00,  -0.00000000e+00,  -3.59695203e-04,
        -5.66135703e-03,  -1.44509170e-04,  -1.06928793e-03,
        -4.34763492e-04,  -9.74456943e-03,  -0.00000000e+00,
        -1.35919210e-04,  -0.00000000e+00,  -1.05489484e-05,
        -4.38508690e-04,  -7.49730133e-04,  -2.62529137e-04,
         1.07622013e-02,  -0.00000000e+00,  -1.88924775e-05,
        -0.00000000e+00,  -1.45854274e-05,  -1.74864265e-03,
        -9.42430117e-03,  -4.09867256e-03,  -7.70450354e-05,
        -3.49563407e-03,  -5.35141848e-04,  -3.73610441e-03,
        -0.00000000e+00,  -0.00000000e+00,  -0.00000000e+00,
        -3.95117533e-05,   2.16138773e-03,  -0.00000000e+00,
        -0.00000000e+00,  -4.18949651e-05,  -0.00000000e+00,
        -3.33314915e-05,  -1.27233388e-04,  -0.00000000e+00,
        -9.75723181e-04,  -9.93146605e-05,  -1.95122616e-06,
        -0.00000000e+00,  -4.48007406e-05,  -1.08149350e-03,
        -3.97307894e-05,  -1.30582921e-04,  -2.57401114e-04,
        -9.80309958e-04,  -0.00000000e+00,  -0.00000000e+00,
        -1.23917054e-03,  -1.00226103e-03,  -0.00000000e+00,
        -0.00000000e+00])
weight_list =[]
pca1_evec = pca.components_[0]
for weight, event in zip(pca1_evec, pca_test_doc.columns):
    print event, weight
    weight_list.append(weight)
1000 -2.15749224858e-10
1100 -6.85213441542e-10
1200 -0.0
1202 -0.0
1203 -1.5477100268e-10
1204 -0.0
1205 1.15735905057e-07
1206 -0.0
1207 -5.42934327176e-11
1208 -0.0
1209 -3.42457664289e-10
1210 -1.87627114385e-11
1211 -0.0
1302 -9.77713260578e-09
1303 -2.57217531887e-10
1304 -2.11676848748e-09
1305 -1.12878011897e-09
1306 -1.34956038834e-09
1307 -0.0
1308 -0.0
1400 -0.0
1402 -2.96258083692e-10
1403 -1.49779730223e-08
1404 -2.00362795673e-10
1405 -4.82095395712e-10
1406 -1.71050087315e-09
1407 -9.63593725095e-10
1408 -6.14168694686e-09
1500 -0.0
1502 -0.0
1503 -3.52388921591e-10
1504 -0.0
1505 -5.13178116759e-10
1506 -2.34220384751e-11
1507 -0.0
1509 -0.0
1510 -1.111372214e-09
1511 -0.0
1512 -5.54090826048e-11
1513 -3.01443336826e-09
1514 -6.84801691931e-10
1515 -1.8922488528e-10
1516 -0.0
1600 -0.0
1602 -8.32743662848e-10
1603 -6.65380533066e-10
1604 -2.75971900612e-09
1605 -0.0
1606 -5.01601064622e-10
1607 -0.0
1608 -3.77021091629e-09
1609 -8.8241282151e-10
1610 -2.54222876008e-09
1611 -2.80572425629e-09
1612 -6.93580895492e-09
1613 -4.92272148115e-10
1614 -0.0
1700 -0.0
1702 -3.4032262111e-08
1703 -7.29988211878e-10
1704 8.89136972502e-09
1705 -5.99220641547e-10
1706 -4.10107633755e-09
1707 -2.46047570517e-09
1708 1.36585226497e-08
1709 -2.74506594987e-11
1710 -5.19396844658e-10
1711 -2.75035084789e-09
1800 -0.0
1802 -6.50224796194e-10
1804 -0.0
1805 -3.49861266351e-10
1806 -3.70645888275e-09
1807 -3.06640606157e-10
1808 -2.21379409533e-09
1809 -0.0
1900 -0.0
1902 -1.56265719997e-09
1903 -1.69593883246e-08
1904 -1.11452271577e-10
1905 -0.0
1907 -3.28743518941e-09
1908 -2.97730437147e-09
1909 -4.35365619092e-10
1910 -0.0
1911 -0.0
1912 -0.0
1913 -0.0
1914 -2.31002062901e-11
1915 -2.26969835893e-09
2000 -0.0
2002 -7.86014447279e-10
2003 -4.36828566016e-08
2004 -7.96792972121e-09
2005 -1.01087522322e-10
2006 -5.94008888377e-10
2100 -6.53909754718e-09
0 -0.0
1 -0.0
2 -4.43436079368e-10
3 -0.0
4 -1.59882288747e-10
5 -0.0
6 -2.76654520013e-12
7 -0.0
8 -7.54900137703e-10
9 -0.0
10 -5.34872893178e-10
11 -6.51797105746e-10
12 -0.0
13 -1.5496509863e-11
14 -0.0
15 -4.31074025677e-10
16 -1.48212049559e-09
17 -0.0
18 -0.0
19 -1.06633677841e-11
20 -9.01786867538e-10
21 -4.38112489078e-10
22 -0.0
23 -1.14728220275e-11
24 -1.59291663762e-10
25 -9.06956166997e-11
26 -6.86877906168e-11
27 -3.14389795356e-11
28 -0.0
29 -2.37753297672e-11
30 -0.0
31 -0.0
32 -1.03136675324e-10
33 -0.0
34 -0.0
35 8.56847942312e-10
36 -4.22955472508e-10
37 -8.74860089536e-11
38 -1.53221522798e-10
39 -4.97471972791e-12
40 -5.95062155017e-12
41 -8.37652813147e-11
42 -3.50449646769e-10
43 -1.28890792309e-10
44 -1.74224700008e-10
45 -8.80198462412e-11
46 -6.96996331505e-11
47 -4.82448263942e-10
48 -0.0
49 1.05391948594e-09
50 -1.43403682347e-10
51 -1.18399145498e-10
52 -2.31234009124e-10
53 -4.45564096568e-12
54 -6.68461235213e-11
55 -2.36379261194e-11
56 -0.0
57 -1.601616177e-11
58 -0.0
59 -0.0
60 -4.17655392858e-11
61 -7.53268459418e-10
62 -3.66156824822e-11
63 -3.51387604315e-12
64 -3.8271714195e-10
65 -2.92704066191e-11
66 -2.15132638296e-09
67 -0.0
68 -9.1236065663e-11
69 -0.0
70 -1.29234792258e-09
71 -9.27377957187e-11
72 -0.0
73 -0.0
74 -4.2197566481e-10
75 -2.23796229237e-10
76 -1.79609265458e-09
77 -8.36806846281e-11
78 -0.0
79 -2.72295522622e-11
80 -5.72059244688e-10
81 -6.82606126047e-12
82 -4.72326766469e-11
83 -4.28467284376e-11
84 -1.33360000785e-10
85 -2.70957686616e-10
86 -2.93827470369e-10
87 -0.0
88 -0.0
89 -4.78919962852e-10
90 -4.42257183963e-12
91 -4.2040592459e-10
92 -7.23659450621e-11
93 -8.82167800085e-11
94 -0.0
95 -3.05808485675e-10
96 -3.82280371272e-10
97 -6.63170075751e-10
98 -0.0
99 -2.21193831647e-12
100 -8.28568472189e-10
101 -0.0
102 -2.28860302946e-09
103 -5.5255637905e-11
104 -0.0
105 -0.0
106 -2.13933893745e-10
107 7.82629464829e-09
108 -7.61258122954e-10
109 -5.71967911104e-11
110 -1.23241357918e-11
111 -0.0
112 -2.15510380401e-10
113 -2.03686615917e-11
114 -1.21294752648e-10
115 -0.0
116 -1.1358043352e-10
117 -2.14138511848e-10
118 -2.74948804645e-11
119 -3.17457154567e-11
120 -0.0
121 -2.74880567305e-11
122 -0.0
123 -0.0
124 -8.97754359772e-10
125 -1.44838863056e-10
126 -8.73538435353e-12
127 -8.11675851791e-11
128 -0.0
129 -1.35461138612e-10
130 -6.62334790437e-10
131 -3.99156604137e-10
132 -6.9626141028e-11
133 -1.02119671375e-11
134 -2.49674515853e-10
135 -0.0
136 -8.21213318219e-10
137 -2.8627446759e-10
138 -0.0
139 9.90176327177e-11
140 -2.77097552762e-09
141 -0.0
142 -0.0
143 -1.27357692461e-09
144 -4.1486135352e-10
145 -6.44805416297e-10
146 -1.4181808323e-10
147 -4.01545550938e-12
148 -5.77421957433e-10
149 -0.0
150 -5.68429450506e-11
151 -0.0
152 -0.0
153 -5.88279681314e-11
154 -1.43156720954e-11
155 -9.73215891777e-12
156 -4.8517330411e-09
157 -0.0
158 -0.0
159 -5.76047047166e-11
160 -4.4803124065e-10
161 -0.0
162 1.54973997482e-09
163 -1.9097564687e-10
164 -6.72933817306e-11
165 -4.9896189346e-11
166 -0.0
167 -0.0
168 -4.35983253966e-11
169 -0.0
170 -1.86918369681e-10
171 -5.3125731285e-10
172 -2.05480511708e-10
173 -1.0083507444e-10
174 -1.94715201466e-10
175 -1.33461723528e-09
176 -0.0
177 3.040115819e-09
178 -5.77625225897e-10
179 -7.55308499844e-11
180 -0.0
181 -2.40946128016e-10
182 -7.6478993367e-12
183 -6.25426235294e-10
184 -8.98441429498e-10
185 -0.0
186 -3.30008491248e-10
187 -0.0
188 -0.0
189 -7.67286682731e-11
190 -0.0
191 -1.52257563197e-10
192 -4.16817485665e-11
193 -2.85629052816e-10
194 -5.30782998248e-11
195 -0.0
196 8.43583344697e-09
197 -2.22530896386e-10
198 -1.57992374138e-09
199 -2.60876620886e-09
200 6.98055692819e-09
201 -1.5206008479e-10
202 2.16246001137e-09
203 -6.36882452954e-11
204 -5.80682199379e-10
205 -7.56500121425e-12
206 -2.0237454404e-10
207 -1.60055449382e-11
208 -0.0
209 -9.59893701423e-11
210 -0.0
211 -1.37033698457e-11
212 -7.57709744955e-11
213 -0.0
214 -1.01873344386e-09
215 -0.0
216 -8.80067463972e-10
217 -1.17264977829e-10
218 -0.0
219 -4.57116420743e-11
220 -0.0
221 -0.0
222 -2.72155051202e-10
223 -1.63572341727e-11
224 -0.0
225 -6.63489759916e-11
226 -5.96490533499e-10
227 -1.18220100882e-10
228 -9.25456473615e-11
229 -5.47677361812e-12
230 -4.26378707958e-10
231 -3.06419813488e-11
232 -3.49081296638e-12
233 -0.0
234 -2.09645270628e-10
235 -4.08852380058e-11
236 -0.0
237 -2.93161031208e-10
238 -8.59387451094e-12
239 -1.30801808695e-10
240 -0.0
241 -2.93472461872e-10
242 -5.73165577388e-12
243 -0.0
244 -1.01467084408e-11
245 -0.0
246 -0.0
247 -1.25450905211e-10
248 -5.25723738319e-10
249 -5.10693790957e-11
250 -2.23386764173e-10
251 -4.8186453575e-11
252 -1.51785924111e-09
253 -0.0
254 -6.95466533085e-11
255 -0.0
256 -3.47540912688e-12
257 -7.2324620829e-11
258 -2.36110716367e-10
259 -1.01543813094e-10
260 -1.87811586843e-09
261 -0.0
262 1.73989501744e-09
263 -0.0
264 -7.35957981504e-12
265 -4.95638857888e-10
266 -3.32880469149e-10
267 -4.22924958759e-10
268 -1.60580735776e-11
269 -1.03872225963e-09
270 -1.99882731895e-10
271 -4.27908161329e-10
272 -0.0
273 -0.0
274 -0.0
275 -2.27701731165e-11
276 -1.73666962087e-10
277 -0.0
278 -0.0
279 -2.27278231144e-11
280 -0.0
281 -1.27028101861e-11
282 -5.26521883181e-11
283 -0.0
284 -1.49323119805e-10
285 -5.14244491763e-11
286 -3.26449519119e-11
287 -0.0
288 -2.74254537588e-11
289 -3.67057280966e-10
290 -2.0163542364e-11
291 -5.83835512543e-11
292 -1.12716887326e-10
293 -1.14259873191e-10
294 -0.0
295 -0.0
296 -5.53647087824e-10
297 -1.64746456087e-10
298 -0.0
299 -0.0
count 1.0
df_test = pd.read_csv("datasets/clicks_test.csv")
len(df_test)
32225162
Written on July 4, 2017