Project 2 Post

Summary

This project is an EDA on Top 100 Billboard Tracks in 2000. The raw data have the following features with 317 rows

track names
artists
genre
date-entered
date-peaked
ranks by week

Hypothesis

My goal is to test two hypothesis on a 95% confidence level, which focused on the Genre column

Hypothesis 1: Country genre has has no difference with overall music in terms of number of days between the tracked entered and peaked on Billboard

Hypothesis 2: Rock genre has no difference with R&B genre in terms of the number of weeks the track stayed on Billboard

Project steps

Here are the steps I took:

Get data:

Took a quick look at the data
Identified assumptions and problems
- I found mistakes on the data entry. For example, Breathe by Faith Hill was considered as Rap, while I would consider it as pop music.
- I assumed the peaked date was when the track had the highest rank on Billboard, even if some tracks were entered before 2000.

Clean Data

Before I went further to my analysis, I spent quite an amount of time on cleaning data. As I was cleaning I revealed new problems. Over and over, the dataset became easier and nicer to work on.

Create a function to convert * to NaN and removed columns after ‘x65th.week’ because those columns only contains NaN values
Clean ‘time’, ‘date.entered’ and ‘date.peaked’ columns
- Use functions like str.replace, .strip, to_datetime.dt.time and tp_datetime
Clean week columns to float
Clean column ‘genre’
- Combine same values in different format, i.e. ‘R&B’ and ‘R & B’, ‘Rock’ and ‘Rock’n’roll’

Analyze Data

Since I was interested in the number of weeks that tracks stayed on Billboard, as well as the time difference between the tracked entered and peaked on Billboard, I created 2 new features.

Created ‘diff.between.peaked.and.enetered’ column
- Substracted ‘date.entered’ from ‘date.peaked’ and converted them into integer
Create ‘no.of.weeks.on.billboard’ column
- Used a for loop to count # of weeks
- Transformed the list into pandas series
Visualizations
- Days between entered and peaked histogram
- Linear regression scatter plot

Test Results

Hypothesis 1: Country genre has has no difference with overall music in terms of number of days between the tracked entered and peaked on Billboard

P-value is much less than 0.05. so we do not accept H0. We believe there is a difference between country music and overall music in terms of number of days between the track peaked and entered

Hypothesis 2: Rock genre has no difference with R&B genre in terms of the number of weeks the track stayed on Billboard

P-value is much less than 0.05. so we do not accept H0. We believe there is a difference between R&B and Rock music in terms of Billboard length

Next Steps

In my next steps, I would like to

Fixed the data entry mistakes in order to get more accurate resutls
Built a function calculate pvalues for each pair of genre combination

Get data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.linear_model import LinearRegression
%matplotlib inline

df = pd.read_csv("assets/billboard.csv")

df.head()

	year	artist.inverted	track	time	genre	date.entered	date.peaked	x1st.week	x2nd.week	x3rd.week	...	x67th.week	x68th.week	x69th.week	x70th.week	x71st.week	x72nd.week	x73rd.week	x74th.week	x75th.week	x76th.week
0	2000	Destiny's Child	Independent Women Part I	3,38,00 AM	Rock	September 23, 2000	November 18, 2000	78	63	49	...	*	*	*	*	*	*	*	*	*	*
1	2000	Santana	Maria, Maria	4,18,00 AM	Rock	February 12, 2000	April 8, 2000	15	8	6	...	*	*	*	*	*	*	*	*	*	*
2	2000	Savage Garden	I Knew I Loved You	4,07,00 AM	Rock	October 23, 1999	January 29, 2000	71	48	43	...	*	*	*	*	*	*	*	*	*	*
3	2000	Madonna	Music	3,45,00 AM	Rock	August 12, 2000	September 16, 2000	41	23	18	...	*	*	*	*	*	*	*	*	*	*
4	2000	Aguilera, Christina	Come On Over Baby (All I Want Is You)	3,38,00 AM	Rock	August 5, 2000	October 14, 2000	57	47	45	...	*	*	*	*	*	*	*	*	*	*

5 rows × 83 columns

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 317 entries, 0 to 316
Data columns (total 83 columns):
year               317 non-null int64
artist.inverted    317 non-null object
track              317 non-null object
time               317 non-null object
genre              317 non-null object
date.entered       317 non-null object
date.peaked        317 non-null object
x1st.week          317 non-null int64
x2nd.week          317 non-null object
x3rd.week          317 non-null object
x4th.week          317 non-null object
x5th.week          317 non-null object
x6th.week          317 non-null object
x7th.week          317 non-null object
x8th.week          317 non-null object
x9th.week          317 non-null object
x10th.week         317 non-null object
x11th.week         317 non-null object
x12th.week         317 non-null object
x13th.week         317 non-null object
x14th.week         317 non-null object
x15th.week         317 non-null object
x16th.week         317 non-null object
x17th.week         317 non-null object
x18th.week         317 non-null object
x19th.week         317 non-null object
x20th.week         317 non-null object
x21st.week         317 non-null object
x22nd.week         317 non-null object
x23rd.week         317 non-null object
x24th.week         317 non-null object
x25th.week         39 non-null object
x26th.week         37 non-null object
x27th.week         30 non-null object
x28th.week         317 non-null object
x29th.week         317 non-null object
x30th.week         317 non-null object
x31st.week         317 non-null object
x32nd.week         317 non-null object
x33rd.week         317 non-null object
x34th.week         317 non-null object
x35th.week         317 non-null object
x36th.week         317 non-null object
x37th.week         317 non-null object
x38th.week         317 non-null object
x39th.week         317 non-null object
x40th.week         317 non-null object
x41st.week         317 non-null object
x42nd.week         317 non-null object
x43rd.week         317 non-null object
x44th.week         317 non-null object
x45th.week         317 non-null object
x46th.week         317 non-null object
x47th.week         317 non-null object
x48th.week         317 non-null object
x49th.week         317 non-null object
x50th.week         317 non-null object
x51st.week         317 non-null object
x52nd.week         317 non-null object
x53rd.week         317 non-null object
x54th.week         317 non-null object
x55th.week         317 non-null object
x56th.week         317 non-null object
x57th.week         317 non-null object
x58th.week         317 non-null object
x59th.week         317 non-null object
x60th.week         317 non-null object
x61st.week         317 non-null object
x62nd.week         317 non-null object
x63rd.week         317 non-null object
x64th.week         317 non-null object
x65th.week         317 non-null object
x66th.week         317 non-null object
x67th.week         317 non-null object
x68th.week         317 non-null object
x69th.week         317 non-null object
x70th.week         317 non-null object
x71st.week         317 non-null object
x72nd.week         317 non-null object
x73rd.week         317 non-null object
x74th.week         317 non-null object
x75th.week         317 non-null object
x76th.week         317 non-null object
dtypes: int64(2), object(81)
memory usage: 205.6+ KB

Clean Data

Replace * to NaN
Remove NaN value columns
Convert time column to datetime
Convert rank column to float
Clean genre column

# replace * to NaN
def _convert_nan(value):
    if value == '*':
        return np.nan
    else:
        return value
df = df.applymap(_convert_nan)

# Remove column 'year' and columns after 'x66th.week' - these columns only contains year 2000 or Nan
df= df.loc[:,'artist.inverted':'x65th.week']

# Convert time to the correct format
df['time'] = df['time'].str.replace(",",":")
df['time'] = df['time'].str.strip(' AM')

df['time'] = pd.to_datetime(df['time']).dt.time

# Convert date to datetime type
df['date.entered'] = pd.to_datetime(df['date.entered'])
df['date.peaked'] = pd.to_datetime(df['date.peaked'])

# Convert rank to float
df.loc[:,'x1st.week':'x65th.week'] = df.loc[:,'x1st.week':'x65th.week'].applymap(lambda x: float(x))

# Replace 'R & B' with 'R&B and 'Rock'n'roll' with 'Rock' due to typo
df['genre'] = df['genre'].apply(lambda x: x.replace('R & B','R&B'))
df['genre'] = df['genre'].apply(lambda x: x.replace("Rock'n'roll","Rock"))

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 317 entries, 0 to 316
Data columns (total 71 columns):
artist.inverted    317 non-null object
track              317 non-null object
time               317 non-null object
genre              317 non-null object
date.entered       317 non-null datetime64[ns]
date.peaked        317 non-null datetime64[ns]
x1st.week          317 non-null float64
x2nd.week          312 non-null float64
x3rd.week          307 non-null float64
x4th.week          300 non-null float64
x5th.week          292 non-null float64
x6th.week          280 non-null float64
x7th.week          269 non-null float64
x8th.week          260 non-null float64
x9th.week          253 non-null float64
x10th.week         244 non-null float64
x11th.week         236 non-null float64
x12th.week         222 non-null float64
x13th.week         210 non-null float64
x14th.week         204 non-null float64
x15th.week         197 non-null float64
x16th.week         182 non-null float64
x17th.week         177 non-null float64
x18th.week         166 non-null float64
x19th.week         156 non-null float64
x20th.week         146 non-null float64
x21st.week         65 non-null float64
x22nd.week         55 non-null float64
x23rd.week         48 non-null float64
x24th.week         46 non-null float64
x25th.week         38 non-null float64
x26th.week         36 non-null float64
x27th.week         29 non-null float64
x28th.week         24 non-null float64
x29th.week         20 non-null float64
x30th.week         20 non-null float64
x31st.week         19 non-null float64
x32nd.week         18 non-null float64
x33rd.week         12 non-null float64
x34th.week         10 non-null float64
x35th.week         9 non-null float64
x36th.week         9 non-null float64
x37th.week         9 non-null float64
x38th.week         8 non-null float64
x39th.week         8 non-null float64
x40th.week         7 non-null float64
x41st.week         7 non-null float64
x42nd.week         6 non-null float64
x43rd.week         6 non-null float64
x44th.week         6 non-null float64
x45th.week         5 non-null float64
x46th.week         5 non-null float64
x47th.week         5 non-null float64
x48th.week         4 non-null float64
x49th.week         4 non-null float64
x50th.week         4 non-null float64
x51st.week         4 non-null float64
x52nd.week         4 non-null float64
x53rd.week         4 non-null float64
x54th.week         2 non-null float64
x55th.week         2 non-null float64
x56th.week         2 non-null float64
x57th.week         2 non-null float64
x58th.week         2 non-null float64
x59th.week         2 non-null float64
x60th.week         2 non-null float64
x61st.week         2 non-null float64
x62nd.week         2 non-null float64
x63rd.week         2 non-null float64
x64th.week         2 non-null float64
x65th.week         1 non-null float64
dtypes: datetime64[ns](2), float64(65), object(4)
memory usage: 175.9+ KB

Analysis

Create new columns ‘diff. between date peaked and entered’ and ‘no. of weeks on billboard’
Create histograms for new columns
Create scatter plot for ‘diff. between date peaked and entered’ and ‘no. of weeks on billboard’
Pivot table on ‘genre’

# Create a new column of diff. between date peaked and entered
df['diff.between.peaked.and.enetered'] = (df['date.peaked']-df['date.entered']).astype(str)

# Convert the column to integer
df['diff.between.peaked.and.enetered'] = df['diff.between.peaked.and.enetered'].apply(lambda x: int(x[0:2]))

# Create a column for no. of weeks on billboard
# Slice data into Week on billboard only
week_df = df.loc[:,'x1st.week':'x65th.week']

# Create a list of no. of weeks on billboard for each song
# Add the column to the dataframe
no_week_list=[]
for i in range(len(week_df)):
    no_week_list.append(week_df.iloc[i,:].count())
df['no.of.weeks.on.billboard'] = pd.Series(no_week_list)

# Create histograms for new columns
fig, axes = plt.subplots(2,1, figsize = (10,7))
fig.subplots_adjust(wspace = 0.25, hspace = 0.5)

axes[0].hist(df['diff.between.peaked.and.enetered'])
axes[0].set_title('Days between peaked and entered')
axes[1].hist(df['no.of.weeks.on.billboard'])
axes[1].set_title('Days on Billboard');

png

# Create scatter plot and regression for 'diff. between date peaked and entered' and 'no. of weeks on billboard'

regr = LinearRegression()
X = df[['diff.between.peaked.and.enetered']]
y = df['no.of.weeks.on.billboard']
model = regr.fit(X,y)

fig, ax = plt.subplots(figsize=(12,6))

ax.scatter(X,y,c='k')
ax.plot(X,regr.predict(X),color='r');
print model.score(X,y)

/Users/KatieJi/anaconda/lib/python2.7/site-packages/scipy/linalg/basic.py:1018: RuntimeWarning: internal gelsd driver lwork query error, required iwork dimension not returned. This is likely the result of LAPACK bug 0038, fixed in LAPACK 3.2.2 (released July 21, 2010). Falling back to 'gelss' driver.
  warnings.warn(mesg, RuntimeWarning)


0.220245120868

png

df.head()

	artist.inverted	track	time	genre	date.entered	date.peaked	x1st.week	x2nd.week	x3rd.week	x4th.week	...	x58th.week	x59th.week	x60th.week	x61st.week	x62nd.week	x63rd.week	x64th.week	x65th.week	diff.between.peaked.and.enetered	no.of.weeks.on.billboard
0	Destiny's Child	Independent Women Part I	03:38:00	Rock	2000-09-23	2000-11-18	78.0	63.0	49.0	33.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	56	28
1	Santana	Maria, Maria	04:18:00	Rock	2000-02-12	2000-04-08	15.0	8.0	6.0	5.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	56	26
2	Savage Garden	I Knew I Loved You	04:07:00	Rock	1999-10-23	2000-01-29	71.0	48.0	43.0	31.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	98	33
3	Madonna	Music	03:45:00	Rock	2000-08-12	2000-09-16	41.0	23.0	18.0	14.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	35	24
4	Aguilera, Christina	Come On Over Baby (All I Want Is You)	03:38:00	Rock	2000-08-05	2000-10-14	57.0	47.0	45.0	29.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	70	21

5 rows × 73 columns

# See value counts on 'genre'
df['genre'].value_counts()

Rock           137
Country         74
Rap             58
R&B             23
Pop              9
Latin            9
Electronica      4
Gospel           1
Jazz             1
Reggae           1
Name: genre, dtype: int64

# Pivot table on genre
pd.pivot_table(df,index = 'genre', values = ['diff.between.peaked.and.enetered','no.of.weeks.on.billboard'],aggfunc = (np.mean, np.std))

	diff.between.peaked.and.enetered		no.of.weeks.on.billboard
	mean	std	mean	std
genre
Country	52.472973	25.048568	16.216216	7.270263
Electronica	61.250000	10.500000	18.000000	4.000000
Gospel	56.000000	NaN	20.000000	NaN
Jazz	14.000000	NaN	5.000000	NaN
Latin	51.222222	32.976423	19.222222	7.293452
Pop	47.444444	25.666667	15.222222	8.120618
R&B	28.608696	24.332431	11.347826	6.176265
Rap	34.862069	24.914907	14.431034	8.764186
Reggae	42.000000	NaN	15.000000	NaN
Rock	41.343066	28.954436	18.883212	10.195202

Evaluate Statements

Statement 1: Country music has no difference with overall music in terms of the no. of days between peaked and entered
H0: country musice mean = overall mean H1: country musice mean != overall mean
Statement 2: Rock music has no difference with R&B music in terms of the no. of weeks on billboard
H0: rock musice mean = R&B music mean H1: rock musice mean > R&B music mean

# Statement 1 Test, set alpha = 0.05
print stats.ttest_1samp(df[df['genre']=='Country']['diff.between.peaked.and.enetered'],overall_mean)

Ttest_1sampResult(statistic=3.4255181858814843, pvalue=0.0010106239460667809)

Pvalue is less than 0.05. We do not accept H0. There is a difference between country music and overall music in terms of no. of days between peaked and entered

# Statement 2 Test, set aplha = 0.05
print stats.ttest_ind(df[df['genre']=='Rock']['no.of.weeks.on.billboard'],df[df['genre']=='R&B']['no.of.weeks.on.billboard'])

Ttest_indResult(statistic=3.4348656996446518, pvalue=0.00075739756196112828)

Pvalue is less than 0.05. We do not accept H0. Rock music siganificantly stays longer than R&B music on billboard

Written on April 4, 2017