Logo

Sikai Wang

LinkedIn
Resume
GitHub


from scipy.stats import pearsonr
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df2 = pd.read_pickle('df2.pickle')
RET = pd.read_pickle('RET.pickle')

Summary section

For this assignment here, I mainly focused on building ‘build_sample’ and ‘download_text_files’. In ‘download_text_files’, I downloaded 499 out of 503 firms. The reason would be that some of the S&P 500 firms do not hold or upload a 10-K file. For ‘build_sample’, there are two important thing I did. One is to calculate the sentiment measurement for 10-K files and the other is to get the returns around the 10-K dates. Besides the provided 4 sentiment, I tried to find risk-related word, crime-related word and people-related word, which has its own positive and negative aspect.


Data section





df2
filling_id Symbol cleaned_html BHR_positive BHR_negative LM_positive LM_negative risk_positive risk_negative crime_positive crime_negative people_positive people_negative
0 0000066740-22-000010 MMM mmm 20211231table of contentsunited statessec... 0.025683 0.031662 0.003977 0.023249 0.001753 0.003807 0.000798 0.000079 0.000000 0.000000
1 0000091142-22-000028 AOS aos 20211231united states securities and exch... 0.024460 0.023602 0.003756 0.012984 0.001686 0.003727 0.000414 0.000118 0.000000 0.000000
2 0001104659-22-025141 ABT 0000001800 12 312021fyfalsehttp fasb org us ga... 0.021590 0.024394 0.003726 0.012793 0.002497 0.002094 0.000615 0.000173 0.000000 0.000038
3 0001551152-22-000007 ABBV abbv 20211231united statessecurities and exch... 0.019753 0.022645 0.006481 0.015448 0.003168 0.001852 0.000763 0.000179 0.000000 0.000000
4 0001467373-22-000295 ACN acn 20220831table of contentsunited states se... 0.027968 0.023964 0.008642 0.016861 0.003388 0.002233 0.000654 0.000135 0.000038 0.000038
... ... ... ... ... ... ... ... ... ... ... ... ... ...
491 0001564590-22-007640 YUM yumc 10k_20211231 htm united states securitie... 0.025014 0.023482 0.006078 0.016549 0.002541 0.001425 0.000641 0.000178 0.000012 0.000036
492 0000877212-22-000026 ZBRA zbra 20211231table of contentsunited statesse... 0.028396 0.026842 0.006258 0.014964 0.001916 0.001809 0.000617 0.000234 0.000000 0.000000
493 0001564590-22-007160 ZBH zbh 10k_20211231 htm i have united states sec... 0.021506 0.026759 0.004591 0.021783 0.001833 0.002912 0.000786 0.000262 0.000000 0.000015
494 0001213900-22-013250 ZION united states securities and exchange commiss... 0.016075 0.016980 0.003070 0.013458 0.001680 0.002230 0.000372 0.000065 0.000048 0.000000
495 0001555280-22-000078 ZTS zts 20211231table of contentsunited states se... 0.021790 0.033508 0.005036 0.019980 0.002996 0.001760 0.000612 0.000166 0.000026 0.000013

496 rows × 13 columns

df2.describe()
BHR_positive BHR_negative LM_positive LM_negative risk_positive risk_negative crime_positive crime_negative people_positive people_negative
count 496.000000 496.000000 496.000000 496.000000 496.000000 496.000000 496.000000 496.000000 496.000000 496.000000
mean 0.023510 0.025476 0.004929 0.015821 0.002051 0.002397 0.000690 0.000149 0.000017 0.000010
std 0.004108 0.003692 0.001378 0.003886 0.000853 0.001002 0.000454 0.000114 0.000028 0.000019
min 0.003530 0.008953 0.000272 0.002541 0.000569 0.000000 0.000104 0.000000 0.000000 0.000000
25% 0.021520 0.023554 0.004011 0.013240 0.001473 0.001841 0.000403 0.000077 0.000000 0.000000
50% 0.023888 0.025689 0.004845 0.015546 0.001949 0.002252 0.000579 0.000116 0.000009 0.000000
75% 0.025936 0.027566 0.005651 0.017897 0.002471 0.002707 0.000835 0.000193 0.000024 0.000015
max 0.037982 0.038030 0.010899 0.035088 0.006665 0.009605 0.005736 0.000870 0.000178 0.000188

RET
Symbol RET_0_2 RET_3_10
0 A 0.008788 0.004322
1 AAL -0.050923 -0.139741
2 AAP -0.017255 -0.067329
3 AAPL 0.075553 -0.106518
4 ABBV -0.006484 0.025967
... ... ... ...
488 ZBH 0.026077 -0.060145
489 ZBRA -0.127031 -0.048513
490 ZION -0.014667 0.047605
491 ZTS -0.017105 0.008494
492 FAST NaN 0.016783

493 rows × 3 columns

RET.describe()
RET_0_2 RET_3_10
count 492.000000 492.000000
mean 0.007006 -0.003495
std 0.045087 0.062729
min -0.279230 -0.542923
25% -0.016821 -0.033890
50% 0.003450 -0.001494
75% 0.030423 0.026293
max 0.348567 0.449406


Results

final = pd.merge(RET,df2,on = 'Symbol', how = 'outer')
final
Symbol RET_0_2 RET_3_10 filling_id cleaned_html BHR_positive BHR_negative LM_positive LM_negative risk_positive risk_negative crime_positive crime_negative people_positive people_negative
0 A 0.008788 0.004322 0000046619-22-000066 hei 20221031united statessecurities and excha... 0.025263 0.018774 0.004013 0.008651 0.002096 0.001293 0.000580 0.000134 0.000067 0.000045
1 AAL -0.050923 -0.139741 0000006201-22-000026 aal 20211231united states securities and exch... 0.017957 0.023794 0.002809 0.013274 0.000687 0.002156 0.001099 0.000048 0.000014 0.000041
2 AAP -0.017255 -0.067329 0001158449-22-000037 aap 20220101table of contents united statesse... 0.025962 0.025993 0.005076 0.017441 0.001353 0.001846 0.000554 0.000185 0.000062 0.000000
3 AAPL 0.075553 -0.106518 0000320193-22-000108 aapl 20220924united statessecurities and exch... 0.020384 0.026536 0.003438 0.019268 0.001327 0.001839 0.000814 0.000181 0.000030 0.000121
4 ABBV -0.006484 0.025967 0001551152-22-000007 abbv 20211231united statessecurities and exch... 0.019753 0.022645 0.006481 0.015448 0.003168 0.001852 0.000763 0.000179 0.000000 0.000000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
491 ZTS -0.017105 0.008494 0001555280-22-000078 zts 20211231table of contentsunited states se... 0.021790 0.033508 0.005036 0.019980 0.002996 0.001760 0.000612 0.000166 0.000026 0.000013
492 FAST NaN 0.016783 0001193125-22-106571 10 k falsefy0 250 8330 833yesfast radius inc ... 0.017014 0.022257 0.005885 0.021064 0.003343 0.001874 0.000970 0.000210 0.000026 0.000000
493 BALL NaN NaN 0001558370-22-001251 http fasb org us gaap 2021 01 31 otherassetsno... 0.022260 0.023719 0.004586 0.011245 0.002207 0.003205 0.000365 0.000096 0.000000 0.000000
494 META NaN NaN 0000950170-22-002601 10 k 12 310 5unlimitedfalse00014319590 50 5fy... 0.017594 0.018674 0.003788 0.013731 0.002860 0.003561 0.001193 0.000114 0.000038 0.000038
495 WELL NaN NaN 0000950170-22-005071 10 k 00016703490 2857142857no0 2857142857fals... 0.021229 0.026493 0.003515 0.017825 0.001308 0.002411 0.000567 0.000095 0.000158 0.000000

496 rows × 15 columns

new_df = final.iloc[:, -14:].drop(columns=['filling_id', 'cleaned_html'])
new_df
RET_0_2 RET_3_10 BHR_positive BHR_negative LM_positive LM_negative risk_positive risk_negative crime_positive crime_negative people_positive people_negative
0 0.008788 0.004322 0.025263 0.018774 0.004013 0.008651 0.002096 0.001293 0.000580 0.000134 0.000067 0.000045
1 -0.050923 -0.139741 0.017957 0.023794 0.002809 0.013274 0.000687 0.002156 0.001099 0.000048 0.000014 0.000041
2 -0.017255 -0.067329 0.025962 0.025993 0.005076 0.017441 0.001353 0.001846 0.000554 0.000185 0.000062 0.000000
3 0.075553 -0.106518 0.020384 0.026536 0.003438 0.019268 0.001327 0.001839 0.000814 0.000181 0.000030 0.000121
4 -0.006484 0.025967 0.019753 0.022645 0.006481 0.015448 0.003168 0.001852 0.000763 0.000179 0.000000 0.000000
... ... ... ... ... ... ... ... ... ... ... ... ...
491 -0.017105 0.008494 0.021790 0.033508 0.005036 0.019980 0.002996 0.001760 0.000612 0.000166 0.000026 0.000013
492 NaN 0.016783 0.017014 0.022257 0.005885 0.021064 0.003343 0.001874 0.000970 0.000210 0.000026 0.000000
493 NaN NaN 0.022260 0.023719 0.004586 0.011245 0.002207 0.003205 0.000365 0.000096 0.000000 0.000000
494 NaN NaN 0.017594 0.018674 0.003788 0.013731 0.002860 0.003561 0.001193 0.000114 0.000038 0.000038
495 NaN NaN 0.021229 0.026493 0.003515 0.017825 0.001308 0.002411 0.000567 0.000095 0.000158 0.000000

496 rows × 12 columns

cols_to_compare = ['RET_0_2', 'RET_3_10']

corr_matrix = new_df.corr()

corr_values = corr_matrix[cols_to_compare].iloc[2:]

corr_df = pd.DataFrame({'RET_0_2': corr_values['RET_0_2'], 'RET_3_10': corr_values['RET_3_10']})

print(corr_df)
                  RET_0_2  RET_3_10
BHR_positive     0.054644 -0.072754
BHR_negative     0.024339  0.003410
LM_positive     -0.069196 -0.086451
LM_negative     -0.012568 -0.112504
risk_positive   -0.047148 -0.022635
risk_negative   -0.002934 -0.009863
crime_positive   0.005598 -0.107710
crime_negative   0.110497 -0.061581
people_positive -0.032323  0.025609
people_negative  0.029462 -0.041455



# Set sentiment and return measure columns
sent_cols = ['BHR_positive','BHR_negative','LM_positive','LM_negative','risk_positive','risk_negative','crime_positive','crime_negative','people_positive','people_negative']
ret_cols = ['RET_0_2', 'RET_3_10']

for col in sent_cols:
    for ret_col in ret_cols:
        sns.scatterplot(data=new_df, x=ret_col, y=col)
        plt.title(f'{col} vs {ret_col}')
        plt.xlabel(ret_col)
        plt.ylabel(col)
        plt.show()

png

png

png

png

png

png

png

png

png

png

png

png

png

png

png

png

png

png

png

png

Discussion

corr_df
RET_0_2 RET_3_10
BHR_positive 0.054644 -0.072754
BHR_negative 0.024339 0.003410
LM_positive -0.069196 -0.086451
LM_negative -0.012568 -0.112504
risk_positive -0.047148 -0.022635
risk_negative -0.002934 -0.009863
crime_positive 0.005598 -0.107710
crime_negative 0.110497 -0.061581
people_positive -0.032323 0.025609
people_negative 0.029462 -0.041455