from scipy.stats import pearsonr
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df2 = pd.read_pickle('df2.pickle')
RET = pd.read_pickle('RET.pickle')
For this assignment here, I mainly focused on building ‘build_sample’ and ‘download_text_files’. In ‘download_text_files’, I downloaded 499 out of 503 firms. The reason would be that some of the S&P 500 firms do not hold or upload a 10-K file. For ‘build_sample’, there are two important thing I did. One is to calculate the sentiment measurement for 10-K files and the other is to get the returns around the 10-K dates. Besides the provided 4 sentiment, I tried to find risk-related word, crime-related word and people-related word, which has its own positive and negative aspect.
HTMLSession()
to get the filing data for each 10-K files from ‘www.sec.gov’. Using inspect to get r.html.find('div.formGrouping:nth-child(1) > div:nth-child(2)'
as their filing date. I merged this dataframe I just get with ret I mentioned before to get newest_sp500_2022 which includes filing dates for each 10-K. Finally, I used for loop along with iterrows and ifelse statement
to get ‘ret_0_2’ and ‘ret_3_10’ which represent for “buy and hold” around the 10-K date.
with ZipFile('10k_files/10k_files.zip','r') as zipfolder
, I make a dataframe for these 10-K files and named them files
. Next, I applied df = file.copy()
to make a copy of file and the code from class to make a for loop
here to make every 10-K file into a dataframe with their ‘accession_number’ and ‘symbol’. I defined a function called ‘sentiment_analysis’ to get all firms’ 10 sentiment scores, which is df2.
df2
filling_id | Symbol | cleaned_html | BHR_positive | BHR_negative | LM_positive | LM_negative | risk_positive | risk_negative | crime_positive | crime_negative | people_positive | people_negative | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0000066740-22-000010 | MMM | mmm 20211231table of contentsunited statessec... | 0.025683 | 0.031662 | 0.003977 | 0.023249 | 0.001753 | 0.003807 | 0.000798 | 0.000079 | 0.000000 | 0.000000 |
1 | 0000091142-22-000028 | AOS | aos 20211231united states securities and exch... | 0.024460 | 0.023602 | 0.003756 | 0.012984 | 0.001686 | 0.003727 | 0.000414 | 0.000118 | 0.000000 | 0.000000 |
2 | 0001104659-22-025141 | ABT | 0000001800 12 312021fyfalsehttp fasb org us ga... | 0.021590 | 0.024394 | 0.003726 | 0.012793 | 0.002497 | 0.002094 | 0.000615 | 0.000173 | 0.000000 | 0.000038 |
3 | 0001551152-22-000007 | ABBV | abbv 20211231united statessecurities and exch... | 0.019753 | 0.022645 | 0.006481 | 0.015448 | 0.003168 | 0.001852 | 0.000763 | 0.000179 | 0.000000 | 0.000000 |
4 | 0001467373-22-000295 | ACN | acn 20220831table of contentsunited states se... | 0.027968 | 0.023964 | 0.008642 | 0.016861 | 0.003388 | 0.002233 | 0.000654 | 0.000135 | 0.000038 | 0.000038 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
491 | 0001564590-22-007640 | YUM | yumc 10k_20211231 htm united states securitie... | 0.025014 | 0.023482 | 0.006078 | 0.016549 | 0.002541 | 0.001425 | 0.000641 | 0.000178 | 0.000012 | 0.000036 |
492 | 0000877212-22-000026 | ZBRA | zbra 20211231table of contentsunited statesse... | 0.028396 | 0.026842 | 0.006258 | 0.014964 | 0.001916 | 0.001809 | 0.000617 | 0.000234 | 0.000000 | 0.000000 |
493 | 0001564590-22-007160 | ZBH | zbh 10k_20211231 htm i have united states sec... | 0.021506 | 0.026759 | 0.004591 | 0.021783 | 0.001833 | 0.002912 | 0.000786 | 0.000262 | 0.000000 | 0.000015 |
494 | 0001213900-22-013250 | ZION | united states securities and exchange commiss... | 0.016075 | 0.016980 | 0.003070 | 0.013458 | 0.001680 | 0.002230 | 0.000372 | 0.000065 | 0.000048 | 0.000000 |
495 | 0001555280-22-000078 | ZTS | zts 20211231table of contentsunited states se... | 0.021790 | 0.033508 | 0.005036 | 0.019980 | 0.002996 | 0.001760 | 0.000612 | 0.000166 | 0.000026 | 0.000013 |
496 rows × 13 columns
df2.describe()
BHR_positive | BHR_negative | LM_positive | LM_negative | risk_positive | risk_negative | crime_positive | crime_negative | people_positive | people_negative | |
---|---|---|---|---|---|---|---|---|---|---|
count | 496.000000 | 496.000000 | 496.000000 | 496.000000 | 496.000000 | 496.000000 | 496.000000 | 496.000000 | 496.000000 | 496.000000 |
mean | 0.023510 | 0.025476 | 0.004929 | 0.015821 | 0.002051 | 0.002397 | 0.000690 | 0.000149 | 0.000017 | 0.000010 |
std | 0.004108 | 0.003692 | 0.001378 | 0.003886 | 0.000853 | 0.001002 | 0.000454 | 0.000114 | 0.000028 | 0.000019 |
min | 0.003530 | 0.008953 | 0.000272 | 0.002541 | 0.000569 | 0.000000 | 0.000104 | 0.000000 | 0.000000 | 0.000000 |
25% | 0.021520 | 0.023554 | 0.004011 | 0.013240 | 0.001473 | 0.001841 | 0.000403 | 0.000077 | 0.000000 | 0.000000 |
50% | 0.023888 | 0.025689 | 0.004845 | 0.015546 | 0.001949 | 0.002252 | 0.000579 | 0.000116 | 0.000009 | 0.000000 |
75% | 0.025936 | 0.027566 | 0.005651 | 0.017897 | 0.002471 | 0.002707 | 0.000835 | 0.000193 | 0.000024 | 0.000015 |
max | 0.037982 | 0.038030 | 0.010899 | 0.035088 | 0.006665 | 0.009605 | 0.005736 | 0.000870 | 0.000178 | 0.000188 |
RET
Symbol | RET_0_2 | RET_3_10 | |
---|---|---|---|
0 | A | 0.008788 | 0.004322 |
1 | AAL | -0.050923 | -0.139741 |
2 | AAP | -0.017255 | -0.067329 |
3 | AAPL | 0.075553 | -0.106518 |
4 | ABBV | -0.006484 | 0.025967 |
... | ... | ... | ... |
488 | ZBH | 0.026077 | -0.060145 |
489 | ZBRA | -0.127031 | -0.048513 |
490 | ZION | -0.014667 | 0.047605 |
491 | ZTS | -0.017105 | 0.008494 |
492 | FAST | NaN | 0.016783 |
493 rows × 3 columns
RET.describe()
RET_0_2 | RET_3_10 | |
---|---|---|
count | 492.000000 | 492.000000 |
mean | 0.007006 | -0.003495 |
std | 0.045087 | 0.062729 |
min | -0.279230 | -0.542923 |
25% | -0.016821 | -0.033890 |
50% | 0.003450 | -0.001494 |
75% | 0.030423 | 0.026293 |
max | 0.348567 | 0.449406 |
for loop
is partially correct. The other is that the firm I interested in all showed a positive tone for my sentiment.
final = pd.merge(RET,df2,on = 'Symbol', how = 'outer')
final
Symbol | RET_0_2 | RET_3_10 | filling_id | cleaned_html | BHR_positive | BHR_negative | LM_positive | LM_negative | risk_positive | risk_negative | crime_positive | crime_negative | people_positive | people_negative | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | A | 0.008788 | 0.004322 | 0000046619-22-000066 | hei 20221031united statessecurities and excha... | 0.025263 | 0.018774 | 0.004013 | 0.008651 | 0.002096 | 0.001293 | 0.000580 | 0.000134 | 0.000067 | 0.000045 |
1 | AAL | -0.050923 | -0.139741 | 0000006201-22-000026 | aal 20211231united states securities and exch... | 0.017957 | 0.023794 | 0.002809 | 0.013274 | 0.000687 | 0.002156 | 0.001099 | 0.000048 | 0.000014 | 0.000041 |
2 | AAP | -0.017255 | -0.067329 | 0001158449-22-000037 | aap 20220101table of contents united statesse... | 0.025962 | 0.025993 | 0.005076 | 0.017441 | 0.001353 | 0.001846 | 0.000554 | 0.000185 | 0.000062 | 0.000000 |
3 | AAPL | 0.075553 | -0.106518 | 0000320193-22-000108 | aapl 20220924united statessecurities and exch... | 0.020384 | 0.026536 | 0.003438 | 0.019268 | 0.001327 | 0.001839 | 0.000814 | 0.000181 | 0.000030 | 0.000121 |
4 | ABBV | -0.006484 | 0.025967 | 0001551152-22-000007 | abbv 20211231united statessecurities and exch... | 0.019753 | 0.022645 | 0.006481 | 0.015448 | 0.003168 | 0.001852 | 0.000763 | 0.000179 | 0.000000 | 0.000000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
491 | ZTS | -0.017105 | 0.008494 | 0001555280-22-000078 | zts 20211231table of contentsunited states se... | 0.021790 | 0.033508 | 0.005036 | 0.019980 | 0.002996 | 0.001760 | 0.000612 | 0.000166 | 0.000026 | 0.000013 |
492 | FAST | NaN | 0.016783 | 0001193125-22-106571 | 10 k falsefy0 250 8330 833yesfast radius inc ... | 0.017014 | 0.022257 | 0.005885 | 0.021064 | 0.003343 | 0.001874 | 0.000970 | 0.000210 | 0.000026 | 0.000000 |
493 | BALL | NaN | NaN | 0001558370-22-001251 | http fasb org us gaap 2021 01 31 otherassetsno... | 0.022260 | 0.023719 | 0.004586 | 0.011245 | 0.002207 | 0.003205 | 0.000365 | 0.000096 | 0.000000 | 0.000000 |
494 | META | NaN | NaN | 0000950170-22-002601 | 10 k 12 310 5unlimitedfalse00014319590 50 5fy... | 0.017594 | 0.018674 | 0.003788 | 0.013731 | 0.002860 | 0.003561 | 0.001193 | 0.000114 | 0.000038 | 0.000038 |
495 | WELL | NaN | NaN | 0000950170-22-005071 | 10 k 00016703490 2857142857no0 2857142857fals... | 0.021229 | 0.026493 | 0.003515 | 0.017825 | 0.001308 | 0.002411 | 0.000567 | 0.000095 | 0.000158 | 0.000000 |
496 rows × 15 columns
new_df = final.iloc[:, -14:].drop(columns=['filling_id', 'cleaned_html'])
new_df
RET_0_2 | RET_3_10 | BHR_positive | BHR_negative | LM_positive | LM_negative | risk_positive | risk_negative | crime_positive | crime_negative | people_positive | people_negative | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.008788 | 0.004322 | 0.025263 | 0.018774 | 0.004013 | 0.008651 | 0.002096 | 0.001293 | 0.000580 | 0.000134 | 0.000067 | 0.000045 |
1 | -0.050923 | -0.139741 | 0.017957 | 0.023794 | 0.002809 | 0.013274 | 0.000687 | 0.002156 | 0.001099 | 0.000048 | 0.000014 | 0.000041 |
2 | -0.017255 | -0.067329 | 0.025962 | 0.025993 | 0.005076 | 0.017441 | 0.001353 | 0.001846 | 0.000554 | 0.000185 | 0.000062 | 0.000000 |
3 | 0.075553 | -0.106518 | 0.020384 | 0.026536 | 0.003438 | 0.019268 | 0.001327 | 0.001839 | 0.000814 | 0.000181 | 0.000030 | 0.000121 |
4 | -0.006484 | 0.025967 | 0.019753 | 0.022645 | 0.006481 | 0.015448 | 0.003168 | 0.001852 | 0.000763 | 0.000179 | 0.000000 | 0.000000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
491 | -0.017105 | 0.008494 | 0.021790 | 0.033508 | 0.005036 | 0.019980 | 0.002996 | 0.001760 | 0.000612 | 0.000166 | 0.000026 | 0.000013 |
492 | NaN | 0.016783 | 0.017014 | 0.022257 | 0.005885 | 0.021064 | 0.003343 | 0.001874 | 0.000970 | 0.000210 | 0.000026 | 0.000000 |
493 | NaN | NaN | 0.022260 | 0.023719 | 0.004586 | 0.011245 | 0.002207 | 0.003205 | 0.000365 | 0.000096 | 0.000000 | 0.000000 |
494 | NaN | NaN | 0.017594 | 0.018674 | 0.003788 | 0.013731 | 0.002860 | 0.003561 | 0.001193 | 0.000114 | 0.000038 | 0.000038 |
495 | NaN | NaN | 0.021229 | 0.026493 | 0.003515 | 0.017825 | 0.001308 | 0.002411 | 0.000567 | 0.000095 | 0.000158 | 0.000000 |
496 rows × 12 columns
cols_to_compare = ['RET_0_2', 'RET_3_10']
corr_matrix = new_df.corr()
corr_values = corr_matrix[cols_to_compare].iloc[2:]
corr_df = pd.DataFrame({'RET_0_2': corr_values['RET_0_2'], 'RET_3_10': corr_values['RET_3_10']})
print(corr_df)
RET_0_2 RET_3_10
BHR_positive 0.054644 -0.072754
BHR_negative 0.024339 0.003410
LM_positive -0.069196 -0.086451
LM_negative -0.012568 -0.112504
risk_positive -0.047148 -0.022635
risk_negative -0.002934 -0.009863
crime_positive 0.005598 -0.107710
crime_negative 0.110497 -0.061581
people_positive -0.032323 0.025609
people_negative 0.029462 -0.041455
# Set sentiment and return measure columns
sent_cols = ['BHR_positive','BHR_negative','LM_positive','LM_negative','risk_positive','risk_negative','crime_positive','crime_negative','people_positive','people_negative']
ret_cols = ['RET_0_2', 'RET_3_10']
for col in sent_cols:
for ret_col in ret_cols:
sns.scatterplot(data=new_df, x=ret_col, y=col)
plt.title(f'{col} vs {ret_col}')
plt.xlabel(ret_col)
plt.ylabel(col)
plt.show()
corr_df
, I noticed that the relationship between LM and RET is negatively related. They get 4 negative correlation regardless of its version of return and side of LM. But for ML sentiment, it is almost the opposite. ML sentiment get 75% of its correlation positive. Especially for both negative side, they have completely the opposite outcome. But when we look at the magnitudes of the correlation, I find out that every correlation between sentiment and return is weak. There is no strong correlation even no natural correlation. It is not quite accurate to assure that if it is positive or negative, the return will be the one we want to see.corr_df
RET_0_2 | RET_3_10 | |
---|---|---|
BHR_positive | 0.054644 | -0.072754 |
BHR_negative | 0.024339 | 0.003410 |
LM_positive | -0.069196 | -0.086451 |
LM_negative | -0.012568 | -0.112504 |
risk_positive | -0.047148 | -0.022635 |
risk_negative | -0.002934 | -0.009863 |
crime_positive | 0.005598 | -0.107710 |
crime_negative | 0.110497 | -0.061581 |
people_positive | -0.032323 | 0.025609 |
people_negative | 0.029462 | -0.041455 |
There are four main reasons that may influence the result here. Firstly, they could use much more company than we did. Since S&P 500 only represent the top 500 firms. There are much more firms all over the world. The sample is too limited. Secondly, 2022 is the year right after pandemic year. Everything was not the same as before. World may changed since they may use the historical data, which is not that representative. Thirdly, the word number could also affect the result. Finally, it could also affect by the way I approached to this result. I may make some mistakes during the process although I think I tried my best to get correct.
(3) I don’t think the 6 sentiment I choose has such high correlation with the stock price. Most of them are weak. But it could be greater if I throw more and more words into this dictionary. Although the magnitude is small, it still has its room for improvement. I only included 10 words for each sentiment. The more sample we have, the more accurate the result will be. This such small sample cannot reflect the whole idea. At least, some sentiment here like crime_negative has already showed a weak correlation, which means they are related.