Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,167 @@
|
|
1 |
---
|
2 |
license: apache-2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
+
language:
|
4 |
+
- tr
|
5 |
+
library_name: bertopic
|
6 |
+
tags:
|
7 |
+
- finance
|
8 |
+
metrics:
|
9 |
+
- accuracy 0.9
|
10 |
---
|
11 |
+
# Financial Sentiment Analysis with BERT for Borsa Istanbul (BIST100)
|
12 |
+
|
13 |
+
- Hello, This is the repository for our model for cs210 in which we've trained a Financial Sentiment Analysis model using Bert for Borsa Istanbul. Knowing live alternatives cost more than 750 TL for a month this project will be further upgraded as a live open-source alternative to its rivals. Please Star the project for further upgrades :))
|
14 |
+
|
15 |
+
- Here you can find the code of the data gathering, parsing, model training and visualizations in different folders.
|
16 |
+
All the documentation is public and open-source.
|
17 |
+
This sheet will be updated, but for the moment
|
18 |
+
let's get a good grade :)
|
19 |
+
|
20 |
+
-----------------------------------------------
|
21 |
+
|
22 |
+
# Links to resources
|
23 |
+
- Here data and trained model's source is shared through google drive since they are significantly big in size or quantity
|
24 |
+
------------------------------------------------------------
|
25 |
+
## Link to the Repository
|
26 |
+
- Repository inclued every file that we've implemented during the process
|
27 |
+
- Has an Accuracy rate > 0.90 for the test data
|
28 |
+
- [Repository Link](https://github.com/onerayhan/cs210_trend_followers)
|
29 |
+
------------------------------------------------------------
|
30 |
+
## Link to the Prelabeled Data
|
31 |
+
- Data is prelabeled with Daily Return Value of the Bist-100 in order to get a first insight
|
32 |
+
- They can be found as classified neg-pos subfolders
|
33 |
+
- [Pre-Labeled Data Link](https://drive.google.com/drive/folders/1NYB9wBx8yt31drdczAB_ll5s31I1dcN4?usp=sharing)
|
34 |
+
-------------------------------------------------------------
|
35 |
+
## Link to the True Labeled Data
|
36 |
+
- Data then labeled another time with keyword searching and more than 200 files have changed directory from neg to pos or vice-versa:
|
37 |
+
- [Labeled Data Link](https://drive.google.com/drive/folders/1sn4JtCZ44wH2FO60Opm3FKXQwYLMtwGY?usp=sharing)
|
38 |
+
|
39 |
+
# Additional Notes
|
40 |
+
|
41 |
+
## Data Gathering
|
42 |
+
- We've gathered Daily Brokerage Reviews, Daily News and Tweets for the training but haven't used the tweets data in the training part of the model because of its limitations and high spam percentage
|
43 |
+
- We've set the starting period as 01.01.21 and ending time as 25.05.23 except for tweets which we couldn't access earlier than 1 month
|
44 |
+
- We've gathered the data through various libraries such as Selenium, Requests, BeautifulSoup and SnScrape
|
45 |
+
## Data Preprocessing
|
46 |
+
- We've used built-in python libraries, pypdfium, Pandas, Numpy, Transformers and BertTokenizer for preprocessing
|
47 |
+
## Model Training
|
48 |
+
- We've used a Turkish Cased Bert for training the data with Transformers
|
49 |
+
- [Link to Untrained Model](https://huggingface.co/dbmdz/bert-base-turkish-cased)
|
50 |
+
## Visualization
|
51 |
+
- We've Used Matplotlib and Seaborn for visualizations
|
52 |
+
|
53 |
+
-----------------------------------------------
|
54 |
+
|
55 |
+
# Quick File Explanations in The repository
|
56 |
+
Below are quick explanation about what every code does,
|
57 |
+
the workings of the python code could be understood more by looking at the comments in each code.
|
58 |
+
This explanations can also be found in our repository
|
59 |
+
|
60 |
+
## Downloading Links
|
61 |
+
|
62 |
+
- akbank_link_download.py
|
63 |
+
|
64 |
+
Downloads links to PDFs in the specified url until 04.01.2021 using selenium to traverse interactive page in akbank website
|
65 |
+
|
66 |
+
- gedik_link_download.py
|
67 |
+
|
68 |
+
Downloads links to PDFs in the specified url using requests and BeautifulSoup to sequentially take links from gedik website
|
69 |
+
|
70 |
+
- download_links_yk_garan
|
71 |
+
Downloads links to PDFs in the specified url using requests, BeautifulSoup and Selenium to sequentially take links from Garanti and YapıKredi website
|
72 |
+
|
73 |
+
## Downloading PDFs
|
74 |
+
|
75 |
+
- akbank_PDF_download.py
|
76 |
+
|
77 |
+
Gets .txt file of links and downloads PDFs from it and saves them to /data/akbank_PDF, folders need to be created beforehand
|
78 |
+
|
79 |
+
- garanti_PDF_download.py
|
80 |
+
|
81 |
+
Gets .txt file of links and downloads PDFs from it and saves them to /data/garanti_PDF, folders need to be created beforehand
|
82 |
+
|
83 |
+
|
84 |
+
- gedik_PDF_download.py
|
85 |
+
|
86 |
+
Gets .txt file of links and downloads PDFs from it and saves them to /data/gedik_PDF, folders need to be created beforehand
|
87 |
+
|
88 |
+
|
89 |
+
- yapikredi_PDF_download.py
|
90 |
+
|
91 |
+
Gets .txt file of links and downloads PDFs from it and saves them to /data/yapikredi_PDF, folders need to be created beforehand
|
92 |
+
|
93 |
+
## Extracting Text
|
94 |
+
|
95 |
+
- pypdfium2_akbank.py
|
96 |
+
|
97 |
+
Using pypdfium2 to get necessary text from gedik pdfs located in data/yapikredi_PDF
|
98 |
+
put all extracted text into a list of dictionaries where date, count, paragraph are keys
|
99 |
+
put the combined dictionaries into .json file
|
100 |
+
|
101 |
+
|
102 |
+
- pypdfium2_garanti.py
|
103 |
+
|
104 |
+
Using pypdfium2 to get necessary text from garanti pdfs located in data/garanti_PDF
|
105 |
+
put all extracted text into a list of dictionaries where date, count, paragraph are keys
|
106 |
+
put the combined dictionaries into .json file
|
107 |
+
|
108 |
+
|
109 |
+
- pypdfium2_gedik.py
|
110 |
+
|
111 |
+
Using pypdfium2 to get necessary text from gedik pdfs located in data/gedik_PDF
|
112 |
+
put all extracted text into a list of dictionaries where date, monthAgo, count, paragraph are keys
|
113 |
+
put the combined dictionaries into .json file
|
114 |
+
|
115 |
+
|
116 |
+
- pypdfium2_yapikredi.py
|
117 |
+
|
118 |
+
Using pypdfium2 to get necessary text from gedik pdfs located in data/yapikredi_PDF
|
119 |
+
put all extracted text into a list of dictionaries where date, count, paragraph are keys
|
120 |
+
put the combined dictionaries into .json file
|
121 |
+
|
122 |
+
## .Json Labeling
|
123 |
+
|
124 |
+
After text extraction the output .json files were processed by dividing them by BIST-100 values such that,
|
125 |
+
if a text was published while BIST-100 had a negative change the processed text was put into the negative folder
|
126 |
+
else it was put into the positive folder, these folders would serve as the labeled data for our machine learning model
|
127 |
+
|
128 |
+
- json_sorter.py
|
129 |
+
|
130 |
+
In this program received data is a json file containing list of dictionary with keys date, count, and paragraph.
|
131 |
+
The date of each element will be compared with XU100 excel sheet where changes in BIST-100 value are located.
|
132 |
+
The dates of dictionaries will be found in XU100 and will be sorted into negative folder if value is negative
|
133 |
+
or into positive folder if value is positive.
|
134 |
+
|
135 |
+
|
136 |
+
- json_sort_haberler.py
|
137 |
+
|
138 |
+
In this program received data is a json file containing list of dictionary with keys date, count, and paragraph.
|
139 |
+
The date of each element will be compared with XU100 excel sheet where changes in BIST-100 value are located.
|
140 |
+
The dates of dictionaries will be found in XU100 and will be sorted into negative folder if value is negative
|
141 |
+
or into positive folder if value is positive.
|
142 |
+
|
143 |
+
|
144 |
+
- json_sort_tweet.py
|
145 |
+
|
146 |
+
In this program received data is a json file containing dictionary of dictionary with id as keys and
|
147 |
+
date, tweet, views as values. The date of each element will be compared with XU100 excel sheet where
|
148 |
+
changes in BIST-100 value are located. The dates of dictionaries will be found in XU100 and will be
|
149 |
+
sorted into negative folder if value is negative or into positive folder if value is positive.
|
150 |
+
## True Labeling
|
151 |
+
|
152 |
+
- parse_keywords.py
|
153 |
+
checks each neg or pos assigned files keywords and move them to other folder if falsely labeled
|
154 |
+
|
155 |
+
## Model Training
|
156 |
+
|
157 |
+
- bert_train
|
158 |
+
+ Trains the data with Bert Model and checks the results. Bert Tokenizer is also used to further preprocess the data.
|
159 |
+
+ To see the results and scores of the model please check this file.
|
160 |
+
|
161 |
+
## Visualizations
|
162 |
+
|
163 |
+
- Visualizations.ipynb
|
164 |
+
To show the performance on of the model on whole data and to visualize the sentiments made from brokerages or news this file is implemented.
|
165 |
+
- CS210Visualization.pptx
|
166 |
+
- sentiment_of_broker_sites.ipynb
|
167 |
+
Plots the sentiment of 4 different broker sites as percentage comparisons with positive and negative sentiments as categories
|