dict_keys(['ExtractedText', 'linkToArchive', 'newsNER', 'newsProbability', 'newsSentiment', 'newsSource', 'tstamp'])
InfoMosaic
124348
aliases | news | keywords | |
---|---|---|---|
companies | |||
Banco Comercial Português | [Banco Comercial Português, BCP] | [{'ExtractedText': 'DN 13 de Setembro de 200... | {'03 Mar': {'count': 2.0, 'date': {'201503': 2... |
Galp Energia | [Galp Energia, GALP] | [{'ExtractedText': 'RTP Galp reforça posição n... | {'00h00': {'count': 7.0, 'date': {'201004': 1.... |
EDP | [EDP, Energias de Portugal, Electricidade de P... | [{'ExtractedText': 'DN-Sinteses Negocios 9 de ... | {'00h00': {'count': 4.0, 'date': {'201004': No... |
Sonae | [Sonae, SON] | [{'ExtractedText': 'DN-Sinteses 5 de Março de ... | {'00h00': {'count': 3.0, 'date': {'201004': No... |
Mota-Engil | [Mota-Engil, EGL] | [{'ExtractedText': 'RTP Lucro da Mota-Engil so... | {'15h30': {'count': 2.0, 'date': {'201509': 1.... |
dict_keys(['ExtractedText', 'linkToArchive', 'newsNER', 'newsProbability', 'newsSentiment', 'newsSource', 'tstamp'])
dict_keys(['count', 'date', 'filter', 'news', 'sentiment', 'source', 'type', 'weight'])
Optimized data organization by saving each cell as a separate JSON file, enhancing loading speed and flexibility.
news_{company}.json
'https://arquivo.pt/noFrame/replay/20010913052557/http://www.dn.pt/int/13p4x.htm'
dict_keys(['keywords', 'probability', 'sentiment', 'source', 'tstamp'])
kwrd_{company}.json
['03 Mar', '10 Nov', '100 Segundos de Ciência']
dict_keys(['count', 'date', 'filter', 'news', 'sentiment', 'source', 'type', 'weight'])
TfidfVectorizer()
and cosine_similarity
from scikit-learn
were used to compute the similarity between news articles.
iteration = 0
ratings = np.array([3.0] * len(news))
def news_recommendation(ratings):
global iteration
iteration += 1
# The exponential assigns higher probabilities to larger values
weights = np.exp(np.array(ratings)) / weights.sum()
# Select the next suggestion based on the weights
news_i = np.random.choice(len(news), p = weights)
return news_i
def update_ratings(news_i, user_rating):
global ratings
global iteration
learning_rate = 0.999 * iteration
# Compute similarity to all other texts
similarity_scores = cosine_similarity(tfidf[news_i], tfidf).flatten()
# Update the ratings for all news
ratings += (user_rating - ratings) * similarity_scores * learning_rate
ratings[news_i] = -1000
A web application was developed using Flask
, integrating various tools for analyzing topics (e.g., companies) based on news articles. It consolidates visualizations and processes created throughout the project into a unified platform, organized into the following sections:
Explorer
Topic Map
Topic Insights
Word Duel
Word Cloud
https://hugover.pythonanywhere.com
Inspired by The Higher Lower Game and Noticioso.
As extracting news articles related to a topic, analyzing the keywords within them and assessing the sentiment of each is a time-consuming and computationally expensive process, it is preferable to extract all the news articles from arquivo.pt and analyze them in advance, avoiding redundant operations. This requires changing some of the methodologies used so far.
To achieve this, the CDX Server API from arquivo.pt is used, which returns preserved pages that begin with a specific URL. Over 100 URLs have been selected, including:
[('https://www.rtp.pt/', 'RTP'),
('https://www.rtp.pt/noticias/', 'RTP'),
('https://www.rtp.pt/noticias/pais/', 'RTP'),
('https://www.rtp.pt/noticias/mundo/', 'RTP'),
('https://www.rtp.pt/noticias/politica/', 'RTP'),
('https://www.rtp.pt/noticias/economia/', 'RTP'), ...]
To process the 3 056 418 results from the CDX API, tools and methods such as Apache Spark, Bloom Filters, Logistic Regression and Probabilistic Counters are being used. The processing approach for extracting sentiment and keywords from each news result is very similar to the one used so far.
root
|-- timestamp: integer (nullable = true)
|-- source: string (nullable = true)
|-- archive: string (nullable = true)
|-- id: integer (nullable = true)
|-- probability: float (nullable = true)
|-- keywords: map (nullable = true)
| |-- key: string
| |-- value: integer (valueContainsNull = true)
|-- sentiment: float (nullable = true)
A new machine learning model was developed, specifically a logistic regression, to distinguish between news and non-news articles. TF-IDF was used for feature extraction, and the hyperparameters were optimized to maximize the recall of the “news” class.
\(\ \)
Process the remaining results from CDX API.
Store the processed data in a database, such as MongoDB.
Design and implement an algorithm to efficiently identify news articles relevant to the user’s search topic and convert the data into the required input format for the web application.
Implement the solution into the web application, ensuring it is optimized for computational performance.
Improve the user interface.