Insights from News and Public Coverage

InfoMosaic

Hugo Veríssimo

124348

data05.parquet

	aliases	news	keywords
companies
Banco Comercial Português	[Banco Comercial Português, BCP]	[{'ExtractedText': 'DN 13 de Setembro de 200...	{'03 Mar': {'count': 2.0, 'date': {'201503': 2...
Galp Energia	[Galp Energia, GALP]	[{'ExtractedText': 'RTP Galp reforça posição n...	{'00h00': {'count': 7.0, 'date': {'201004': 1....
EDP	[EDP, Energias de Portugal, Electricidade de P...	[{'ExtractedText': 'DN-Sinteses Negocios 9 de ...	{'00h00': {'count': 4.0, 'date': {'201004': No...
Sonae	[Sonae, SON]	[{'ExtractedText': 'DN-Sinteses 5 de Março de ...	{'00h00': {'count': 3.0, 'date': {'201004': No...
Mota-Engil	[Mota-Engil, EGL]	[{'ExtractedText': 'RTP Lucro da Mota-Engil so...	{'15h30': {'count': 2.0, 'date': {'201509': 1....

pd.read_parquet("data05.parquet")["news"].iloc[0][0].keys()

dict_keys(['ExtractedText', 'linkToArchive', 'newsNER', 'newsProbability', 'newsSentiment', 'newsSource', 'tstamp'])

pd.read_parquet("data05.parquet")["keywords"].iloc[0]['03 Mar'].keys()

dict_keys(['count', 'date', 'filter', 'news', 'sentiment', 'source', 'type', 'weight'])

Optimized Data Storage

Optimized data organization by saving each cell as a separate JSON file, enhancing loading speed and flexibility.

news_{company}.json

json.load(open("news_bcp.json")); .keys()[0] & .values()[0].keys()

'https://arquivo.pt/noFrame/replay/20010913052557/http://www.dn.pt/int/13p4x.htm'

dict_keys(['keywords', 'probability', 'sentiment', 'source', 'tstamp'])

kwrd_{company}.json

json.load(open("kwrd_bcp.json")); .keys()[:3] & .values()[0].keys()

['03 Mar', '10 Nov', '100 Segundos de Ciência']

dict_keys(['count', 'date', 'filter', 'news', 'sentiment', 'source', 'type', 'weight'])

News Recommendations

News Similarity

TfidfVectorizer() and cosine_similarity from scikit-learn were used to compute the similarity between news articles.

Rating News

iteration = 0
ratings = np.array([3.0] * len(news))

def news_recommendation(ratings):
  global iteration
  iteration += 1
  
  # The exponential assigns higher probabilities to larger values 
  weights = np.exp(np.array(ratings)) / weights.sum()
  
  # Select the next suggestion based on the weights
  news_i = np.random.choice(len(news), p = weights)
  
  return news_i

def update_ratings(news_i, user_rating):
    global ratings
    global iteration
    learning_rate = 0.999 * iteration
    
    # Compute similarity to all other texts
    similarity_scores = cosine_similarity(tfidf[news_i], tfidf).flatten()
    
    # Update the ratings for all news
    ratings += (user_rating - ratings) * similarity_scores * learning_rate
    ratings[news_i] = -1000

News Recommendation System

Web Application

A web application was developed using Flask, integrating various tools for analyzing topics (e.g., companies) based on news articles. It consolidates visualizations and processes created throughout the project into a unified platform, organized into the following sections:

Explorer
Topic Map
Topic Insights
Word Duel
Word Cloud

https://hugover.pythonanywhere.com

Explorer

Topic Map

Topic Insights

Word Duel

Inspired by The Higher Lower Game and Noticioso.

Word Cloud

Further Improvements

Generalize the Topic Search

As extracting news articles related to a topic, analyzing the keywords within them and assessing the sentiment of each is a time-consuming and computationally expensive process, it is preferable to extract all the news articles from arquivo.pt and analyze them in advance, avoiding redundant operations. This requires changing some of the methodologies used so far.

To achieve this, the CDX Server API from arquivo.pt is used, which returns preserved pages that begin with a specific URL. Over 100 URLs have been selected, including:

[('https://www.rtp.pt/', 'RTP'),
('https://www.rtp.pt/noticias/', 'RTP'),
('https://www.rtp.pt/noticias/pais/', 'RTP'),
('https://www.rtp.pt/noticias/mundo/', 'RTP'),
('https://www.rtp.pt/noticias/politica/', 'RTP'),
('https://www.rtp.pt/noticias/economia/', 'RTP'), ...]

Processing CDX API Results

To process the 3 056 418 results from the CDX API, tools and methods such as Apache Spark, Bloom Filters, Logistic Regression and Probabilistic Counters are being used. The processing approach for extracting sentiment and keywords from each news result is very similar to the one used so far.


root
 |-- timestamp: integer (nullable = true)
 |-- source: string (nullable = true)
 |-- archive: string (nullable = true)
 |-- id: integer (nullable = true)
 |-- probability: float (nullable = true)
 |-- keywords: map (nullable = true)
 |    |-- key: string
 |    |-- value: integer (valueContainsNull = true)
 |-- sentiment: float (nullable = true)

News Detection

A new machine learning model was developed, specifically a logistic regression, to distinguish between news and non-news articles. TF-IDF was used for feature extraction, and the hyperparameters were optimized to maximize the recall of the “news” class.

\(\ \)

metrics class news ML — Evaluation metrics for logistic regression classification of the 'news' class.

Remaining Tasks

Process the remaining results from CDX API.
Store the processed data in a database, such as MongoDB.
Design and implement an algorithm to efficiently identify news articles relevant to the user’s search topic and convert the data into the required input format for the web application.
Implement the solution into the web application, ensuring it is optimized for computational performance.
Improve the user interface.