Insights from News and Public Coverage

InfoMosaic

Hugo Veríssimo

124348

data05.parquet

aliases news keywords
companies
Banco Comercial Português [Banco Comercial Português, BCP] [{'ExtractedText': 'DN   13 de Setembro de 200... {'03 Mar': {'count': 2.0, 'date': {'201503': 2...
Galp Energia [Galp Energia, GALP] [{'ExtractedText': 'RTP Galp reforça posição n... {'00h00': {'count': 7.0, 'date': {'201004': 1....
EDP [EDP, Energias de Portugal, Electricidade de P... [{'ExtractedText': 'DN-Sinteses Negocios 9 de ... {'00h00': {'count': 4.0, 'date': {'201004': No...
Sonae [Sonae, SON] [{'ExtractedText': 'DN-Sinteses 5 de Março de ... {'00h00': {'count': 3.0, 'date': {'201004': No...
Mota-Engil [Mota-Engil, EGL] [{'ExtractedText': 'RTP Lucro da Mota-Engil so... {'15h30': {'count': 2.0, 'date': {'201509': 1....
pd.read_parquet("data05.parquet")["news"].iloc[0][0].keys()
dict_keys(['ExtractedText', 'linkToArchive', 'newsNER', 'newsProbability', 'newsSentiment', 'newsSource', 'tstamp'])
 
pd.read_parquet("data05.parquet")["keywords"].iloc[0]['03 Mar'].keys()
dict_keys(['count', 'date', 'filter', 'news', 'sentiment', 'source', 'type', 'weight'])

Optimized Data Storage

Optimized data organization by saving each cell as a separate JSON file, enhancing loading speed and flexibility.

news_{company}.json

json.load(open("news_bcp.json")); .keys()[0] & .values()[0].keys()
'https://arquivo.pt/noFrame/replay/20010913052557/http://www.dn.pt/int/13p4x.htm'
dict_keys(['keywords', 'probability', 'sentiment', 'source', 'tstamp'])
 

kwrd_{company}.json

json.load(open("kwrd_bcp.json")); .keys()[:3] & .values()[0].keys()
['03 Mar', '10 Nov', '100 Segundos de Ciência']
dict_keys(['count', 'date', 'filter', 'news', 'sentiment', 'source', 'type', 'weight'])

News Recommendations

News Similarity

TfidfVectorizer() and cosine_similarity from scikit-learn were used to compute the similarity between news articles.

cluster de noticias

Rating News

iteration = 0
ratings = np.array([3.0] * len(news))

def news_recommendation(ratings):
  global iteration
  iteration += 1
  
  # The exponential assigns higher probabilities to larger values 
  weights = np.exp(np.array(ratings)) / weights.sum()
  
  # Select the next suggestion based on the weights
  news_i = np.random.choice(len(news), p = weights)
  
  return news_i

def update_ratings(news_i, user_rating):
    global ratings
    global iteration
    learning_rate = 0.999 * iteration
    
    # Compute similarity to all other texts
    similarity_scores = cosine_similarity(tfidf[news_i], tfidf).flatten()
    
    # Update the ratings for all news
    ratings += (user_rating - ratings) * similarity_scores * learning_rate
    ratings[news_i] = -1000

News Recommendation System

rating de noticias

Web Application

Web Application

A web application was developed using Flask, integrating various tools for analyzing topics (e.g., companies) based on news articles. It consolidates visualizations and processes created throughout the project into a unified platform, organized into the following sections:

  • Explorer

  • Topic Map

  • Topic Insights

  • Word Duel

  • Word Cloud

https://hugover.pythonanywhere.com

Explorer

explorer page

Topic Map

grafo page

Topic Insights

read page

Word Duel

Inspired by The Higher Lower Game and Noticioso.

duel page

Word Cloud

wcloud page

Further Improvements

Processing CDX API Results

To process the 3 056 418 results from the CDX API, tools and methods such as Apache Spark, Bloom Filters, Logistic Regression and Probabilistic Counters are being used. The processing approach for extracting sentiment and keywords from each news result is very similar to the one used so far.


root
 |-- timestamp: integer (nullable = true)
 |-- source: string (nullable = true)
 |-- archive: string (nullable = true)
 |-- id: integer (nullable = true)
 |-- probability: float (nullable = true)
 |-- keywords: map (nullable = true)
 |    |-- key: string
 |    |-- value: integer (valueContainsNull = true)
 |-- sentiment: float (nullable = true)
 

News Detection

A new machine learning model was developed, specifically a logistic regression, to distinguish between news and non-news articles. TF-IDF was used for feature extraction, and the hyperparameters were optimized to maximize the recall of the “news” class.

\(\ \)

metrics class news ML
Evaluation metrics for logistic regression classification of the 'news' class.

Remaining Tasks

  • Process the remaining results from CDX API.

  • Store the processed data in a database, such as MongoDB.

  • Design and implement an algorithm to efficiently identify news articles relevant to the user’s search topic and convert the data into the required input format for the web application.

  • Implement the solution into the web application, ensuring it is optimized for computational performance.

  • Improve the user interface.