CDSS, Statistics @ UC Berkeley

News Article Political Bias Classifier (NLP Project)

End-to-end NLP data pipeline for detecting political bias in news articles using transformer-based deep learning

Duration: Sep 2024 - Dec 2024
Technologies:
PyTorch Transformers RoBERTa Pandas NewsAPI Trafilatura Scikit-learn Python

News Article Political Bias Classifier (NLP Project)

End-to-end NLP pipeline that processes 24,500+ news articles to automatically detect political bias (Left/Right) using fine-tuned RoBERTa transformers, achieving 69.7% classification accuracy.

Objectives

This project aims to:

  1. Detect political bias in news articles: Build an automated NLP system to classify news articles as Left-leaning or Right-leaning with high accuracy

  2. Uncover hidden biases: Help readers identify whether specific news sources or authors consistently lean toward one political side

  3. Promote media literacy: Provide an objective, data-driven tool for understanding the political slant of internet news sources

Overview

This project was developed as a final project of UC Berkeley’s DATA 198 (Fall 2025) course to address a critical challenge in modern media: identifying political bias in news coverage. With media polarization at an all-time high, we built an automated NLP system that processes 24,500+ articles to objectively detect Left vs Right editorial slant, achieving 69.7% classification accuracy—19.7pp above random guessing.

Our goal was to help readers understand the political orientation of their news sources through data, not opinion.

Key Features

  • Automated Data Pipeline: End-to-end ETL workflow from raw HTML to model-ready features
  • Large-Scale Processing: Handles 24,500+ articles spanning 13 years (2012-2025)
  • High Accuracy: Achieves 69.7% classification accuracy (19.7pp above baseline)
  • Topic Analysis: Performs best on Coronavirus coverage (73.8% accuracy)
  • Quality Assurance: 99.5% data quality validation across all pipeline stages

Technical Implementation

Data Pipeline Architecture

Built an automated pipeline with 5 distinct stages:

# Pipeline flow
Kaggle Dataset (8,478 rows) 
  → Web Scraping (NewsAPI + Trafilatura)
  → Text Cleaning (normalization + deduplication)
  → Format Transform (wide → long: 24,505 articles)
  → Feature Extraction (RoBERTa embeddings)
  → Neural Network Classification

Data Collection & Preprocessing

Implemented automated web scraping and ETL workflow:

# Wide-to-long format transformation
df_long = df_wide.melt(
    id_vars=['Topics', 'Date'],
    value_vars=['left_story_text', 'center_story_text', 'right_story_text'],
    var_name='bias_label',
    value_name='text'
)

# Data quality validation
def validate_article(article):
    checks = [
        len(article['text'].split()) >= 10,  # Minimum length
        article['bias_label'] in ['left', 'right'],  # Valid label
        not is_duplicate(article),  # No duplicates
    ]
    return all(checks)

Model Architecture

Fine-tuned RoBERTa-base with custom classification head:

class BiasClassifier(nn.Module):
    def __init__(self, roberta_model):
        super().__init__()
        self.roberta = roberta_model
        self.classifier = nn.Sequential(
            nn.Linear(768, 256),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(128, 2)  # Binary: Left vs Right
        )
    
    def forward(self, input_ids, attention_mask):
        outputs = self.roberta(input_ids, attention_mask)
        embeddings = outputs.last_hidden_state[:, 0, :]  # CLS token
        return self.classifier(embeddings)

Challenges and Solutions

Challenge 1: Data Format Complexity

  • Problem: Original dataset stored 3 articles per row (wide format), making analysis difficult.
  • Solution: Engineered ETL workflow to transform 8,478 rows into 24,505 normalized articles, enabling standard ML processing and increasing data usability by 290%.

Challenge 2: Data Quality at Scale

  • Problem: Web-scraped articles contained HTML artifacts, duplicates, and inconsistent formatting.
  • Solution: Built 5-stage validation pipeline with automated cleaning, achieving 99.5% data quality. Removed 2,147 duplicates and standardized text format.

Challenge 3: Class Imbalance

  • Problem: Initial concern about uneven Left/Center/Right distribution affecting model performance.
  • Solution: Analysis revealed nearly balanced distribution (34.4% / 31.4% / 34.2%). Excluded Center articles for clearer binary classification, resulting in 69.7% accuracy.

Technologies Used

  • Data Collection: NewsAPI, Trafilatura, Requests, BeautifulSoup
  • Data Processing: Pandas, NumPy, Regex
  • Machine Learning: PyTorch, Transformers (HuggingFace), Scikit-learn
  • NLP: RoBERTa-base (pretrained on 160GB text corpus)
  • Visualization: Matplotlib, Seaborn
  • Development: Google Colab (GPU), Python 3.10+

Key Learnings

  1. Data Engineering: Learned the critical importance of data pipeline design - 80% of ML project time is data preparation, not modeling
  2. ETL Best Practices: Gained hands-on experience with schema design, data lineage tracking, and quality validation at scale
  3. Transfer Learning: Discovered how pretrained transformers (RoBERTa) dramatically improve NLP performance with limited training data
  4. Production Thinking: Importance of reproducible pipelines, documentation, and automated validation for deployment-ready systems

Performance Metrics

Overall Results

  • Accuracy: 69.7% (baseline: 50%)
  • Dataset Size: 24,505 articles
  • Data Quality: 99.5% validation pass rate
  • Processing Speed: 24K+ articles processed in <2 hours

Topic-Specific Performance

TopicAccuracySample Size
Coronavirus73.8%1,242 articles
Politics67.0%2,232 articles
World65.8%1,161 articles
Economy & Jobs65.3%1,386 articles
Elections63.3%1,845 articles

Future Enhancements

  • Three-Class Classification: Include Center articles for more nuanced bias detection
  • Attention Visualization: Add interpretability features to show which words/phrases indicate bias
  • Real-Time API: Deploy model as REST API for live article classification
  • Browser Extension: Create Chrome extension for instant bias detection while reading news
  • Temporal Analysis: Track how news sources’ bias evolves over time

Project Structure

news-article-bias-classifier/
├── notebooks/                    # Sequential workflow (7 notebooks)
│   ├── 01_data_exploration.ipynb
│   ├── 02_web_scraping.ipynb
│   ├── 03_data_cleaning.ipynb
│   ├── 04_data_transformation.ipynb
│   ├── 05_feature_extraction.ipynb
│   ├── 06_model_experiments.ipynb
│   └── 07_final_model.ipynb
├── data/
│   ├── raw/                      # Original Kaggle dataset
│   └── processed/                # Cleaned, transformed data
├── results/
│   ├── plots/                    # Accuracy curves, confusion matrices
│   └── metrics/                  # Performance logs
├── README.md                     # Full technical documentation
└── requirements.txt              # Python dependencies

Try It Out

Visual Results

Data Pipeline Flow

flowchart LR
    A[Raw Data<br/>8,478 rows] --> B[Web Scraping]
    B --> C[Text Cleaning]
    C --> D[ETL Transform<br/>24,505 articles]
    D --> E[RoBERTa Embeddings<br/>768-dim]
    E --> F[Neural Network]
    F --> G[69.7% Accuracy]

Model Performance

  • Training Curve: Converged at epoch 8 with early stopping
  • Confusion Matrix: Balanced predictions across Left/Right classes
  • Topic Analysis: Coronavirus articles showed clearest bias signals

  • Course: DATA 198 (Fall 2025) - Data Science Society @ UC Berkeley
  • Team Size: 5 members (Data Pipeline & Preprocessing Lead)
  • Duration: 3 months (Sep - Dec 2025)

Wrapping up

It’s been a tough semester, but we made it through!

Huge thanks to Alex and Sameer, who were leading mentors throughout this project. We couldn’t have wrapped this up without them.

And another big thanks to my fantastic teammates – Edan, Mike, Avni, and Sriya. I learned so much from y’all.

I’ll be back with a new one 😁

Merry Christmas 🎄 :)

Dec 10, 2025 at 1 : 43 AM in my dorm.