HarmGuard AI
🟢 Real-time Analysis | 🧠 Deep Learning Model
Project Overview
HarmGuard AI is a real-time AI-powered content moderation platform designed to detect and classify harmful textual content across various online environments like social media, messaging apps, and community forums. The system uses advanced Natural Language Processing (NLP) and multi-label classification techniques to identify self-harm ideation, aggressive behavior, and references to violence.
Developed to support moderation workflows and early-intervention systems, this tool helps platforms maintain safer digital spaces by flagging and prioritizing potentially risky messages.
Detection Showcase


The system includes a Threat Analysis Panel providing visual, real-time feedback for each message analyzed.
Key Functionalities
- 🔍 Real-Time Text Classification: Processes text inputs instantly with low latency suitable for live environments.
- 🧠 Multi-Label Harm Detection: Identifies one or more risk categories per message, including:
self_harm
,harming_others
,harmed_by_others
, andreference_to_harm
. - 🎯 Confidence Scoring: Provides a confidence score for each prediction, allowing filtering and prioritization.
- ⚖️ Responsible AI: Designed with considerations to reduce bias and ensure fairness.
Technologies Used
- Python
- Flask
- Hugging Face Transformers
- BERT / RoBERTa / Fine-tuned LLM
- Scikit-learn
- Pandas
- NumPy
Model Training Highlights
Model Architecture
Fine-tuned roberta-large
transformer model adapted for multi-label classification with 4 output nodes.
Loss Function
BCEWithLogitsLoss
optimized through HuggingFace Trainer for effective multi-label learning.
Evaluation Metrics
Macro F1-score (0.89) and Accuracy (0.91) maintained across all harm categories.
Data Pipeline
CSV datasets processed with pandas
and HuggingFace datasets
library for efficient batch loading.
Training Config
HuggingFace Trainer
with early stopping (patience=3) and per-epoch validation.
Inference Process
Logits processed through sigmoid activation with dynamic thresholding (≥0.5) for label assignment.