I’m exploring ways to analyze extensive WhatsApp or ANWhatsApk’s conversation data sets, and I’m particularly interested in techniques for sentiment analysis and topic modeling. Can anyone share insights or suggest robust methodologies, tools, or libraries that are well-suited for handling such large-scale data and extracting meaningful insights from it?
do you make analysis? this topic is very important to me.
Someone have a code to sharing?
I can suggest you use streaming datasets for loading a large corpus of dataset. I think you will have to curate the datasets for the same based on conversational data. Also I can see distilled roberta models on hf interface working well for sentiment analysis
Jumping in on this older post because newer tools might make things easier now. You could try small transformer models like DistilBERT for quick sentiment checks, and something like BERTopic for topic modeling since it works well on chat-style text. If your WhatsApp export is huge, batching the messages or sampling can help. Curious if you’re still working on this or already picked a workflow.
for now, today version resources:
Context (what’s changed since that older discussion)
The core idea still holds: small transformer sentiment + BERTopic topics is a strong baseline. What’s improved since then is mainly:
- Parsing/ETL: better WhatsApp-to-DataFrame tooling (less regex pain). (whatstk.readthedocs.io)
- Topic modeling ergonomics: clearer BERTopic best practices + guided topics + representation models (better labels). (maartengr.github.io)
- Scaling options: lighter BERTopic installs + Model2Vec backend (can avoid heavyweight stacks), plus ONNX Runtime pipelines for faster inference. (GitHub)
Below is a “good solutions today” workflow that stays practical for both small and huge exports.
Recommended “today” workflow (high level)
-
Export WhatsApp chat “without media” (to reduce noise and parsing issues). (whatstk.readthedocs.io)
-
Parse into a DataFrame (timestamp, sender, message) using a WhatsApp-aware parser. (whatstk.readthedocs.io)
-
Choose your document unit for topics (single messages are often too short; use time windows or bursts).
-
Sentiment:
- quick baseline: DistilBERT SST-2
- informal/social-text: Twitter-RoBERTa sentiment-latest
- enforce batching + truncation (critical for speed + correctness). (Hugging Face)
-
Topics: BERTopic with best-practice settings; use guided topics or representation models for readable labels. (maartengr.github.io)
-
Evaluate topics (coherence + manual sanity checks) and handle outliers. (AKSW Subversion)
Step-by-step pipeline (practical and reproducible)
1) Export + parse (avoid regex unless you must)
Export
The whatstk docs explicitly recommend exporting chats “Without Media” and note exports are done chat-by-chat. (whatstk.readthedocs.io)
Parse to DataFrame
whatstk’s “from_source → pandas.DataFrame” flow is designed specifically for WhatsApp exports. (whatstk.readthedocs.io)
Typical outputs you want:
date(datetime)username(sender)message(string)
At this point, do a quick audit:
2) Clean + normalize (minimal but important)
Do not over-clean. For chats:
- Keep emojis (they often carry sentiment)
- Remove/flag system lines (“joined using invite link”, encryption notices)
- Optionally normalize URLs to a token like
<URL>
Also decide how to handle:
- attachments placeholders (e.g., “”)
- very long pasted text (truncate or split)
3) Define the “document” for topic modeling (the main quality lever)
For topics, a single WhatsApp message is often too short to cluster well. Good “document” definitions:
- Time window: concatenate all messages in 5–15 minute bins.
- Burst: concatenate consecutive messages until a gap threshold (e.g., 10 minutes).
- Daily: concatenate all messages per day (or per day per user).
This one choice often reduces noise and improves topic coherence.
Sentiment “today”: fast baselines + correct batching
Option A: very quick baseline (DistilBERT SST-2)
DistilBERT SST-2 is still a solid polarity baseline and its model card reports 91.3% accuracy on the SST-2 dev set. (Hugging Face)
Option B: better for informal chat (Twitter-RoBERTa sentiment-latest)
The CardiffNLP model card describes it as RoBERTa trained on ~124M tweets (2018–2021) and fine-tuned for sentiment, which often transfers better to chatty/emoji-heavy text than review datasets. (Hugging Face)
Operational rules (these prevent common failures)
-
Batch the inputs (don’t run the pipeline row-by-row)
Hugging Face pipelines are intended to abstract inference, but performance still depends heavily on batching. (Hugging Face) -
Always specify truncation (and padding if needed)
Padding/truncation is the standard way to handle variable-length text batches and prevent overflow issues. (Hugging Face) -
If CPU-bound or large export: use ONNX Runtime pipelines
Optimum’s ONNX Runtime pipelines can load ONNX models and run accelerated inference without rewriting your pipeline code. (Hugging Face)
Topic modeling “today”: BERTopic, but run it with the modern playbook
Why BERTopic remains a top default
BERTopic’s docs define it as transformer embeddings + c-TF-IDF + clustering to produce interpretable topics. (maartengr.github.io)
What to do first: apply the BERTopic best practices + tuning guidance
BERTopic explicitly maintains a best-practices guide and a parameter tuning guide (including UMAP/HDBSCAN considerations). (maartengr.github.io)
If you already know the likely themes: guided topics
Guided topic modeling lets you provide seed topics and guide convergence (useful for chats where you expect recurring themes like “work”, “travel”, “sports”). (maartengr.github.io)
Make topic labels readable: representation models
BERTopic supports representation models (from fast keyword extraction to GPT-like labeling) to improve topic naming and summaries. (maartengr.github.io)
Scaling: what changes when your export is huge
1) Memory pitfalls (especially probabilities)
If you set calculate_probabilities=True, memory use can jump significantly (HDBSCAN soft clustering + big matrices). This shows up repeatedly in user reports and is called out in BERTopic’s API docs: calculating probabilities for all topics across all documents can slow computation and increase memory usage. (maartengr.github.io)
Default recommendation for large exports
- start with
calculate_probabilities=False - only compute full distributions later for a subset or after topic reduction
2) Outliers (-1 topic) can dominate on short text
A paper evaluating BERTopic on multi-domain short text found HDBSCAN can classify a majority as outliers, and replacing HDBSCAN with k-means achieved similar performance without outliers. (arXiv)
This is why “document definition” (window/burst aggregation) and clustering choices matter for chat logs.
3) Lightweight BERTopic + Model2Vec (newer, practical)
BERTopic releases describe combining light-weight installation with Model2Vec embeddings so you can run BERTopic without PyTorch. (GitHub)
This is useful when:
- you want simpler deployments
- you’re processing large logs on CPU machines
Evaluation: how to know if topics are “good”
Topic coherence (not just eyeballing)
The classic reference for coherence measures is Röder et al. (framework for constructing and evaluating coherence measures). (AKSW Subversion)
Practical tooling:
- Palmetto (implements coherence calculations; explicitly references Röder et al.). (GitHub)
- tomotopy coherence module (compute coherence programmatically). (Bab2Min)
Outliers and noise are expected with density clustering
HDBSCAN is explicitly a noise-aware clustering method; understanding how it treats noise/outliers helps interpret BERTopic results. (HDBSCAN)
Alternatives worth knowing (when BERTopic isn’t the best fit)
Top2Vec (embedding-driven topic discovery)
Top2Vec is a commonly used embedding-based topic modeling approach that jointly embeds topics/documents/words and can be a good alternative when you want less parameter tuning. (GitHub)
Common pitfalls (and the fix)
- Modeling topics on single messages → aggregate into windows/bursts.
- Row-by-row sentiment inference → batch inputs (huge speedup). (Stack Overflow)
- Not setting truncation → enforce truncation/padding explicitly. (Hugging Face)
- Turning on full probability matrices too early → start with assigned-topic only; scale later. (maartengr.github.io)
- Too many outliers in BERTopic → change document unit; consider k-means clustering; evaluate outlier ratio. (arXiv)
A practical “default stack” (if you just want something that works)
-
Parsing: whatstk export → DataFrame (whatstk.readthedocs.io)
-
Sentiment:
- baseline: DistilBERT SST-2 (Hugging Face)
- informal chat: Twitter-RoBERTa sentiment-latest (Hugging Face)
- always: batched pipeline + truncation (Hugging Face)
- scale: Optimum ONNX Runtime pipeline (Hugging Face)
-
Topics: BERTopic with best practices + guided topics + representation models (maartengr.github.io)
-
Scale/ops: lightweight BERTopic + Model2Vec when needed (GitHub)
-
Evaluation: coherence (Palmetto/tomotopy) + outlier ratio (GitHub)