Day 2: Spam Classifier for Spam Probability
My Telegram dev community is doing great! But sifting out the occasional phishing link or bot spam is a nightmare. It’s not overwhelming, but it’s enough to disrupt the flow, to demand attention I’d rather spend elsewhere. The standard binary classification – “spam” or “not spam” – feels too blunt, like using a hammer where a scalpel is needed. I want a quick gauge of how likely a message is spam so I can triage.
1. Frustration → Insight
Every day I watch hundreds of messages roll into our Telegram group. Some are clearly helpful, others clearly malicious, and a lot land in the gray area. A binary label isn’t enough, I need a spam score so I can, for example, auto mute anything above 80% and let a human review the 40–80% zone.
2. Updated Mental Model: Your Spam-Detection “Toolbox” Station
LangChain = pipelines of specialists + toolbox of tools.
Instead of a fixed “summarizer” station on our assembly line, today we slot in a SpamDetector tool—a black-box specialist that scores incoming messages. Think of your system as a workshop:
- Assembly Line (Chains) still connects inputs → outputs.
- Toolbox (Tools) holds reusable modules like our
SpamDetector
.- Agent (Conductor) decides when to pull out the spam-scoring tool versus passing data straight to another station.
When a new message arrives, the agent reaches into the toolbox, grabs the SpamDetector, hands it the message, and gets back a {type, probability}
verdict—then chooses the next step (mute, notify, or forward). This mix of chains + tools lets you build flexible, maintainable workflows without rewriting core logic each time.
3. The New Code: Hugging Face + LangChain Tool
# 1. Install dependencies (once)
! pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
! pip install transformers langchain openai
from transformers import pipeline
from langchain.tools import Tool
from langchain_openai import ChatOpenAI
from langchain.agents import initialize_agent, AgentType
# 3. Prepare your HF-based spam detector tool
spam_pipe = pipeline(
"text-classification",
model="Titeiiko/OTIS-Official-Spam-Model",
framework="pt"
)
def analyze_output(input_text: str):
out = spam_pipe(input_text)[0]
label = "Not Spam" if out["label"] == "LABEL_0" else "Spam"
return {"type": label, "probability": out["score"]}
spam_detection_tool = Tool(
name="SpamDetector",
description="Detects if a given text is spam or not using the OTIS model",
func=analyze_output
)
# 4. Initialize your LLM and Agent
llm = ChatOpenAI(
base_url="http://localhost:1234/v1", # LM Studio's API URL
api_key="lm-studio",
model="deepseek-r1-distill-qwen-7b",
temperature=0
) # deterministic behavior for tool selection
agent = initialize_agent(
tools=[spam_detection_tool],
llm=llm,
agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
)
# 5. Give the agent a raw prompt; it will reason and call SpamDetector under the hood
message = """
NIKЕ AND ОРЕNSEA launched collection, go fast take one https://opensea.io/collection/nikbox/overview
"""
response = agent.run(
"Please analyze the following Telegram message and return its spam probability in percentile:\n\n" + message
)
print("Agent Response:", response)
Key Changes & Why
- Moved from OpenAI to a specialized HF model for higher accuracy on spam tasks.
- Wrapped the HF
pipeline
in a LangChainTool
so you can drop it into any agent or chain. - Returns a structured dict with
"type"
and"probability"
for easy downstream logic (e.g., auto-mute, human review).
4. How It Works (Updated)
-
Model Initialization
We spin up a Hugging Facepipeline("text-classification")
with theTiteiiko/OTIS-Official-Spam-Model
on PyTorch. By loading it once at startup, we avoid the per-call overhead of re-instantiating the model. -
Scoring Function (
analyze_output
)- We call
spam_pipe(input_text)
to get a list of predictions (always one element here). - The pipeline returns
{"label": "LABEL_0" or "LABEL_1", "score": float}
. We map"LABEL_0"
to Not Spam and"LABEL_1"
to Spam, preserving the confidence score.
- We call
-
LangChain Tool Wrapping
Tool(name="SpamDetector", func=analyze_output, …)
turns our pure Python function into a first-class LangChain component.- Anywhere in your chains or agents, you can now refer to
SpamDetector
and LangChain will handle callinganalyze_output
under the hood.
-
Runtime Usage
- Your bot or chain invokes
spam_detection_tool.run(message)
. - You receive a structured dictionary:
{ "type": "Spam" or "Not Spam", "probability": 0.97 // confidence score }
- From there, comparing
probability
against your threshold is trivial (if result["probability"] > 0.8: mute_user()
).
- Your bot or chain invokes
By refactoring to a Hugging Face pipeline and wrapping it as a Tool
, we gain specialized accuracy and seamless integration into any larger LangChain workflow—without losing clarity or composability.
5. Next Steps
- Thresholds & Actions: Auto-mute at >0.8, human-review at 0.4–0.8, post at <0.4.
- Batch Processing: Loop over new messages in your bot’s webhook handler.
- Logging & Feedback: Store scores and review corrective actions to refine your prompt or even train a custom classifier later.
With this “Spam Scorer” in place, your Telegram community can self-moderate in real time—letting you focus on building, not policing. On to Day 3’s Translation Chain!
Did I make a mistake? Please consider Send Email With Subject