100 free credits — no credit card required.Start building
Logo
Back to blog

Python Social Media Crawling in 2026

·8 min read

Two ways to crawl social data with Python in 2026: build it yourself with requests and BeautifulSoup, or call a unified API with one httpx.get(). Here is when each one wins.

Python Social Media Crawling in 2026

There are two ways to crawl social data with Python in 2026: build a crawler with requests and BeautifulSoup, or call a unified API from Python. Build it yourself and you fight TLS fingerprinting, datacenter-IP blocks, JS hydration, and rotating internal params forever. The unified path is one httpx.get(). This post walks both, with code you can run. For platform deep-dives, see TikTok crawling and Instagram scraping.

The two paths of Python crawling

Say "Python crawling" and most people picture requests fetching a page and BeautifulSoup parsing it. For blogs, news, and wikis that ship complete HTML, that is still the right answer. Social platforms are the problem. TikTok, Instagram, and YouTube ship near-empty HTML, and the real data (followers, view counts) lives inside JSON that JavaScript renders later. They also block automated requests aggressively.

AspectDIY crawler (requests + BeautifulSoup)Unified API (httpx.get)
Time to dataHours to daysAbout 5 minutes
Anti-botYou manage proxies, TLS, CAPTCHAVendor handles it
SchemaDifferent per platform, write adaptersOne schema, 42 platforms
MaintenanceFix selectors every 3-6 weeks, foreverVendor owns it
CostProxies (about $10/GB) + engineer time1 credit per request

requests and BeautifulSoup, by hand

The familiar starting point. For static HTML it is done in ten lines.

import requests
from bs4 import BeautifulSoup

# Plain requests + BeautifulSoup. Works fine on complete HTML pages.
res = requests.get("https://example.com/blog")
soup = BeautifulSoup(res.text, "html.parser")
titles = [h2.get_text(strip=True) for h2 in soup.select("h2")]
print(titles)

# On social platforms this same code breaks:
# - requests has a non-browser TLS fingerprint, so you get a 403
# - followers/views live in JS-rendered JSON, not in the HTML
# - datacenter IPs get banned within minutes

Point the same code at a TikTok profile and the response is an empty shell or a 403. Swapping requests for httpx changes nothing: neither looks like a browser.

How social platforms block you

There are four layers, and one is never enough.

First, TLS fingerprinting. Platforms read the handshake and tell real Chrome from a Python library, so you need curl_cffi to impersonate Chrome. Second, IP: datacenter IPs get blocked on the first request, so residential proxies at roughly $10/GB are effectively required, and 10k requests/day is a realistic $50-100/month in bandwidth. Third, hydration: TikTok renders public pages as JSON inside a <script id="__UNIVERSAL_DATA_FOR_REHYDRATION__"> tag, and your traversal path breaks every few weeks when the keys change. Fourth, rotation: Instagram's internal GraphQL wants a doc_id that shifts every 2-4 weeks, and youtube-transcript-api reverse-engineers YouTube's internal timedtext endpoint with no SLA. Need transcripts? Try the YouTube transcript tool first.

The code is the easy part. Keeping all four layers alive, every few weeks, is the work.

Calling a unified API from Python

Hand all four layers to a vendor: one HTTP call, one API key, one schema across platforms. In Python that is a single httpx.get().

import os, httpx

r = httpx.get(
    "https://www.socialcrawl.dev/v1/tiktok/profile",
    params={"handle": "tiktok"},
    headers={"x-api-key": os.environ["SOCIALCRAWL_API_KEY"]},
)
body = r.json()
profile = body["data"]
print(f"{profile['username']} - followers {profile['followers']:,}")
print(f"engagement rate: {profile['engagement_rate']}%")

Every response is the same { success, platform, data } envelope. See TikTok data API for the full endpoint list and credit costs.

The value of a unified schema

The request shape is unremarkable. The schema is the point. Raw TikTok JSON stores followers as followerCount, Instagram as edge_followed_by.count, YouTube as statistics.subscriberCount. Build across three platforms and you are writing per-platform adapters before you ship anything. A unified schema maps them all to followers, so one parser works everywhere.

import os, httpx

KEY = os.environ["SOCIALCRAWL_API_KEY"]
BASE = "https://www.socialcrawl.dev/v1"

targets = [
    ("tiktok", {"handle": "nike"}),
    ("instagram", {"username": "nike"}),
    ("youtube", {"handle": "nike"}),
]

for platform, params in targets:
    r = httpx.get(f"{BASE}/{platform}/profile", params=params,
                  headers={"x-api-key": KEY})
    data = r.json()["data"]
    # Same parser. followers is followers, whatever the platform.
    print(f"{platform}: {data['followers']:,} followers, "
          f"engagement {data['engagement_rate']}%")

Responses also carry computed fields (engagement_rate, estimated_reach, content_category, language), so you can reason over the data instead of just reading it. Pricing is credit-based: 100 free credits to start, then £15 for 2,500 (Starter), £49 for 20,000 (Growth), £299 for 150,000 (Pro). Credits never expire, and there is no daily cap or approval queue. Working with YouTube? See the YouTube API guide.

Legality and privacy

Crawling public social data is broadly defensible under US federal law, but terms of service and privacy law are separate matters. In hiQ Labs v. LinkedIn (9th Cir. 2022) the court held the CFAA does not bar crawling publicly accessible, logged-out data, and Meta v. Bright Data (N.D. Cal. 2024) followed that logic. Both cover only US law. The GDPR (EU/UK) and Korea's PIPA still apply to personal data even when it is public. Avoid authenticated endpoints, never collect private or logged-in content, and respect rate limits. This is not legal advice and it varies by jurisdiction. See the scraping legal guide for more.

Choosing a path

One question decides it: how much maintenance can you absorb? For a one-off research project where you want to own every line, build it with curl_cffi and proxies. For a production pipeline or AI agent that must still run in 90 days with nobody watching, a unified API removes the maintenance and standardizes the schema. For static blogs and news, requests + BeautifulSoup is plenty. The math only changes once social data enters the picture.

Frequently asked questions

Can you crawl social media with Python?

Yes, two ways. Combine requests or httpx with BeautifulSoup to build your own crawler, or call a unified API with httpx.get(). Building it yourself means social platforms block datacenter IPs and non-browser TLS fingerprints, so curl_cffi and residential proxies are effectively required, and data like follower counts lives in JS-rendered JSON that needs hydration parsing. A unified API handles all of that for you.

Can requests and BeautifulSoup scrape Instagram?

They work on static HTML but rarely on Instagram. It blocks datacenter IPs, catches the requests TLS fingerprint, renders follower counts with JS, and rotates its internal GraphQL doc_id every 2-4 weeks. A plain BeautifulSoup script tends to hit empty responses or 403s. Add curl_cffi and proxies, or use a unified API.

Should I use requests or httpx for Python crawling?

For static pages either is fine; httpx adds async support and HTTP/2. For social platforms both get blocked by TLS fingerprinting, so you need curl_cffi to impersonate Chrome. When calling a unified API it does not matter, and the examples here use httpx.get().

In the US, crawling public logged-out data is broadly not a CFAA violation; hiQ v. LinkedIn (2022) and Meta v. Bright Data (2024) both upheld public-data crawling. But platform terms often forbid automated collection, and the GDPR (EU/UK) and Korea's PIPA apply to personal data even when public. Avoid authenticated endpoints and private data. This is not legal advice.

Why do I need a unified schema?

Because field names differ per platform. Followers are followerCount on TikTok, edge_followed_by.count on Instagram, statistics.subscriberCount on YouTube. Working across platforms means writing an adapter for each. A unified schema maps them all to followers and adds computed fields like engagement_rate, so one parser covers every platform.

Topics
#python-crawling#python-web-scraping#python-social-media-scraping#beautifulsoup-scraping#httpx-api#social-media-data-python

Related posts