Back to blog

How I scraped 100,000 fishing posts to find a secret fishing spot

See how I scraped hundreds of different sites for fishing reports and posts. Previously, I worked as a sneaker bot developer and learned a few neat tricks along the way about scraping at scale, bypassing antibots, and evading detection.

I'll walk through how I built scrapers to extract data from a few different websites to find new fishing spots - including the site crawling, data extraction, and then the storage of the data for retrieval.

The Problem

Recently, I've moved around a lot for work, and I found myself struggling to find "secret" spots that I spent years finding in other places. I needed a shortcut, and fortunately, people like giving hints as to where their favorite spots are in niche forums. People also like gloating when they catch a lot of fish, and they usually give a general direction as to where they caught them. It's just hard to extract these insights at scale and find the nuggets of good information. I decided to fire up old reliable (BeautifulSoup, some proxies, an async http client, and a coffee) to see if I could make something happen.

Many of these insights were spread across different forums or lake-specific fishing reports, so it was challenging to have a catch-all way to scrape each site. Each of these sites has a different makeup of text - posts sitting in different selectors, etc. The simplest solution was to create a tailored scraper that extracts the specific content for each site, which is how I originally solved this problem, but there are better ways that I will discuss later on in the article.

Indexing the sites

To extract all content on a site, you need to somehow find a way to index the site. Fortunately, the sites in this scenario had a sitemap.xml. A sitemap is a list of all URLs on the site in XML format, and it makes it much easier to filter and find URLs that you want to extract content for. Below, I've included an example of what this sitemap might look like. You can see the URLs are sitting neatly in each <url> block. Not all sitemaps will look like this, but this is a great example.

<url>
<loc>https://www.secretfishingspots.com/threads/1</loc>
</url>
<url>
<loc>https://www.secretfishingspots.com/threads/2</loc>
</url>

In this scenario, it's an easy way to extract the URLs that we want to extract content from. Typically, websites will follow some sort of pattern on their URL structure. In forums, this might include something like a /forum/{uniqueID} or /threads/{title}. We can take this pattern and then match all URLs from the sitemap, and then we have a list of every forum post on the site. I ended up reducing the overall links by about 30% by only matching URLs that had /threads/fishing-reports*, so that I was only grabbing fishing reports specifically. Websites have a ton of noise, so any sort of pre-filtering on the frontend makes retrieval much simpler later on in the process.

After this, I had a list of about 10,000 links for this specific site. Next, we'll extract the data from them (pretty quickly, but obviously within the ToS).

Data Extraction

With a pool of about 10,000 links, we'll need to understand the pattern that each one is using to render text in the site. Each site will have a different makeup of how it organizes text, some sites even render this data with javascript rather than raw text. I will open the site in a browser, highlight whatever text I want to extract, and then pop open inspect element. Inspect element will highlight the specific selectors for that text - it might look like this:

<p class="post-content">And the secret fishing spot is...</p>

We can see that our content is wrapped in the 'post-content' class. I used BeautifulSoup to extract this data with code that looked like this:

# Parse your HTML
soup = BeautifulSoup(site_html, 'html.parser')

# Find all p tags with class 'post-content'
posts = soup.find_all('p', class_='post-content')

# Extract the text content
for post in posts:
    print(post.get_text())

Fortunately, the sites for this use case had simple HTML structures that made data extraction easy.

Queue up the threadpool, we're going to scrape

After creating our list of URLs and building our extraction pattern, we'll move on to applying the extraction pattern to each of these URLs in parallel. At scale, this is usually the hardest part. Anyone can fire up a scraper, run it once, and extract some content. It starts to get more challenging to do it at scale while evading antibots, IP blocks, and browser fingerprinting.

I ran a quick scrape on the site with this.

import requests

result = requests.get("https://www.secretfishingspots.com")

After running this, the request timed out, which was weird because this was working in my browser. The site had some sort of browser fingerprinting, and it was identifying that this was not a normal client trying to access the site. I was being blocked by the bot protection.

There are a few ways to bypass this, but in this case, it was fixed by adding all of the headers that I saw in my browser.

Your browser will have request headers like User-Agent, Cookie, sec-ch-ua, etc. Your browser also has other headers like :authority, :method, :path, and :scheme. Advanced bot protections look at these headers too, and they will check if the other is correct. Each browser orders these headers differently, and bot protections will catch if they are out of order based on the fingerprint, user-agent, and other parameters.

This site wasn't too challenging, I just needed to add the headers like I saw in the browser. That usually works for simple sites. Now that we have a repeatable pattern, we can wrap that request flow inside get_secretsite(url) to automate that.

We can get to the fun stuff now. We'll need a way to run these requests in parallel that doesn't get us IP banned. If we were to just launch all of these requests for each URL in the list of 10,000, it would be closer to a DDoS attack than scraping. It's hard to know the limit to prevent an IP ban, but some sites will return headers that let you know. Unfortunately, this site did not.

I'm a big fan of using a semaphore, which is a way to limit and share access to a specific item (a URL in this case) to a specific number of requests running in parallel. To visualize this, a semaphore will help say - "I want 50 requests running parallel". The semaphore will handle a new request starting when another finishes and keeping us at the 50 requests in flight. It's an easy way to use a synchronization primitive to handle keeping our requests to the site in check.

async def fetch_and_extract(url, semaphore):
    async with semaphore:
        async with get_secretsite(url) as response:
            html = await response.text()
            soup = BeautifulSoup(html, 'html.parser')
            posts = soup.find_all('p', class_='post-content')
            return [post.get_text() for post in posts]

async def scrape_all(urls, max_concurrent=50):
    semaphore = asyncio.Semaphore(max_concurrent)
    tasks = [fetch_and_extract(url, semaphore) for url in urls]
    results = await asyncio.gather(*tasks, return_exceptions=True)
    return results

# Run the scraper
urls = ["https://secretfishingspots.com/thread/1", "https://secretfishingspots.com/thread/2", ...]
all_posts = asyncio.run(scrape_all(urls))

After setting this up, I ran about 50 requests in parallel at once and wrote the extracted data to a file. It still took some time due to the size of the site, but it went fairly fast, and then I had access to all of the raw posts from the site!

Extract Insights

After extracting all of the fishing posts, there were too many posts to read myself. I decided to embed and upload each of them into a vector database, so that I could ask questions about the posts.

from openai import OpenAI
from qdrant_client import QdrantClient
from qdrant_client.models import PointStruct

openai = OpenAI()
qdrant = QdrantClient(url="localhost:6333")

def embed(text):
    return openai.embeddings.create(model="text-embedding-3-small", input=text).data[0].embedding

# Store posts
points = [PointStruct(id=i, vector=embed(p['content']), payload=p) for i, p in enumerate(posts)]
qdrant.upsert(collection_name="fishing", points=points)

# Search
results = qdrant.search("fishing", vector=embed("where's the best spot to fish on ladybird lake?"), limit=10)

Once everything was indexed, I could query it conversationally. I'd ask things like "Where do people catch largemouth bass in Austin in the summer?" or "What lures work best for night fishing on Lake Travis?" The vector search would return the most relevant posts, and I'd feed those into an LLM for synthesis.

One pattern I noticed: users would refer to spots by nicknames or vague landmarks—"the cove past the second dock" or "where the old tire used to be." By pulling dozens of posts mentioning the same codename and cross-referencing details (nearby features, depth, what side of the lake), I was able to put together the pieces for a general idea of the spot. It had been mentioned in fragments across maybe 100 different posts over three years, but no single post gave it away. Naturally, I went out to the lake and caught quite a few fish in the "secret" spot, including a 5 pound largemouth bass! I still use the scraped data across not only this site, but many others, to find new spots and learn what lures work best.

Takeaways

I had a ton of fun writing these scrapers, and it did bring me back to my sneaker dev days. It's always fun when there's a real world component too. I also found how truly annoying it was to do this at scale - it took forever to write these scrapers, setup the concurrency, write the antibot evasions, etc. The current tools like firecrawl are too expensive to index content like this at scale, and they have a repeatable pattern, so why would we use an LLM for that? The work is cumbersome, and it becomes even more cumbersome if you want to continuously extract data (like new forum posts every day) because you have to get into scheduling, re-embedding, etc.

Writing these scrapers took me a weekend. Maintaining them would take forever. So I built meter—you describe what you want to extract, and it handles the scraping, proxies, and antibot evasion automatically. It'll notify you when content changes too, so you're not babysitting cron jobs. Here's demo through our SDK:

[
    {
        "post": "Secret spot #1..."
    },
    {
        "post": "Secret spot #2..."
    }
]

What's Next?

Want to learn more? Shoot me an email if you want to try meter out (mckinnon@meter.sh) - run a scrape on a site here for a deeper dive.

Happy scraping!