How I scraped 100,000 fishing posts to find a secret fishing spot
See how I scraped hundreds of different sites for fishing reports and posts. Previously, I worked as a sneaker bot developer and learned a few neat tricks along the way about scraping at scale, bypassing antibots, and evading detection.
I'll walk through how I built scrapers to extract data from a few different websites to find new fishing spots - including the site crawling, data extraction, and then the storage of the data for retrieval.
The Problem
Recently, I've moved around a lot for work, and I found myself struggling to find "secret" spots that I spent years finding in other places. I needed a shortcut, and fortunately, people like giving hints as to where their favorite spots are in niche forums. People also like gloating when they catch a lot of fish, and they usually give a general direction as to where they caught them. It's just hard to extract these insights at scale and find the nuggets of good information. I decided to fire up old reliable (BeautifulSoup, some proxies, an async http client, and a coffee) to see if I could make something happen.
Many of these insights were spread across different forums or lake-specific fishing reports, so it was challenging to have a catch-all way to scrape each site. Each of these sites has a different makeup of text - posts sitting in different selectors, etc. The simplest solution was to create a tailored scraper that extracts the specific content for each site, which is how I originally solved this problem, but there are better ways that I will discuss later on in the article.
Indexing the sites
To extract all content on a site, you need to somehow find a way to index the site. Fortunately, the sites in this scenario had a sitemap.xml. A sitemap is a list of all URLs on the site in XML format, and it makes it much easier to filter and find URLs that you want to extract content for. Below, I've included an example of what this sitemap might look like. You can see the URLs are sitting neatly in each <url> block. Not all sitemaps will look like this, but this is a great example.
<url>
<loc>https://www.secretfishingspots.com</loc>
</url>
<url>
<loc>https://www.secretfishingspots.com</loc>
</url>
In this scenario, it's an easy way to extract the URLs that we want to extract content from. Typically, websites will follow some sort of pattern on their URL structure. In forums, this might include something like a /forum/{uniqueID} or /threads/{title}. We can take this pattern and then match all URLs from the sitemap, and then we have a list of every forum post on the site. I ended up reducing the overall links by about 30% by only matching URLs that had /threads/fishing-reports*, so that I was only grabbing fishing reports specifically. Websites have a ton of noise, so any sort of pre-filtering on the frontend makes retrieval much simpler later on in the process.
After this, I had a list of about 10,000 links for this specific site. Next, we'll extract the data from them (pretty quickly, but obviously within the ToS).
Data Extraction
With a pool of about 10,000 links, we'll need to understand the pattern that each one is using to render text in the site. Each site will have a different makeup of how it organizes text, some sites even render this data with javascript rather than raw text. I will open the site in a browser, highlight whatever text I want to extract, and then pop open inspect element. Inspect element will highlight the specific selectors for that text - it might look like this:
<p class='post-content'>
And the secret fishing spot is...
</p>
We can see that our content is wrapped in the 'post-content' class. I used BeautifulSoup to extract this data with code that looked like this:
# Parse your HTML
soup = BeautifulSoup(site_html, 'html.parser')
# Find all p tags with class 'post-content'
posts = soup.find_all('p', class_='post-content')
# Extract the text content
for post in posts:
print(post.get_text())
Fortunately, the sites for this use case had simple HTML structures that made data extraction easy.
Queue up the threadpool, we're going to scrape
After creating our list of URLs and building our extraction pattern, we'll move on to applying the extraction pattern to each of these URLs in parallel. At scale, this is usually the hardest part. Anyone can fire up a scraper, run it once, and extract some content. It starts to get more challenging to do it at scale while evading antibots, IP blocks, and browser fingerprinting.
I ran a quick scrape on the site with this.
import requests
result = requests.get("https://www.secretfishingspots.com")
After running this, the request timed out, which was weird because this was working in my browser. The site had some sort of browser fingerprinting, and it was identifying that this was not a normal client trying to access the site. I was being blocked by the bot protection.
There a few ways to bypass this, but in this case, it was fixed by adding all of the headers that I saw in my browser.
Your browser will have request headers like User-Agent, Cookie, sec-ch-ua, etc. Your browser also has other headers like :authority, :method, :path, and :scheme. Advanced bot protections look at these headers too, and they will check if the other is correct. Each browser orders these headers differently, and bot protections will catch if they are out of order based on the fingerprint, user-agent, and other parameters.
This site wasn't too challenging, I just needed to add the headers like I saw in the browser. That usually works for simple sites. Now that we have a repeatable pattern, we can wrap that request flow inside get_secretsite(url) to automate that.
We can get to the fun stuff now. We'll need a way to run these requests in parallel that doesn't get us IP banned. If we were to just launch all of these requests for each URL in the list of 10,000, it would be closer to a DDoS attack than scraping. It's hard to know the limit to prevent an IP ban, but some sites will return headers that let you know. Unfortunately, this site did not.
I'm a big fan of using a semaphore, which is a way to limit and share access to a specific item (a URL in this case) to a specific number of requests running in parallel. To visualize this, a semaphore will help say - "I want 50 requests running parallel". The semaphore will handle a new request starting when another finishes and keeping us at the 50 requests in flight. It's an easy way to use a synchronization primitive to handle keeping our requests to the site in check.
After setting this up, I ran about 50 requests in parallel at once and wrote the extracted data to a file. It still took some time due to the size of the site, but it went fairly fast, and then I had access to all of the raw posts from the site!
Extract Insights
I won't go too in depth on this side, but I will explain how I found some unique spots (and caught fish at them)! After extracting all of the fishing posts, There were too many posts to read myself. I decided to embed and upload each of them into a vector database, so that I could ask questions about the posts. I asked tons of questions about these and plugged many of the best posts into an LLM, and I was able to extract some key insights around a secret fishing spot that a user had codenamed. I was able to deduce over about a hundred posts a general location on a specific lake in Austin, TX. Naturally, I went out to the lake and caught quite a few fish there, including a 5 pound largemouth bass! I still use the scraped data across not only this site, but many others, to find new spots and learn what lures work best.
Takeaways
I had a ton of fun writing these scrapers, and it did bring me back to my sneaker dev days. It's always fun when there's a real world component too. I also found how truly annoying it was to do this at scale - it took forever to write these scrapers, setup the concurrency, write the antibot evasions, etc. The current tools like firecrawl are too expensive to index content like this at scale, and they have a repeatable pattern, so why would we use an LLM for that? The work is cumbersome, and it becomes even more cumbersome if you want to continuously extract data (like new forum posts every day) because you have to get into scheduling, re-embedding, etc.
I created meter for this. It uses an LLM once to generate the selectors, and then creates a repeatable pattern that scales across many different URLs with the same pattern. In our case, this pattern would work for all forum post URLs, and it would extract the posts on each page. You'd be able to say Extract all of the forum posts and get results like below. The power of using an LLM once is that you get a repeatable pattern to run on any URL that uses that pattern. It also reduces the LLM noise that you get from markdown based generator.
[
{
"post": "Secret spot #1..."
},
{
"post": "Secret spot #2..."
}
]
Meter manages the scheduling and orchestration of scraping, including sending notifications if any content changes. It also handles the bot protection, proxy, and scheduling of the scraping.
What's Next?
Want to learn more? Shoot me an email if you want to try meter out (mckinnon@meter.sh) - check out our documentation for a deeper dive.
Happy scraping!