Tracking supermarket prices with Playwright
In December 2022, the author created a price tracking website for Greek supermarkets, utilizing Playwright for scraping, cloud services for automation, and Tailscale to bypass IP restrictions, optimizing for efficiency.
Read original articleIn December 2022, amid rising inflation, the author developed a website to track price changes in Greece's three largest supermarkets. The project faced challenges, particularly with scraping JavaScript-rendered sites, which required the use of Playwright, a tool that allows for browser automation. The author initially attempted to run the scraping on an old laptop but found it insufficient due to performance issues. Switching to cloud services, the author opted for Hetzner, which offered a more cost-effective solution compared to AWS. The scraping process was automated to run daily, utilizing a CI server on the old laptop to manage tasks on the more powerful cloud server. To bypass IP restrictions imposed by one supermarket, the author implemented Tailscale, allowing requests to appear as if they originated from a residential IP. Over time, the setup proved reliable, although it faced challenges from website changes that could disrupt scraping accuracy. The author optimized the process by upgrading server specifications and reducing data fetched during scraping, which improved efficiency and reduced costs. The overall monthly expenses for the scraping operation remained low, primarily due to the economical cloud service and the free tier of data storage used.
- The author built a price tracking website for supermarkets in Greece using Playwright for scraping.
- Initial attempts to scrape using an old laptop were unsuccessful due to performance limitations.
- The scraping process was automated and run on a cost-effective cloud server from Hetzner.
- Tailscale was used to circumvent IP restrictions from one supermarket.
- The setup has been optimized for efficiency and cost-effectiveness over time.
Related
How I scraped 6 years of Reddit posts in JSON
The article covers scraping 6 years of Reddit posts for self-promotion data, highlighting challenges like post limits and cutoffs. Pushshift is suggested for Reddit archives. Extracting URLs and checking website status are explained. Findings reveal 40% of sites inactive. Trends in online startups are discussed.
Evaluating a Decade of Hacker News Predictions: An Open-Source Approach
The blog post evaluates a decade of Hacker News predictions using LLMs and ClickHouse. Results show a 50% success rate, highlighting challenges in prediction nuances. Future plans include expanding the project. Website: https://hn-predictions.eamag.me/.
Storing Scraped Data in an SQLite Database on GitHub
The article explains Git scraping, saving data to a Git repository with GitHub Actions. Benefits include historical tracking and using SQLite for storage. Limitations and Datasette for data visualization are discussed.
How to save $13.27 on your SaaS bill
The author discusses managing costs with Vercel's analytics, converting images to reduce charges, and building a custom API using SQLite. They faced deployment challenges but plan future enhancements.
Archiving and Syndicating Mastodon Posts
The article details archiving Mastodon posts to a personal website using the PESOS model, emphasizing online presence, automation, and content organization through a custom tool developed in Go.
- Many commenters have created similar price tracking websites, sharing insights on the technical challenges of scraping and data management.
- Common issues include dealing with changing website structures, anti-scraping measures, and the complexities of accurately matching products across different retailers.
- Several users emphasize the importance of using advanced tools like Playwright and cloud services for effective scraping.
- There is a call for greater price transparency and the potential for collaborative efforts in data scraping.
- Some users express concerns about the ethical implications of price tracking and the impact of AI on pricing strategies.
At the time I wrote it I thought nobody else was doing but now I know of at least 3 start ups doing the same in NZ. It seems the the inflation really stoked a lot of innovation here. The patterns are about what you'd expect. Supermarkets are up to the usual tricks of arbitrary making pricing as complicated as possible using 'sawtooth' methods to segment time-poor people from poor people. Often they'll segment on brand loyalty vs price sensitive people; There might be 3 popular brands of chocolate and every week only one of them will be sold at a fair price.
One of my biggest hurdles initially was matching products across 100+ websites. Even though you think a product has a unique name, everyone puts their own twist on it. Most can be handled with regexes, but I had to manually map many of these (I used AI for some of it, but had to manually verify all of it).
I've found that building the scrapers and infrastructure is somewhat the easy part. The hard part is maintaining all of the scrapers and figuring out if when a product disappears from a site, is that because my scraper has an error, is it my scraper being blocked, did the site make a change, was the site randomly down for maintenance when I scraped it etc.
A fun project, but challenging at times, and annoying problems to fix.
> The second kind is nastier. > > They change things in a way that doesn't make your scraper fail. Instead the scraping continues as before, visiting all the links and scraping all the products.
I have found that it is best to split the task of scraping and parsing into separate processes. By saving the raw JSON or HTML, you can always go back and apply fixes to your parser.
I have built a similar system and website for the Netherlands, as part of my master's project: https://www.superprijsvergelijker.nl/
Most of the scraping in my project is done by doing simple HTTP calls to JSON apis. For some websites, a Playwright instance is used to get a valid session cookie and circumvent bot protection and captchas. The rest of the crawler/scraper, parsers and APIs are build using Haskell and run on AWS ECS. The website is NextJS.
The main challenge I have been trying to work on, is trying to link products from different supermarkets, so that you can list prices in a single view. See for example: https://www.superprijsvergelijker.nl/supermarkt-aanbieding/6...
It works for the most part, as long as at least one correct barcode number is provided for a product.
You could probably add some automated checks to not sync changes to prices/products if a sanity check fails e.g. each price shouldn't change by more than 100%, and the number of active products shouldn't change by more than 20%.
You'll need to constantly rotate residential proxies (high rated) and make sure not to exhibit data scraping patterns. Some supermarkets don't show the network requests in the network tab, so cannot just get that api response.
Even then, mitm attacks with mobile app (to see the network requests and data) will also get blocked without decent cover ups.
I tried but realised it isn't worth it due to the costs and constant dev work required. In fact, some of the supermarket pricing comparison services just have (cheap labour) people scrape it
For example, compare the price of oat milk at different zip codes and grocery stores. Additionally track “shrinkflation” (same price but smaller portion).
On that note, it seems you are tracking price but are you also checking the cost per gram (or ounce)? Manufacturer or store could keep price the same but offer less to the consumer. Wonder if your tool would catch this.
I'm also one of the founders and the current CTO, so there been a lot of scraping and maintaining during the years. We are scraping over 30 million prices daily.
In Europe, that would probably be Aldi/Lidl.
In the U.S., maybe Costco/Trader Joe's.
For online, CamelCamelCamel/Amazon. (for health/beauty/some electronics but not food)
If you can buy direct from the manufacturer, sometimes that's even better. For example, I got a particular brand of soap I love at the soap's wholesaler site in bulk for less than half the retail price. For shampoo, buying the gallon size direct was way cheaper than buying from any retailer.
The flip side to this is that very often you find that the data populating the site is in a very simple JSON format to facilitate easy rendering, ironically making the scraping process a lot more reliable.
I really wish supermarkets were mandated to post this information whenever the price of a particular SKU updated.
The tools that could be built with such information would do amazing things for consumers.
What's the thought process behind using a CI server - which I thought is mainly for builds - for what essentially is a data pipeline?
One difference: I'm recording each scraping session as a HAR file (for proving provenance). mitmproxy (mitmdump) is invaluable for that.
For example, I recently purchased fruit and dairy at Safeway in the western US, and after I had everything I wanted, I searched each item in the Safeway app, and it had coupons I could apply for $1.5 to $5 off per item. The other week, my wife ran into the store to buy cream cheese. While she did that, I searched the item in the app, and “clipped” a $2.30 discount, so what would have been $5.30 to someone that didn’t use the app was $3.
I am looking at the receipt now, and it is showing I would have spent $70 total if I did not apply the app discounts, but with the app discounts, I spent $53.
These price obfuscation tactics are seen in many businesses, making price tracking very difficult.
Something I learned recently, which might help your scrapers, is the ability in Playwright to sniff the network calls made through the browser (basically, programmatic API to the Network tab of the browser).
The boost is that you allow the website/webapp to make the API calls and then the scraper focuses on the data (rather than allowing the page to render DOM updates).
This approach falls apart if the page is doing server side rendering as there are no API calls to sniff.
For my project (https://frankendash.com/), I also ran into issues with dynamically generated class names which change on every site update, so in the end I just went with saving a crop area from the website as an image and showing that.
We built a system for admins so they can match products from Site A with products from Site B.
The scraping part was not that hard. We used our product https://automatio.co/ where possible, and where we couldn't, we built some scrapers from scratch t using simple CURL or Puppetteer.
Thanks for sharing your experience, especially that I didn't use Playwright before.
You just reminded me, it's probably still running today :-D
Good lesson on cloud economics. Below certain threshold we get linear performance gain with more expensive instance type. It is essentially the same amount of spending but you would save time running the same workload with more expensive machine but for shorter period of time.
Nice blog post and very informative. Good to read that it costs you less than 70€ per year to run this and hope that the big supermarkets don’t block this somehow.
Have you thought of monetizing this? Perhaps with ads from the 3 big supermarkets you scrape ;-)
Someone showed me this a decade ago. The site had many obvious issues but it did list everything. If I remember correctly it was started to stop merchants pricing things by who is buying.
I forget which country it was.
Yep, AWS is hugely overrated and overpriced.
Wonder how's the data from R2 fed into frontend?
Although might be hard to do with messy data.
How would one scrape those? Anyone experienced?
Why did you pick Tailscale as the solution for proxy vs scraping with something like AWS Lambda?
Related
How I scraped 6 years of Reddit posts in JSON
The article covers scraping 6 years of Reddit posts for self-promotion data, highlighting challenges like post limits and cutoffs. Pushshift is suggested for Reddit archives. Extracting URLs and checking website status are explained. Findings reveal 40% of sites inactive. Trends in online startups are discussed.
Evaluating a Decade of Hacker News Predictions: An Open-Source Approach
The blog post evaluates a decade of Hacker News predictions using LLMs and ClickHouse. Results show a 50% success rate, highlighting challenges in prediction nuances. Future plans include expanding the project. Website: https://hn-predictions.eamag.me/.
Storing Scraped Data in an SQLite Database on GitHub
The article explains Git scraping, saving data to a Git repository with GitHub Actions. Benefits include historical tracking and using SQLite for storage. Limitations and Datasette for data visualization are discussed.
How to save $13.27 on your SaaS bill
The author discusses managing costs with Vercel's analytics, converting images to reduce charges, and building a custom API using SQLite. They faced deployment challenges but plan future enhancements.
Archiving and Syndicating Mastodon Posts
The article details archiving Mastodon posts to a personal website using the PESOS model, emphasizing online presence, automation, and content organization through a custom tool developed in Go.