August 6th, 2024

Tracking supermarket prices with Playwright

In December 2022, the author created a price tracking website for Greek supermarkets, utilizing Playwright for scraping, cloud services for automation, and Tailscale to bypass IP restrictions, optimizing for efficiency.

Read original article

InterestAppreciationFrustration

Tracking supermarket prices with Playwright

In December 2022, amid rising inflation, the author developed a website to track price changes in Greece's three largest supermarkets. The project faced challenges, particularly with scraping JavaScript-rendered sites, which required the use of Playwright, a tool that allows for browser automation. The author initially attempted to run the scraping on an old laptop but found it insufficient due to performance issues. Switching to cloud services, the author opted for Hetzner, which offered a more cost-effective solution compared to AWS. The scraping process was automated to run daily, utilizing a CI server on the old laptop to manage tasks on the more powerful cloud server. To bypass IP restrictions imposed by one supermarket, the author implemented Tailscale, allowing requests to appear as if they originated from a residential IP. Over time, the setup proved reliable, although it faced challenges from website changes that could disrupt scraping accuracy. The author optimized the process by upgrading server specifications and reducing data fetched during scraping, which improved efficiency and reduced costs. The overall monthly expenses for the scraping operation remained low, primarily due to the economical cloud service and the free tier of data storage used.

- The author built a price tracking website for supermarkets in Greece using Playwright for scraping.

- Initial attempts to scrape using an old laptop were unsuccessful due to performance limitations.

- The scraping process was automated and run on a cost-effective cloud server from Hetzner.

- Tailscale was used to circumvent IP restrictions from one supermarket.

- The setup has been optimized for efficiency and cost-effectiveness over time.

How I scraped 6 years of Reddit posts in JSON

The article covers scraping 6 years of Reddit posts for self-promotion data, highlighting challenges like post limits and cutoffs. Pushshift is suggested for Reddit archives. Extracting URLs and checking website status are explained. Findings reveal 40% of sites inactive. Trends in online startups are discussed.

Evaluating a Decade of Hacker News Predictions: An Open-Source Approach

The blog post evaluates a decade of Hacker News predictions using LLMs and ClickHouse. Results show a 50% success rate, highlighting challenges in prediction nuances. Future plans include expanding the project. Website: https://hn-predictions.eamag.me/.

Storing Scraped Data in an SQLite Database on GitHub

The article explains Git scraping, saving data to a Git repository with GitHub Actions. Benefits include historical tracking and using SQLite for storage. Limitations and Datasette for data visualization are discussed.

How to save $13.27 on your SaaS bill

The author discusses managing costs with Vercel's analytics, converting images to reduce charges, and building a custom API using SQLite. They faced deployment challenges but plan future enhancements.

Archiving and Syndicating Mastodon Posts

The article details archiving Mastodon posts to a personal website using the PESOS model, emphasizing online presence, automation, and content organization through a custom tool developed in Go.

AI: What people are saying

The comments reflect a shared interest in price tracking and scraping technologies, with several users discussing their own experiences and challenges.

Many commenters have created similar price tracking websites, sharing insights on the technical challenges of scraping and data management.
Common issues include dealing with changing website structures, anti-scraping measures, and the complexities of accurately matching products across different retailers.
Several users emphasize the importance of using advanced tools like Playwright and cloud services for effective scraping.
There is a call for greater price transparency and the potential for collaborative efforts in data scraping.
Some users express concerns about the ethical implications of price tracking and the impact of AI on pricing strategies.

44 comments

By @brikym - 9 months

I have been doing something similar for New Zealand since the start of the year with Playwright/Typescript dumping parquet files to cloud storage. I've just collecting the data I have not yet displayed it. Most of the work is getting around the reverse proxy services like Akamai and Cloudflare.

At the time I wrote it I thought nobody else was doing but now I know of at least 3 start ups doing the same in NZ. It seems the the inflation really stoked a lot of innovation here. The patterns are about what you'd expect. Supermarkets are up to the usual tricks of arbitrary making pricing as complicated as possible using 'sawtooth' methods to segment time-poor people from poor people. Often they'll segment on brand loyalty vs price sensitive people; There might be 3 popular brands of chocolate and every week only one of them will be sold at a fair price.

By @RasmusFromDK - 9 months

Nice writeup. I've been through similar problems that you have with my contact lens price comparison website https://lenspricer.com/ that I run in ~30 countries. I have found, like you, that websites changing their HTML is a pain.

One of my biggest hurdles initially was matching products across 100+ websites. Even though you think a product has a unique name, everyone puts their own twist on it. Most can be handled with regexes, but I had to manually map many of these (I used AI for some of it, but had to manually verify all of it).

I've found that building the scrapers and infrastructure is somewhat the easy part. The hard part is maintaining all of the scrapers and figuring out if when a product disappears from a site, is that because my scraper has an error, is it my scraper being blocked, did the site make a change, was the site randomly down for maintenance when I scraped it etc.

A fun project, but challenging at times, and annoying problems to fix.

By @batata004 - 9 months

I created a similar website which got lots of interest in my city. I scrape even app and websites data using a single server at Linode with 2GB of RAM with 5 IPv4 and 1000 IPv6 (which is free) and every single product is scraped at most 40 minutes interval, never more than that with avg time of 25 minutes. I use curl impersonate and scrape JSON as much as possible because 90% of markets provide prices from Ajax calls and the other 10% I use regex to easily parse the HTML. You can check it at https://www.economizafloripa.com.br

By @maerten - 9 months

Nice article!

> The second kind is nastier. > > They change things in a way that doesn't make your scraper fail. Instead the scraping continues as before, visiting all the links and scraping all the products.

I have found that it is best to split the task of scraping and parsing into separate processes. By saving the raw JSON or HTML, you can always go back and apply fixes to your parser.

I have built a similar system and website for the Netherlands, as part of my master's project: https://www.superprijsvergelijker.nl/

Most of the scraping in my project is done by doing simple HTTP calls to JSON apis. For some websites, a Playwright instance is used to get a valid session cookie and circumvent bot protection and captchas. The rest of the crawler/scraper, parsers and APIs are build using Haskell and run on AWS ECS. The website is NextJS.

The main challenge I have been trying to work on, is trying to link products from different supermarkets, so that you can list prices in a single view. See for example: https://www.superprijsvergelijker.nl/supermarkt-aanbieding/6...

It works for the most part, as long as at least one correct barcode number is provided for a product.

By @pcblues - 9 months

This is interesting because I believe the two major supermarkets in Australia can create a duopoly in anti-competitive pricing by just employing price analysis AI algorithms on each side and the algorithms will likely end up cooperating to maximise profit. This can probably be done legally through publicly obtained prices and illegally by sharing supply cost or profit per product data. The result is likely to be similar. Two trained AIs will maximise profit in weird ways using (super)multidimensional regression analysis (which is all AI is), and the consumer will pay for maximised profits to ostensible competitors. If the pricing data can be obtained like this, not much more is needed to implement a duopoly-focused pair of machine learning implementations.

By @seanwilson - 9 months

> They change things in a way that doesn't make your scraper fail. Instead the scraping continues as before, visiting all the links and scraping all the products. However the way they write the prices has changed and now a bag of chips doesn't cost €1.99 but €199. To catch these changes I rely on my transformation step being as strict as possible with its inputs.

You could probably add some automated checks to not sync changes to prices/products if a sanity check fails e.g. each price shouldn't change by more than 100%, and the number of active products shouldn't change by more than 20%.

By @langsoul-com - 9 months

The hard thing is not scraping, but getting around the increasingly sophisticated blockers.

You'll need to constantly rotate residential proxies (high rated) and make sure not to exhibit data scraping patterns. Some supermarkets don't show the network requests in the network tab, so cannot just get that api response.

Even then, mitm attacks with mobile app (to see the network requests and data) will also get blocked without decent cover ups.

I tried but realised it isn't worth it due to the costs and constant dev work required. In fact, some of the supermarket pricing comparison services just have (cheap labour) people scrape it

By @xyst - 9 months

Would be nice to have a price transparency of goods. It would make processes like this much more easier to track by store, and region.

For example, compare the price of oat milk at different zip codes and grocery stores. Additionally track “shrinkflation” (same price but smaller portion).

On that note, it seems you are tracking price but are you also checking the cost per gram (or ounce)? Manufacturer or store could keep price the same but offer less to the consumer. Wonder if your tool would catch this.

By @grafraf - 9 months

We have been doing it for the Swedish market in more than 8 years. We have a website https://www.matspar.se/ , where the customer can browse all the products of all major online stores, compare the prices and add the products they want to buy in the cart. The customer can in the end of the journey compare the total price of that cart (including shipping fee) and export the cart to the store they desire to order it.

I'm also one of the founders and the current CTO, so there been a lot of scraping and maintaining during the years. We are scraping over 30 million prices daily.

By @odysseus - 9 months

I used to price track when I moved to a new area, but now I find it way easier to just shop at 2 markets or big box stores that consistently have low prices.

In Europe, that would probably be Aldi/Lidl.

In the U.S., maybe Costco/Trader Joe's.

For online, CamelCamelCamel/Amazon. (for health/beauty/some electronics but not food)

If you can buy direct from the manufacturer, sometimes that's even better. For example, I got a particular brand of soap I love at the soap's wholesaler site in bulk for less than half the retail price. For shampoo, buying the gallon size direct was way cheaper than buying from any retailer.

By @andrewla - 9 months

One problem that the author notes is that so much rendering is done client side via javascript.

The flip side to this is that very often you find that the data populating the site is in a very simple JSON format to facilitate easy rendering, ironically making the scraping process a lot more reliable.

By @ikesau - 9 months

Ah, I love this. Nice work!

I really wish supermarkets were mandated to post this information whenever the price of a particular SKU updated.

The tools that could be built with such information would do amazing things for consumers.

By @xnx - 9 months

Scraping tools have become more powerful than ever, but bot restrictions have become equally more strict. It's hard to scrape reliably under any circumstance, or even consistently without residential proxies.

By @gadders - 9 months

This reminds me a bit of a meme that said something along the lines of "I don't want AI to draw my art, I want AI review my weekly grocery shop, workout which combinations of shops save me money, and then schedule the deliveries for me."

By @ptrik - 9 months

> My CI of choice is [Concourse](https://concourse-ci.org/) which describes itself as "a continuous thing-doer". While it has a bit of a learning curve, I appreciate its declarative model for the pipelines and how it versions every single input to ensure reproducible builds as much as it can.

What's the thought process behind using a CI server - which I thought is mainly for builds - for what essentially is a data pipeline?

By @jfil - 9 months

I'm building something similar for 7 grocery vendor in Canada and am looking to talk with others who are doing this - my email is in my profile.

One difference: I'm recording each scraping session as a HAR file (for proving provenance). mitmproxy (mitmdump) is invaluable for that.

By @nosecreek - 9 months

Very cool! I did something similar in Canada (https://grocerytracker.ca/)

By @PigiVinci83 - 9 months

Nice article, enjoyed reading it. I’m Pier, co founder of https://Databoutique.com, which is a marketplace for web scraped data. If you’re willing to monetize your data extractions, you can list them on our website. We just started with the grocery industry and it would be great to have you on board.

By @lotsofpulp - 9 months

In the US, retail businesses are offering individualized and general coupons via the phone apps. I wonder if this pricing can be tracked, as it results in significant differences.

For example, I recently purchased fruit and dairy at Safeway in the western US, and after I had everything I wanted, I searched each item in the Safeway app, and it had coupons I could apply for $1.5 to $5 off per item. The other week, my wife ran into the store to buy cream cheese. While she did that, I searched the item in the app, and “clipped” a $2.30 discount, so what would have been $5.30 to someone that didn’t use the app was $3.

I am looking at the receipt now, and it is showing I would have spent $70 total if I did not apply the app discounts, but with the app discounts, I spent $53.

These price obfuscation tactics are seen in many businesses, making price tracking very difficult.

By @hnrodey - 9 months

Nice job getting through all this. I kind of enjoy writing scrapers and browser automation in general. Browser automation is quite powerful and under explored/utilized by the average developer.

Something I learned recently, which might help your scrapers, is the ability in Playwright to sniff the network calls made through the browser (basically, programmatic API to the Network tab of the browser).

The boost is that you allow the website/webapp to make the API calls and then the scraper focuses on the data (rather than allowing the page to render DOM updates).

This approach falls apart if the page is doing server side rendering as there are no API calls to sniff.

By @mishu2 - 9 months

Playwright is basically necessary for scraping nowadays, as the browser needs to do a lot of work before the web page becomes useful/readable. I remember scraping with HTTrack back in high school and most of the sites kept working...

For my project (https://frankendash.com/), I also ran into issues with dynamically generated class names which change on every site update, so in the end I just went with saving a crop area from the website as an image and showing that.

By @kinderjaje - 9 months

A few years ago, we had a client and built a price-monitoring app for women's beauty products. They had multiple marketplaces, and like someone mentioned before, it was tricky because many products come in different sizes and EANs, and you need to be able to match them.

We built a system for admins so they can match products from Site A with products from Site B.

The scraping part was not that hard. We used our product https://automatio.co/ where possible, and where we couldn't, we built some scrapers from scratch t using simple CURL or Puppetteer.

Thanks for sharing your experience, especially that I didn't use Playwright before.

By @moohaad - 9 months

Cloudflare Worker has Browser Rendering API

By @Stubbs - 9 months

I did something very similar but for the price of wood from sellers here in the UK but instead of Platwright, which I'd never heard of at the time, I used NodeRED.

You just reminded me, it's probably still running today :-D

By @ptrik - 9 months

> I went from 4vCPUs and 16GB of RAM to 8vCPUs and 16GB of RAM, which reduced the duration by about ~20%, making it comparable to the performance I get on my MBP. Also, because I'm only using the scraping server for ~2h the difference in price is negligible.

Good lesson on cloud economics. Below certain threshold we get linear performance gain with more expensive instance type. It is essentially the same amount of spending but you would save time running the same workload with more expensive machine but for shorter period of time.

By @scarredwaits - 9 months

Great article and congrats on making this! It would be great to have a chat if you like, because I’ve built Zuper, also for Greek supermarkets, which has similar goals (and problems!)

By @joelthelion - 9 months

We should mutualize scraping efforts, creating a sort of Wikipedia of scraped data. I bet a ton of people and cool applications would benefit from it.

By @haolez - 9 months

I heard that some e-commerce sites will not block scrappers, but poison the data shown to them (e.g. subtly wrong prices). Does anyone know more about this?

By @NKosmatos - 9 months

Hey, thanks for creating https://pricewatcher.gr/en/ very much appreciated.

Nice blog post and very informative. Good to read that it costs you less than 70€ per year to run this and hope that the big supermarkets don’t block this somehow.

Have you thought of monetizing this? Perhaps with ads from the 3 big supermarkets you scrape ;-)

By @antman - 9 months

Looks great. Perhaps more than 30 days comparisons would be interesting. Or customizable should be fast enough with a duckdb backend

By @6510 - 9 months

Can someone name the South-American country where they have a government price comparison website. Listing all products was required by law.

Someone showed me this a decade ago. The site had many obvious issues but it did list everything. If I remember correctly it was started to stop merchants pricing things by who is buying.

I forget which country it was.

By @cynicalsecurity - 9 months

> My first thought was to use AWS, since that's what I'm most familiar with, but looking at the prices for a moderately-powerful EC2 instance (i.e. 4 cores and 8GB of RAM) it was going to cost much more than I was comfortable to spend for a side project.

Yep, AWS is hugely overrated and overpriced.

By @jonatron - 9 months

If you were thinking of making a UK supermarket price comparison site, IIRC there's a company who owns all the product photos, read more at https://news.ycombinator.com/item?id=31900312

By @hk1337 - 9 months

I would be curious if there were a price difference between what is online and physically in the store.

By @janandonly - 9 months

I live in the Netherlands, where we are blessed with a price comparison website (https://tweakers.net/pricewatch/) for gadgets.

By @ptrik - 9 months

> The data from the scraping are saved in Cloudflare's R2 where they have a pretty generous 10GB free tier which I have not hit yet, so that's another €0.00 there.

Wonder how's the data from R2 fed into frontend?

By @Closi - 9 months

This is great! Would be great if the website would give a summary of which shop was actually cheapest (e.g. based on a basket of comparable goods that all retailers stock).

Although might be hard to do with messy data.

By @SebFender - 9 months

I've worked with similar solutions for decades (complete different need) and in the end web changes made the solution unscalable. Fun idea to play but with too many error scenarios.

By @Alifatisk - 9 months

Some stores don’t have an interactive website but instead send out magazines to your email with news for the week.

How would one scrape those? Anyone experienced?

By @throwaway346434 - 9 months

https://prices.openfoodfacts.org/

By @Scrapemist - 9 months

What if you add all products to your shopping cart and save it as “favourites” and scrape that every other day.

By @ptrik - 9 months

> While the supermarket that I was using to test things every step of the way worked fine, one of them didn't. The reason? It was behind Akamai and they had enabled a firewall rule which was blocking requests originating from non-residential IP addresses.

Why did you pick Tailscale as the solution for proxy vs scraping with something like AWS Lambda?

By @mt_ - 9 months

What about networking costs? Is it free in Hetzner?

By @raybb - 9 months

Anyone know of one of these for Spain?

How I scraped 6 years of Reddit posts in JSON

Evaluating a Decade of Hacker News Predictions: An Open-Source Approach

Storing Scraped Data in an SQLite Database on GitHub

How to save $13.27 on your SaaS bill

Archiving and Syndicating Mastodon Posts

The article details archiving Mastodon posts to a personal website using the PESOS model, emphasizing online presence, automation, and content organization through a custom tool developed in Go.

Tracking supermarket prices with Playwright

Related

How I scraped 6 years of Reddit posts in JSON

Evaluating a Decade of Hacker News Predictions: An Open-Source Approach

Storing Scraped Data in an SQLite Database on GitHub

How to save $13.27 on your SaaS bill

Archiving and Syndicating Mastodon Posts

Related

How I scraped 6 years of Reddit posts in JSON

Evaluating a Decade of Hacker News Predictions: An Open-Source Approach

Storing Scraped Data in an SQLite Database on GitHub

How to save $13.27 on your SaaS bill

Archiving and Syndicating Mastodon Posts