Tweaker Skit3000 has developed a site with which the Dutch can compare prices from supermarkets. Via checkjebon.nl, users can enter shopping lists or scan a receipt. The site checks which supermarket has the cheapest list.
Developer Sjoerd van den Hoorn, known on Tweakers as Shit3000gives on Github more information about how the checkjebon.nl. His website uses tesseract OCR software to recognize text on receipts, after which underlying processing is also done to recognize product names. The site then tries to link a product to each product name entered. First, it searches for exactly the same product name, but if it is not found, the algorithm searches for a broader description. In that case, the algorithm searches for product names in which all letters occur in the same order as they are specified. Those letters do not have to be written together, so that several products can be eligible.
If a content measure or weight is provided, the algorithm filters out all products that contain less than the specified value. If this results in no results, the filter is undone. The results are ranked by cost price by the algorithm and only the cheapest item available in one supermarket is then displayed as the comparable option in another supermarket.
Skit3000 uses Node-RED for this, he tells Tweakers. “That is a handy tool for developing prototypes quickly, because you can stick all kinds of different blocks of functionality together without first having to make all kinds of dependencies need to download. The visual environment lets you easily inspect the output of nodes and essentially makes Node-RED to NodeJS what Jupyter is to Python. With the Inject node, with which you can manually start a flow, you can also set it up with a few clicks so that it starts the flow at set times without having to create a cron job yourself, for example. It’s also easy to temporarily disconnect two nodes, so while you’re debugging you can avoid polluting your database with test data, for example. This is ideal for scraping prizes. If a flow gets stuck somewhere during scraping, you can start it from one environment, inspect what goes wrong, make adjustments and test and then immediately apply it back into production.”
“All scrapers start with an Inject node that is automatically triggered on a daily basis. This is followed by a template node containing JavaScript code that must be executed in a browser to navigate through the supermarket’s site. After this comes an nbrowser node which launches a headless browser, goes to the supermarket’s site and runs the script.”
“While this is going on, a subflow is running that inspects a variable every ten seconds that the scrape script should change once it’s done, meanwhile showing the number of products scraped so far. If there’s an error or the flow gets stuck to sit in one loop without encountering new products, the subflow closes via the error-output that I created and sends the main flow an email to myself so that I know I have to do something about this. If there are no errors and all pages have been scraped, the subflow is sent via the ok-output exit and the products are saved in a json file.”
“Finally, when all the scrapers are done, a flow which merges and cleans up the various json files. Because Checkjebon loads all data as one file and it concerns more than one hundred thousand products, I reduce the attribute names from ‘price’ to ‘p’, etcetera. I thought this was unnecessary since everything is hosted through GitHub Pages and they use gzip compression, but apparently the token used for ‘price’ is longer than the eight bits needed for a single ‘p’. I’ve looked into applying further compression by moving from a json object to a two-dimensional array, but that only yielded a small saving. It also took longer to properly read that into a json object in the user’s browser.”
Since both the scrapers and the website use JavaScript, Skit3000 says it can reuse pieces of code, such as those used to detect and convert quantities to liters and kilograms. “If I had used a different language, I would have had to write this code twice.”
Checkjebon.nl can currently read the range from Albert Heijn, Coop, Dirk, Hoogvliet, Jan Linders, Jumbo, Plus, Spar and Vomar supermarkets. There is limited support for the assortment of Aldi and DekaMarkt supermarkets, because these stores do not share all prices of their goods online.
The developer says all data is processed locally. According to him, the processing of data takes place in the browser and the results thereof are not kept on the website or sold on to advertisers.