blg.tch.re – Use AI to extract structured data from scraped pages (Apr. 03, 2024)

Use AI to extract structured data from scraped pages

Disclaimer: this is a personal note to something I want explore in the future.

I recently saw on HN two interestings projects which could be used to extract data from scraped web pages.

LaVague

LaVague is a project written in python. It can execute instructions written in natural language on a browser to automate it.

It uses a local LLM with a selenium integration. A playwright integration is in progress: https://github.com/lavague-ai/LaVague/pull/76

LaVague execute the LLM instructions and generates a Selenium script to replay the operation easly w/o the need of LaVague itself. So LaVague is a way to build a scraping script.

Relative HN discussion: https://news.ycombinator.com/item?id=39698546

Skyvern

Skyvern is also a project written in python. Skyvern uses computer vision + LLM to execute instruction and it commands the browser through Playwright.

Skyvern will not generate scraping script. We must use it each time we execute scrape.

Moreover Skyvern seems to use ChatGPT API only, so each scrape generates ChatGPT requests. So the final scraping cost could be very high. The version 0.1.1 introduces the use of liteLLM to use any LLM provider with Skyvern.

Skyvern seems to be a good candidate to resolve captchas.

Relatvie HN discussion: https://news.ycombinator.com/item?id=39706004