


They are generally fast, but fail scraping the contents when the HTML dynamically changes on browsers.ĭynamic crawlers based on PhantomJS and Selenium work magically on such dynamic applications. The static crawlers are based on simple requests to HTML files.

FAQ How is this different from other crawlers? NODE_PATH=./ node examples/priority-queue.js API reference The examples can be run from the root folder as follows: Save screenshots for the crawling evidenceĬonst HCCrawler = require ( 'headless-chrome-crawler' ) ( async ( ) => ) ( ) Examples.Insert jQuery automatically for scraping.Pause at the max request and resume at any time.Support CSV and JSON Lines for exporting results.Support both depth-first search and breadth-first search algorithm.Powered by Headless Chrome, the crawler provides simple APIs to crawl these dynamic websites with the following features: However, it sometimes ends up capturing empty bodies, especially when the websites are built on such modern frontend frameworks as AngularJS, React and Vue.js. Headless Chrome Crawler API | Examples | Tips | Code of Conduct | Contributing | Changelogĭistributed crawler powered by Headless ChromeĬrawlers based on simple requests to HTML files are generally fast. Fix: 🔒 update JQuery to fix XSS vulnerability Fix: 🔒 update puppeteer to fix Use After Free vulnerability ( Closes #350). Fix: 🔒 update jquery and lodash to fix Prototype Pollution vulnerability. Fix: 🔒 high-severity lodash vulnerability ( Closes #339). Fix crawler pending indefinitely when mixed content is present ( Closes #260).

Fix crawler failure to follow urls with `#` hashes in them ( Closes #332). Fix `crawler.response` returning `null` when connecting to specific chrome instance ( Closes #354). Merge pull request #374 from kulikalov/update-dependencies # Changed
