使用Node.js爬取网页内容并且生成本地PDF文件
本项目实现需求:给我们一个网页地址,爬取他的网页内容,然后输出成我们想要的PDF格式文档,请注意,是高质量的PDF文档
第一步,安装Node.js
,推荐http://nodejs.cn/download/
,Node.js
的中文官网下载对应的操作系统包
第二步,在下载安装完了Node.js
后, 启动windows
命令行工具(windows下启动系统搜索功能,输入cmd,回车,就出来了)
第三步 需要查看环境变量是否已经自动配置,在命令行工具中输入 node -v
,如果出现 v10. ***
字段,则说明成功安装Node.js
第四步 如果您在第三步发现输入node -v
还是没有出现 对应的字段,那么请您重启电脑即可
第五步 打开本项目文件夹,打开命令行工具(windows系统中直接在文件的url
地址栏输入cmd
就可以打开了),输入 npm i cnpm nodemon -g
第六步 下载puppeteer
爬虫包,在完成第五步后,使用cnpm i puppeteer --save
命令 即可下载
第七步 完成第六步下载后,打开本项目的url.js
,将您需要爬虫爬取的网页地址替换上去(默认是http://nodejs.cn/
)
第八步 在命令行中输入 nodemon index.js
即可爬取对应的内容,并且自动输出到当前文件夹下面的index.pdf
文件中
TIPS
: 本项目设计思想就是一个网页一个index.pdf
拷贝出去,然后继续更换url
地址,继续爬取,生成新的
对应像京东首页这样的开启了图片懒加载的网页,爬取到的部分内容是
loading
状态的内容,对于有一些反爬虫机制的网页,爬虫也会出现问题,但是绝大多数网站都是可以的
Anyone who knows node. js and puppeteer crawlers can do it. Please read this document carefully and execute each step in sequence
as a first step, installthe implementation requirements of this project: give us a web address, crawl his web content, and then output to the PDF format document we want, please note, is a high quality PDF document
Node. Js
, recommend http://nodejs.cn/download/
, Node. Js
Chinese website to download the corresponding operating system package
step 2, after downloading and installing 'node. js', start' Windows' command line tool (start system search function under Windows, enter CMD, enter, and out)
in the third step, it is necessary to check whether the environment variable has been automatically configured. Enter 'node-v' in the command line tool. If 'v10
step 4: if you find that the corresponding field does not appear after entering 'node-v' in step 3, please restart your computer
step 5 open the project folder, open the command line tool (you can open it by directly entering 'CMD' in the 'url' address bar of the file in Windows system), and enter 'NPM I CNPM nodemon -g'
step 6: download 'puppeteer' crawler package. After completing step 5, use 'CNPM I puppeteer --save' command to download
step 7: after downloading step 6, open the 'url.js' of this project and replace the web address you need the crawler to fetch (the default is' http://nodejs.cn/').
step 8 enter 'nodemon index.js' on the command line to crawl the corresponding content, and automatically output to the' index.pdf 'file under the current folder
'TIPS' : the design idea of this project is a web page a 'PDF' file, so each time to crawl a separate page, please copy 'index.pdf' out, and then continue to change 'url' address, continue to crawl, generate a new 'PDF' file, of course, you can also through cyclic compilation to crawl multiple web pages to generate multiple 'PDF' files. corresponds to the webpage with lazy loading of pictures such as the homepage of jd. Some of the contents crawled are in 'loading' state. For the webpage with some anti-crawler mechanism, crawlers may also have problems, but most websites are ok
版权声明:
1、该文章(资料)来源于互联网公开信息,我方只是对该内容做点评,所分享的下载地址为原作者公开地址。2、网站不提供资料下载,如需下载请到原作者页面进行下载。