1.前提

    整体思绪,应用多线程(mutiSpider)爬取陈雄博客首页引荐博客,依据用户名爬取该用户的浏览排行榜(TopViewPosts),批评排行榜(TopFeedbackPosts),引荐排行榜(TopDiggPosts),然后对获得的数据举行处置惩罚(兼并目次),再举行基础排序(这里我们已浏览排行榜为例),排序浏览最多的文章,然后应用词云(wordcloud)天生图片,末了发送邮件给本身。(有兴致的小伙伴能够布置到服务器上!)

  1.1参考链接:

   大神博客:https://www.cnblogs.com/lovesoo/p/7780957.html (引荐先看这个,我是在此博客基础上举行革新与扩大了的)

   词云下载:https://www.lfd.uci.edu/~gohlke/pythonlibs/#wordcloud (我下载的这个wordcloud-1.5.0-cp36-cp36m-win32.whl)

   邮件发送:https://www.runoob.com/python/python-email.html (菜鸟教程引荐)

 1.2完成效果:

 

 

 

2.情况设置装备摆设:

  python3.6.5(对应cp36,最好记着这个,由于今后下载一些whl文件都邑用到)

  pycharm + QQ邮箱受权码 + wordcloud-1.5.0-cp36-cp36m-win32.whl

  win10,64位(虽然我是64位,然则下载词云win_amd64.whl不兼容,改成win32.whl就兼容了)

2.0  读者须要供应的器械

  1.词云所须要的图片(我是avatar.jpg)与电脑字体(详细见View_wordcloud函数

  2.邮箱的SMTP受权码(暗码就是受权码)

  3.默许一切代码、图片等都在统一文件夹下面。

)

2.1须要导入的库(词云 + 邮件 + 爬虫)

  注:1.requests,beatuifulsoup,是爬虫须要,wordcloud,jieba,是词云须要,smtplib,email是邮件须要,其他都是些基础Python语法

    2.装置wordcloud词云的时刻轻易报错,官方链接 ,官网下载然后在当地cmd下pip install 便可。

 

3.编写爬虫

3.1陈雄博客首页引荐博客

  选中XHR,找到https://www.cnblogs.com/aggsite/UserStats,直接requests猎取,返回的是html花样

 

 

#coding:utf-8
import requests

r=requests.get('https://www.cnblogs.com/aggsite/UserStats')
print r.text

  然后能够须要对数据举行基础处置惩罚,一种是运用Beautiful Soup剖析Html内容,别的一种是运用正则表达式挑选内容。

  个中BeautifulSoup剖析时,我们运用的是CSS选择器.select要领,查找id="blogger_list" > ul >li下的一切a标签元素,同时对效果举行处置惩罚,去除"更多引荐博客"及""博客列表(按积分)链接。

运用正则表达式挑选也是同理:我们起首组织了相符前提的正则表达式,然后运用re.findall找出一切元素,同时对效果举行处置惩罚,去除"更多引荐博客"及""博客列表(按积分)链接。

如许我们就完成了第一步,猎取了首页引荐博客列表。


1 #coding:utf-8
2 importrequests3 importre4 importjson5 from bs4 importBeautifulSoup6 
7 #猎取引荐博客列表
8 r = requests.get('https://www.cnblogs.com/aggsite/UserStats')9 
10 #运用BeautifulSoup剖析
11 soup = BeautifulSoup(r.text, 'lxml')12 users = [(i.text, i['href']) for i in soup.select('#blogger_list > ul > li > a') if 'AllBloggers.aspx' not in i['href'] and 'expert' not in i['href']]13 print json.dumps(users,ensure_ascii=False)14 
15 #也能够运用运用正则表达式
16 user_re=re.compile('<a href="(http://www.cnblogs.com/.+)" target="_blank">(.+)</a>')17 users=[(name,url) for url,name in re.findall(user_re,r.text) if 'AllBloggers.aspx' not in url and 'expert' not inurl]18 print json.dumps(users,ensure_ascii=False)

View Code

  然后,这里就可以猎取引荐用户的博客了,我们接下来须要进入某个用户博客,找到接口sidecolumn.aspx,这个接口返回了我们须要的信息:漫笔分类,点击Headers检察接口挪用信息,能够看到这也是一个GET范例接口,途径含有博客用户名,且传入参数blogApp=用户名:检察Header:

https://www.cnblogs.com/meditation5201314/mvc/blog/sidecolumn.aspx?blogApp=meditation5201314,直接发送requests要求便可

   

 

   

 

#coding:utf-8
import requests

user='meditation5201314'
url = 'http://www.cnblogs.com/{0}/mvc/blog/sidecolumn.aspx'.format(user)
blogApp = user
payload = dict(blogApp=blogApp)
r = requests.get(url, params=payload)
print r.text

  到此,便能够获得博客的分类目次及文章数目信息,其他2个我就不展现了,统共3个功用,猎取用户的浏览排行榜(TopViewPosts),批评排行榜(TopFeedbackPosts),引荐排行榜(TopDiggPosts),详细见引荐博客 别的多线程爬虫代码也在这内里,比较简朴,然后就是对数据举行排序处置惩罚了。见以下代码

详细完全代码


1 #!/usr/bin/env python
2 #-*- coding: utf-8 -*-
3 #@Time : 2019/5/7 21:37
4 #@Author : Empirefree
5 #@File : __init__.py.py
6 #@Software: PyCharm Community Edition
7 
8 importrequests9 importre10 importjson11 from bs4 importBeautifulSoup12 from  concurrent importfutures13 from wordcloud importWordCloud14 importjieba15 importos16 from os importpath17 importsmtplib18 from email.mime.text importMIMEText19 from email.utils importformataddr20 from email.mime.image importMIMEImage21 from email.mime.multipart importMIMEMultipart22 
23 defCnblog_getUsers():24     r = requests.get('https://www.cnblogs.com/aggsite/UserStats')25     #运用BeautifulSoup剖析引荐博客
26     soup = BeautifulSoup(r.text, 'lxml')27     users = [(i.text, i['href']) for i in soup.select('#blogger_list > ul > li > a') if
28              'AllBloggers.aspx' not in i['href'] and 'expert' not in i['href']]29     #print(json.dumps(users, ensure_ascii=False))
30     returnusers31 defMy_Blog_Category(user):32     myusers =user33     category_re = re.compile('(.+)\((\d+)\)')34     url = 'https://www.cnblogs.com/{0}/mvc/blog/sidecolumn.aspx'.format(myusers)35     blogApp =myusers36     payload = dict(blogApp =blogApp)37     r = requests.get(url, params=payload)38     #运用BeautifulSoup剖析引荐博客
39     soup = BeautifulSoup(r.text, 'lxml')40     category = [re.search(category_re, i.text).groups() for i in soup.select('.catListPostCategory > ul > li') if
41 re.search(category_re, i.text)]42     #print(json.dumps(category, ensure_ascii=False))
43     return dict(category=category)44 
45 defgetPostsDetail(Posts):46     #猎取文章详细信息:题目,次数,URL
47     post_re = re.compile('\d+\. (.+)\((\d+)\)')48     soup = BeautifulSoup(Posts, 'lxml')49     return [list(re.search(post_re, i.text).groups()) + [i['href']] for i in soup.find_all('a')]50 
51 defMy_Blog_Detail(user):52     url = 'http://www.cnblogs.com/mvc/Blog/GetBlogSideBlocks.aspx'
53     blogApp =user54     showFlag = 'ShowRecentComment, ShowTopViewPosts, ShowTopFeedbackPosts, ShowTopDiggPosts'
55     payload = dict(blogApp=blogApp, showFlag=showFlag)56     r = requests.get(url, params=payload)57 
58     print(json.dumps(r.json(), ensure_ascii=False))59     #最新批评(数据有点不一样),浏览排行榜 批评排行榜 引荐排行榜
60     TopViewPosts = getPostsDetail(r.json()['TopViewPosts'])61     TopFeedbackPosts = getPostsDetail(r.json()['TopFeedbackPosts'])62     TopDiggPosts = getPostsDetail(r.json()['TopDiggPosts'])63     #print(json.dumps(dict(TopViewPosts=TopViewPosts, TopFeedbackPosts=TopFeedbackPosts, TopDiggPosts=TopDiggPosts),ensure_ascii=False))
64     return dict(TopViewPosts=TopViewPosts, TopFeedbackPosts=TopFeedbackPosts, TopDiggPosts=TopDiggPosts)65 
66 
67 defMy_Blog_getTotal(url):68     #猎取博客悉数信息,包罗分类及排行榜信息
69     #初始化博客用户名
70     print('Spider blog:\t{0}'.format(url))71     user = url.split('/')[-2]72     print(user)73     return dict(My_Blog_Detail(user), **My_Blog_Category(user))74 
75 def mutiSpider(max_workers=4):76     try:77         with futures.ThreadPoolExecutor(max_workers=max_workers) as executor:  #多线程
78         #with futures.ProcessPoolExecutor(max_workers=max_workers) as executor:  # 多历程
79             for blog in executor.map(My_Blog_getTotal, [i[1] for i inusers]):80 blogs.append(blog)81     exceptException as e:82         print(e)83 defcountCategory(category, category_name):84     #兼并盘算目次数
85     n =086     for name, count incategory:87         if name.lower() ==category_name:88             n +=int(count)89     returnn90 
91 if __name__ == '__main__':92     #Cnblog_getUsers()
93     #user = 'meditation5201314'
94     #My_Blog_Category(user)
95     #My_Blog_Detail(user)
96     print(os.path.dirname(os.path.realpath(__file__)))97     bmppath = os.path.dirname(os.path.realpath(__file__))98     blogs =[]99 
100     #猎取引荐博客列表
101     users =Cnblog_getUsers()102     #print(users)
103     #print(json.dumps(users, ensure_ascii=False))
104 
105     #多线程/多历程猎取博客信息
106 mutiSpider()107     #print(json.dumps(blogs,ensure_ascii=False))
108 
109     #猎取一切分类目次信息
110     category = [category for blog in blogs if blog['category'] for category in blog['category']]111 
112     #兼并雷同目次
113     new_category ={}114     for name, count incategory:115         #悉数转换为小写
116         name =name.lower()117         if name not innew_category:118             new_category[name] =countCategory(category, name)119     sorted(new_category.items(), key=lambda i: int(i[1]), reverse=True)120     print(new_category)121     TopViewPosts = 122     sorted(TopViewPosts, key=lambda i: int(i[1]), reverse=True)123     print(TopViewPosts)

View Code

  

 4.天生词云

  对引荐博客内容举行处置惩罚(List花样),有关词云详细运用能够百度,简朴引见就是在给定的img和txt天生图片,就是把2者结合起来,font_path是本身电脑本机上的,去C盘下面搜一下就行,不一定人人都一样。

  注:词云装置:这个比较复杂,我在pycharm下面install 没装置好,我是先去官网下载了whl文件,然后在cmd下

pip install  wordcloud-1.5.0-cp36-cp36m-win32.whl

  ,然后把天生的文件夹从新放入到pycharm的venv/Lib/site_packages/下面,然后就弄好了(小我引荐这类设施,百试不爽!)

 

def View_wordcloud(TopViewPosts):
    ##天生词云
    # 拼接为长文本
    contents = ' '.join([i[0] for i in TopViewPosts])
    # 运用结巴分词举行中文分词
    cut_texts = ' '.join(jieba.cut(contents))
    # 设置字体为黑体,最大词数为2000,配景色彩为白色,天生图片宽1000,高667
    cloud = WordCloud(font_path='C:\\Windows\\WinSxS\\amd64_microsoft-windows-b..core-fonts-chs-boot_31bf3856ad364e35_10.0.17134.1_none_ba644a56789f974c\\msyh_boot.ttf', max_words=2000, background_color="white", width=1000,
                      height=667, margin=2)
    # 天生词云
    wordcloud = cloud.generate(cut_texts)
    # 生存图片
    file_name = 'avatar'
    wordcloud.to_file('{0}.jpg'.format(file_name))
    # 展现图片
    wordcloud.to_image().show()
    cloud.to_file(path.join(bmppath, 'temp.jpg'))

  在上面代码中,我们应用cloud.to_file(path.join(bmppath, 'temp.jpg')),生存了temp.jpg,所今背面发送的图片就直接默许是temp.jpg了

  5.发送邮件:

  去QQ邮箱请求一下受权码,然后发送给本身就好了,内容嵌套img这个教贫苦,我查了良久,须要用cid指定一下,有点像ajax和format。

def Send_email():
    my_sender = '1842449680@qq.com'  # 发件人邮箱账号
    my_pass = 'XXXXXXXX这里是你的受权码哎'  # 发件人邮箱暗码
    my_user = '1842449680@qq.com'  # 收件人邮箱账号,我这边发送给本身


    ret = True
    try:

        msg = MIMEMultipart()
        # msg = MIMEText('填写邮件内容', 'plain', 'utf-8')
        msg['From'] = formataddr(["Empirefree", my_sender])  # 括号里的对应发件人邮箱昵称、发件人邮箱账号
        msg['To'] = formataddr(["Empirefree", my_user])  # 括号里的对应收件人邮箱昵称、收件人邮箱账号
        msg['Subject'] = "陈雄博客首页引荐博客内容词云"  # 邮件的主题,也能够说是题目

        content = '<b>SKT 、<i>Empirefree</i> </b>向您发送陈雄博客近来内容.<br><p><img src="cid:image1"><p>'
        msgText = MIMEText(content, 'html', 'utf-8')
        msg.attach(msgText)
        fp = open('temp.jpg', 'rb')
        img = MIMEImage(fp.read())
        fp.close()
        img.add_header('Content-ID', '<image1>')
        msg.attach(img)

        server = smtplib.SMTP_SSL("smtp.qq.com", 465)  # 发件人邮箱中的SMTP服务器,端口是25
        server.login(my_sender, my_pass)  # 括号中对应的是发件人邮箱账号、邮箱暗码
        server.sendmail(my_sender, [my_user, ], msg.as_string())  # 括号中对应的是发件人邮箱账号、收件人邮箱账号、发送邮件
        server.quit()  # 封闭衔接
    except Exception:  # 若是 try 中的语句没有实行,则会实行下面的 ret=False
        ret = False
    if ret:
        print("邮件发送胜利")
    else:
        print("邮件发送失利")    

 

终究完全代码:


1 #!/usr/bin/env python
2 #-*- coding: utf-8 -*-
3 #@Time : 2019/5/7 21:37
4 #@Author : Empirefree
5 #@File : __init__.py.py
6 #@Software: PyCharm Community Edition
7 
8 importrequests9 importre10 importjson11 from bs4 importBeautifulSoup12 from  concurrent importfutures13 from wordcloud importWordCloud14 importjieba15 importos16 from os importpath17 importsmtplib18 from email.mime.text importMIMEText19 from email.utils importformataddr20 from email.mime.image importMIMEImage21 from email.mime.multipart importMIMEMultipart22 
23 defCnblog_getUsers():24     r = requests.get('https://www.cnblogs.com/aggsite/UserStats')25     #运用BeautifulSoup剖析引荐博客
26     soup = BeautifulSoup(r.text, 'lxml')27     users = [(i.text, i['href']) for i in soup.select('#blogger_list > ul > li > a') if
28              'AllBloggers.aspx' not in i['href'] and 'expert' not in i['href']]29     #print(json.dumps(users, ensure_ascii=False))
30     returnusers31 defMy_Blog_Category(user):32     myusers =user33     category_re = re.compile('(.+)\((\d+)\)')34     url = 'https://www.cnblogs.com/{0}/mvc/blog/sidecolumn.aspx'.format(myusers)35     blogApp =myusers36     payload = dict(blogApp =blogApp)37     r = requests.get(url, params=payload)38     #运用BeautifulSoup剖析引荐博客
39     soup = BeautifulSoup(r.text, 'lxml')40     category = [re.search(category_re, i.text).groups() for i in soup.select('.catListPostCategory > ul > li') if
41 re.search(category_re, i.text)]42     #print(json.dumps(category, ensure_ascii=False))
43     return dict(category=category)44 
45 defgetPostsDetail(Posts):46     #猎取文章详细信息:题目,次数,URL
47     post_re = re.compile('\d+\. (.+)\((\d+)\)')48     soup = BeautifulSoup(Posts, 'lxml')49     return [list(re.search(post_re, i.text).groups()) + [i['href']] for i in soup.find_all('a')]50 
51 defMy_Blog_Detail(user):52     url = 'http://www.cnblogs.com/mvc/Blog/GetBlogSideBlocks.aspx'
53     blogApp =user54     showFlag = 'ShowRecentComment, ShowTopViewPosts, ShowTopFeedbackPosts, ShowTopDiggPosts'
55     payload = dict(blogApp=blogApp, showFlag=showFlag)56     r = requests.get(url, params=payload)57 
58     print(json.dumps(r.json(), ensure_ascii=False))59     #最新批评(数据有点不一样),浏览排行榜 批评排行榜 引荐排行榜
60     TopViewPosts = getPostsDetail(r.json()['TopViewPosts'])61     TopFeedbackPosts = getPostsDetail(r.json()['TopFeedbackPosts'])62     TopDiggPosts = getPostsDetail(r.json()['TopDiggPosts'])63     #print(json.dumps(dict(TopViewPosts=TopViewPosts, TopFeedbackPosts=TopFeedbackPosts, TopDiggPosts=TopDiggPosts),ensure_ascii=False))
64     return dict(TopViewPosts=TopViewPosts, TopFeedbackPosts=TopFeedbackPosts, TopDiggPosts=TopDiggPosts)65 
66 
67 defMy_Blog_getTotal(url):68     #猎取博客悉数信息,包罗分类及排行榜信息
69     #初始化博客用户名
70     print('Spider blog:\t{0}'.format(url))71     user = url.split('/')[-2]72     print(user)73     return dict(My_Blog_Detail(user), **My_Blog_Category(user))74 
75 def mutiSpider(max_workers=4):76     try:77         with futures.ThreadPoolExecutor(max_workers=max_workers) as executor:  #多线程
78         #with futures.ProcessPoolExecutor(max_workers=max_workers) as executor:  # 多历程
79             for blog in executor.map(My_Blog_getTotal, [i[1] for i inusers]):80 blogs.append(blog)81     exceptException as e:82         print(e)83 defcountCategory(category, category_name):84     #兼并盘算目次数
85     n =086     for name, count incategory:87         if name.lower() ==category_name:88             n +=int(count)89     returnn90 
91 defView_wordcloud(TopViewPosts):92     ##天生词云
93     #拼接为长文本
94     contents = ' '.join([i[0] for i inTopViewPosts])95     #运用结巴分词举行中文分词
96     cut_texts = ' '.join(jieba.cut(contents))97     #设置字体为黑体,最大词数为2000,配景色彩为白色,天生图片宽1000,高667
98     cloud = WordCloud(font_path='C:\\Windows\\WinSxS\\amd64_microsoft-windows-b..core-fonts-chs-boot_31bf3856ad364e35_10.0.17134.1_none_ba644a56789f974c\\msyh_boot.ttf', max_words=2000, background_color="white", width=1000,99                       height=667, margin=2)100     #天生词云
101     wordcloud =cloud.generate(cut_texts)102     #生存图片
103     file_name = 'avatar'
104     wordcloud.to_file('{0}.jpg'.format(file_name))105     #展现图片
106 wordcloud.to_image().show()107     cloud.to_file(path.join(bmppath, 'temp.jpg'))108 
109 defSend_email():110     my_sender = '1842449680@qq.com'  #发件人邮箱账号
111     my_pass = 'XXXXXXXX'  #受权码
112     my_user = '1842449680@qq.com'  #收件人邮箱账号,我这边发送给本身
113 
114 
115     ret =True116     try:117 
118         msg =MIMEMultipart()119         #msg = MIMEText('填写邮件内容', 'plain', 'utf-8')
120         msg['From'] = formataddr(["Empirefree", my_sender])  #括号里的对应发件人邮箱昵称、发件人邮箱账号
121         msg['To'] = formataddr(["Empirefree", my_user])  #括号里的对应收件人邮箱昵称、收件人邮箱账号
122         msg['Subject'] = "陈雄博客首页引荐博客内容词云"  #邮件的主题,也能够说是题目
123 
124         content = '<b>SKT 、<i>Empirefree</i> </b>向您发送陈雄博客近来内容.<br><p><img src="cid:image1"><p>'
125         msgText = MIMEText(content, 'html', 'utf-8')126 msg.attach(msgText)127         fp = open('temp.jpg', 'rb')128         img =MIMEImage(fp.read())129 fp.close()130         img.add_header('Content-ID', '<image1>')131 msg.attach(img)132 
133         server = smtplib.SMTP_SSL("smtp.qq.com", 465)  #发件人邮箱中的SMTP服务器,端口是25
134         server.login(my_sender, my_pass)  #括号中对应的是发件人邮箱账号、邮箱暗码
135         server.sendmail(my_sender, [my_user, ], msg.as_string())  #括号中对应的是发件人邮箱账号、收件人邮箱账号、发送邮件
136         server.quit()  #封闭衔接
137     except Exception:  #若是 try 中的语句没有实行,则会实行下面的 ret=False
138         ret =False139     ifret:140         print("邮件发送胜利")141     else:142         print("邮件发送失利")143 
144 if __name__ == '__main__':145     #Cnblog_getUsers()
146     #user = 'meditation5201314'
147     #My_Blog_Category(user)
148     #My_Blog_Detail(user)
149     print(os.path.dirname(os.path.realpath(__file__)))150     bmppath = os.path.dirname(os.path.realpath(__file__))151     blogs =[]152 
153     #猎取引荐博客列表
154     users =Cnblog_getUsers()155     #print(users)
156     #print(json.dumps(users, ensure_ascii=False))
157 
158     #多线程/多历程猎取博客信息
159 mutiSpider()160     #print(json.dumps(blogs,ensure_ascii=False))
161 
162     #猎取一切分类目次信息
163     category = [category for blog in blogs if blog['category'] for category in blog['category']]164 
165     #兼并雷同目次
166     new_category ={}167     for name, count incategory:168         #悉数转换为小写
169         name =name.lower()170         if name not innew_category:171             new_category[name] =countCategory(category, name)172     sorted(new_category.items(), key=lambda i: int(i[1]), reverse=True)173     print(new_category)174     TopViewPosts = 175     sorted(TopViewPosts, key=lambda i: int(i[1]), reverse=True)176     print(TopViewPosts)177 
178 View_wordcloud(TopViewPosts)179     Send_email()

View Code

 

    总结:整体功用就是依据引荐博客,爬取引荐用户的浏览排行榜 批评排行榜 引荐排行榜,然后数据处置惩罚成,将处置惩罚好的数据整合成词云,末了发送给用户

 难点1:爬取用户和博客所用到的一系列爬虫学问(正则,剖析等等)

 难点2:词云的装置(确切挺贫苦的。。。。。)

 难点3:邮件发送内容嵌套image(菜鸟教程没有给出QQ邮箱内嵌套图片,本身去官网找的。)

Last modification:March 25, 2020
如果觉得我的文章对你有用,请随意赞赏