1.前提:
整体思绪,应用多线程(mutiSpider)爬取陈雄博客首页引荐博客,依据用户名爬取该用户的浏览排行榜(TopViewPosts),批评排行榜(TopFeedbackPosts),引荐排行榜(TopDiggPosts),然后对获得的数据举行处置惩罚(兼并目次),再举行基础排序(这里我们已浏览排行榜为例),排序浏览最多的文章,然后应用词云(wordcloud)天生图片,末了发送邮件给本身。(有兴致的小伙伴能够布置到服务器上!)
1.1参考链接:
大神博客:https://www.cnblogs.com/lovesoo/p/7780957.html (引荐先看这个,我是在此博客基础上举行革新与扩大了的)
词云下载:https://www.lfd.uci.edu/~gohlke/pythonlibs/#wordcloud (我下载的这个wordcloud-1.5.0-cp36-cp36m-win32.whl)
邮件发送:https://www.runoob.com/python/python-email.html (菜鸟教程引荐)
1.2完成效果:
2.情况设置装备摆设:
python3.6.5(对应cp36,最好记着这个,由于今后下载一些whl文件都邑用到)
pycharm + QQ邮箱受权码 + wordcloud-1.5.0-cp36-cp36m-win32.whl
win10,64位(虽然我是64位,然则下载词云win_amd64.whl不兼容,改成win32.whl就兼容了)
2.0 读者须要供应的器械:
1.词云所须要的图片(我是avatar.jpg)与电脑字体(详细见View_wordcloud函数
2.邮箱的SMTP受权码(暗码就是受权码)
3.默许一切代码、图片等都在统一文件夹下面。
)
2.1须要导入的库(词云 + 邮件 + 爬虫)
注:1.requests,beatuifulsoup,是爬虫须要,wordcloud,jieba,是词云须要,smtplib,email是邮件须要,其他都是些基础Python语法
2.装置wordcloud词云的时刻轻易报错,官方链接 ,官网下载然后在当地cmd下pip install 便可。
3.编写爬虫
3.1陈雄博客首页引荐博客
选中XHR,找到https://www.cnblogs.com/aggsite/UserStats,直接requests猎取,返回的是html花样
#coding:utf-8 import requests r=requests.get('https://www.cnblogs.com/aggsite/UserStats') print r.text
然后能够须要对数据举行基础处置惩罚,一种是运用Beautiful Soup剖析Html内容,别的一种是运用正则表达式挑选内容。
个中BeautifulSoup剖析时,我们运用的是CSS选择器.select要领,查找id="blogger_list" > ul >li下的一切a标签元素,同时对效果举行处置惩罚,去除"更多引荐博客"及""博客列表(按积分)链接。
运用正则表达式挑选也是同理:我们起首组织了相符前提的正则表达式,然后运用re.findall找出一切元素,同时对效果举行处置惩罚,去除"更多引荐博客"及""博客列表(按积分)链接。
如许我们就完成了第一步,猎取了首页引荐博客列表。
1 #coding:utf-8 2 importrequests3 importre4 importjson5 from bs4 importBeautifulSoup6 7 #猎取引荐博客列表 8 r = requests.get('https://www.cnblogs.com/aggsite/UserStats')9 10 #运用BeautifulSoup剖析 11 soup = BeautifulSoup(r.text, 'lxml')12 users = [(i.text, i['href']) for i in soup.select('#blogger_list > ul > li > a') if 'AllBloggers.aspx' not in i['href'] and 'expert' not in i['href']]13 print json.dumps(users,ensure_ascii=False)14 15 #也能够运用运用正则表达式 16 user_re=re.compile('<a href="(http://www.cnblogs.com/.+)" target="_blank">(.+)</a>')17 users=[(name,url) for url,name in re.findall(user_re,r.text) if 'AllBloggers.aspx' not in url and 'expert' not inurl]18 print json.dumps(users,ensure_ascii=False)
View Code
然后,这里就可以猎取引荐用户的博客了,我们接下来须要进入某个用户博客,找到接口sidecolumn.aspx,这个接口返回了我们须要的信息:漫笔分类,点击Headers检察接口挪用信息,能够看到这也是一个GET范例接口,途径含有博客用户名,且传入参数blogApp=用户名:检察Header:
https://www.cnblogs.com/meditation5201314/mvc/blog/sidecolumn.aspx?blogApp=meditation5201314,直接发送requests要求便可
#coding:utf-8 import requests user='meditation5201314' url = 'http://www.cnblogs.com/{0}/mvc/blog/sidecolumn.aspx'.format(user) blogApp = user payload = dict(blogApp=blogApp) r = requests.get(url, params=payload) print r.text
到此,便能够获得博客的分类目次及文章数目信息,其他2个我就不展现了,统共3个功用,猎取用户的浏览排行榜(TopViewPosts),批评排行榜(TopFeedbackPosts),引荐排行榜(TopDiggPosts),详细见引荐博客 别的多线程爬虫代码也在这内里,比较简朴,然后就是对数据举行排序处置惩罚了。见以下代码
详细完全代码
1 #!/usr/bin/env python 2 #-*- coding: utf-8 -*- 3 #@Time : 2019/5/7 21:37 4 #@Author : Empirefree 5 #@File : __init__.py.py 6 #@Software: PyCharm Community Edition 7 8 importrequests9 importre10 importjson11 from bs4 importBeautifulSoup12 from concurrent importfutures13 from wordcloud importWordCloud14 importjieba15 importos16 from os importpath17 importsmtplib18 from email.mime.text importMIMEText19 from email.utils importformataddr20 from email.mime.image importMIMEImage21 from email.mime.multipart importMIMEMultipart22 23 defCnblog_getUsers():24 r = requests.get('https://www.cnblogs.com/aggsite/UserStats')25 #运用BeautifulSoup剖析引荐博客 26 soup = BeautifulSoup(r.text, 'lxml')27 users = [(i.text, i['href']) for i in soup.select('#blogger_list > ul > li > a') if 28 'AllBloggers.aspx' not in i['href'] and 'expert' not in i['href']]29 #print(json.dumps(users, ensure_ascii=False)) 30 returnusers31 defMy_Blog_Category(user):32 myusers =user33 category_re = re.compile('(.+)\((\d+)\)')34 url = 'https://www.cnblogs.com/{0}/mvc/blog/sidecolumn.aspx'.format(myusers)35 blogApp =myusers36 payload = dict(blogApp =blogApp)37 r = requests.get(url, params=payload)38 #运用BeautifulSoup剖析引荐博客 39 soup = BeautifulSoup(r.text, 'lxml')40 category = [re.search(category_re, i.text).groups() for i in soup.select('.catListPostCategory > ul > li') if 41 re.search(category_re, i.text)]42 #print(json.dumps(category, ensure_ascii=False)) 43 return dict(category=category)44 45 defgetPostsDetail(Posts):46 #猎取文章详细信息:题目,次数,URL 47 post_re = re.compile('\d+\. (.+)\((\d+)\)')48 soup = BeautifulSoup(Posts, 'lxml')49 return [list(re.search(post_re, i.text).groups()) + [i['href']] for i in soup.find_all('a')]50 51 defMy_Blog_Detail(user):52 url = 'http://www.cnblogs.com/mvc/Blog/GetBlogSideBlocks.aspx' 53 blogApp =user54 showFlag = 'ShowRecentComment, ShowTopViewPosts, ShowTopFeedbackPosts, ShowTopDiggPosts' 55 payload = dict(blogApp=blogApp, showFlag=showFlag)56 r = requests.get(url, params=payload)57 58 print(json.dumps(r.json(), ensure_ascii=False))59 #最新批评(数据有点不一样),浏览排行榜 批评排行榜 引荐排行榜 60 TopViewPosts = getPostsDetail(r.json()['TopViewPosts'])61 TopFeedbackPosts = getPostsDetail(r.json()['TopFeedbackPosts'])62 TopDiggPosts = getPostsDetail(r.json()['TopDiggPosts'])63 #print(json.dumps(dict(TopViewPosts=TopViewPosts, TopFeedbackPosts=TopFeedbackPosts, TopDiggPosts=TopDiggPosts),ensure_ascii=False)) 64 return dict(TopViewPosts=TopViewPosts, TopFeedbackPosts=TopFeedbackPosts, TopDiggPosts=TopDiggPosts)65 66 67 defMy_Blog_getTotal(url):68 #猎取博客悉数信息,包罗分类及排行榜信息 69 #初始化博客用户名 70 print('Spider blog:\t{0}'.format(url))71 user = url.split('/')[-2]72 print(user)73 return dict(My_Blog_Detail(user), **My_Blog_Category(user))74 75 def mutiSpider(max_workers=4):76 try:77 with futures.ThreadPoolExecutor(max_workers=max_workers) as executor: #多线程 78 #with futures.ProcessPoolExecutor(max_workers=max_workers) as executor: # 多历程 79 for blog in executor.map(My_Blog_getTotal, [i[1] for i inusers]):80 blogs.append(blog)81 exceptException as e:82 print(e)83 defcountCategory(category, category_name):84 #兼并盘算目次数 85 n =086 for name, count incategory:87 if name.lower() ==category_name:88 n +=int(count)89 returnn90 91 if __name__ == '__main__':92 #Cnblog_getUsers() 93 #user = 'meditation5201314' 94 #My_Blog_Category(user) 95 #My_Blog_Detail(user) 96 print(os.path.dirname(os.path.realpath(__file__)))97 bmppath = os.path.dirname(os.path.realpath(__file__))98 blogs =[]99 100 #猎取引荐博客列表 101 users =Cnblog_getUsers()102 #print(users) 103 #print(json.dumps(users, ensure_ascii=False)) 104 105 #多线程/多历程猎取博客信息 106 mutiSpider()107 #print(json.dumps(blogs,ensure_ascii=False)) 108 109 #猎取一切分类目次信息 110 category = [category for blog in blogs if blog['category'] for category in blog['category']]111 112 #兼并雷同目次 113 new_category ={}114 for name, count incategory:115 #悉数转换为小写 116 name =name.lower()117 if name not innew_category:118 new_category[name] =countCategory(category, name)119 sorted(new_category.items(), key=lambda i: int(i[1]), reverse=True)120 print(new_category)121 TopViewPosts = 122 sorted(TopViewPosts, key=lambda i: int(i[1]), reverse=True)123 print(TopViewPosts)
View Code
4.天生词云
对引荐博客内容举行处置惩罚(List花样),有关词云详细运用能够百度,简朴引见就是在给定的img和txt天生图片,就是把2者结合起来,font_path是本身电脑本机上的,去C盘下面搜一下就行,不一定人人都一样。
注:词云装置:这个比较复杂,我在pycharm下面install 没装置好,我是先去官网下载了whl文件,然后在cmd下
pip install wordcloud-1.5.0-cp36-cp36m-win32.whl
,然后把天生的文件夹从新放入到pycharm的venv/Lib/site_packages/下面,然后就弄好了(小我引荐这类设施,百试不爽!)
def View_wordcloud(TopViewPosts): ##天生词云 # 拼接为长文本 contents = ' '.join([i[0] for i in TopViewPosts]) # 运用结巴分词举行中文分词 cut_texts = ' '.join(jieba.cut(contents)) # 设置字体为黑体,最大词数为2000,配景色彩为白色,天生图片宽1000,高667 cloud = WordCloud(font_path='C:\\Windows\\WinSxS\\amd64_microsoft-windows-b..core-fonts-chs-boot_31bf3856ad364e35_10.0.17134.1_none_ba644a56789f974c\\msyh_boot.ttf', max_words=2000, background_color="white", width=1000, height=667, margin=2) # 天生词云 wordcloud = cloud.generate(cut_texts) # 生存图片 file_name = 'avatar' wordcloud.to_file('{0}.jpg'.format(file_name)) # 展现图片 wordcloud.to_image().show() cloud.to_file(path.join(bmppath, 'temp.jpg'))
在上面代码中,我们应用cloud.to_file(path.join(bmppath, 'temp.jpg')),生存了temp.jpg,所今背面发送的图片就直接默许是temp.jpg了
5.发送邮件:
去QQ邮箱请求一下受权码,然后发送给本身就好了,内容嵌套img这个教贫苦,我查了良久,须要用cid指定一下,有点像ajax和format。
def Send_email():
my_sender = '1842449680@qq.com' # 发件人邮箱账号
my_pass = 'XXXXXXXX这里是你的受权码哎' # 发件人邮箱暗码
my_user = '1842449680@qq.com' # 收件人邮箱账号,我这边发送给本身
ret = True
try:
msg = MIMEMultipart()
# msg = MIMEText('填写邮件内容', 'plain', 'utf-8')
msg['From'] = formataddr(["Empirefree", my_sender]) # 括号里的对应发件人邮箱昵称、发件人邮箱账号
msg['To'] = formataddr(["Empirefree", my_user]) # 括号里的对应收件人邮箱昵称、收件人邮箱账号
msg['Subject'] = "陈雄博客首页引荐博客内容词云" # 邮件的主题,也能够说是题目
content = '<b>SKT 、<i>Empirefree</i> </b>向您发送陈雄博客近来内容.<br><p><img src="cid:image1"><p>'
msgText = MIMEText(content, 'html', 'utf-8')
msg.attach(msgText)
fp = open('temp.jpg', 'rb')
img = MIMEImage(fp.read())
fp.close()
img.add_header('Content-ID', '<image1>')
msg.attach(img)
server = smtplib.SMTP_SSL("smtp.qq.com", 465) # 发件人邮箱中的SMTP服务器,端口是25
server.login(my_sender, my_pass) # 括号中对应的是发件人邮箱账号、邮箱暗码
server.sendmail(my_sender, [my_user, ], msg.as_string()) # 括号中对应的是发件人邮箱账号、收件人邮箱账号、发送邮件
server.quit() # 封闭衔接
except Exception: # 若是 try 中的语句没有实行,则会实行下面的 ret=False
ret = False
if ret:
print("邮件发送胜利")
else:
print("邮件发送失利")
终究完全代码:
1 #!/usr/bin/env python 2 #-*- coding: utf-8 -*- 3 #@Time : 2019/5/7 21:37 4 #@Author : Empirefree 5 #@File : __init__.py.py 6 #@Software: PyCharm Community Edition 7 8 importrequests9 importre10 importjson11 from bs4 importBeautifulSoup12 from concurrent importfutures13 from wordcloud importWordCloud14 importjieba15 importos16 from os importpath17 importsmtplib18 from email.mime.text importMIMEText19 from email.utils importformataddr20 from email.mime.image importMIMEImage21 from email.mime.multipart importMIMEMultipart22 23 defCnblog_getUsers():24 r = requests.get('https://www.cnblogs.com/aggsite/UserStats')25 #运用BeautifulSoup剖析引荐博客 26 soup = BeautifulSoup(r.text, 'lxml')27 users = [(i.text, i['href']) for i in soup.select('#blogger_list > ul > li > a') if 28 'AllBloggers.aspx' not in i['href'] and 'expert' not in i['href']]29 #print(json.dumps(users, ensure_ascii=False)) 30 returnusers31 defMy_Blog_Category(user):32 myusers =user33 category_re = re.compile('(.+)\((\d+)\)')34 url = 'https://www.cnblogs.com/{0}/mvc/blog/sidecolumn.aspx'.format(myusers)35 blogApp =myusers36 payload = dict(blogApp =blogApp)37 r = requests.get(url, params=payload)38 #运用BeautifulSoup剖析引荐博客 39 soup = BeautifulSoup(r.text, 'lxml')40 category = [re.search(category_re, i.text).groups() for i in soup.select('.catListPostCategory > ul > li') if 41 re.search(category_re, i.text)]42 #print(json.dumps(category, ensure_ascii=False)) 43 return dict(category=category)44 45 defgetPostsDetail(Posts):46 #猎取文章详细信息:题目,次数,URL 47 post_re = re.compile('\d+\. (.+)\((\d+)\)')48 soup = BeautifulSoup(Posts, 'lxml')49 return [list(re.search(post_re, i.text).groups()) + [i['href']] for i in soup.find_all('a')]50 51 defMy_Blog_Detail(user):52 url = 'http://www.cnblogs.com/mvc/Blog/GetBlogSideBlocks.aspx' 53 blogApp =user54 showFlag = 'ShowRecentComment, ShowTopViewPosts, ShowTopFeedbackPosts, ShowTopDiggPosts' 55 payload = dict(blogApp=blogApp, showFlag=showFlag)56 r = requests.get(url, params=payload)57 58 print(json.dumps(r.json(), ensure_ascii=False))59 #最新批评(数据有点不一样),浏览排行榜 批评排行榜 引荐排行榜 60 TopViewPosts = getPostsDetail(r.json()['TopViewPosts'])61 TopFeedbackPosts = getPostsDetail(r.json()['TopFeedbackPosts'])62 TopDiggPosts = getPostsDetail(r.json()['TopDiggPosts'])63 #print(json.dumps(dict(TopViewPosts=TopViewPosts, TopFeedbackPosts=TopFeedbackPosts, TopDiggPosts=TopDiggPosts),ensure_ascii=False)) 64 return dict(TopViewPosts=TopViewPosts, TopFeedbackPosts=TopFeedbackPosts, TopDiggPosts=TopDiggPosts)65 66 67 defMy_Blog_getTotal(url):68 #猎取博客悉数信息,包罗分类及排行榜信息 69 #初始化博客用户名 70 print('Spider blog:\t{0}'.format(url))71 user = url.split('/')[-2]72 print(user)73 return dict(My_Blog_Detail(user), **My_Blog_Category(user))74 75 def mutiSpider(max_workers=4):76 try:77 with futures.ThreadPoolExecutor(max_workers=max_workers) as executor: #多线程 78 #with futures.ProcessPoolExecutor(max_workers=max_workers) as executor: # 多历程 79 for blog in executor.map(My_Blog_getTotal, [i[1] for i inusers]):80 blogs.append(blog)81 exceptException as e:82 print(e)83 defcountCategory(category, category_name):84 #兼并盘算目次数 85 n =086 for name, count incategory:87 if name.lower() ==category_name:88 n +=int(count)89 returnn90 91 defView_wordcloud(TopViewPosts):92 ##天生词云 93 #拼接为长文本 94 contents = ' '.join([i[0] for i inTopViewPosts])95 #运用结巴分词举行中文分词 96 cut_texts = ' '.join(jieba.cut(contents))97 #设置字体为黑体,最大词数为2000,配景色彩为白色,天生图片宽1000,高667 98 cloud = WordCloud(font_path='C:\\Windows\\WinSxS\\amd64_microsoft-windows-b..core-fonts-chs-boot_31bf3856ad364e35_10.0.17134.1_none_ba644a56789f974c\\msyh_boot.ttf', max_words=2000, background_color="white", width=1000,99 height=667, margin=2)100 #天生词云 101 wordcloud =cloud.generate(cut_texts)102 #生存图片 103 file_name = 'avatar' 104 wordcloud.to_file('{0}.jpg'.format(file_name))105 #展现图片 106 wordcloud.to_image().show()107 cloud.to_file(path.join(bmppath, 'temp.jpg'))108 109 defSend_email():110 my_sender = '1842449680@qq.com' #发件人邮箱账号 111 my_pass = 'XXXXXXXX' #受权码 112 my_user = '1842449680@qq.com' #收件人邮箱账号,我这边发送给本身 113 114 115 ret =True116 try:117 118 msg =MIMEMultipart()119 #msg = MIMEText('填写邮件内容', 'plain', 'utf-8') 120 msg['From'] = formataddr(["Empirefree", my_sender]) #括号里的对应发件人邮箱昵称、发件人邮箱账号 121 msg['To'] = formataddr(["Empirefree", my_user]) #括号里的对应收件人邮箱昵称、收件人邮箱账号 122 msg['Subject'] = "陈雄博客首页引荐博客内容词云" #邮件的主题,也能够说是题目 123 124 content = '<b>SKT 、<i>Empirefree</i> </b>向您发送陈雄博客近来内容.<br><p><img src="cid:image1"><p>' 125 msgText = MIMEText(content, 'html', 'utf-8')126 msg.attach(msgText)127 fp = open('temp.jpg', 'rb')128 img =MIMEImage(fp.read())129 fp.close()130 img.add_header('Content-ID', '<image1>')131 msg.attach(img)132 133 server = smtplib.SMTP_SSL("smtp.qq.com", 465) #发件人邮箱中的SMTP服务器,端口是25 134 server.login(my_sender, my_pass) #括号中对应的是发件人邮箱账号、邮箱暗码 135 server.sendmail(my_sender, [my_user, ], msg.as_string()) #括号中对应的是发件人邮箱账号、收件人邮箱账号、发送邮件 136 server.quit() #封闭衔接 137 except Exception: #若是 try 中的语句没有实行,则会实行下面的 ret=False 138 ret =False139 ifret:140 print("邮件发送胜利")141 else:142 print("邮件发送失利")143 144 if __name__ == '__main__':145 #Cnblog_getUsers() 146 #user = 'meditation5201314' 147 #My_Blog_Category(user) 148 #My_Blog_Detail(user) 149 print(os.path.dirname(os.path.realpath(__file__)))150 bmppath = os.path.dirname(os.path.realpath(__file__))151 blogs =[]152 153 #猎取引荐博客列表 154 users =Cnblog_getUsers()155 #print(users) 156 #print(json.dumps(users, ensure_ascii=False)) 157 158 #多线程/多历程猎取博客信息 159 mutiSpider()160 #print(json.dumps(blogs,ensure_ascii=False)) 161 162 #猎取一切分类目次信息 163 category = [category for blog in blogs if blog['category'] for category in blog['category']]164 165 #兼并雷同目次 166 new_category ={}167 for name, count incategory:168 #悉数转换为小写 169 name =name.lower()170 if name not innew_category:171 new_category[name] =countCategory(category, name)172 sorted(new_category.items(), key=lambda i: int(i[1]), reverse=True)173 print(new_category)174 TopViewPosts = 175 sorted(TopViewPosts, key=lambda i: int(i[1]), reverse=True)176 print(TopViewPosts)177 178 View_wordcloud(TopViewPosts)179 Send_email()
View Code
总结:整体功用就是依据引荐博客,爬取引荐用户的浏览排行榜 批评排行榜 引荐排行榜,然后数据处置惩罚成,将处置惩罚好的数据整合成词云,末了发送给用户
难点1:爬取用户和博客所用到的一系列爬虫学问(正则,剖析等等)
难点2:词云的装置(确切挺贫苦的。。。。。)
难点3:邮件发送内容嵌套image(菜鸟教程没有给出QQ邮箱内嵌套图片,本身去官网找的。)
Comment here is closed