问题
在使用requests多次访问同一个ip时,尤其是在高频率访问下,很容易出现Max retries exceeded with url的错误。
分析
一是要及时close()关闭,再是用tryCatch来反复抓取。循环继续访问并用sleep控制访问频率。
def get(url): try: res = requests.get(url) # 如果响应状态码不是 200,就主动抛出异常 res.raise_for_status() # 关闭连接 !!!--非常重要 res.close() except Exception as e: logger.error(e) else: return res.json()
别人写的抓取知乎文章。也用此法:
html=""
while html == "":#因为请求可能被知乎拒绝,采用循环+sleep的方式重复发送,但保持频率不太高
try:
proxies = get_random_ip(ipList)
print("这次试用ip:{}".format(proxies))
r = requests.request("GET", url, headers=headers, params=querystring, proxies=proxies)
r.encoding = 'utf-8'
html = r.text
return html
except:
print("Connection refused by the server..")
print("Let me sleep for 5 seconds")
print("ZZzzzz...")
sleep(5)
print("Was a nice sleep, now let me continue...")
continue
综合
如何实现,整个抓取过程不断呢?就是python3报错了,只是跳出本次循环,仍会继续运行呢?
if __name__ == '__main__':
for i in range(80):
url = 'https://www.xxxx.com/qiye/{}.htm'.format(1000-i)
try:
getData(url)#此处写抓取函数,出错误也不怕
except:
print(url+"Let me sleep for 5 seconds")
print("ZZzzzz...")
time.sleep(5)
print("Was a nice sleep, now let me continue...")
continue