Python爬虫入门

爬虫是获取互联网数据技术的拟物称呼。现在流行用python来实现爬虫，因为python提供了很多好用的官方库和第三方库方便爬虫技术的实现。Node.js也是可以用来实现爬虫的。

介绍

如图所示，爬虫流程就四步。将你需要爬取的网页地址交个下载器（urllib2、requests），下载器会将页面下载下来。然后你需要通过各种获取筛选手段（正则表达式、xpath）将你需要的数据提取出来。最后将这些数据保存起来以便后续使用。

urllib和urllib2 （页面下载）

在python2.7中urllib和urllib2是两个独立的模块，而在python3.x中两个库合围一个urllib模块。
urllib和urllib2模块都是做与请求URL相关的操作，但它们提供不同的功能：

urllib2可以接收一个Request对象，并以此可以设置一个URL的headers,但是urllib只接收一个URL。这意味着，你不能伪装你的用户代理字符串。
urllib模块可以提供进行urlencode的方法，该方法用于GET查询字符串的生成，urllib2的不具有这样的功能，所以urllib和urllib2模块经常一起使用。

__urllib2__模块定义了一些方法和类用于打开URL。
它提供了授权(authentication)、重定向(redirections)、cookie等功能。

urllib2.urlopen(url)

打开一个URL链接，它可以是个字符串或者Request对象。

import urllib2
url = r'https://www.baidu.com'
html = urllib2.urlopen(url).read()
print html

上面例子中urllib2.urlopen(url)接收的参数是个字符串。

import urllib2
url = r'https://www.baidu.com'
req = urllib2.Request(url)
html = urllib2.urlopen(req).read()
print html

上面例子中urllib2.urlopen(req)接收的是个request对象
urllib2请求网页是使用opener对象，如果urllib2.urlopen()第一个参数是个字符串，urllib2会使用默认的opener请求网页。
urllib2.urlopen()方法返回一个类file对象。它额外有三个方法：

geturl() — 返回一个资源URL，一般用来比较是否发生跳转。
info() — 返回页面的__meta__信息。
getcode() — 返回HTTP状态码，一般用于判断请求状态。

urllib2.Request(url,[,data],[,headers])

这是__URL_request__的抽象类。实例对象__Request__常用来设置请求头部信息。

url — 需要请求的URL地址，必填项。
data — 发送给服务器的数据，它应该是个字符串类型。如果data字段有值，则urllib2.urlopen(req)会以post形式发起页面请求。
headers — 应该是dict类型，经常用它来伪造user-agent信息。

import urllib
import urllib2
url = 'http://www.server.com/login'
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
values = {'username' : 'cqc',  'password' : 'XXXX'}
headers = {'User-Agent' : user_agent }
data = urllib.urlencode(values)
request = urllib2.Request(url, data, headers)
response = urllib2.urlopen(request)
page = response.read()

urllib2.install_opener(openerDirector) 和 urllib2.build_open([handle,])

install_opener和build_opener一般一起使用。
build_opener 实例化得到一个OpenerDirector对象，其中参数handlers可以被BaseHandler或他的子类实例化。子类中可以通过以下实例化：ProxyHandler (如果检测代理设置用), UnknownHandler, HTTPHandler, HTTPDefaultErrorHandler, HTTPRedirectHandler, FTPHandler, FileHandler, HTTPErrorProcessor。　
install_opener 实例化会得到OpenerDirector 对象用来赋予全局变量opener。如果想用这个opener来调用urlopen，那么就必须实例化得到OpenerDirector；这样就可以简单的调用OpenerDirector.open()来代替urlopen()。

import urllib2

class RedirectHandler(urllib2.HTTPRedirectHandler):
   		def http_error_301(self, req, fp, code, msg, headers):
       		pass
   		def http_error_302(self, req, fp, code, msg, headers):
       		pass

opener = urllib2.build_opener(RedirectHandler)
opener.open('http://www.google.cn')

urllib2默认遇到30x会自动跳转，自定义 HTTPRedirectHandler 类。可以禁止跳转。

import urllib2
# Create an OpenerDirector with support for Basic HTTP Authentication...
auth_handler = urllib2.HTTPBasicAuthHandler()
auth_handler.add_password(realm='PDQ Application',
			uri='https://mahler:8092/site-updates.py',
			user='klem',
			passwd='kadidd!ehopper')
opener = urllib2.build_opener(auth_handler)
# ...and install it globally so it can be used with urlopen.
urllib2.install_opener(opener)
urllib2.urlopen('http://www.example.com/login.html')

异常处理 urllib2.URLError 和 urllib2.HTTPError

URLError — handlers当运行出现问题时（通常是因为没有网络连接也就是没有路由到指定的服务器，或在指定的服务器不存在），抛出这个异常.它是IOError的子类.这个抛出的异常包括一个‘reason’ 属性,他包含一个错误编码和一个错误文字描述。如下面代码，request请求的是一个无法访问的地址，捕获到异常后我们打印reason对象可以看到错误编码和文字描述。

import urllib2
req = urllib2.Request('http://www.python11.org/')
try:
	response=urllib2.urlopen(req)
except urllib2.URLError,e:
print e.reason
print e.reason[0]
print e.reason[1]

__HTTPError__——HTTPError是URLError的子类。每个来自服务器HTTP的response都包含“status code”. 有时status code不能处理这个request. 默认的处理程序将处理这些异常的responses。例如，urllib2发现response的URL与你请求的URL不同时也就是发生了重定向时，会自动处理。对于不能处理的请求, urlopen将抛出HTTPError异常. 典型的错误包含‘404’ (没有找到页面), ‘403’ (禁止请求),‘401’ (需要验证)等。它包含2个重要的属性reason和code。

　　当一个错误被抛出的时候，服务器返回一个HTTP错误代码和一个错误页。你可以使用返回的HTTP错误示例。这意味着它不但具有code和reason属性，而且同时具有read，geturl，和info等方法，如下代码和运行结果。

import urllib2
req = urllib2.Request('http://www.python.org/fish.html')
try:
	response=urllib2.urlopen(req)
except urllib2.HTTPError,e:
	print e.code
	print e.reason
	print e.geturl()
	print e.read()

如果我们想同时处理HTTPError和URLError，因为HTTPError是URLError的子类，所以应该把捕获HTTPError放在URLError前面，如不然URLError也会捕获一个HTTPError错误，代码参考如下：

import urllib2
req = urllib2.Request('http://www.python.org/fish.html')
try:
	response=urllib2.urlopen(req)
except urllib2.HTTPError,e:
	print 'The server couldn\'t fulfill the request.'
	print 'Error code: ',e.code
	print 'Error reason: ',e.reason   
except urllib2.URLError,e:
	print 'We failed to reach a server.'
	print 'Reason: ', e.reason
else:
	# everything is fine
	response.read()

Requests

__requests__是目前python最好用的http库。它兼容python2.6 ,2.7,3.4,3.5,3.6。

Requests 使用的是 urllib3，继承了urllib2的所有特性。Requests支持HTTP连接保持和连接池，支持使用cookie保持会话，支持文件上传，支持自动确定响应内容的编码，支持国际化的 URL 和 POST 数据自动编码。

请求方式

requests支持get,post,put,delete,head,options请求方式。

import requests
r = requests.get('https://github.com/timeline.json')
r = requests.post('http://httpbin.org/post')
r = requests.put('http://httpbin.org/put')
r = requests.delete('http://httpbin.org/delete')
r = requests.head("http://httpbin.org/get")
r = requests.options("http://httpbin.org/get")

给__get__方式传递参数：

playload = {'key1':'value1','key1':'value2'}
r = requests.get('http://httpbin.org/get',params=playload)
print(r.url)
# http://httpbin.org/get?key2=value2&key1=value1

你也可以将一个列表作为值传入：

payload = {'key1': 'value1', 'key2': ['value2', 'value3']}
r = requests.get('http://httpbin.org/get', params=payload)
print(r.url)
# http://httpbin.org/get?key1=value1&key2=value2&key2=value3

requests会将URL自动进行正确的编码。
__post__请求
有时需要发送一些表单格形式的数据，我们需要使用__post__提交数据：

payload = {'key1': 'value1', 'key2': 'value2'}
r = requests.post("http://httpbin.org/post", data=payload)
print(r.text)
'''{
	...
	"form": {
		"key2": "value2",
		"key1": "value1"
	},
	...
}'''

提交的数据放到__data__参数上。你还可以为data参数传入一个元祖列表。

返回与编码

通过t.text获取返回内容。
在Python2中，只有unicode编码才能被打印，所以如果要print返回的内容，你需要指定打印编码t.text.encode('utf-8')

响应状态码

__status_code__检测响应状态码

1 2	r = requests.get('http://httpbin.org/get') r.status_code # 200

requests.codes.ok 状态码查询对象，Requests附带一个内置的状态码查询对象

1	r.status_code == requests.codes.ok # true

__response.raise_for_status()__抛出异常
如果一个错误请求，我们可以通过__response.raise_for_status()__抛出异常：

bad_r = requests.get('http://httpbin.org/status/404')
bad_r.status_code
# 404
bad_r.raise_for_status()
'''Traceback (most recent call last):
	File "requests/models.py", line 832, in raise_for_status
raise http_error
requests.exceptions.HTTPError: 404 Client Error'''

响应头

通过__response.headers__可以看到服务器响应头

1	r.headers['Content-Type'] #'application/json'

如果某个响应中包含一些 cookie，你可以快速访问它们:

url = 'http://example.com/some/cookie/setting/url'
r = requests.get(url)

r.cookies['example_cookie_name']
# 'example\_cookie\_value'

想发送你的cookies的服务器，可以使用__cookies__参数：

url = 'http://httpbin.org/cookies'
cookies = dict(cookies_are='working')
r = requests.get(url, cookies=cookies)
r.text # '{"cookies": {"cookies_are": "working"}}'

Cookie 的返回对象为 __RequestsCookieJar__，它的行为和字典类似，但界面更为完整，适合跨域名跨路径使用。你还可以把 Cookie Jar 传到 Requests 中

jar = requests.cookies.RequestsCookieJar()
jar.set('tasty_cookie', 'yum', domain='httpbin.org', path='/cookies')
jar.set('gross_cookie', 'blech', domain='httpbin.org', path='/elsewhere')
url = 'http://httpbin.org/cookies'
r = requests.get(url, cookies=jar)
r.text
# '{"cookies": {"tasty_cookie": "yum"}}'

重定向和请求历史

默认情况下，Requests会自动处理所有重定向。
可以使用响应对象的__history__方法来追踪重定向。
__Response.history__是一个__Response__对象的列表，这个列表按照从最老到最近的请求进行排序。

r = requests.get('http://github.com')
r.url # https://github.com
r.status_code # 200
r.history # [<Response [301]>]

__allow_redirects__参数禁用重定向处理：

1
2
3

r = requests.get('http://github.com',allow_redirects=false)
r.status_code # 301
r.history # []

会话对象

会话对象让你能够跨请求保持某些参数。它也会在同一个Session实例发出所有请求之间保持cookie。
跨请求保持一些cookie:

s = requests.Session()
s.get('http://httpbin.org/cookies/set/sessioncookie/123456789')
r = s.get("http://httpbin.org/cookies")
print(r.text)
# '{"cookies": {"sessioncookie": "123456789"}}'

会话可以为请求方法提供缺省数据。这是为会话对象的属性提供数据实现的：

s=requests.Session()
s.auth=('user','pass')
s.headers.update({'x-test':'true'}) #会话层

s.get('http://httpbin.org/headers',headers={'x-test2':'true'}) # 方法层

任何你传递给请求方法的字典都会与已设置会话层数据合并。方法层的参数覆盖会话的参数。
__注意：__方法级别的参数不会跨请求保持。

正则表达式

获得网页内容后，我们下面是找到我们需要用的数据。查找需要的数据就是正则表达式了。
python提供__re__模块对正则表达式支持。__re__主要有下面8个方法

import re
pattern = re.compile(string[,flags])
re.match(pattern, string[,flags])
re.search(pattern, string[,flags])
re.split(pattern, string[, maxsplit])
re.findall(pattern, string[, flags])
re.finditer(pattern, string[, flags])
re.sub(pattern, repl, string[, count] )
re.subn(pattern, repl, string[, count])

re.match(pattern, string,[,flags])

这个方法将会从 string（我们要匹配的字符串）的开头开始，尝试匹配 pattern，一直向后匹配，如果遇到无法匹配的字符，立即返回 None，如果匹配未结束已经到达 string 的末尾，也会返回 None。两个结果均表示匹配失败，否则匹配 pattern 成功，同时匹配终止，不再对string 向后匹配。

re.search(pattern, string[, flags])

search 方法与 match 方法极其类似，区别在于 match() 函数只检测 re 是不是在 string的开始位置匹配，search() 会扫描整个 string 查找匹配，match（）只有在0位置匹配成功的话才有返回，如果不是开始位置匹配成功的话，match() 就返回 None。同样，search 方法的返回对象同样 match() 返回对象的方法和属性。

re.split(pattern,string[,maxsplit])

按照能够匹配的子串将 string 分割后返回列表。maxsplit 用于指定最大分割次数，不指定将全部分割。

import re

pattern = re.compile(r'\d+')
print re.split(pattern,'one1two2three3four4')
# ['one', 'two', 'three', 'four', '']

re.findall(pattern, string,[, flags])

搜索 string，以列表形式返回全部能匹配的子串。

import re

pattern = re.compile(r'\d+')
for m in re.finditer(pattern,'one1two2three3four4'):
   		print m.group(),
# 1 2 3 4

re.sub(pattern, repl, string[,count])

使用 repl 替换 string 中每一个匹配的子串后返回替换后的字符串。当 repl 是一个字符串时，可以使用 \id 或 \g、\g 引用分组，但不能使用编号0。当 repl 是一个方法时，这个方法应当只接受一个参数（Match对象），并返回一个字符串用于替换（返回的字符串中不能再引用分组）。 count 用于指定最多替换次数，不指定时全部替换。

import re
pattern = re.compile(r'(\w+) (\w+)')
s = 'i say, hello world!'
print re.sub(pattern,r'\2 \1', s)
def func(m):
	return m.group(1).title() + ' ' + m.group(2).title()
print re.sub(pattern,func, s)

\### output ###
\# say i, world hello!
\# I Say, Hello World!

re.subn(pattern,repl,string[, count])

返回 (sub(repl, string[, count]), 替换次数)。

(.*?)

__(.*?)__是用的最多的匹配表达式，它含义是提取符合要求的内容。

()：表示这个内容是我们需要提取的
.*：表示匹配任意字符0到n次
?：表示非贪心，找对第一个就停下来

import re
text = '<a href = "www.baidu.com">....'
urls = re.findall('<a href = (.*?)>',text,re.S)
for each in urls:
	print each

__注意：__re.S的意思是让”.”可以匹配换行符，不然有些标签头和尾是分几行的，就会匹配失败

lxml和Xpath

lxml是一款页面内容解析库，配合xpath(XPath 使用路径表达式来选取 XML 文档中的节点或节点集。节点是通过沿着路径 (path) 或者步 (steps) 来选取的)。

from lxml import etree
html=
	'''
		<div id="test1">content1</div>
		<div id="test2">content2</div>
		<div id="test3">content3</div>
	'''

selector = etree.HTML(html)
content = selector.XPath('//div[start-with(@id,"test")]/text()')
for each in content:
print each
html1=
	'''
		<div id="class">Hello,
		<font color=red>my</font>
   				world!
		<div>
	'''

selector = etree.HTML(html)
tmp = selector.XPath('//div[@id="class"]')[0]
info = tmp.XPath('string(.)')
content2 = info.replace('\n','')
print content2

XPath语法：

// 根节点
/ 下一层路径
[@XX=xx] 特定的标签
/text() 以文本返回
/@para 返回参数
string(.) 当前层的所有内容作为一个字符串输出
start-with(str) 所有以这个str开头的标签

Beautiful Soup

Beautiful Soup是一个可以从HTML或XML文件中提取数据的Python库。它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.

Beautiful Soup 最新版是 __beautifulsoup4__。

1	pip install beautifulsoup4

Beautiful Soup支持Python标准库中的HTML解析器，但还支持第三方解析器。推荐使用__lxml__，因为它速度快而且容错能力好

1 2	pip install lxml BeautifulSoup(markup,"lxml")；

BeautifulSoup解析一段代码获得__BeautifulSoup__对象，并能按标准的缩进格式的结构输出:

from bs4 import BeautifulSoup
import lxml
html_doc = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'lxml')
print(soup.prettify())
# <html>
#  <head>
#   <title>
#    The Dormouse's story
#   </title>
#  </head>
#  <body>
#   <p class="title">
#    <b>
#     The Dormouse's story
#    </b>
#   </p>
#   <p class="story">
#    Once upon a time there were three little sisters; and their names were
#    <a class="sister" href="http://example.com/elsie" id="link1">
#     Elsie
#    </a>
#    ,
#    <a class="sister" href="http://example.com/lacie" id="link2">
#     Lacie
#    </a>
#    and
#    <a class="sister" href="http://example.com/tillie" id="link2">
#     Tillie
#    </a>
#    ; and they lived at the bottom of a well.
#   </p>
#   <p class="story">
#    ...
#   </p>
#  </body>
# </html>

几个简单的浏览结构化数据的方法：

soup.title
# <title>The Dormouse's story</title>

soup.title.name
# u'title'

soup.title.string
# u'The Dormouse's story'

soup.title.parent.name
# u'head'

soup.p
# <p class="title"><b>The Dormouse's story</b></p>

soup.p['class']
# u'title'

soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

从文档中找到所有<a>标签的链接：

1
2
3

for link in soup.find_all('a')
	print(link.get('href'))
# http://example.com/elsie

从文档中获取所有文字内容：

print(soup,get_text())
# The Dormouse's story
#
# The Dormouse's story
#
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
# ...

Beautiful Soup将复杂的HTML文档转换成一个复杂的树形结构，每个节点都是Python对象，所有对象可以归纳为4种：Tag,NavigableString__，__BeautifulSoup,comment

Tag

__Tag__对象与XML或HTML原生文档中的tag相同：

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
type(tag)
# <class 'bs4.element.Tag'>

Tag有很多方法和属性，例如:

tag.name — 获取每个tag自己的名字。
tag[‘attrs’] — 获取tag属性值。
tag.attrs — 已字典格式获取tag属性。
tag.contents — 已列表方式展示tag子节点。

NavigableString

字符串常被包含在tag内.Beautiful Soup用 NavigableString 类来包装tag中的字符串:

tag.string
# u'Extremely bold'
type(tag.string)
# <class 'bs4.element.NavigableString'>

一个 NavigableString 字符串与Python中的Unicode字符串相同,并且还支持包含在遍历文档树和搜索文档树中的一些特性. 通过 unicode() 方法可以直接将 NavigableString 对象转换成Unicode字符串

unicode_string = unicode(tag.string)
unicode_string
# u'Extremely bold'
type(unicode_string)
# <type 'unicode'>

ag中包含的字符串不能编辑,但是可以被替换成其它的字符串,用 replace_with() 方法:

1
2
3

tag.string.replace_with("No longer bold")
tag
# <blockquote>No longer bold</blockquote>

BeautifulSoup

BeautifulSoup 对象表示的是一个文档的全部内容.大部分时候,可以把它当作 Tag 对象,它支持遍历文档树和搜索文档树中描述的大部分的方法.
因为 BeautifulSoup 对象并不是真正的HTML或XML的tag,所以它没有name和attribute属性.但有时查看它的 .name 属性是很方便的,所以 BeautifulSoup 对象包含了一个值为 “[document]” 的特殊属性 .name

find_all()

find_all()方法搜索当前tag的所有tag子节点，并判断是否符合过滤器的条件。

soup.find_all('title') #通过name参数查找所有名字为 title 的tag
# [<title>The Dormouse's story</title>]
soup.find_all('p','title')
# [<p class="title"><b>The Dormouse's story</b></p>]
soup.find_all('a'，class_="sister") #按照CSS类名搜索tag
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soup.find_all(id="link2") #通过keyword关键字查找
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
soup.find_all("a", text="Elsie") #text参数，搜索文档中的字符串内容
# [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>]
import re
soup.find(text=re.compile("sisters"))
# u'Once upon a time there were three little sisters; and their names were\n'

find()

find()用法和find_all()一样，唯一区别是后者返回文档中符合条件的所有tag,前者只返回第一个。

解析XML

XML是可扩展标记语言（Extensible Markup Language）其中的标记（markup）是关键部分。您可以创建内容，然后使用限定标记标记它，从而使每个单词、短语或块成为可识别、可分类的信息。
特点：

XML的设计宗旨是传输数据，而非显示数据。
XML标签没有被预定义。您需要自行定义标签。
XML被设计为具有自我描述性。
XML是W3C的推荐标准。

XML是各种应用程序之间进行数据传输的最常用的工具，并且在信息存储和描述领域变得越来越流行。因此，学会如何解析XML文件，对于Web开发来说是十分重要的。
python的标准库中，提供了6种处理XML包。
xml.dom__、__xml.dom.minidom__、__xml.dom.pulldom__、__xml.sax__、__xml.parser.expat__、__xml.etree.ElementTree
推荐使用__xml.etree.ElementTree__，因为它提供了一个高效C语言实现方式__xml.etree.cElementTree__。与DOM相比，ET的速度更快，API使用更直接，方便。

ElementTree

try:
	import xml.etree.cElementTree as ET
except ImportError:
	import xml.etree.ElementTree as ET

上面是常见的导入方式。但自python3.3之后，就不需要采用上面导入方法了，因为ElementTree模块会自动优先使用C加速器，如果不存在C实现，则会使用python实现。

导入数据

country_data_as_string = ‘<?xml version="1.0"?>
<data>
	<country name="Liechtenstein">
		<rank>1</rank>
		<year>2008</year>
		<gdppc>141100</gdppc>
       		<neighbor name="Austria" direction="E"/>
       		<neighbor name="Switzerland" direction="W"/>
   		</country>
   		<country name="Singapore">
       		<rank>4</rank>
       		<year>2011</year>
       		<gdppc>59900</gdppc>
       		<neighbor name="Malaysia" direction="N"/>
   		</country>
   		<country name="Panama">
       		<rank>68</rank>
       		<year>2011</year>
       		<gdppc>13600</gdppc>
       		<neighbor name="Costa Rica" direction="W"/>
       		<neighbor name="Colombia" direction="E"/>
   		</country>
</data>’
import xml.etree.ElementTree as ET
tree = ET.ElementTree(file='docl.xml') #文件
root = ET.fromstring(country_data_as_string) # 字符串数据

获取标签和属性

1 2	root.tag # data root.attrib # {}

tag获取标签名字，attrib获取标签属性对象，text获取标签内文本。
root是这个XML根标签，需要访问它的子标签可以用迭代它。

for child in root:
	print child.tag,child attrib
#country {'name': 'Liechtenstein'}
#country {'name': 'Singapore'}
#country {'name': 'Panama'}

子标签是嵌套形式的，我们可以通过节点下标访问

1	root[0][1].text # 2008

查询需要节点

__Element__有方法可以快速获取自己想要节点 element.iter()

for neighbor in root.iter('neighbor'):
	print neighbor.attrib
#{'name': 'Austria', 'direction': 'E'}
#{'name': 'Switzerland', 'direction': 'W'}
#{'name': 'Malaysia', 'direction': 'N'}
#{'name': 'Costa Rica', 'direction': 'W'}
#{'name': 'Colombia', 'direction': 'E'}

__element.findall()__查询当前节点的直接子节点。

for country in root.findall('country'):
	rank = country.find('rank').text
	name = country.get('name')
	print name, rank
#Liechtenstein 1
#Singapore 4
#Panama 68

find()查询第一个子节点，get()获取当前节点属性值

XPath查询

__Elementree__支持XPath查询

1	root.findall('./country/neighbor') #所有contry节点下neighbor节点

PyMongo

__Pymongo__是Python中操作__MongoDB__的推荐库。

1	pip install pymongo

链接Mongodb

__MongoClient__用于创建Mongodb客户端。

1
2
3

from pymongo import MongoClient
client = MongoClient('localhost',27017) # 使用host和port链接
clinet = MongoClient('mongodb://localhost:27017/') # 使用URL格式链接

我们经常使用Username和Password链接Mongodb。
__Username__和__password__必须经过percent-escaped。
python3中使用__urllib.parse.quote_plus()，python2中使用__urllib.quote_plus()

from pymongo import MongoClient
import urllib.parse
username = urllib.parse.quote_plus('user')
password = urllib.parse.quote_plus('pass/word')
client = MongoClient('mongodb://%s:%s@127.0.0.1' % (username, password))

获取数据库

__pymongo__可以使用两种方式获取数据库。

1 2	db = client.test_database //属性风格取值。 db = client['test-database'] //字典风格取值对于含有特殊字符使用这种

获取集合

集合也是有两种方式获取

1 2	collection = db.test_collection collection = db.['test-collection']

只有数据插入时集合才会被创建。

插入一条数据 insert_one

1
2
3

p = db.person
person ={'name':'fynn','age':27}
person_id = p.insert_one(person).inserten_id #插入数据并获取'_id'

Mongodb会自动为每条插入的数据创建一个_id字段，它是唯一的。
__insert_one()__返回一个实例InsertOneResult。这个实例的inserten_id字段就是_id。

获取一条文档 find_one()

import pprint
pprint.pprint(p.find_one())
{u'_id': ObjectId('...'),
u'name': u'fynn',
u'age': 27,}

__find_one()__支持条件查询

1 2	p.find_one({'name':'fynn'}) p.find_one({'_id':person_id})

_id实际是个对象，但在web开发中_id经常被序列化成字符串。但在查询时我们只能以对象形式查询。所以我们需要进行格式话。

1
2
3

from bson.objectid import ObjectId
def get (person_id):
	document = client.db.collection.find_one({'_id':ObjectId(person_id)})

批量查询 find()

1 2	p.find() p.find({'age':27})

插入多条数insert_many()

persons = [{'name':'fynn',age:27},{'name':'echo',age:27}]

result = p.insert_many(persons)
result.inserted_id # [ObjectId('....'),ObjectId('...')]

更新数据update_one()

1	p.update_one({age:21},{'$set':{'name':'fynn'}})

删除数据 delete_one(),delete_many()

1	p.delete_one({'name':'fynn'})

计数 count()

1 2	p.count() # 2 p.find({'name':'fynn'}).count() #1

区间查询

1	p.find({'age':{'$lt':30}}) # 小于30岁

创建索引

1	db.person.create_index([('name',pymongo.ASCENDING),unique=True])

参考：

https://docs.python.org/2/library/urllib2.html
http://www.cnblogs.com/wly923/archive/2013/05/07/3057122.html
http://www.w3school.com.cn/xpath/xpath_syntax.asp
https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/
https://api.mongodb.com/python/current/tutorial.html

文章目录

介绍