scrapy抓取的页面中文会变成unicode字符串

程序员瑞娜技术 2022年11月9日

0 收藏 823 点赞 3,451 浏览 640 个字

不了解编码的，需要先补下：http://www.cnblogs.com/jiangtu/p/6245264.html

在学习&使用scrapy抓取网上信息时，发现scrapy 会将含有中文的field输出为 unicode字符串形式。

这个原因的根本是，在python中使用json序列化时，如果使用 ensure_ascii 编码就会出现这个问题。并且，json.dumps默认使用的也是这个编码。

在scrapy中，JsonItemExporter 也是默认使用的 ensure_ascii 编码:

1 class JsonItemExporter(BaseItemExporter):
2
3     def __init__(self, file, **kwargs):
4         self._configure(kwargs, dont_fail=True)
5         self.file = file
6         kwargs.setdefault('ensure_ascii', not self.encoding) # look here
7         self.encoder = ScrapyJSONEncoder(**kwargs)
8         self.first_item = True

可以看到，在第六行，如果不传递值的话，就会默认使用 ensure_ascii 编码。

所以，我们只要在 pipeline 中实例化 exporter 时，传入编码方式即可:

exporter = MyJsonExporter(fi, encoding='utf-8')

然后就ok了。

JSON.dumps()同理。

不了解会将可以看到就会这个问题

程序员瑞娜

贡献者

上一篇： ELK 收集 Tomcat日志以及修改Tomcat日志格式

下一篇： Robot Framework（6）- BuiltIn 测试库常用的关键字列表

相关推荐

python开发_常用的python模块及安装方法

adodb：我们领导推荐的数据库连接组件bsddb3：BerkeleyDB的连接组件Cheetah-1.0：我比较喜欢这个版本的cheeta…

程序员润宾技术

日期：2022-11-24 点赞：878 阅读：9,492

Educational Codeforces Round 11 C. Hard Process 二分

C. Hard Process题目连接：http://www.codeforces.com/contest/660/problem/CDes…

程序员春广技术

日期：2022-11-24 点赞：807 阅读：5,907

下载Ubuntn 17.04 内核源代码

zengkefu@server1:/usr/src$ uname -aLinux server1 4.10.0-19-generic #21…

程序员峰军技术

日期：2022-11-24 点赞：569 阅读：6,740

可用Active Desktop Calendar V7.86 注册码序列号

可用Active Desktop Calendar V7.86 注册码序列号Name: www.greendown.cn Code: &nb…

程序员天赐技术

日期：2022-11-24 点赞：733 阅读：6,493

Android调用系统相机、自定义相机、处理大图片

Android调用系统相机和自定义相机实例本博文主要是介绍了android上使用相机进行拍照并显示的两种方式，并且由于涉及到要把拍到的照片显…

程序员爱鹏技术

日期：2022-11-24 点赞：512 阅读：8,132

Struts的使用

一、Struts2的获取　　Struts的官方网站为：http://struts.apache.org/　　下载完Struts2的jar包,…

程序员红卫技术

日期：2022-11-24 点赞：671 阅读：5,295

个人收藏笔记记录

开通VIP