想爬一点微博的数据,发现在未登录状态,虽然可以浏览,但微博正文数据是放在 script 标签里的,不知怎样转换才能得到正确的中文?
<script>STK && STK.pageletM && STK.pageletM.view({"pid":"pl_content_weiboDetail","js":[],"css":["style\/css\/module\/global\/person_info_big.css?version=f05544dd408986d7","style\/css\/module\/list\/feed.css?version=f05544dd408986d7","style\/css\/module\/forms\/feed_repeat.css?version=f05544dd408986d7","style\/css\/module\/list\/comment_list.css?version=f05544dd408986d7","style\/css\/module\/global\/pages.css?version=f05544dd408986d7","style\/css\/module\/tab\/tab_c.css?version=f05544dd408986d7","style\/css\/module\/layer\/layer_faces.css?version=f05544dd408986d7","style\/css\/module\/layer\/layer_addfavor_tags.css?version=f05544dd408986d7","style\/css\/module\/layer\/layer_forward.css?version=f05544dd408986d7","style\/css\/module\/layer\/layer_menu_list.css?version=f05544dd408986d7","style\/css\/module\/tab\/tab_b.css?version=f05544dd408986d7","style\/css\/module\/layer\/layer_send_pic.css?version=f05544dd408986d7"],"html":"<div node-type=\"weibo_info\" class=\"feed_lists W_linka W_texta\" action-data=\"&ispower=1\">\n\t<dl class=\"feed_list clearfix feed_list_hover W_no_border\" mid=\"3502493067895526\" >\n\t<dd class=\"content\">\n\t<p><em nick-name=\"\u5415\u6674\u6625-\">\u6709\u505a\u5efa\u7b51\u8bbe\u8ba1\u65b9\u9762\u7684\u670b\u53cb\u4e48\uff1f\u8bf7\u5404\u4f4d\u4eb2\u4eec\u5e2e\u624b\u4ecb\u7ecd\u4ecb\u7ecd\uff0c<img src=\"http:\/\/img.t.sinajs.cn\/t35\/style\/images\/common\/face\/ext\/normal\/c3\/zy_org.gif\" title=\"[\u6324\u773c]\" alt=\"[\u6324\u773c]\" type=\"face\" \/><\/em><\/p>\n<!--pic-->\n\t<!--\/pic-->\n<!--retweeted-->\n\t<!--\/retweeted-->\n\t<div class=\"wTablist W_linkb W_textb\" node-type =\"feed_list_tagList\" style=\"display:none;\">\u6807\u7b7e\uff1a\n\t \t \t \n\t <\/div>\n\t \t\t<p class=\"info W_linkb W_textb\">\n\t\t<span>\t\n\t\t\t <a href=\"javascript:void(0);\" action-type=\"login\">\u8f6c\u53d1(10)<\/a><i class=\"W_vline\">|<\/i>\n\t\t\t<em class=\"hover\">\n\t\t\t<a href=\"javascript:void(0);\" action-type=\"login\">\u6536\u85cf<\/a><i class=\"W_vline\">|<\/i>\n\t<\/em>\t\n\t<a href=\"javascript:void(0);\" action-type=\"login\">\u8bc4\u8bba(5)<\/a>\t\t\n\t\t<\/span>\n\t\t\u4eca\u5929 14:48 \u6765\u81ea<a target=\"_blank\" href=\"http:\/\/se.360.cn\/?fromweibo\" rel=\"nofollow\">360\u5b89\u5168\u6d4f\u89c8\u5668<\/a>\n\t\t\t<em class=\"hover\"><i class=\"W_vline\">|<\/i><a href=\"javascript:void(0);\" action-type=\"login\">\u4e3e\u62a5<\/a><\/em>\n\t\t<\/p>\n\t<\/dd>\n\t<\/dl>\t\n<\/div>\n<div class=\"unlogin_vip_info\">\n<span class=\"icon_warn\"><\/span>\n<a target=\"_blank\" href=\"http:\/\/weibo.com\/signup\/signup.php?inviteCode=1832604244&entry=weiyonghu\" >\u5feb\u901f\u5f00\u901a\u5fae\u535a<\/a>\u4f60\u53ef\u4ee5\u67e5\u770b\u66f4\u591a\u5185\u5bb9\uff0c\u8fd8\u53ef\u4ee5\u8bc4\u8bba\u3001\u8f6c\u53d1\u5fae\u535a\u3002\n<\/div>\n"})</script>
需要提取的部分在 content 中,如下:
<dd class=\"content\">\n\t<p><em nick-name=\"\u5415\u6674\u6625-\">\u6709\u505a\u5efa\u7b51\u8bbe\u8ba1\u65b9\u9762\u7684\u670b\u53cb\u4e48\uff1f\u8bf7\u5404\u4f4d\u4eb2\u4eec\u5e2e\u624b\u4ecb\u7ecd\u4ecb\u7ecd\uff0c<img src=\"http:\/\/img.t.sinajs.cn\/t35\/style\/images\/common\/face\/ext\/normal\/c3\/zy_org.gif\" title=\"[\u6324\u773c]\" alt=\"[\u6324\u773c]\" type=\"face\" \/><\/em>