TrBlog

python的str和bytes问题

Posted on 2019-07-18 Edited on 2022-01-19 In Python

python字符编码问题

最近在复习python，又遇到了编码问题，下定决心弄懂，把自己的理解写出来。

什么是编码

编码是由于计算机无法识别除了0/1以外的东西而产生的。早起只有英文字母的时代，一个字节即可以表达所有字母，这个编码叫ascall，每个字母通过这个表可以将字母转为对应的二进制数，存在磁盘中。例如字母B的ascall是66，转为二进制：0100002存在磁盘上。如果这时候一个不兼容ascall码表的单字节码表读取的时候认不得这个二进制，就会产生乱码。

后来由于发展，ascall码不够用，出现了很多码表，其中以一种规范为unicode（万国码），这个表里英文是2字节表示，中文3字节，不兼容ascall码，注意的是unicode不是物理存储实现，所以需要一种存储方式和ascall的存储方式兼容。所以又有人推出了 utf-8规范，这个规范是对unicode的实现，具体做法是对unicode的对应字符序号（码点：code-point）使用边长字节进行存放。内存中unicode是2字节存放码点。在python中显示验证： a变量存放了str类型的unicode的中文，在python3里头unicode的字自动会转为中文的，类型都是str。这里将str编码为utf-8，a就是这3字节（24位）的数。再使用gbk编码解码读取，gbk不兼容utf-8显然无法识别这串二进制数。

>>> a = '\u8bf7'
>>> a
'请'
>>> a.encode('utf-8')
b'\xe8\xaf\xb7'
>>> a= a.encode('utf-8')
>>> a.decode('gbk')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'gbk' codec can't decode byte 0xb7 in position 2: incomplete multibyte sequence

python的unicode和str区别

str在python2.7里指向的是二进制，在python3里是unicode字符（统一了str和unicode,取消了unicode对象），2.7 print时会转为unicode显示，因为内存默认读取unicode编码。

python3 str为unicode字符集，传输需要编码：

str ==> encode：bytes

bytes ==> decode：str
python2 里unicode是文本字符串，str是字节字符串！！！能直接看到str的二进制表示，看不到unicode对象的二进制内存表示！unicode对象和str对象之间转换需要encode和decode，内存里默认的用unicode编码但是存放的是unicode的序号（code-point）的二进制，所以str的二进制解码就是unicode字符
1
2
3
4
5
6
7
>>> a = '唐'
>>> a
'\xe5\x94\x90'
>>> a.decode('utf-8')
u'\u5510'
>>>
python2将字节字符串和文本字符串视为一致了，这是一个设计错误，所以在3里取消了！

python会自动处理他们的关系。例如将unicode对象和str对象拼接时，python自动对str进行转换，转为unicode，但是转换的decode方式为默认的ascall，但这年头谁还纯用ascall，所以非常容易出错，很烦。

例如：a为str对象，b为unicode对象，两者拼接，str对象是二进制，需要将str转为unicode，默认执行str.decode(‘ascall’)，但是实际文本是utf-8的二进制，所以出错。

（2.7）str是python内存的一种二进制编码（但是编码方式取决于环境：如果是在解释器中写，二进制为终端的编码类型，文件的话则是文件保存编码类型），python读取这个str对象（二进制序列）就知道是什么字符，显示到显示器。

unicode对象是一种字符集并非二进制（本质在内存中是2个字节标识的序号二进制），如果要存储unicode对象到磁盘，需要自己指定编码方式如utf-8（以三个字节表示中文），如果指定的编码方式如gbk会无法写入，因为gbk不支持unicode编码。

但是在python中是无法直接看到unicode对象的二进制数。（注意的是在python3中unicode和str都为str类型）

（2.7）举例子： ‘唐’ 如果是str类型且在终端的解释器环境下为gbk，那么a = '唐' a在内存里就是gbk的二进制数组。再次打印 a 时计算机读取内存的二进制，根据终端的码表显示，如果写的时候终端为utf-8,打印的时候改成gbk那么应该会乱码。

在python中验证乱码：

>>> a = '唐'
>>> a 	#python2.7里头直接打印a打印的是a指向的内存的二进制数，需要print函数
'\xe5\x94\x90' # 内存里头的二进制数，因为终端是utf-8所以是三个字节24位
>>> type(a)
<type 'str'>
>>> print(a)
唐
# 这时候我将终端的编码改为了gbk
>>> print(a)
鍞�
# 乱码，因为gbk读取这个24位读不懂

使用py2需要注意的点（参考了StackOverflow)

所有文本字符串都应该是unicode类型，而不是str类型。如果处理的是文本，而变量类型是str，这就是bug了！(因为在内存中默认的是unicode，磁盘中才是对于编码后的二进制)
若要将字节串解码成字符串，需要使用正确的解码，即 var.decode(encoding)（如， var.decode(‘utf-8’) ）。将文本字符串编码成字节，使用var.encode(encoding)。
永远不要对unicode字符串使用 str() （str转为了二进制，编码方式取决于环境），也不要在不指定编码的情况下就对字节串使用 unicode() 。
当应用从外部（磁盘，网络）读取数据时，应将其视为字节串，即str类型的，接着调用.decode() 将其解释成文本。同样，在将文本发送到外部时，总是对文本调用.encode() 。

vpn

Posted on 2019-07-12 Edited on 2022-01-19 In Vpn

vpn配置

本脚本适用环境
系统支持：CentOS 6+，Debian 7+，Ubuntu 12+
内存要求：≥128M

关于本脚本
1、一键安装 Shadowsocks-Python， ShadowsocksR， Shadowsocks-Go， Shadowsocks-libev 版（四选一）服务端；
2、各版本的启动脚本及配置文件名不再重合；
3、每次运行可安装一种版本；
4、支持以多次运行来安装多个版本，且各个版本可以共存（注意端口号需设成不同）；
5、若已安装多个版本，则卸载时也需多次运行（每次卸载一种）；

默认配置
服务器端口：自己设定（如不设定，默认从 9000-19999 之间随机生成）
密码：自己设定（如不设定，默认为 teddysun.com）
加密方式：自己设定（如不设定，Python 和 libev 版默认为 aes-256-gcm，R 和 Go 版默认为 aes-256-cfb）
协议（protocol）：自己设定（如不设定，默认为 origin）（仅限 ShadowsocksR 版）
混淆（obfs）：自己设定（如不设定，默认为 plain）（仅限 ShadowsocksR 版）
备注：脚本默认创建单用户配置文件，如需配置多用户，请手动修改相应的配置文件后重启即可。

客户端下载
常规版 Windows 客户端
https://github.com/shadowsocks/shadowsocks-windows/releases

ShadowsocksR 版 Windows 客户端
https://github.com/shadowsocksrr/shadowsocksr-csharp/releases

使用方法
使用root用户登录，依次运行以下命令：

wget –no-check-certificate -O shadowsocks-all.sh https://raw.githubusercontent.com/teddysun/shadowsocks_install/master/shadowsocks-all.sh
chmod +x shadowsocks-all.sh
./shadowsocks-all.sh 2>&1 | tee shadowsocks-all.log

安装完成后，脚本提示如下

Congratulations, your_shadowsocks_version install completed!
Your Server IP :your_server_ip
Your Server Port :your_server_port
Your Password :your_password
Your Encryption Method:your_encryption_method

Your QR Code: (For Shadowsocks Windows, OSX, Android and iOS clients)
ss://your_encryption_method:your_password@your_server_ip:your_server_port
Your QR Code has been saved as a PNG file path:
your_path.png

Welcome to visit:https://teddysun.com/486.html
Enjoy it!

卸载方法
若已安装多个版本，则卸载时也需多次运行（每次卸载一种）
使用root用户登录，运行以下命令：

./shadowsocks-all.sh uninstall

启动脚本
启动脚本后面的参数含义，从左至右依次为：启动，停止，重启，查看状态。

Shadowsocks-Python 版：
/etc/init.d/shadowsocks-python start | stop | restart | status

ShadowsocksR 版：
/etc/init.d/shadowsocks-r start | stop | restart | status

Shadowsocks-Go 版：
/etc/init.d/shadowsocks-go start | stop | restart | status

Shadowsocks-libev 版：
/etc/init.d/shadowsocks-libev start | stop | restart | status

各版本默认配置文件
Shadowsocks-Python 版：
/etc/shadowsocks-python/config.json

ShadowsocksR 版：
/etc/shadowsocks-r/config.json

Shadowsocks-Go 版：
/etc/shadowsocks-go/config.json

Shadowsocks-libev 版：
/etc/shadowsocks-libev/config.json

BBR plus

wget “https://github.com/chiakge/Linux-NetSpeed/raw/master/tcp.sh" && chmod +x tcp.sh && ./tcp.sh

BBR

使用BBR加速器
让访问速度加速，飞起来！使用 BBR 加速工具。

安装 BBR
wget –no-check-certificate https://github.com/teddysun/across/raw/master/bbr.sh
获取读写权限
chmod +x bbr.sh
启动BBR安装
./bbr.sh
接着按任意键，开始安装，坐等一会。安装完成一会之后它会提示我们是否重新启动vps，我们输入 y 确定重启服务器。

重新启动之后，输入 lsmod | grep bbr 如果看到 tcp_bbr 就说明 BBR 已经启动了。

再访问一下 Youtube，1080p 超高清，很顺畅不卡顿！

python非常简单的线程管理

Posted on 2019-07-10 Edited on 2022-01-19 In Python

问题

多线程爬虫，带一个网页后台，调用接口传递json即可多线程爬虫。因为爬虫时间长需要提供接口取消线程。（每个搜索主题对应一个线程）

代码

import threading                                                                                                                                                   
--import sys
  import time
  
  
--def thread(d_input, pool, thread_id):
--    for i in range(6):
          if pool.pool[thread_id] == -1:
              exit()
--        print('thread run '+str(thread_id)+'****'+str(d_input))
          time.sleep(2)
  
  
--class ThreadPool(object):
      _pool = {}
      _count = 0
  
      @property
--    def pool(self):
          return ThreadPool._pool
  
      @property
--    def count(self):
          return ThreadPool._count
  
      @count.setter
--    def count(self, count):
          ThreadPool._count = count
  
      @pool.setter
--    def pool(self, pool):
          ThreadPool._pool = pool
  
  
--def mainThread():
      # prepare the thread pool
      pool = ThreadPool()
      # start up ten thread
      for i in range(10):
          # start up a thread and pass the thread id and pool object
          timer = threading.Timer(1, thread, [i, pool, pool.count])
          # store the pool id
          pool.pool[pool.count] = 1 
          pool.count += 1
          timer.start()
  
      # start for a while
      time.sleep(4)
      # end one thread
      pool.pool[0] = -1
      pool.pool[1] = -1
      pool.pool[2] = -1
      pool.pool[3] = -1
      pool.pool[4] = -1
      pool.pool[5] = -1
  
  
  mainThread()

python类setter注解问题

Posted on 2019-07-10 Edited on 2022-01-19 In Python

概述

配置爬虫线程池时出现的问题：定义一个线程类，在类内方法外声明变量，按照python规则，这个是非实例所能拥有的静态变量。

变量为dict类型{}。且为这个变量设定了setter和getter。为了能够让实例也能访问类变量使用了注解property。类代码如下

def thread(input):
    for i in range(6):
        print('hello+'+str(input))
        time.sleep(2)
        if i == 4:
            print('stop')
  
class ThreadPool(object):
    _pool = {}
    
    @property
    def pool(self):
        return ThreadPool._pool

	@pool.setter
    def pool(self, pool):
        ThreadPool._pool = pool

问题

以下是使用方式：第三行为什么可以直接使用dict的方法新增？就像是直接访问变量而不是通过setter和getter一样，待解决。

tp = ThreadPool()
timer = threading.Timer(1, thread, [1])
tp.pool[timer.name]=timer
print(tp.pool)
timer.start()

linux-YouCompleteMe配置

Posted on 2019-06-25 Edited on 2022-01-19

折腾笔记

这是一款linux在vim下的插件，自动补全代码

配置vim的插件管理器vundle

执行一下命令导入vundle包

git clone https://github.com/gmarik/Vundle.vim.git ~/.vim/bundle/Vundle.vim

配置~/.vimrc （最好别改/etc/vimrc) 配置后重新打开，以下配置里写了两个插件：auto-format和nerdTree,这两个差价可以稍后写


" vundle
set nocompatible              " required
filetype off                  " required

" set the runtime path to include Vundle and initialize
set rtp+=~/.vim/bundle/Vundle.vim
call vundle#begin()

" alternatively, pass a path where Vundle should install plugins
"call vundle#begin('~/some/path/here')

" let Vundle manage Vundle, required
Plugin 'gmarik/Vundle.vim'

" Add all your plugins here (note older versions of Vundle used Bundle instead of Plugin)
Plugin 'Chiel92/vim-autoformat'
nnoremap <F6> :Autoformat<CR>
let g:autoformat_autoindent = 0 
let g:autoformat_retab = 0 
let g:autoformat_remove_trailing_spaces = 0 
                           
Plugin 'https://github.com/scrooloose/nerdtree'
nnoremap <F3> :NERDTreeToggle<CR>

" All of your Plugins must be added before the following line
call vundle#end()            " required
filetype plugin indent on    " required

配置插件后重新打开vim，输入:PluginInstall开始安装那两个插件

配置安装YCM

Ensure that your version of Vim is at least 7.4.1578 and that it has support for Python 2 or Python 3 scripting.确保安装的vim版本在7.4以上（vim –version)确保vim支持python2/3
下载YCM本体，有两种安装方式：

a. 使用vundle安装（推荐）：在vimrc插件配置里写入： Plugin ‘Valloric/YouCompleteMe’，执行：PluginInstall。

b. 或者在~/.vim/bundle/下使用

git clone https://github.com/ycm-core/YouCompleteMe.git,

在下载后需要执行git submodule update –init –recursive
安装YCM：（下载C语言家族支持的情况下）
1. 载好cmake和python head文件类似：python-dev,python3-dev,但是archlinux好像也么得
然后为了支持c语言，我们需要编译ycm_core，首先执行以下命令
```
   cd ~
   mkdir ycm_build
   cd ycm_build
```
然后下载最新的libclang或者clangd（编译的支持文件），由于archlinux下好像么得libclang，我从llvm.org llvm.org.下载了clangd的编译版本，archlinux编译过版本
下载后将文件放入~/ycm_temp/llvm_root_dir（不存在请创建）
执行编译配置：（系统是linux，Generator选UnixMakefiles，需要cmake和make提前装好）
```
  cmake -G "Unix Makefiles" -DPATH_TO_LLVM_ROOT=~/ycm_temp/llvm_root_dir . ~/.vim/bundle/YouCompleteMe/third_party/ycmd/cpp
  
```
编译结束，当前目录生成配置文件
执行编译本体
```
    cmake --build . --target ycm_core --config Release
```
官方文档中说–config Release是为了windows特有的，可以不加

注意：在cmake编译好配置文件后不能更改当前文件所在的目录，否则编译本体会找不到目录

git clone need sudo: socks4错误

Posted on 2019-06-25 Edited on 2022-01-19 In Linux

git clone在开了代理的情况下无法使用

原因：git 默认https使用socks4作为代理协议，shadowsock使用的是socks5协议

方法：

git config --global http.proxy 'socks5://127.0.0.1:1080'

文本情感分析

Posted on 2019-06-12 Edited on 2022-01-19 In DataAnalysis

情感分析

wordcloud大数据词云分析

Posted on 2019-06-12 Edited on 2022-01-19 In DataAnalysis

词云

在配置了anaconda和jupyternotebook的情况下，在以上环境下使用python作为工具分析

jupyter_notebook

Posted on 2019-06-12 Edited on 2022-01-19 In DataAnalysis

jupyter_notebook

说明：将代码文档集中的工具

Anaconda管理器

Posted on 2019-06-12 Edited on 2022-01-19 In DataAnalysis

Anaconda管理器

说明：使用python做大数据时需要的工具,配置后可以使用conda