简单的压缩算法
所谓压缩,就是读取数据并对其进行编码(修改格式)的操作,以便减少数据占用的空间
解压缩,是压缩的逆过程,即将数据恢复成原始数据
数据类型占用的二进制位数要比实际的数据多,所以我们可以考虑对数据进行压缩的办法,就是对存储它的数据类型进行压缩。
以DNA组成的基因为例,核苷酸一般的值为A,T,C,G这4个,
如果基因的序列使用字符串来存储,则,每个核苷酸占1个字符,即8个二进制。
但是发现如果只有4个值的话,我们完全可以用2个二进制来存储,即00,01,10,11,4个来存储序列,这样原本需要8个二进制位来存储,现在只需要2个二进制位,相当于每个字符串的存储空间可以减少75%。
然而,Python中不包含可处理任意长度位串的现成结构体。因此,
from __future__ import absolute_import
from __future__ import unicode_literals
class CompressedGene:
def __init__(self, gene: str) -> None:
self._compress(gene)
def _compress(self, gene: str) -> None:
self.bit_string: int = 1
for nucleotide in gene.upper():
self.bit_string <<= 2
if nucleotide == "A":
self.bit_string |= 0b00
elif nucleotide == "T":
self.bit_string |= 0b01
elif nucleotide == "C":
self.bit_string |= 0b10
elif nucleotide == "G":
self.bit_string |= 0b11
else:
raise ValueError('Invalid Nucleotide: {}'.format(nucleotide))
def decompress(self) -> str:
gene: str = ""
for i in range(0, self.bit_string.bit_length() - 1, 2):
bits: int = self.bit_string >> i & 0b11
if bits == 0b00:
gene += "A"
elif bits == 0b01:
gene += "T"
elif bits == 0b10:
gene += "C"
elif bits == 0b11:
gene += "G"
else:
raise ValueError("Invalid bits:{}".format(bits))
return gene[::-1] # 技巧:切片反转打印
def __str__(self):
return self.decompress()
if __name__ == '__main__':
from sys import getsizeof
original: str = "TAGGGATTAACCGTTATATATATATAGCCATGGATCGATTATATAGGGATTAACCGTTATATATATATAGCCATGGATCGATTATA" * 100
print("original is {} bytes".format(getsizeof(original)))
compressed: CompressedGene = CompressedGene(original)
print("compressed is {} bytes".format(getsizeof(compressed.bit_string)))
print(compressed)
print("original and decompressed are the same: {}".format(original == compressed.decompress()))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
# 练习
编写int封装类,以使其能通用地当作位序列来使用
编辑 (opens new window)
上次更新: 2022/05/25, 16:53:28