Memory usage of some python types

I ran into a memory usage problem recently, in which a json decoder will allocate memory in about 10 times the total bytes of the data. (I use cjson.) This drove me to investigate the overhead and finally I made this memory usage table:

(This table is generated on a 32-bit Debian GNU/Linux machine with python 2.7)

Type Parameter sys.getsizeof Memory usage (cumulate average of small instance) Is Memory usage of single large instance ~ sys.getsizeof
int value
  • 12 (value < 2^31)
  • 18, 20, … (for values become more and more larger)
  • 12
  • 24, …
y
float value 16 16 y
str len as i 21 + i (i <= 1E8)
  • 0 (i = 0:1)
  • 24 (i = 2:3)
  • 32 (i = 4:11)
  • 40 (i = 12:19), …
y
[] len as i 32 + 4i (i <= 1E8)
  • 32 (i = 0)
  • 48 (i = 1:3)
  • 32 + (i / 2 + 1) * 8 (i = 4:100)
y
() len as i 24 + 4i (i <= 1E8)
  • 0 (i = 0),
  • 41 (i = 1:3)
  • 24 + (i / 2 + 1) * 8 (i = 4:100)
y
dict len as i
  • 136 (0 <= i <= 5) 5 = (101)b
  • 520 (i <= 21) 21 = (10101)b
  • 1672 (85) (1010101)b
  • 6280 (341) (101010101)b
  • 1573000 (87381) (10…101)b
  • 3145864 (174762) (10…1010)b
  • 6291592 (349525) (10…10101)b
  • 12583048 (699050) (10…101010)b

summary:

  • 136 + 24 * 4^k (i <= 1+4+…+4^k) (5 < i <= 87381)
  • 136 + 24 * 2^k (i <= 2^k + 2^(k-2) + … ) (87381 < i <= 699050)
  • 143 (i = 0:5)
  • 535 (i = 6:21)
  • 1686 (i = 22:85)
  • 6298 (i = 86)
y
set len as i
  • 112 (0 <= i <= 5) 5 = (101)b
  • 368 (i <= 21) 21 = (10101)b
  • 1136 (85) (1010101)b
  • 4208 (341) (101010101)b
  • 1048688 (87381) (10…101)b
  • 2097264 (174762) (10…1010)b

summary:

  • 112 + 16 * 4^k (i <= 1+4+…+4^k) (5 < i <= 87381)
  • 112 + 16 * 2^k (i <= 2^k + 2^(k-2) + … ) (87381 < i <= …)
  • 115
  • 379
  • 1148
  • 4219
y
namedtuple len as i same as ()
  • 32 (i = 0:1)
  • 25 + (i / 2 + 1) * 8 (i = 2:100)
y (I guess the dict is stored in Type)
old-style class instance(no method besides __init__, var number of members init by *) #param as i 32
  • 32 (i = 0)
  • 176 (i = 1:5)
  • 567 (i = 6:21)
  • 1719 (i = 22:…),

summary: ~= 32 + dict size if i > 0

n
new-style class instance #param as i 28
  • 176 (i = 0:5)
  • 567 (i = 6:21)
  • 1719 (i = 22:…),

summary: ~= 32 + dict size

n

Remarks:

There are some zeros in the memory usage column. This is because “”, (), and some single char strings are singletons.

Bool is not listed because the values are singleton.

Comes back to my json decoding case: I have 650K lines of json string to decode, which sum up to 111MB in total, end up with 1212MB used in python. In each json, there are 7 fields with 3.35 string values, 1 large int, 2 small int, 0.65 dict. The dict contains about 3 keys and 3 str/int/bool/float if present. The overhead of each json string will be (7 + 3.35 + 3 * 0.65) string overhead + 1 dict overhead(535) + 0.65 dict overhead(143) + 1 large int overhead(11) + 2 small int overhead(small int may be singleton) + 3 str/int/bool/float overhead(assume 10) ~= 12.3*24.5 + 535 + 0.65*143 + 11 + 2 * 0 + 3*8 =  964.3 ~= 1K. This is still an underestimate. But it’s clear that dict is the root of the problem.

In my case the fields of each json strings is fixed, so that I can use namedtuple to save my memory. Classes won’t do. Also I can use string value dict to save memory on repeated string values in the json data, just like the singletons.

ps. The memory usage inspecting code

#!/usr/bin/env python

from sys import getsizeof
from resource import getrusage
import resource
import math

from collections import namedtuple

n = 1000000
l = [None] * n

def makeint(i):
        return i

m = math.sqrt(2) * 1E15
def makefloat(i):
        return 0.0 + i + m

def makelist(i, k):
        return [1] * k

def maketuple(i, k):
        return (1,) * k

def makestr(i, k):
        return "".join(["1"] * k)

str_list = [("s" + str(i)) for i in xrange(174762)]
def makedict(i, k):
        d = {}
        for j in xrange(k):
                d[str_list[j]] = "1"
        return d

NTuple = None
def makenamedtuple(i, k):
        p = str_list[:k]
        if NTuple is None:
                NTuple = namedtuple("NTuple", p)
        return NTuple(*p)

class NewClass(object):
        def __init__(self, *p):
                for i in xrange(len(p)):
                        self.__dict__[str_list[i]] = p[i]

def makenewclass(i, k):
        p = str_list[:k]
        return NewClass(*p)

class OldClass:
        def __init__(self, *p):
                for i in xrange(len(p)):
                        self.__dict__[str_list[i]] = p[i]

def makeoldclass(i, k):
        p = str_list[:k]
        return OldClass(*p)

def makeset(i, k):
        return set(str_list[:k])

def makebool(i):
        return True

before = getrusage(resource.RUSAGE_SELF).ru_maxrss
after = before

before = getrusage(resource.RUSAGE_SELF).ru_maxrss
i = 0
k = 1365
print(getsizeof(makebool(0)))
while i < n:
        l[i] = makebool(i)
        i += 1
after = getrusage(resource.RUSAGE_SELF).ru_maxrss
print("before(KiB): %d, after(KiB): %d, average(B): %f" \
      % (before, after, (after - before) * 1024.0 / n))

好吧。。。中文什么的。。。。。。我还是写一下,这绝不是鄙视您,其实上面都是假正经。。。

最近折腾python,没想到其内存开销这么大,读个100多M的json居然用了我1G多内存。搞的电脑承受不起了。于是狂搜python memory overhead看见有人列了个类似的表,不过我觉得不完善,而且不太对。。就折腾了一下午一晚上自己列了个出来。内存消耗过大根本问题就是dict了。要省内存就别用dict用namedtuple,class也不靠谱。然后嘛就是我这json里面有不少重复字符串,弄个字典存着吧,重复的直接扔掉。

Advertisements

发表评论

Fill in your details below or click an icon to log in:

WordPress.com 徽标

You are commenting using your WordPress.com account. Log Out /  更改 )

Google+ photo

You are commenting using your Google+ account. Log Out /  更改 )

Twitter picture

You are commenting using your Twitter account. Log Out /  更改 )

Facebook photo

You are commenting using your Facebook account. Log Out /  更改 )

Connecting to %s

%d 博主赞过: