Golang 的内存管理

Sep 11, 2020 go golang go语言 go-mem.md go文档 go技术

分享到:

Golang 的内存管理

[TOC]

白话 Go 语言内存管理（一）内存分配原理白话 Go 语言内存管理（二）解密栈内存管理

Golang 的内存管理

Golang 的内存管理基于 tcmalloc，可以说起点挺高的。但是 Golang 在实现的时候还做了很多优化，我们下面通过源码来看一下 Golang 的内存管理实现。下面的源码分析基于go1.8rc3。

1. tcmalloc 介绍

关于 tcmalloc 可以参考这篇文章 tcmalloc 介绍，原始论文可以参考 TCMalloc : Thread-Caching Malloc or Thread-Caching Malloc。

2. Golang 内存管理

0. 准备知识

这里先简单介绍一下 Golang 运行调度。在 Golang 里面有三个基本的概念：G, M, P。

G: Goroutine 执行的上下文环境。
M: 操作系统线程。
P: Processer。进程调度的关键，调度器，也可以认为约等于 CPU。

一个 Goroutine 的运行需要 G + P + M 三部分结合起来。好，先简单介绍到这里，更详细的放在后面的文章里面来说。

1. 逃逸分析（escape analysis）

对于手动管理内存的语言，比如 C/C++，我们使用 malloc 或者 new 申请的变量会被分配到堆上。但是 Golang 并不是这样，虽然 Golang 语言里面也有 new。Golang 编译器会进行逃逸分析来判断变量应该分配到什么地方。下面是一个简单的例子。

 1package main
 2import ()
 3func foo() *int {  
 4  var x int
 5  return &x
 6}
 7func bar() int {
 8  x := new(int)
 9  *x = 1
10  return *x
11}
12func main() {}

将上面文件保存为 escape.go，执行下面命令

1$ go run -gcflags '-m -l' escape.go
2./escape.go:6: moved to heap: x
3./escape.go:7: &x escape to heap
4./escape.go:11: bar new(int) does not escape

go1.13

1go run -gcflags "-m -l" cmd.go
2
3
4.\cmd.go:6:6: moved to heap: x
5.\cmd.go:10:10: bar new(int) does not escape

上面的意思是 foo() 中的 x 最后在堆上分配，而 bar() 中的 x 最后分配在了栈上（尽管是通过 new 分配的）。在官网 (golang.org) FAQ 上有一个关于变量分配的问题如下： How do I know whether a variable is allocated on the heap or the stack?

From a correctness standpoint, you don’t need to know. Each variable in Go exists as long as there are references to it. The storage location chosen by the implementation is irrelevant to the semantics of the language.

The storage location does have an effect on writing efficient programs. When possible, the Go compilers will allocate variables that are local to a function in that function’s stack frame. However, if the compiler cannot prove that the variable is not referenced after the function returns, then the compiler must allocate the variable on the garbage-collected heap to avoid dangling pointer errors. Also, if a local variable is very large, it might make more sense to store it on the heap rather than the stack.

In the current compilers, if a variable has its address taken, that variable is a candidate for allocation on the heap. However, a basic escape analysis recognizes some cases when such variables will not live past the return from the function and can reside on the stack.

如何得知变量是分配在栈（stack）上还是堆（heap）上？

准确地说，你并不需要知道。Golang 中的变量只要被引用就一直会存活，存储在堆上还是栈上由内部实现决定而和具体的语法没有关系。

知道变量的存储位置确实和效率编程有关系。如果可能，Golang 编译器会将函数的局部变量分配到函数栈帧（stack frame）上。然而，如果编译器不能确保变量在函数 return 之后不再被引用，编译器就会将变量分配到堆上。而且，如果一个局部变量非常大，那么它也应该被分配到堆上而不是栈上。

当前情况下，如果一个变量被取地址，那么它就有可能被分配到堆上。然而，还要对这些变量做逃逸分析，如果函数 return 之后，变量不再被引用，则将其分配到栈上。

2. 关键数据结构

几个关键的地方：

mcache: per-P cache，可以认为是 local cache。
mcentral: 全局 cache，mcache 不够用的时候向 mcentral 申请。
mheap: 当 mcentral 也不够用的时候，通过 mheap 向操作系统申请。

可以将其看成多级内存分配器。简单的分配过程可以描述如下，具体的之后下面的再说。

先向 mcache 申请。
mcache 不足的话向 mcentral 申请填充。
mcentral 不足则向 mheap 申请填充。
mheap 不足的话，则想操作系统申请。

2.1 mcache

我们知道每个 Gorontine 的运行都是绑定到一个 P 上面，mcache 是每个 P 的 cache。这么做的好处是分配内存时不需要加锁。mcache 结构如下。 runtime/mcache.go

 1// Per-thread (in Go, per-P) cache for small objects.
 2// No locking needed because it is per-thread (per-P).
 3//
 4// mcaches are allocated from non-GC'd memory, so any heap pointers
 5// must be specially handled.
 6//
 7//go:notinheap
 8type mcache struct {
 9	// The following members are accessed on every malloc,
10	// so they are grouped here for better caching.
11	next_sample uintptr // trigger heap sample after allocating this many bytes
12	local_scan  uintptr // bytes of scannable heap allocated
13
14	// Allocator cache for tiny objects w/o pointers.
15	// See "Tiny allocator" comment in malloc.go.
16
17	// tiny points to the beginning of the current tiny block, or
18	// nil if there is no current tiny block.
19	//
20	// tiny is a heap pointer. Since mcache is in non-GC'd memory,
21	// we handle it by clearing it in releaseAll during mark
22	// termination.
23	tiny             uintptr
24	tinyoffset       uintptr
25	local_tinyallocs uintptr // number of tiny allocs not counted in other stats
26
27	// The rest is not accessed on every malloc.
28
29	alloc [numSpanClasses]*mspan // spans to allocate from, indexed by spanClass
30
31	stackcache [_NumStackOrders]stackfreelist
32
33	// Local allocator stats, flushed during GC.
34	local_largefree  uintptr                  // bytes freed for large objects (>maxsmallsize)
35	local_nlargefree uintptr                  // number of frees for large objects (>maxsmallsize)
36	local_nsmallfree [_NumSizeClasses]uintptr // number of frees for small objects (<=maxsmallsize)
37
38	// flushGen indicates the sweepgen during which this mcache
39	// was last flushed. If flushGen != mheap_.sweepgen, the spans
40	// in this mcache are stale and need to the flushed so they
41	// can be swept. This is done in acquirep.
42	flushGen uint32
43}

我们可以暂时只关注 alloc [_NumSizeClasses]*mspan，这是一个大小为 67 的指针（指针指向 mspan ）数组（_NumSizeClasses = 67），每个数组元素也就是 mspan，被切分成特定大小的块。当要分配内存时，为 object 在 alloc 数组中选择合适的元素来分配。67 种块大小为 0，8 byte, 16 byte, … ，这个和 tcmalloc 稍有区别。

1//file: sizeclasses.go
2var class_to_size = [_NumSizeClasses]uint16{0, 8, 16, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 576, 640, 704, 768, 896, 1024, 1152, 1280, 1408, 1536, 1792, 2048, 2304, 2688, 3072, 3200, 3456, 4096, 4864, 5376, 6144, 6528, 6784, 6912, 8192, 9472, 9728, 10240, 10880, 12288, 13568, 14336, 16384, 18432, 19072, 20480, 21760, 24576, 27264, 28672, 32768}

这里仔细想有个小问题，上面的 alloc 类似内存池的 freelist 数组或者链表，正常实现每个数组元素是一个链表，链表由特定大小的块串起来。但是这里统一使用了 mspan 结构，那么只有一种可能，就是 mspan 中记录了需要分配的块大小。我们来看一下 mspan 的结构。

2.2 mspan

span 在 tcmalloc 中作为一种管理内存的基本单位而存在。Golang 的 mspan 的结构如下，省略了部分内容。

 1// runtime/mheap.go
 2//go:notinheap
 3type mspan struct {
 4	next *mspan     // next span in list, or nil if none
 5	prev *mspan     // previous span in list, or nil if none
 6	list *mSpanList // For debugging. TODO: Remove.
 7
 8	startAddr uintptr // address of first byte of span aka s.base()
 9	npages    uintptr // number of pages in span
10
11	manualFreeList gclinkptr // list of free objects in mSpanManual spans
12
13	// freeindex is the slot index between 0 and nelems at which to begin scanning
14	// for the next free object in this span.
15	// Each allocation scans allocBits starting at freeindex until it encounters a 0
16	// indicating a free object. freeindex is then adjusted so that subsequent scans begin
17	// just past the newly discovered free object.
18	//
19	// If freeindex == nelem, this span has no free objects.
20	//
21	// allocBits is a bitmap of objects in this span.
22	// If n >= freeindex and allocBits[n/8] & (1<<(n%8)) is 0
23	// then object n is free;
24	// otherwise, object n is allocated. Bits starting at nelem are
25	// undefined and should never be referenced.
26	//
27	// Object n starts at address n*elemsize + (start << pageShift).
28	freeindex uintptr
29	// TODO: Look up nelems from sizeclass and remove this field if it
30	// helps performance.
31	nelems uintptr // number of object in the span.
32
33	// Cache of the allocBits at freeindex. allocCache is shifted
34	// such that the lowest bit corresponds to the bit freeindex.
35	// allocCache holds the complement of allocBits, thus allowing
36	// ctz (count trailing zero) to use it directly.
37	// allocCache may contain bits beyond s.nelems; the caller must ignore
38	// these.
39	allocCache uint64// 用位图来管理可用的 free object，1 表示可用
40
41	// allocBits and gcmarkBits hold pointers to a span's mark and
42	// allocation bits. The pointers are 8 byte aligned.
43	// There are three arenas where this data is held.
44	// free: Dirty arenas that are no longer accessed
45	//       and can be reused.
46	// next: Holds information to be used in the next GC cycle.
47	// current: Information being used during this GC cycle.
48	// previous: Information being used during the last GC cycle.
49	// A new GC cycle starts with the call to finishsweep_m.
50	// finishsweep_m moves the previous arena to the free arena,
51	// the current arena to the previous arena, and
52	// the next arena to the current arena.
53	// The next arena is populated as the spans request
54	// memory to hold gcmarkBits for the next GC cycle as well
55	// as allocBits for newly allocated spans.
56	//
57	// The pointer arithmetic is done "by hand" instead of using
58	// arrays to avoid bounds checks along critical performance
59	// paths.
60	// The sweep will free the old allocBits and set allocBits to the
61	// gcmarkBits. The gcmarkBits are replaced with a fresh zeroed
62	// out memory.
63	allocBits  *gcBits
64	gcmarkBits *gcBits
65
66	// sweep generation:
67	// if sweepgen == h->sweepgen - 2, the span needs sweeping
68	// if sweepgen == h->sweepgen - 1, the span is currently being swept
69	// if sweepgen == h->sweepgen, the span is swept and ready to use
70	// if sweepgen == h->sweepgen + 1, the span was cached before sweep began and is still cached, and needs sweeping
71	// if sweepgen == h->sweepgen + 3, the span was swept and then cached and is still cached
72	// h->sweepgen is incremented by 2 after every GC
73
74	sweepgen    uint32
75	divMul      uint16     // for divide by elemsize - divMagic.mul
76	baseMask    uint16     // if non-0, elemsize is a power of 2, & this will get object allocation base
77	allocCount  uint16     // number of allocated objects
78	spanclass   spanClass  // size class and noscan (uint8)
79	state       mSpanState // mspaninuse etc
80	needzero    uint8      // needs to be zeroed before allocation
81	divShift    uint8      // for divide by elemsize - divMagic.shift
82	divShift2   uint8      // for divide by elemsize - divMagic.shift2
83	scavenged   bool       // whether this span has had its pages released to the OS
84	elemsize    uintptr    // computed from sizeclass or from npages
85	limit       uintptr    // end of data in span
86	speciallock mutex      // guards specials list
87	specials    *special   // linked list of special records sorted by offset.
88}

从上面的结构可以看出：

next, prev: 指针域， mspan 一般都是以链表形式使用。
npages: mspan 的大小为 page 大小的整数倍。
sizeclass: 0 ~ _NumSizeClasses 之间的一个值，这个解释了我们的疑问。比如，sizeclass = 3，那么这个 mspan 被分割成 32 byte 的块。
elemsize: 通过 sizeclass 或者 npages 可以计算出来。比如 sizeclass = 3, elemsize = 32 byte。对于大于 32Kb 的内存分配，都是分配整数页，elemsize = page_size * npages。
nelems: span 中包块的总数目。
freeindex: 0 ~ nelemes-1，表示分配到第几个块。

2.3 mcentral

上面说到当 mcache 不够用的时候，会从 mcentral 申请。那我们下面就来介绍一下 mcentral。

1// runtime/mheap.go
2//go:notinheap
3type mSpanList struct {
4	first *mspan // first span in list, or nil if none
5	last  *mspan // last span in list, or nil if none
6}

 1// runtime/mcentral.go
 2// Central list of free objects of a given size.
 3//
 4//go:notinheap
 5type mcentral struct {
 6	lock      mutex
 7	spanclass spanClass
 8	nonempty  mSpanList // list of spans with a free object, ie a nonempty free list
 9	empty     mSpanList // list of spans with no free objects (or cached in an mcache)
10
11	// nmalloc is the cumulative count of objects allocated from
12	// this mcentral, assuming all spans in mcaches are
13	// fully-allocated. Written atomically, read under STW.
14	nmalloc uint64
15}

mcentral 分析：

sizeclass: 也有成员 sizeclass，那么 mcentral 是不是也有 67 个呢？是的。
lock: 因为会有多个 P 过来竞争。
nonempty: mspan 的双向链表，当前 mcentral 中可用的 mspan list。
empty: 已经被使用的，可以认为是一种对所有 mspan 的 track。

问题来了，mcentral 存在于什么地方？虽然在上面我们将 mcentral 和 mheap 作为两个部分来讲，但是作为全局的结构，这两部分是可以定义在一起的。实际上也是这样，mcentral 包含在 mheap 中。

2.4 mheap

  1// runtime/mheap.go
  2
  3// minPhysPageSize is a lower-bound on the physical page size. The
  4// true physical page size may be larger than this. In contrast,
  5// sys.PhysPageSize is an upper-bound on the physical page size.
  6const minPhysPageSize = 4096
  7
  8// Main malloc heap.
  9// The heap itself is the "free" and "scav" treaps,
 10// but all the other global data is here too.
 11//
 12// mheap must not be heap-allocated because it contains mSpanLists,
 13// which must not be heap-allocated.
 14//
 15//go:notinheap
 16type mheap struct {
 17	// lock must only be acquired on the system stack, otherwise a g
 18	// could self-deadlock if its stack grows with the lock held.
 19	lock      mutex
 20	free      mTreap // free spans
 21	sweepgen  uint32 // sweep generation, see comment in mspan
 22	sweepdone uint32 // all spans are swept
 23	sweepers  uint32 // number of active sweepone calls
 24
 25	// allspans is a slice of all mspans ever created. Each mspan
 26	// appears exactly once.
 27	//
 28	// The memory for allspans is manually managed and can be
 29	// reallocated and move as the heap grows.
 30	//
 31	// In general, allspans is protected by mheap_.lock, which
 32	// prevents concurrent access as well as freeing the backing
 33	// store. Accesses during STW might not hold the lock, but
 34	// must ensure that allocation cannot happen around the
 35	// access (since that may free the backing store).
 36	allspans []*mspan // all spans out there
 37
 38	// sweepSpans contains two mspan stacks: one of swept in-use
 39	// spans, and one of unswept in-use spans. These two trade
 40	// roles on each GC cycle. Since the sweepgen increases by 2
 41	// on each cycle, this means the swept spans are in
 42	// sweepSpans[sweepgen/2%2] and the unswept spans are in
 43	// sweepSpans[1-sweepgen/2%2]. Sweeping pops spans from the
 44	// unswept stack and pushes spans that are still in-use on the
 45	// swept stack. Likewise, allocating an in-use span pushes it
 46	// on the swept stack.
 47	sweepSpans [2]gcSweepBuf
 48
 49	_ uint32 // align uint64 fields on 32-bit for atomics
 50
 51	// Proportional sweep
 52	//
 53	// These parameters represent a linear function from heap_live
 54	// to page sweep count. The proportional sweep system works to
 55	// stay in the black by keeping the current page sweep count
 56	// above this line at the current heap_live.
 57	//
 58	// The line has slope sweepPagesPerByte and passes through a
 59	// basis point at (sweepHeapLiveBasis, pagesSweptBasis). At
 60	// any given time, the system is at (memstats.heap_live,
 61	// pagesSwept) in this space.
 62	//
 63	// It's important that the line pass through a point we
 64	// control rather than simply starting at a (0,0) origin
 65	// because that lets us adjust sweep pacing at any time while
 66	// accounting for current progress. If we could only adjust
 67	// the slope, it would create a discontinuity in debt if any
 68	// progress has already been made.
 69	pagesInUse         uint64  // pages of spans in stats mSpanInUse; R/W with mheap.lock
 70	pagesSwept         uint64  // pages swept this cycle; updated atomically
 71	pagesSweptBasis    uint64  // pagesSwept to use as the origin of the sweep ratio; updated atomically
 72	sweepHeapLiveBasis uint64  // value of heap_live to use as the origin of sweep ratio; written with lock, read without
 73	sweepPagesPerByte  float64 // proportional sweep ratio; written with lock, read without
 74	// TODO(austin): pagesInUse should be a uintptr, but the 386
 75	// compiler can't 8-byte align fields.
 76
 77	// Scavenger pacing parameters
 78	//
 79	// The two basis parameters and the scavenge ratio parallel the proportional
 80	// sweeping implementation, the primary differences being that:
 81	//  * Scavenging concerns itself with RSS, estimated as heapRetained()
 82	//  * Rather than pacing the scavenger to the GC, it is paced to a
 83	//    time-based rate computed in gcPaceScavenger.
 84	//
 85	// scavengeRetainedGoal represents our goal RSS.
 86	//
 87	// All fields must be accessed with lock.
 88	//
 89	// TODO(mknyszek): Consider abstracting the basis fields and the scavenge ratio
 90	// into its own type so that this logic may be shared with proportional sweeping.
 91	scavengeTimeBasis     int64
 92	scavengeRetainedBasis uint64
 93	scavengeBytesPerNS    float64
 94	scavengeRetainedGoal  uint64
 95	scavengeGen           uint64 // incremented on each pacing update
 96
 97	// Page reclaimer state
 98
 99	// reclaimIndex is the page index in allArenas of next page to
100	// reclaim. Specifically, it refers to page (i %
101	// pagesPerArena) of arena allArenas[i / pagesPerArena].
102	//
103	// If this is >= 1<<63, the page reclaimer is done scanning
104	// the page marks.
105	//
106	// This is accessed atomically.
107	reclaimIndex uint64
108	// reclaimCredit is spare credit for extra pages swept. Since
109	// the page reclaimer works in large chunks, it may reclaim
110	// more than requested. Any spare pages released go to this
111	// credit pool.
112	//
113	// This is accessed atomically.
114	reclaimCredit uintptr
115
116	// Malloc stats.
117	largealloc  uint64                  // bytes allocated for large objects
118	nlargealloc uint64                  // number of large object allocations
119	largefree   uint64                  // bytes freed for large objects (>maxsmallsize)
120	nlargefree  uint64                  // number of frees for large objects (>maxsmallsize)
121	nsmallfree  [_NumSizeClasses]uint64 // number of frees for small objects (<=maxsmallsize)
122
123	// arenas is the heap arena map. It points to the metadata for
124	// the heap for every arena frame of the entire usable virtual
125	// address space.
126	//
127	// Use arenaIndex to compute indexes into this array.
128	//
129	// For regions of the address space that are not backed by the
130	// Go heap, the arena map contains nil.
131	//
132	// Modifications are protected by mheap_.lock. Reads can be
133	// performed without locking; however, a given entry can
134	// transition from nil to non-nil at any time when the lock
135	// isn't held. (Entries never transitions back to nil.)
136	//
137	// In general, this is a two-level mapping consisting of an L1
138	// map and possibly many L2 maps. This saves space when there
139	// are a huge number of arena frames. However, on many
140	// platforms (even 64-bit), arenaL1Bits is 0, making this
141	// effectively a single-level map. In this case, arenas[0]
142	// will never be nil.
143	arenas [1 << arenaL1Bits]*[1 << arenaL2Bits]*heapArena
144
145	// heapArenaAlloc is pre-reserved space for allocating heapArena
146	// objects. This is only used on 32-bit, where we pre-reserve
147	// this space to avoid interleaving it with the heap itself.
148	heapArenaAlloc linearAlloc
149
150	// arenaHints is a list of addresses at which to attempt to
151	// add more heap arenas. This is initially populated with a
152	// set of general hint addresses, and grown with the bounds of
153	// actual heap arena ranges.
154	arenaHints *arenaHint
155
156	// arena is a pre-reserved space for allocating heap arenas
157	// (the actual arenas). This is only used on 32-bit.
158	arena linearAlloc
159
160	// allArenas is the arenaIndex of every mapped arena. This can
161	// be used to iterate through the address space.
162	//
163	// Access is protected by mheap_.lock. However, since this is
164	// append-only and old backing arrays are never freed, it is
165	// safe to acquire mheap_.lock, copy the slice header, and
166	// then release mheap_.lock.
167	allArenas []arenaIdx
168
169	// sweepArenas is a snapshot of allArenas taken at the
170	// beginning of the sweep cycle. This can be read safely by
171	// simply blocking GC (by disabling preemption).
172	sweepArenas []arenaIdx
173
174	// curArena is the arena that the heap is currently growing
175	// into. This should always be physPageSize-aligned.
176	curArena struct {
177		base, end uintptr
178	}
179
180	_ uint32 // ensure 64-bit alignment of central
181
182	// central free lists for small size classes.
183	// the padding makes sure that the mcentrals are
184	// spaced CacheLinePadSize bytes apart, so that each mcentral.lock
185	// gets its own cache line.
186	// central is indexed by spanClass.
187	central [numSpanClasses]struct {
188		mcentral mcentral
189		pad      [cpu.CacheLinePadSize - unsafe.Sizeof(mcentral{})%cpu.CacheLinePadSize]byte
190	}
191
192	spanalloc             fixalloc // allocator for span*
193	cachealloc            fixalloc // allocator for mcache*
194	treapalloc            fixalloc // allocator for treapNodes*
195	specialfinalizeralloc fixalloc // allocator for specialfinalizer*
196	specialprofilealloc   fixalloc // allocator for specialprofile*
197	speciallock           mutex    // lock for special record allocators.
198	arenaHintAlloc        fixalloc // allocator for arenaHints
199
200	unused *specialfinalizer // never set, just here to force the specialfinalizer type into DWARF
201}
202
203var mheap_ mheap

mheap_ 是一个全局变量，会在系统初始化的时候初始化（在函数 mallocinit() 中）。我们先看一下 mheap 具体结构。

allspans []*mspan: 所有的 spans 都是通过 mheap_ 申请，所有申请过的 mspan 都会记录在 allspans。结构体中的 lock 就是用来保证并发安全的。注释中有关于 STW 的说明，这个之后会在 Golang 的 GC 文章中细说。
central [_NumSizeClasses]…: 这个就是之前介绍的 mcentral ，每种大小的块对应一个 mcentral。mcentral 上面介绍过了。pad 可以认为是一个字节填充，为了避免伪共享（false sharing）问题的。False Sharing伪共享可以参考 False Sharing - wikipedia，这里就不细说了。
sweepgen, sweepdone: GC 相关。（Golang 的 GC 策略是 Mark & Sweep, 这里是用来表示 sweep 的，这里就不再深入了。）
free [_MaxMHeapList]mSpanList: 这是一个 SpanList 数组，每个 SpanList 里面的 mspan 由 1 ~ 127 (_MaxMHeapList - 1) 个 page 组成。比如 free[3] 是由包含 3 个 page 的 mspan 组成的链表。free 表示的是 free list，也就是未分配的。对应的还有 busy list。
freelarge mSpanList: mspan 组成的链表，每个元素（也就是 mspan）的 page 个数大于 127。对应的还有 busylarge。
spans []*mspan: 记录 arena 区域页号（page number）和 mspan 的映射关系。
arena_start, arena_end, arena_used: 要解释这几个变量之前要解释一下 arena。arena 是 Golang 中用于分配内存的连续虚拟地址区域。由 mheap 管理，堆上申请的所有内存都来自 arena。那么如何标志内存可用呢？操作系统的常见做法用两种：一种是用链表将所有的可用内存都串起来；另一种是使用位图来标志内存块是否可用。结合上面一条 spans，内存的布局是下面这样的。

1+-----------------------+---------------------+-----------------------+
2|    spans              |    bitmap           |   arena               |
3+-----------------------+---------------------+-----------------------+

spanalloc, cachealloc fixalloc: fixalloc 是 free-list，用来分配特定大小的块。比如 cachealloc 分配 mcache 大小的块。
剩下的是一些统计信息和 GC 相关的信息，这里暂且按住不表，以后专门拿出来说。

3. 初始化

在系统初始化阶段，上面介绍的几个结构会被进行初始化，我们直接看一下初始化代码：mallocinit()。

  1// runtime/malloc.go
  2
  3// OS memory management abstraction layer
  4//
  5// Regions of the address space managed by the runtime may be in one of four
  6// states at any given time:
  7// 1) None - Unreserved and unmapped, the default state of any region.
  8// 2) Reserved - Owned by the runtime, but accessing it would cause a fault.
  9//               Does not count against the process' memory footprint.
 10// 3) Prepared - Reserved, intended not to be backed by physical memory (though
 11//               an OS may implement this lazily). Can transition efficiently to
 12//               Ready. Accessing memory in such a region is undefined (may
 13//               fault, may give back unexpected zeroes, etc.).
 14// 4) Ready - may be accessed safely.
 15//
 16// This set of states is more than is strictly necessary to support all the
 17// currently supported platforms. One could get by with just None, Reserved, and
 18// Ready. However, the Prepared state gives us flexibility for performance
 19// purposes. For example, on POSIX-y operating systems, Reserved is usually a
 20// private anonymous mmap'd region with PROT_NONE set, and to transition
 21// to Ready would require setting PROT_READ|PROT_WRITE. However the
 22// underspecification of Prepared lets us use just MADV_FREE to transition from
 23// Ready to Prepared. Thus with the Prepared state we can set the permission
 24// bits just once early on, we can efficiently tell the OS that it's free to
 25// take pages away from us when we don't strictly need them.
 26//
 27// For each OS there is a common set of helpers defined that transition
 28// memory regions between these states. The helpers are as follows:
 29//
 30// sysAlloc transitions an OS-chosen region of memory from None to Ready.
 31// More specifically, it obtains a large chunk of zeroed memory from the
 32// operating system, typically on the order of a hundred kilobytes
 33// or a megabyte. This memory is always immediately available for use.
 34//
 35// sysFree transitions a memory region from any state to None. Therefore, it
 36// returns memory unconditionally. It is used if an out-of-memory error has been
 37// detected midway through an allocation or to carve out an aligned section of
 38// the address space. It is okay if sysFree is a no-op only if sysReserve always
 39// returns a memory region aligned to the heap allocator's alignment
 40// restrictions.
 41//
 42// sysReserve transitions a memory region from None to Reserved. It reserves
 43// address space in such a way that it would cause a fatal fault upon access
 44// (either via permissions or not committing the memory). Such a reservation is
 45// thus never backed by physical memory.
 46// If the pointer passed to it is non-nil, the caller wants the
 47// reservation there, but sysReserve can still choose another
 48// location if that one is unavailable.
 49// NOTE: sysReserve returns OS-aligned memory, but the heap allocator
 50// may use larger alignment, so the caller must be careful to realign the
 51// memory obtained by sysReserve.
 52//
 53// sysMap transitions a memory region from Reserved to Prepared. It ensures the
 54// memory region can be efficiently transitioned to Ready.
 55//
 56// sysUsed transitions a memory region from Prepared to Ready. It notifies the
 57// operating system that the memory region is needed and ensures that the region
 58// may be safely accessed. This is typically a no-op on systems that don't have
 59// an explicit commit step and hard over-commit limits, but is critical on
 60// Windows, for example.
 61//
 62// sysUnused transitions a memory region from Ready to Prepared. It notifies the
 63// operating system that the physical pages backing this memory region are no
 64// longer needed and can be reused for other purposes. The contents of a
 65// sysUnused memory region are considered forfeit and the region must not be
 66// accessed again until sysUsed is called.
 67//
 68// sysFault transitions a memory region from Ready or Prepared to Reserved. It
 69// marks a region such that it will always fault if accessed. Used only for
 70// debugging the runtime.
 71
 72func mallocinit() {
 73	if class_to_size[_TinySizeClass] != _TinySize {
 74		throw("bad TinySizeClass")
 75	}
 76
 77	testdefersizes()
 78
 79	if heapArenaBitmapBytes&(heapArenaBitmapBytes-1) != 0 {
 80		// heapBits expects modular arithmetic on bitmap
 81		// addresses to work.
 82		throw("heapArenaBitmapBytes not a power of 2")
 83	}
 84
 85	// Copy class sizes out for statistics table.
 86	for i := range class_to_size {
 87		memstats.by_size[i].size = uint32(class_to_size[i])
 88	}
 89
 90	// Check physPageSize.
 91	if physPageSize == 0 {
 92		// The OS init code failed to fetch the physical page size.
 93		throw("failed to get system page size")
 94	}
 95	if physPageSize < minPhysPageSize {
 96		print("system page size (", physPageSize, ") is smaller than minimum page size (", minPhysPageSize, ")\n")
 97		throw("bad system page size")
 98	}
 99	if physPageSize&(physPageSize-1) != 0 {
100		print("system page size (", physPageSize, ") must be a power of 2\n")
101		throw("bad system page size")
102	}
103	if physHugePageSize&(physHugePageSize-1) != 0 {
104		print("system huge page size (", physHugePageSize, ") must be a power of 2\n")
105		throw("bad system huge page size")
106	}
107	if physHugePageSize != 0 {
108		// Since physHugePageSize is a power of 2, it suffices to increase
109		// physHugePageShift until 1<<physHugePageShift == physHugePageSize.
110		for 1<<physHugePageShift != physHugePageSize {
111			physHugePageShift++
112		}
113	}
114
115	// Initialize the heap.
116	mheap_.init()
117	_g_ := getg()
118	_g_.m.mcache = allocmcache()
119    //系统指针大小 PtrSize = 8，表示这是一个 64 位系统。
120	// Create initial arena growth hints.
121	if sys.PtrSize == 8 {
122		// On a 64-bit machine, we pick the following hints
123		// because:
124		//
125		// 1. Starting from the middle of the address space
126		// makes it easier to grow out a contiguous range
127		// without running in to some other mapping.
128		//
129		// 2. This makes Go heap addresses more easily
130		// recognizable when debugging.
131		//
132		// 3. Stack scanning in gccgo is still conservative,
133		// so it's important that addresses be distinguishable
134		// from other data.
135		//
136		// Starting at 0x00c0 means that the valid memory addresses
137		// will begin 0x00c0, 0x00c1, ...
138		// In little-endian, that's c0 00, c1 00, ... None of those are valid
139		// UTF-8 sequences, and they are otherwise as far away from
140		// ff (likely a common byte) as possible. If that fails, we try other 0xXXc0
141		// addresses. An earlier attempt to use 0x11f8 caused out of memory errors
142		// on OS X during thread allocations.  0x00c0 causes conflicts with
143		// AddressSanitizer which reserves all memory up to 0x0100.
144		// These choices reduce the odds of a conservative garbage collector
145		// not collecting memory because some non-pointer block of memory
146		// had a bit pattern that matched a memory address.
147		//
148		// However, on arm64, we ignore all this advice above and slam the
149		// allocation at 0x40 << 32 because when using 4k pages with 3-level
150		// translation buffers, the user address space is limited to 39 bits
151		// On darwin/arm64, the address space is even smaller.
152		// On AIX, mmaps starts at 0x0A00000000000000 for 64-bit.
153		// processes.
154		for i := 0x7f; i >= 0; i-- {
155			var p uintptr
156			switch {
157			case GOARCH == "arm64" && GOOS == "darwin":
158				p = uintptr(i)<<40 | uintptrMask&(0x0013<<28)
159			case GOARCH == "arm64":
160				p = uintptr(i)<<40 | uintptrMask&(0x0040<<32)
161			case GOOS == "aix":
162				if i == 0 {
163					// We don't use addresses directly after 0x0A00000000000000
164					// to avoid collisions with others mmaps done by non-go programs.
165					continue
166				}
167				p = uintptr(i)<<40 | uintptrMask&(0xa0<<52)
168			case raceenabled:
169				// The TSAN runtime requires the heap
170				// to be in the range [0x00c000000000,
171				// 0x00e000000000).
172				p = uintptr(i)<<32 | uintptrMask&(0x00c0<<32)
173				if p >= uintptrMask&0x00e000000000 {
174					continue
175				}
176			default:
177				p = uintptr(i)<<40 | uintptrMask&(0x00c0<<32)
178            }
179            // go1.10以前是申请连续虚拟空间
180            // pSize = bitmapSize + spansSize + arenaSize + _PageSize  //向 OS 申请大小为 pSize 的连续的虚拟地址空间
181            // p = uintptr(sysReserve(unsafe.Pointer(p), pSize, &reserved))            
182            // if p != 0 {                
183            //  break
184            // }
185			hint := (*arenaHint)(mheap_.arenaHintAlloc.alloc())
186			hint.addr = p
187			hint.next, mheap_.arenaHints = mheap_.arenaHints, hint
188		}
189	} else {//这里是 32 位系统代码对应的操作
190		// On a 32-bit machine, we're much more concerned
191		// about keeping the usable heap contiguous.
192		// Hence:
193		//
194		// 1. We reserve space for all heapArenas up front so
195		// they don't get interleaved with the heap. They're
196		// ~258MB, so this isn't too bad. (We could reserve a
197		// smaller amount of space up front if this is a
198		// problem.)
199		//
200		// 2. We hint the heap to start right above the end of
201		// the binary so we have the best chance of keeping it
202		// contiguous.
203		//
204		// 3. We try to stake out a reasonably large initial
205		// heap reservation.
206
207		const arenaMetaSize = (1 << arenaBits) * unsafe.Sizeof(heapArena{})
208		meta := uintptr(sysReserve(nil, arenaMetaSize))
209		if meta != 0 {
210			mheap_.heapArenaAlloc.init(meta, arenaMetaSize)
211		}
212
213		// We want to start the arena low, but if we're linked
214		// against C code, it's possible global constructors
215		// have called malloc and adjusted the process' brk.
216		// Query the brk so we can avoid trying to map the
217		// region over it (which will cause the kernel to put
218		// the region somewhere else, likely at a high
219		// address).
220		procBrk := sbrk0()
221
222		// If we ask for the end of the data segment but the
223		// operating system requires a little more space
224		// before we can start allocating, it will give out a
225		// slightly higher pointer. Except QEMU, which is
226		// buggy, as usual: it won't adjust the pointer
227		// upward. So adjust it upward a little bit ourselves:
228		// 1/4 MB to get away from the running binary image.
229		p := firstmoduledata.end
230		if p < procBrk {
231			p = procBrk
232		}
233		if mheap_.heapArenaAlloc.next <= p && p < mheap_.heapArenaAlloc.end {
234			p = mheap_.heapArenaAlloc.end
235		}
236		p = round(p+(256<<10), heapArenaBytes)
237		// Because we're worried about fragmentation on
238		// 32-bit, we try to make a large initial reservation.
239		arenaSizes := []uintptr{
240			512 << 20,
241			256 << 20,
242			128 << 20,
243		}
244		for _, arenaSize := range arenaSizes {
245			a, size := sysReserveAligned(unsafe.Pointer(p), arenaSize, heapArenaBytes)
246			if a != nil {
247				mheap_.arena.init(uintptr(a), size)
248				p = uintptr(a) + size // For hint below
249				break
250			}
251		}
252		hint := (*arenaHint)(mheap_.arenaHintAlloc.alloc())
253		hint.addr = p
254		hint.next, mheap_.arenaHints = mheap_.arenaHints, hint
255	}
256}

3.1 arena 相关

arena 相关地址的大小初始化代码如下。

1// runtime/mheap.go
2// Try to add at least npage pages of memory to the heap,
3// returning whether it worked.
4//
5// h must be locked.
6func (h *mheap) grow(npage uintptr) bool {
7	ask := npage << _PageShift
8
9	nBase := round(h.curArena.base+ask, physPageSize)

round 是一个对齐操作，向上取 _PageSize 的倍数。实现也很有意思，代码如下。

1// runtime/stubs.go
2// round n up to a multiple of a.  a must be a power of 2.
3func round(n, a uintptr) uintptr {
4	return (n + a - 1) &^ (a - 1)
5}

这里值得注意的是 Golang 的内存管理虚拟地址页大小为 8k

1// runtime/malloc.go
2_PageSize = 1 << _PageShift
3_PageShift = 13

go1.8的解析

 1arena 相关地址的大小初始化代码如下。
 2
 3arenaSize := round(_MaxMem, _PageSize)
 4bitmapSize = arenaSize / (sys.PtrSize * 8 / 2)
 5spansSize = arenaSize / _PageSize * sys.PtrSize
 6spansSize = round(spansSize, _PageSize)
 7
 8_MaxMem = uintptr(1<<_MHeapMap_TotalBits - 1)
 9首先解释一下变量 _MaxMem ，里面还有一个变量就不再列出来了。简单来说 _MaxMem 就是系统为 arena 区分配的大小：64 位系统分配 512 G；对于 Windows 64 位系统，arena 区分配 32 G。round 是一个对齐操作，向上取 _PageSize 的倍数。实现也很有意思，代码如下。
10
11// round n up to a multiple of a.  a must be a power of 2.
12func round(n, a uintptr) uintptr {    
13   return (n + a - 1) &^ (a - 1)
14}
15bitmap 用两个 bit 表示一个字的可用状态，那么算下来 bitmap 的大小为 16 G。读过 Golang 源码的同学会发现其实这段代码的注释里写的 bitmap 的大小为 32 G。其实是这段注释很久没有更新了，之前是用 4 个 bit 来表示一个字的可用状态，这真是一个悲伤的故事啊。
16
17spans 记录的 arena 区的块页号和对应的 mspan 指针的对应关系。比如 arena 区内存地址 x，对应的页号就是 page_num =  (x - arena_start) / page_size，那么 spans 就会记录 spans[page_num] = x。如果 arena 为 512 G的话，spans 区的大小为 512 G / 8K * 8 = 512 M。这里值得注意的是 Golang 的内存管理虚拟地址页大小为 8k。
18
19_PageSize = 1 << _PageShift
20_PageShift = 13
21所以这一段连续的的虚拟内存布局（64 位）如下：
22
23+-----------------------+---------------------+-----------------------+
24|    spans 512M         |    bitmap 16G       |   arena 512           |
25+-----------------------+---------------------+-----------------------+

3.2 虚拟地址申请

go1.8的解析

 1主要是下面这段代码。
 2
 3//尝试从不同地址开始申请
 4for i := 0; i <= 0x7f; i++ {    
 5    switch {
 6    case GOARCH == "arm64" && GOOS == "darwin":
 7        p = uintptr(i)<<40 | uintptrMask&(0x0013<<28)
 8    case GOARCH == "arm64":
 9        p = uintptr(i)<<40 | uintptrMask&(0x0040<<32)    
10    default:
11        p = uintptr(i)<<40 | uintptrMask&(0x00c0<<32)
12    }
13    pSize = bitmapSize + spansSize + arenaSize + _PageSize
14    //向 OS 申请大小为 pSize 的连续的虚拟地址空间
15    p = uintptr(sysReserve(unsafe.Pointer(p), pSize, &reserved))    
16    if p != 0 {        
17        break
18    }
19}
20初始化的时候，Golang 向操作系统申请一段连续的地址空间，就是上面的 spans + bitmap + arena。p 就是这段连续地址空间的开始地址，不同平台的 p 取值不一样。像 OS 申请的时候视不同的 OS 版本，调用不同的系统调用，比如 Unix 系统调用 mmap (mmap 想操作系统内核申请新的虚拟地址区间，可指定起始地址和长度)，Windows 系统调用 VirtualAlloc （类似 mmap）。
21
22//bsd
23func sysReserve(v unsafe.Pointer, n uintptr, reserved *bool) unsafe.Pointer {    
24    if sys.PtrSize == 8 && uint64(n) > 1<<32 || sys.GoosNacl != 0 {
25        *reserved = false
26        return v
27    }
28
29    p := mmap(v, n, _PROT_NONE, _MAP_ANON|_MAP_PRIVATE, -1, 0)
30    if uintptr(p) < 4096 {
31       return nil
32    }
33    *reserved = true
34    return p
35}
36//darwin
37func sysReserve(v unsafe.Pointer, n uintptr, reserved *bool) unsafe.Pointer {
38    *reserved = true
39    p := mmap(v, n, _PROT_NONE, _MAP_ANON|_MAP_PRIVATE, -1, 0)    
40    if uintptr(p) < 4096 {        
41       return nil
42    }   
43    return p
44}
45//linux
46func sysReserve(v unsafe.Pointer, n uintptr, reserved *bool) unsafe.Pointer {
47    ...
48    p := mmap(v, n, _PROT_NONE, _MAP_ANON|_MAP_PRIVATE, -1, 0)
49    if uintptr(p) < 4096 {
50       return nil
51    }
52    *reserved = true
53    return p
54}
55//windows
56func sysReserve(v unsafe.Pointer, n uintptr, reserved *bool) unsafe.Pointer {
57    *reserved = true
58    // v is just a hint.
59    // First try at v.
60    v = unsafe.Pointer(stdcall4(_VirtualAlloc, uintptr(v), n, _MEM_RESERVE, _PAGE_READWRITE))
61    if v != nil {    
62        return v
63    }    // Next let the kernel choose the address.
64    return unsafe.Pointer(stdcall4(_VirtualAlloc, 0, n, _MEM_RESERVE, _PAGE_READWRITE))
65}

go1.13

 1// linux
 2func sysReserve(v unsafe.Pointer, n uintptr) unsafe.Pointer {
 3	p, err := mmap(v, n, _PROT_NONE, _MAP_ANON|_MAP_PRIVATE, -1, 0)
 4	if err != 0 {
 5		return nil
 6	}
 7	return p
 8}
 9// windows
10func sysReserve(v unsafe.Pointer, n uintptr) unsafe.Pointer {
11	// v is just a hint.
12	// First try at v.
13	// This will fail if any of [v, v+n) is already reserved.
14	v = unsafe.Pointer(stdcall4(_VirtualAlloc, uintptr(v), n, _MEM_RESERVE, _PAGE_READWRITE))
15	if v != nil {
16		return v
17	}
18
19	// Next let the kernel choose the address.
20	return unsafe.Pointer(stdcall4(_VirtualAlloc, 0, n, _MEM_RESERVE, _PAGE_READWRITE))
21}

3.3 mheap 初始化

go1.8

 1我们上面介绍 mheap 结构的时候知道 spans, bitmap, arena 都是存在于 mheap 中的，从操作系统申请完地址之后就是初始化 mheap 了。
 2
 3func mallocinit() {
 4  ...
 5    p1 := round(p, _PageSize)
 6
 7    spansStart := p1
 8    mheap_.bitmap = p1 + spansSize + bitmapSize
 9    if sys.PtrSize == 4 { 
10        // Set arena_start such that we can accept memory
11        // reservations located anywhere in the 4GB virtual space.
12        mheap_.arena_start = 0
13    } else {
14        mheap_.arena_start = p1 + (spansSize + bitmapSize)
15    }
16    mheap_.arena_end = p + pSize
17    mheap_.arena_used = p1 + (spansSize + bitmapSize)
18    mheap_.arena_reserved = reserved
19    if mheap_.arena_start&(_PageSize-1) != 0 {
20        println("bad pagesize", hex(p), hex(p1), hex(spansSize), hex(bitmapSize), hex(_PageSize), "start", hex(mheap_.arena_start))
21        throw("misrounded allocation in mallocinit")
22    }    // Initialize the rest of the allocator.
23    mheap_.init(spansStart, spansSize)  //获取当前 G
24    _g_ := getg()  // 获取 G 上绑定的 M 的 mcache
25    _g_.m.mcache = allocmcache()
26}
27p 是从连续虚拟地址的起始地址，先进行对齐，然后初始化 arena，bitmap，spans 地址。mheap_.init()会初始化 fixalloc 等相关的成员，还有 mcentral 的初始化。
28
29func (h *mheap) init(spansStart, spansBytes uintptr) {
30    h.spanalloc.init(unsafe.Sizeof(mspan{}), recordspan, unsafe.Pointer(h), &memstats.mspan_sys)
31    h.cachealloc.init(unsafe.Sizeof(mcache{}), nil, nil, &memstats.mcache_sys)
32    h.specialfinalizeralloc.init(unsafe.Sizeof(specialfinalizer{}), nil, nil, &memstats.other_sys)
33    h.specialprofilealloc.init(unsafe.Sizeof(specialprofile{}), nil, nil, &memstats.other_sys)
34
35    h.spanalloc.zero = false
36
37    // h->mapcache needs no init
38    for i := range h.free {
39        h.free[i].init()
40        h.busy[i].init()
41    }
42
43    h.freelarge.init()
44    h.busylarge.init()
45    for i := range h.central {
46        h.central[i].mcentral.init(int32(i))
47    }
48
49    sp := (*slice)(unsafe.Pointer(&h.spans))
50    sp.array = unsafe.Pointer(spansStart)
51    sp.len = 0
52    sp.cap = int(spansBytes / sys.PtrSize)
53}
54mheap 初始化之后，对当前的线程也就是 M 进行初始化。
55
56//获取当前 G
57_g_ := getg()
58// 获取 G 上绑定的 M 的 mcache
59_g_.m.mcache = allocmcache()

3.4 per-P mcache 初始化

go1.8

 1上面好像并没有说到针对 P 的 mcache 初始化，因为这个时候还没有初始化 P。我们看一下 bootstrap 的代码。
 2
 3func schedinit() {
 4  ...
 5  mallocinit()
 6  ...  if procs > _MaxGomaxprocs {
 7  procs = _MaxGomaxprocs
 8  }  if procresize(procs) != nil {
 9  ...
10  }
11}
12其中 mallocinit() 上面说过了。对 P 的初始化在函数 procresize() 中执行，我们下面只看内存相关的部分。
13
14func procresize(nprocs int32) *p {
15    ...    // initialize new P's
16    for i := int32(0); i < nprocs; i++ {
17        pp := allp[i]        if pp == nil {
18            pp = new(p)
19            pp.id = i
20            pp.status = _Pgcstop
21            pp.sudogcache = pp.sudogbuf[:0]
22            for i := range pp.deferpool {
23                pp.deferpool[i] = pp.deferpoolbuf[i][:0]
24            }
25            atomicstorep(unsafe.Pointer(&allp[i]), unsafe.Pointer(pp))
26        }  // P mcache 初始化
27        if pp.mcache == nil {
28            if old == 0 && i == 0 {        
29                if getg().m.mcache == nil {
30                    throw("missing mcache?")
31                }  // P[0] 分配给主 Goroutine
32                pp.mcache = getg().m.mcache // bootstrap
33            } else {  // P[0] 之外的 P 申请 mcache
34                pp.mcache = allocmcache()
35            }
36        }
37        ...
38    }
39  ...
40}
41所有的 P 都存放在一个全局数组 allp 中，procresize() 的目的就是将 allp 中用到的 P 进行初始化，同时对多余 P 的资源剥离。我们看一下 allocmcache。
42
43func allocmcache() *mcache {
44    lock(&mheap_.lock)
45    c := (*mcache)(mheap_.cachealloc.alloc())
46    unlock(&mheap_.lock)
47    for i := 0; i < _NumSizeClasses; i++ {
48        c.alloc[i] = &emptymspan
49    }
50    c.next_sample = nextSample()    return c
51}
52allocmcache 是调用 mheap 中特定分配器 cachealloc 来分配的。我们前面说过 fixalloc 是一个特定大小内存块的 free list。那么 cachealloc 就是 mcache 的 free list，每次分配也很简单，直接取 list 表头就可以了。mcache.alloc 现在并没有分配，是从 mcentral 中分配的。

4. 内存分配

先说一下给对象 object 分配内存的主要流程：

object size > 32K，则使用 mheap 直接分配。
object size < 16 byte，使用 mcache 的小对象分配器 tiny 直接分配。（其实 tiny 就是一个指针，暂且这么说吧。）
object size > 16 byte && size <=32K byte 时，先使用 mcache 中对应的 size class 分配。
如果 mcache 对应的 size class 的 span 已经没有可用的块，则向 mcentral 请求。
如果 mcentral 也没有可用的块，则向 mheap 申请，并切分。
如果 mheap 也没有合适的 span，则想操作系统申请。

我们看一下在堆上，也就是 arena 区分配内存的相关函数。

1package main
2func foo() *int {
3  x := 1
4  return &x
5}
6func main() {
7  x := foo()  println(*x)
8}

根据之前介绍的逃逸分析，foo() 中的 x 会被分配到堆上。把上面代码保存为 test1.go 看一下汇编代码。

 1$ go build -gcflags '-l' -o test1 test1.go
 2$ go tool objdump -s "main\.foo" test1
 3TEXT main.foo(SB) /Users/didi/code/go/malloc_example/test2.go
 4    test2.go:3  0x2040  65488b0c25a0080000  GS MOVQ GS:0x8a0, CX
 5    test2.go:3  0x2049  483b6110        CMPQ 0x10(CX), SP
 6    test2.go:3  0x204d  762a            JBE 0x2079
 7    test2.go:3  0x204f  4883ec10        SUBQ $0x10, SP
 8    test2.go:4  0x2053  488d1d66460500      LEAQ 0x54666(IP), BX
 9    test2.go:4  0x205a  48891c24        MOVQ BX, 0(SP)
10    test2.go:4  0x205e  e82d8f0000      CALL runtime.newobject(SB)
11    test2.go:4  0x2063  488b442408      MOVQ 0x8(SP), AX
12    test2.go:4  0x2068  48c70001000000      MOVQ $0x1, 0(AX)
13    test2.go:5  0x206f  4889442418      MOVQ AX, 0x18(SP)
14    test2.go:5  0x2074  4883c410        ADDQ $0x10, SP
15    test2.go:5  0x2078  c3          RET
16    test2.go:3  0x2079  e8a28d0400      CALL runtime.morestack_noctxt(SB)
17    test2.go:3  0x207e  ebc0            JMP main.foo(SB)

堆上内存分配调用了 runtime 包的 newobject 函数。

  1// runtime/malloc.go
  2// Allocate an object of size bytes.
  3// Small objects are allocated from the per-P cache's free lists.
  4// Large objects (> 32 kB) are allocated straight from the heap.
  5func mallocgc(size uintptr, typ *_type, needzero bool) unsafe.Pointer {
  6	if gcphase == _GCmarktermination {
  7		throw("mallocgc called with gcphase == _GCmarktermination")
  8	}
  9
 10	if size == 0 {
 11		return unsafe.Pointer(&zerobase)
 12	}
 13
 14	if debug.sbrk != 0 {
 15		align := uintptr(16)
 16		if typ != nil {
 17			// TODO(austin): This should be just
 18			//   align = uintptr(typ.align)
 19			// but that's only 4 on 32-bit platforms,
 20			// even if there's a uint64 field in typ (see #599).
 21			// This causes 64-bit atomic accesses to panic.
 22			// Hence, we use stricter alignment that matches
 23			// the normal allocator better.
 24			if size&7 == 0 {
 25				align = 8
 26			} else if size&3 == 0 {
 27				align = 4
 28			} else if size&1 == 0 {
 29				align = 2
 30			} else {
 31				align = 1
 32			}
 33		}
 34		return persistentalloc(size, align, &memstats.other_sys)
 35	}
 36
 37	// assistG is the G to charge for this allocation, or nil if
 38	// GC is not currently active.
 39	var assistG *g
 40	if gcBlackenEnabled != 0 {
 41		// Charge the current user G for this allocation.
 42		assistG = getg()
 43		if assistG.m.curg != nil {
 44			assistG = assistG.m.curg
 45		}
 46		// Charge the allocation against the G. We'll account
 47		// for internal fragmentation at the end of mallocgc.
 48		assistG.gcAssistBytes -= int64(size)
 49
 50		if assistG.gcAssistBytes < 0 {
 51			// This G is in debt. Assist the GC to correct
 52			// this before allocating. This must happen
 53			// before disabling preemption.
 54			gcAssistAlloc(assistG)
 55		}
 56	}
 57
 58	// Set mp.mallocing to keep from being preempted by GC.
 59	mp := acquirem()
 60	if mp.mallocing != 0 {
 61		throw("malloc deadlock")
 62	}
 63	if mp.gsignal == getg() {
 64		throw("malloc during signal")
 65	}
 66	mp.mallocing = 1
 67
 68	shouldhelpgc := false
 69	dataSize := size
 70	c := gomcache()
 71	var x unsafe.Pointer
 72	noscan := typ == nil || typ.ptrdata == 0
 73	if size <= maxSmallSize {
 74		if noscan && size < maxTinySize { // 小于 16 byte 的小对象分配
 75			// Tiny allocator.
 76			//
 77			// Tiny allocator combines several tiny allocation requests
 78			// into a single memory block. The resulting memory block
 79			// is freed when all subobjects are unreachable. The subobjects
 80			// must be noscan (don't have pointers), this ensures that
 81			// the amount of potentially wasted memory is bounded.
 82			//
 83			// Size of the memory block used for combining (maxTinySize) is tunable.
 84			// Current setting is 16 bytes, which relates to 2x worst case memory
 85			// wastage (when all but one subobjects are unreachable).
 86			// 8 bytes would result in no wastage at all, but provides less
 87			// opportunities for combining.
 88			// 32 bytes provides more opportunities for combining,
 89			// but can lead to 4x worst case wastage.
 90			// The best case winning is 8x regardless of block size.
 91			//
 92			// Objects obtained from tiny allocator must not be freed explicitly.
 93			// So when an object will be freed explicitly, we ensure that
 94			// its size >= maxTinySize.
 95			//
 96			// SetFinalizer has a special case for objects potentially coming
 97			// from tiny allocator, it such case it allows to set finalizers
 98			// for an inner byte of a memory block.
 99			//
100			// The main targets of tiny allocator are small strings and
101			// standalone escaping variables. On a json benchmark
102			// the allocator reduces number of allocations by ~12% and
103			// reduces heap size by ~20%.
104			off := c.tinyoffset // 地址对齐
105			// Align tiny pointer for required (conservative) alignment.
106			if size&7 == 0 {
107				off = round(off, 8)
108			} else if size&3 == 0 {
109				off = round(off, 4)
110			} else if size&1 == 0 {
111				off = round(off, 2)
112			}
113			if off+size <= maxTinySize && c.tiny != 0 {// 分配
114				// The object fits into existing tiny block.
115				x = unsafe.Pointer(c.tiny + off)
116				c.tinyoffset = off + size
117				c.local_tinyallocs++
118				mp.mallocing = 0
119				releasem(mp)
120				return x
121            }
122             // tiny 不够了，为其重新分配一个 16 byte 内存块
123			// Allocate a new maxTinySize block.
124			span := c.alloc[tinySpanClass]
125			v := nextFreeFast(span)
126			if v == 0 {
127				v, _, shouldhelpgc = c.nextFree(tinySpanClass)
128			}
129			x = unsafe.Pointer(v)//将申请的内存块全置为 0
130			(*[2]uint64)(x)[0] = 0
131			(*[2]uint64)(x)[1] = 0
132			// See if we need to replace the existing tiny block with the new one
133            // based on amount of remaining free space.
134            // 如果申请的内存块用不完，则将剩下的给 tiny，用 tinyoffset 记录分配了多少。
135			if size < c.tinyoffset || c.tiny == 0 {
136				c.tiny = uintptr(x)
137				c.tinyoffset = size
138			}
139			size = maxTinySize
140		} else {
141			var sizeclass uint8//计算 sizeclass
142			if size <= smallSizeMax-8 {
143				sizeclass = size_to_class8[(size+smallSizeDiv-1)/smallSizeDiv]
144			} else {
145				sizeclass = size_to_class128[(size-smallSizeMax+largeSizeDiv-1)/largeSizeDiv]
146			}
147			size = uintptr(class_to_size[sizeclass])
148			spc := makeSpanClass(sizeclass, noscan)
149			span := c.alloc[spc]
150			v := nextFreeFast(span)//从对应的 span 里面分配一个 object 
151			if v == 0 {
152				v, span, shouldhelpgc = c.nextFree(spc)
153			}
154			x = unsafe.Pointer(v)
155			if needzero && span.needzero != 0 {
156				memclrNoHeapPointers(unsafe.Pointer(v), size)
157			}
158		}
159	} else {//object size > 32K byte
160		var s *mspan
161		shouldhelpgc = true
162		systemstack(func() {
163			s = largeAlloc(size, needzero, noscan)
164		})
165		s.freeindex = 1
166		s.allocCount = 1
167		x = unsafe.Pointer(s.base())
168		size = s.elemsize
169	}
170
171	var scanSize uintptr
172	if !noscan {
173		// If allocating a defer+arg block, now that we've picked a malloc size
174		// large enough to hold everything, cut the "asked for" size down to
175		// just the defer header, so that the GC bitmap will record the arg block
176		// as containing nothing at all (as if it were unused space at the end of
177		// a malloc block caused by size rounding).
178		// The defer arg areas are scanned as part of scanstack.
179		if typ == deferType {
180			dataSize = unsafe.Sizeof(_defer{})
181		}
182		heapBitsSetType(uintptr(x), size, dataSize, typ)
183		if dataSize > typ.size {
184			// Array allocation. If there are any
185			// pointers, GC has to scan to the last
186			// element.
187			if typ.ptrdata != 0 {
188				scanSize = dataSize - typ.size + typ.ptrdata
189			}
190		} else {
191			scanSize = typ.ptrdata
192		}
193		c.local_scan += scanSize
194	}
195
196	// Ensure that the stores above that initialize x to
197	// type-safe memory and set the heap bits occur before
198	// the caller can make x observable to the garbage
199	// collector. Otherwise, on weakly ordered machines,
200	// the garbage collector could follow a pointer to x,
201	// but see uninitialized memory or stale heap bits.
202	publicationBarrier()
203
204	// Allocate black during GC.
205	// All slots hold nil so no scanning is needed.
206	// This may be racing with GC so do it atomically if there can be
207	// a race marking the bit.
208	if gcphase != _GCoff {
209		gcmarknewobject(uintptr(x), size, scanSize)
210	}
211
212	if raceenabled {
213		racemalloc(x, size)
214	}
215
216	if msanenabled {
217		msanmalloc(x, size)
218	}
219
220	mp.mallocing = 0
221	releasem(mp)
222
223	if debug.allocfreetrace != 0 {
224		tracealloc(x, size, typ)
225	}
226
227	if rate := MemProfileRate; rate > 0 {
228		if rate != 1 && size < c.next_sample {
229			c.next_sample -= size
230		} else {
231			mp := acquirem()
232			profilealloc(mp, x, size)
233			releasem(mp)
234		}
235	}
236
237	if assistG != nil {
238		// Account for internal fragmentation in the assist
239		// debt now that we know it.
240		assistG.gcAssistBytes -= int64(size - dataSize)
241	}
242
243	if shouldhelpgc {
244		if t := (gcTrigger{kind: gcTriggerHeap}); t.test() {
245			gcStart(t)
246		}
247	}
248
249	return x
250}
251
252func largeAlloc(size uintptr, needzero bool, noscan bool) *mspan {
253	// print("largeAlloc size=", size, "\n")
254
255	if size+_PageSize < size {
256		throw("out of memory")
257	}
258	npages := size >> _PageShift
259	if size&_PageMask != 0 {
260		npages++
261	}
262
263	// Deduct credit for this span allocation and sweep if
264	// necessary. mHeap_Alloc will also sweep npages, so this only
265	// pays the debt down to npage pages.
266	deductSweepCredit(npages*_PageSize, npages)
267
268	s := mheap_.alloc(npages, makeSpanClass(0, noscan), true, needzero)
269	if s == nil {
270		throw("out of memory")
271	}
272	s.limit = s.base() + size
273	heapBitsForAddr(s.base()).initSpan(s)
274	return s
275}
276
277// implementation of new builtin
278// compiler (both frontend and SSA backend) knows the signature
279// of this function
280func newobject(typ *_type) unsafe.Pointer {
281	return mallocgc(typ.size, typ, true)
282}

整个分配过程可以根据 object size 拆解成三部分：size < 16 byte, 16 byte <= size <= 32 K byte, size > 32 K byte。

4.1 size 小于 16 byte

对于小于 16 byte 的内存块，mcache 有个专门的内存区域 tiny 用来分配，tiny 是指针，指向开始地址。 tinyoffset 表示 tiny 当前分配到什么地址了，之后的分配根据 tinyoffset 寻址。先根据要分配的对象大小进行地址对齐，比如 size 是 8 的倍数，tinyoffset 和 8 对齐。然后就是进行分配。如果 tiny 剩余的空间不够用，则重新申请一个 16 byte 的内存块，并分配给 object。如果有结余，则记录在 tiny 上。

4.2 size 大于 32 K byte

对于大于 32 Kb 的内存分配，直接跳过 mcache 和 mcentral，通过 mheap 分配。对于大于 32 K 的内存分配都是分配整数页，先右移然后低位与计算需要的页数。

4.3 size 介于 16 和 32K

对于 size 介于 16 ~ 32K byte 的内存分配先计算应该分配的 sizeclass，然后去 mcache 里面 alloc[sizeclass] 申请，如果 mcache.alloc[sizeclass] 不足以申请，则 mcache 向 mcentral 申请，然后再分配。mcentral 给 mcache 分配完之后会判断自己需不需要扩充，如果需要则想 mheap 申请。

我们首先看一下如何计算 sizeclass 的，预先定义了两个数组：size_to_class8 和 size_to_class128。数组 size_to_class8，其第 i 个值表示地址区间 ( (i-1)8, i8 ] (smallSizeDiv = 8) 对应的 sizeclass，size_to_class128 类似。小于 1024 - 8 = 1016 （smallSizeMax=1024），使用 size_to_class8，否则使用数组 size_to_class128。看一下数组具体的值：0, 1, 2, 3, 3, 4, 4…。举个例子，比如要分配 17 byte 的内存（16 byte 以下的使用 mcache.tiny 分配），sizeclass = size_to_calss8[(17+7)/8] = size_to_class8[3] = 3。不得不说这种用空间换时间的策略确实提高了运行效率。

计算出 sizeclass，那么就可以去 mcache.alloc[sizeclass] 分配了，注意这是一个 mspan 指针，真正的分配函数是 nextFreeFast() 函数。如下。

 1// nextFreeFast returns the next free object if one is quickly available.
 2// Otherwise it returns 0.
 3func nextFreeFast(s *mspan) gclinkptr {
 4	theBit := sys.Ctz64(s.allocCache) // Is there a free object in the allocCache?
 5	if theBit < 64 {
 6		result := s.freeindex + uintptr(theBit)
 7		if result < s.nelems {
 8			freeidx := result + 1
 9			if freeidx%64 == 0 && freeidx != s.nelems {
10				return 0
11			}
12			s.allocCache >>= uint(theBit + 1)
13			s.freeindex = freeidx
14			s.allocCount++
15			return gclinkptr(result*s.elemsize + s.base())
16		}
17	}
18	return 0
19}

allocCache 这里是用位图表示内存是否可用，1 表示可用。然后通过 span 里面的 freeindex 和 elemsize 来计算地址即可。

如果 mcache.alloc[sizeclass] 已经不够用了，则从 mcentral 申请内存到 mcache。

 1// nextFree returns the next free object from the cached span if one is available.
 2// Otherwise it refills the cache with a span with an available object and
 3// returns that object along with a flag indicating that this was a heavy
 4// weight allocation. If it is a heavy weight allocation the caller must
 5// determine whether a new GC cycle needs to be started or if the GC is active
 6// whether this goroutine needs to assist the GC.
 7//
 8// Must run in a non-preemptible context since otherwise the owner of
 9// c could change.
10func (c *mcache) nextFree(spc spanClass) (v gclinkptr, s *mspan, shouldhelpgc bool) {
11	s = c.alloc[spc]
12	shouldhelpgc = false
13	freeIndex := s.nextFreeIndex()
14	if freeIndex == s.nelems {
15		// The span is full.
16		if uintptr(s.allocCount) != s.nelems {
17			println("runtime: s.allocCount=", s.allocCount, "s.nelems=", s.nelems)
18			throw("s.allocCount != s.nelems && freeIndex == s.nelems")
19		}
20		c.refill(spc)
21		shouldhelpgc = true
22		s = c.alloc[spc]
23
24		freeIndex = s.nextFreeIndex()
25	}
26
27	if freeIndex >= s.nelems {
28		throw("freeIndex is not valid")
29	}
30
31	v = gclinkptr(freeIndex*s.elemsize + s.base())
32	s.allocCount++
33	if uintptr(s.allocCount) > s.nelems {
34		println("s.allocCount=", s.allocCount, "s.nelems=", s.nelems)
35		throw("s.allocCount > s.nelems")
36	}
37	return
38}

mcache 向 mcentral，如果 mcentral 不够，则向 mheap 申请。

 1// refill acquires a new span of span class spc for c. This span will
 2// have at least one free object. The current span in c must be full.
 3//
 4// Must run in a non-preemptible context since otherwise the owner of
 5// c could change.
 6func (c *mcache) refill(spc spanClass) {
 7	// Return the current cached span to the central lists.
 8	s := c.alloc[spc]
 9
10	if uintptr(s.allocCount) != s.nelems {
11		throw("refill of span with free space remaining")
12	}
13	if s != &emptymspan {
14		// Mark this span as no longer cached.
15		if s.sweepgen != mheap_.sweepgen+3 {
16			throw("bad sweepgen in refill")
17		}
18		atomic.Store(&s.sweepgen, mheap_.sweepgen)
19	}
20
21	// Get a new cached span from the central lists.
22	s = mheap_.central[spc].mcentral.cacheSpan()
23	if s == nil {
24		throw("out of memory")
25	}
26
27	if uintptr(s.allocCount) == s.nelems {
28		throw("span has no free space")
29	}
30
31	// Indicate that this span is cached and prevent asynchronous
32	// sweeping in the next sweep phase.
33	s.sweepgen = mheap_.sweepgen + 3
34
35	c.alloc[spc] = s
36}

  1
  2// Allocate a span to use in an mcache.
  3func (c *mcentral) cacheSpan() *mspan {
  4	// Deduct credit for this span allocation and sweep if necessary.
  5	spanBytes := uintptr(class_to_allocnpages[c.spanclass.sizeclass()]) * _PageSize
  6	deductSweepCredit(spanBytes, 0)
  7
  8	lock(&c.lock)
  9	traceDone := false
 10	if trace.enabled {
 11		traceGCSweepStart()
 12	}
 13	sg := mheap_.sweepgen
 14retry:
 15	var s *mspan
 16	for s = c.nonempty.first; s != nil; s = s.next {
 17		if s.sweepgen == sg-2 && atomic.Cas(&s.sweepgen, sg-2, sg-1) {
 18			c.nonempty.remove(s)
 19			c.empty.insertBack(s)
 20			unlock(&c.lock)
 21			s.sweep(true)
 22			goto havespan
 23		}
 24		if s.sweepgen == sg-1 {
 25			// the span is being swept by background sweeper, skip
 26			continue
 27		}
 28		// we have a nonempty span that does not require sweeping, allocate from it
 29		c.nonempty.remove(s)
 30		c.empty.insertBack(s)
 31		unlock(&c.lock)
 32		goto havespan
 33	}
 34
 35	for s = c.empty.first; s != nil; s = s.next {
 36		if s.sweepgen == sg-2 && atomic.Cas(&s.sweepgen, sg-2, sg-1) {
 37			// we have an empty span that requires sweeping,
 38			// sweep it and see if we can free some space in it
 39			c.empty.remove(s)
 40			// swept spans are at the end of the list
 41			c.empty.insertBack(s)
 42			unlock(&c.lock)
 43			s.sweep(true)
 44			freeIndex := s.nextFreeIndex()
 45			if freeIndex != s.nelems {
 46				s.freeindex = freeIndex
 47				goto havespan
 48			}
 49			lock(&c.lock)
 50			// the span is still empty after sweep
 51			// it is already in the empty list, so just retry
 52			goto retry
 53		}
 54		if s.sweepgen == sg-1 {
 55			// the span is being swept by background sweeper, skip
 56			continue
 57		}
 58		// already swept empty span,
 59		// all subsequent ones must also be either swept or in process of sweeping
 60		break
 61	}
 62	if trace.enabled {
 63		traceGCSweepDone()
 64		traceDone = true
 65	}
 66	unlock(&c.lock)
 67
 68	// Replenish central list if empty.
 69	s = c.grow()
 70	if s == nil {
 71		return nil
 72	}
 73	lock(&c.lock)
 74	c.empty.insertBack(s)
 75	unlock(&c.lock)
 76
 77	// At this point s is a non-empty span, queued at the end of the empty list,
 78	// c is unlocked.
 79havespan:
 80	if trace.enabled && !traceDone {
 81		traceGCSweepDone()
 82	}
 83	n := int(s.nelems) - int(s.allocCount)
 84	if n == 0 || s.freeindex == s.nelems || uintptr(s.allocCount) == s.nelems {
 85		throw("span has no free objects")
 86	}
 87	// Assume all objects from this span will be allocated in the
 88	// mcache. If it gets uncached, we'll adjust this.
 89	atomic.Xadd64(&c.nmalloc, int64(n))
 90	usedBytes := uintptr(s.allocCount) * s.elemsize
 91	atomic.Xadd64(&memstats.heap_live, int64(spanBytes)-int64(usedBytes))
 92	if trace.enabled {
 93		// heap_live changed.
 94		traceHeapAlloc()
 95	}
 96	if gcBlackenEnabled != 0 {
 97		// heap_live changed.
 98		gcController.revise()
 99	}
100	freeByteBase := s.freeindex &^ (64 - 1)
101	whichByte := freeByteBase / 8
102	// Init alloc bits cache.
103	s.refillAllocCache(whichByte)
104
105	// Adjust the allocCache so that s.freeindex corresponds to the low bit in
106	// s.allocCache.
107	s.allocCache >>= s.freeindex % 64
108
109	return s
110}
111
112// Return span from an mcache.
113func (c *mcentral) uncacheSpan(s *mspan) {
114	if s.allocCount == 0 {
115		throw("uncaching span but s.allocCount == 0")
116	}
117
118	sg := mheap_.sweepgen
119	stale := s.sweepgen == sg+1
120	if stale {
121		// Span was cached before sweep began. It's our
122		// responsibility to sweep it.
123		//
124		// Set sweepgen to indicate it's not cached but needs
125		// sweeping and can't be allocated from. sweep will
126		// set s.sweepgen to indicate s is swept.
127		atomic.Store(&s.sweepgen, sg-1)
128	} else {
129		// Indicate that s is no longer cached.
130		atomic.Store(&s.sweepgen, sg)
131	}
132
133	n := int(s.nelems) - int(s.allocCount)
134	if n > 0 {
135		// cacheSpan updated alloc assuming all objects on s
136		// were going to be allocated. Adjust for any that
137		// weren't. We must do this before potentially
138		// sweeping the span.
139		atomic.Xadd64(&c.nmalloc, -int64(n))
140
141		lock(&c.lock)
142		c.empty.remove(s)
143		c.nonempty.insert(s)
144		if !stale {
145			// mCentral_CacheSpan conservatively counted
146			// unallocated slots in heap_live. Undo this.
147			//
148			// If this span was cached before sweep, then
149			// heap_live was totally recomputed since
150			// caching this span, so we don't do this for
151			// stale spans.
152			atomic.Xadd64(&memstats.heap_live, -int64(n)*int64(s.elemsize))
153		}
154		unlock(&c.lock)
155	}
156
157	if stale {
158		// Now that s is in the right mcentral list, we can
159		// sweep it.
160		s.sweep(false)
161	}
162}
163
164// freeSpan updates c and s after sweeping s.
165// It sets s's sweepgen to the latest generation,
166// and, based on the number of free objects in s,
167// moves s to the appropriate list of c or returns it
168// to the heap.
169// freeSpan reports whether s was returned to the heap.
170// If preserve=true, it does not move s (the caller
171// must take care of it).
172func (c *mcentral) freeSpan(s *mspan, preserve bool, wasempty bool) bool {
173	if sg := mheap_.sweepgen; s.sweepgen == sg+1 || s.sweepgen == sg+3 {
174		throw("freeSpan given cached span")
175	}
176	s.needzero = 1
177
178	if preserve {
179		// preserve is set only when called from (un)cacheSpan above,
180		// the span must be in the empty list.
181		if !s.inList() {
182			throw("can't preserve unlinked span")
183		}
184		atomic.Store(&s.sweepgen, mheap_.sweepgen)
185		return false
186	}
187
188	lock(&c.lock)
189
190	// Move to nonempty if necessary.
191	if wasempty {
192		c.empty.remove(s)
193		c.nonempty.insert(s)
194	}
195
196	// delay updating sweepgen until here. This is the signal that
197	// the span may be used in an mcache, so it must come after the
198	// linked list operations above (actually, just after the
199	// lock of c above.)
200	atomic.Store(&s.sweepgen, mheap_.sweepgen)
201
202	if s.allocCount != 0 {
203		unlock(&c.lock)
204		return false
205	}
206
207	c.nonempty.remove(s)
208	unlock(&c.lock)
209	mheap_.freeSpan(s, false)
210	return true
211}
212
213// grow allocates a new empty span from the heap and initializes it for c's size class.
214func (c *mcentral) grow() *mspan {
215	npages := uintptr(class_to_allocnpages[c.spanclass.sizeclass()])
216	size := uintptr(class_to_size[c.spanclass.sizeclass()])
217    //这里想 mheap 申请
218	s := mheap_.alloc(npages, c.spanclass, false, true)
219	if s == nil {
220		return nil
221	}
222
223	// Use division by multiplication and shifts to quickly compute:
224	// n := (npages << _PageShift) / size
225	n := (npages << _PageShift) >> s.divShift * uintptr(s.divMul) >> s.divShift2
226	s.limit = s.base() + size*n
227	heapBitsForAddr(s.base()).initSpan(s)
228	return s
229}

如果 mheap 不足，则想 OS 申请。接上面的代码 mheap_.alloc()

  1// alloc allocates a new span of npage pages from the GC'd heap.
  2//
  3// Either large must be true or spanclass must indicates the span's
  4// size class and scannability.
  5//
  6// If needzero is true, the memory for the returned span will be zeroed.
  7func (h *mheap) alloc(npage uintptr, spanclass spanClass, large bool, needzero bool) *mspan {
  8	// Don't do any operations that lock the heap on the G stack.
  9	// It might trigger stack growth, and the stack growth code needs
 10	// to be able to allocate heap.
 11	var s *mspan
 12	systemstack(func() {
 13		s = h.alloc_m(npage, spanclass, large)
 14	})
 15
 16	if s != nil {
 17		if needzero && s.needzero != 0 {
 18			memclrNoHeapPointers(unsafe.Pointer(s.base()), s.npages<<_PageShift)
 19		}
 20		s.needzero = 0
 21	}
 22	return s
 23}
 24
 25// alloc_m is the internal implementation of mheap.alloc.
 26//
 27// alloc_m must run on the system stack because it locks the heap, so
 28// any stack growth during alloc_m would self-deadlock.
 29//
 30//go:systemstack
 31func (h *mheap) alloc_m(npage uintptr, spanclass spanClass, large bool) *mspan {
 32	_g_ := getg()
 33
 34	// To prevent excessive heap growth, before allocating n pages
 35	// we need to sweep and reclaim at least n pages.
 36	if h.sweepdone == 0 {
 37		h.reclaim(npage)
 38	}
 39
 40	lock(&h.lock)
 41	// transfer stats from cache to global
 42	memstats.heap_scan += uint64(_g_.m.mcache.local_scan)
 43	_g_.m.mcache.local_scan = 0
 44	memstats.tinyallocs += uint64(_g_.m.mcache.local_tinyallocs)
 45	_g_.m.mcache.local_tinyallocs = 0
 46
 47	s := h.allocSpanLocked(npage, &memstats.heap_inuse)
 48	if s != nil {
 49		// Record span info, because gc needs to be
 50		// able to map interior pointer to containing span.
 51		atomic.Store(&s.sweepgen, h.sweepgen)
 52		h.sweepSpans[h.sweepgen/2%2].push(s) // Add to swept in-use list.
 53		s.state = mSpanInUse
 54		s.allocCount = 0
 55		s.spanclass = spanclass
 56		if sizeclass := spanclass.sizeclass(); sizeclass == 0 {
 57			s.elemsize = s.npages << _PageShift
 58			s.divShift = 0
 59			s.divMul = 0
 60			s.divShift2 = 0
 61			s.baseMask = 0
 62		} else {
 63			s.elemsize = uintptr(class_to_size[sizeclass])
 64			m := &class_to_divmagic[sizeclass]
 65			s.divShift = m.shift
 66			s.divMul = m.mul
 67			s.divShift2 = m.shift2
 68			s.baseMask = m.baseMask
 69		}
 70
 71		// Mark in-use span in arena page bitmap.
 72		arena, pageIdx, pageMask := pageIndexOf(s.base())
 73		arena.pageInUse[pageIdx] |= pageMask
 74
 75		// update stats, sweep lists
 76		h.pagesInUse += uint64(npage)
 77		if large {
 78			memstats.heap_objects++
 79			mheap_.largealloc += uint64(s.elemsize)
 80			mheap_.nlargealloc++
 81			atomic.Xadd64(&memstats.heap_live, int64(npage<<_PageShift))
 82		}
 83	}
 84	// heap_scan and heap_live were updated.
 85	if gcBlackenEnabled != 0 {
 86		gcController.revise()
 87	}
 88
 89	if trace.enabled {
 90		traceHeapAlloc()
 91	}
 92
 93	// h.spans is accessed concurrently without synchronization
 94	// from other threads. Hence, there must be a store/store
 95	// barrier here to ensure the writes to h.spans above happen
 96	// before the caller can publish a pointer p to an object
 97	// allocated from s. As soon as this happens, the garbage
 98	// collector running on another processor could read p and
 99	// look up s in h.spans. The unlock acts as the barrier to
100	// order these writes. On the read side, the data dependency
101	// between p and the index in h.spans orders the reads.
102	unlock(&h.lock)
103	return s
104}
105
106// Try to add at least npage pages of memory to the heap,
107// returning whether it worked.
108//
109// h must be locked.
110func (h *mheap) grow(npage uintptr) bool {
111	ask := npage << _PageShift
112
113	nBase := round(h.curArena.base+ask, physPageSize)
114	if nBase > h.curArena.end {
115		// Not enough room in the current arena. Allocate more
116		// arena space. This may not be contiguous with the
117		// current arena, so we have to request the full ask.
118		av, asize := h.sysAlloc(ask)//sysAlloc 会调用系统调用（mmap 或者 VirtualAlloc，和初始化那部分有点类似）去向操作系统申请。
119		if av == nil {
120			print("runtime: out of memory: cannot allocate ", ask, "-byte block (", memstats.heap_sys, " in use)\n")
121			return false
122		}
123
124		if uintptr(av) == h.curArena.end {
125			// The new space is contiguous with the old
126			// space, so just extend the current space.
127			h.curArena.end = uintptr(av) + asize
128		} else {
129			// The new space is discontiguous. Track what
130			// remains of the current space and switch to
131			// the new space. This should be rare.
132			if size := h.curArena.end - h.curArena.base; size != 0 {
133				h.growAddSpan(unsafe.Pointer(h.curArena.base), size)
134			}
135			// Switch to the new space.
136			h.curArena.base = uintptr(av)
137			h.curArena.end = uintptr(av) + asize
138		}
139
140		// The memory just allocated counts as both released
141		// and idle, even though it's not yet backed by spans.
142		//
143		// The allocation is always aligned to the heap arena
144		// size which is always > physPageSize, so its safe to
145		// just add directly to heap_released. Coalescing, if
146		// possible, will also always be correct in terms of
147		// accounting, because s.base() must be a physical
148		// page boundary.
149		memstats.heap_released += uint64(asize)
150		memstats.heap_idle += uint64(asize)
151
152		// Recalculate nBase
153		nBase = round(h.curArena.base+ask, physPageSize)
154	}
155
156	// Grow into the current arena.
157	v := h.curArena.base
158	h.curArena.base = nBase
159	h.growAddSpan(unsafe.Pointer(v), nBase-v)
160	return true
161}

整个函数调用链如上所示，最后 sysAlloc 会调用系统调用（mmap 或者 VirtualAlloc，和初始化那部分有点类似）去向操作系统申请。

5. 内存回收

5.1 mcache 回收

mcache 回收可以分两部分：第一部分是将 alloc 中未用完的内存归还给对应的 mcentral。

 1func freemcache(c *mcache) {
 2    systemstack(func() {
 3        c.releaseAll()
 4        ...
 5
 6        lock(&mheap_.lock)
 7        purgecachedstats(c)
 8        mheap_.cachealloc.free(unsafe.Pointer(c))
 9        unlock(&mheap_.lock)
10    })
11}
12func (c *mcache) releaseAll() {
13    for i := 0; i < _NumSizeClasses; i++ {
14        s := c.alloc[i]        if s != &emptymspan {
15            mheap_.central[i].mcentral.uncacheSpan(s)
16            c.alloc[i] = &emptymspan
17        }
18    }    // Clear tinyalloc pool.
19    c.tiny = 0
20    c.tinyoffset = 0}

函数 releaseAll() 负责将 mcache.alloc 中各个 sizeclass 中的 mspan 归还给 mcentral。这里需要注意的是归还给 mcentral 的时候需要加锁，因为 mcentral 是全局的。除此之外将剩下的 mcache （基本是个空壳）归还给 mheap.cachealloc，其实就是把 mcache 插入 free list 表头。

1func (f *fixalloc) free(p unsafe.Pointer) {
2    f.inuse -= f.size
3    v := (*mlink)(p)
4    v.next = f.list
5    f.list = v
6}

5.2 mcentral 回收

当 mspan 没有 free object 的时候，将 mspan 归还给 mheap。

 1func (c *mcentral) freeSpan(s *mspan, preserve bool, wasempty bool) bool {
 2    ...
 3    lock(&c.lock)
 4    ...   
 5    if s.allocCount != 0 {
 6        unlock(&c.lock)        
 7         return false
 8    }
 9
10    c.nonempty.remove(s)
11    unlock(&c.lock)
12    mheap_.freeSpan(s, 0)
13    return true
14}

5.3 mheap

mheap 并不向操作系统归还，但是会对 span 做一些操作，比如合并相邻的 span。

3. 总结

tcmalloc 是一种理论，运用到实践中还要考虑工程实现的问题。学习 Golang 源码的过程中，除了知道它是如何工作的之外，还可以学习到很多有趣的知识，比如使用变量填充 CacheLine 避免 False Sharing，利用 debruijn 序列求解 Trailing Zero（在函数中 sys.Ctz64 使用）等等。我想这就是读源码的意义所在吧。

参考

Golang的内存管理(上篇) Golang的内存管理(中篇) Golang的内存管理(下篇)

Memory Allocation

Virtual Memory

Virtual Memory.pdf http://csapp.cs.cmu.edu/2e/ch9-preview.pdf

内存分配

内存分配的函数在go SDK内部，我们是不能用的，如果要用，请使用： https://github.com/smasher164/mem

 1package mem
 2
 3import (
 4	"unsafe"
 5
 6	"golang.org/x/sys/unix"
 7)
 8
 9type slice struct {
10	addr unsafe.Pointer
11	len  int
12	cap  int
13}
14
15func mmap(size uint) (unsafe.Pointer, error) {
16	b, err := unix.Mmap(-1, 0, int(size), unix.PROT_READ|unix.PROT_WRITE, unix.MAP_PRIVATE|unix.MAP_ANONYMOUS)
17	if err != nil {
18		return nil, err
19	}
20	sl := (*slice)(unsafe.Pointer(&b))
21	return sl.addr, nil
22}
23
24func munmap(p unsafe.Pointer) error {
25	size := int(((*header)(p)).size + szheader)
26	b := *(*[]byte)(unsafe.Pointer(&slice{p, size, size}))
27	return unix.Munmap(b)
28}

windows

 1package mem
 2
 3import (
 4	"unsafe"
 5
 6	"golang.org/x/sys/windows"
 7)
 8
 9func mmap(size uint) (unsafe.Pointer, error) {
10	p, err := windows.VirtualAlloc(0, uintptr(size), windows.MEM_COMMIT|windows.MEM_RESERVE, windows.PAGE_READWRITE)
11	if err != nil {
12		return nil, err
13	}
14	return unsafe.Pointer(p), nil
15}
16
17func munmap(p unsafe.Pointer) error {
18	return windows.VirtualFree(uintptr(p), 0, windows.MEM_RELEASE)
19}