linux内核基础二物理内存映射区

2022-10-12

kernel_pwn

更新于2022-10-13
PS: 好学者请先学习完linux内核基础一
并完成对应习题后学习该篇内容

direct mapping of all physical memory
linux内核内存分配函数
task_struct
进程内核栈
内核态和用户态转变（again）
- 用户态->内核态
- 内核态->用户态
references

direct mapping of all physical memory

https://elixir.bootlin.com/linux/latest/source/Documentation/x86/x86_64/mm.rst里提出了官方的64位linux下虚拟内存布局

本节我们重点关注的是其中一项

direct mapping of all physical memory(page_offset_base)这一项，其表明的意思是，这段虚拟内存区域直接映射了整个物理内存，换句话说，这一段区域的地址和物理内存地址存在线性关系（virtual_addr = physical_addr + 0xffff888000000000
从这里也可以看出linux64位支持的最大的物理内存为64TB
同时，direct mapping of all physical memory(page_offset_base)位于内核虚拟地址空间中，那么，对于一个用户态的虚拟地址–其对应的物理地址，在内核态中也有一个虚拟地址对应。即同一个物理内存区域，可以同时通过用户态虚拟地址和内核态虚拟地址进行读写
这里存在一个攻击手段，即在用户态的虚拟地址写入shellcode/rop_chain，通过泄露内核态地址，可以实现在内核态中执行，非常的牛逼

linux内核内存分配函数

kmalloc

kmalloc申请的虚拟内存地址位于direct mapping of all physical memory物理内存映射区域，在物理上连续，于真实物理地址的差值为一个定值，存在简单的转换关系，申请大小不能超过128kb

kzalloc

kzalloc = kmalloc + 清空内存（为0）

vmalloc

vmalloc() 函数则会在虚拟内存空间给出一块连续的内存区，但这片连续的虚拟内存在物理内存中并不一定连续。由于 vmalloc() 没有保证申请到的是连续的物理内存，因此对申请的内存大小没有限制，如果需要申请较大的内存空间就需要用此函数了。

__get_free_pages:

于kmalloc一样，申请的虚拟地址位于direct mapping of all physical memory区域，是提供给调用者最底层的内存分配函数，基于buddy system实现，同样是连续的物理内存。分配粒度为页

task_struct

对于OS来说，为了能够方便的管控进程的运行情况，使用一个叫做PCB(Process Control Block)的数据结构来记录进程的所有信息
当启动一个进程后，OS会创建一个PCB结构体（声明为task_struct），当进程结束后，才会撤销。
task_struct在Linux中的<include/linux/sched.h>中被定义。
可以在该网址中越读源码:https://elixir.bootlin.com/linux/v6.0-rc5/source/include/linux/sched.h
task_struct要记录进程的所有信息，其结构也极其复杂，对于本次学习，我们只需要关注其中2个定义

struct task_struct {
    struct thread_info	thread_info;
    ...
    void* stack;
    ...
};

thread_info 结构体中，我们需要知道其中一个定义

struct thread_info{
    ...
    struct task_struct * task;
};

这几个定义具体有什么用，接下来会讲

进程内核栈

众所周知，栈(stack)是当今计算机系统中不可或缺的结构之一，其实，一个进程的栈分为用户栈和内核栈。在运行用户态代码时，使用的是用户栈，当进程通过系统调用等操作陷入内核态时，在内核态运行代码使用的是内核栈。
内核栈的定义如下:

union thread_union {
#ifndef CONFIG_ARCH_TASK_STRUCT_ON_STACK
	struct task_struct task;
#endif
#ifndef CONFIG_THREAD_INFO_IN_TASK
	struct thread_info thread_info;
#endif
	unsigned long stack[THREAD_SIZE/sizeof(long)];
};

其中THREAD_SIZE的定义如下（arch/x86/include/asm/page_64_types.h arch/x86/include/asm/page_32_types.h）:

//ARM架构 , 8K
#define THREAD_SIZE_ORDER	1
#define THREAD_SIZE		(PAGE_SIZE << THREAD_SIZE_ORDER)
#define THREAD_START_SP		(THREAD_SIZE - 8)

//ARM64架构, 16K
#define THREAD_SIZE		16384
#define THREAD_START_SP		(THREAD_SIZE - 16)

//X86_64, 16K
#define THREAD_SIZE_ORDER	(2 + KASAN_STACK_ORDER)
#define THREAD_SIZE  (PAGE_SIZE << THREAD_SIZE_ORDER)

pt_regs

pt_regs是一个寄存器组结构，用于在用户态和内核态进行转换时，保存上下文（context）所用

struct pt_regs { //x86 32bits
	unsigned long bx;
	unsigned long cx;
	unsigned long dx;
	unsigned long si;
	unsigned long di;
	unsigned long bp;
	unsigned long ax;
	unsigned short ds;
	unsigned short __dsh;
	unsigned short es;
	unsigned short __esh;
	unsigned short fs;
	unsigned short __fsh;
	/*
	 * On interrupt, gs and __gsh store the vector number.  They never
	 * store gs any more.
	 */
	unsigned short gs;
	unsigned short __gsh;
	/* On interrupt, this is the error code. */
	unsigned long orig_ax;
	unsigned long ip;
	unsigned short cs;
	unsigned short __csh;
	unsigned long flags;
	unsigned long sp;
	unsigned short ss;
	unsigned short __ssh;
};

64位定义如下:

struct pt_regs { //x86-64 64bits
/*
 * C ABI says these regs are callee-preserved. They aren't saved on kernel entry
 * unless syscall needs a complete, fully filled "struct pt_regs".
 */
	unsigned long r15;
	unsigned long r14;
	unsigned long r13;
	unsigned long r12;
	unsigned long bp;
	unsigned long bx;
/* These regs are callee-clobbered. Always saved on kernel entry. */
	unsigned long r11;
	unsigned long r10;
	unsigned long r9;
	unsigned long r8;
	unsigned long ax;
	unsigned long cx;
	unsigned long dx;
	unsigned long si;
	unsigned long di;
/*
 * On syscall entry, this is syscall#. On CPU exception, this is error code.
 * On hw interrupt, it's IRQ number:
 */
	unsigned long orig_ax;
/* Return frame for iretq */
	unsigned long ip;
	unsigned long cs;
	unsigned long flags;
	unsigned long sp;
	unsigned long ss;
/* top of stack page */
};

既然面临着用户栈和内核栈的转换，那么OS就需要知道用户栈和内核栈的“位置”，即如何查询到用户栈和内核栈

通过task_struct寻找内核栈（32位）

之前提到了task_struct中有 void * stack值，通过如下代码找到内核栈

static inline void *task_stack_page(const struct task_struct *task)
{
	return task->stack;
}

可以通过如下的代码索引到pt_regs的位置

//processor.h	(arch\x86\include\asm)
#define task_pt_regs(task) \
({									\
	unsigned long __ptr = (unsigned long)task_stack_page(task);	\
	__ptr += THREAD_SIZE - TOP_OF_KERNEL_STACK_PADDING;		\
	((struct pt_regs *)__ptr) - 1;					\
})

从上述代码也可以看到，pt_regs结构体被放置在内核栈的下方 因此可以通过task_struct方便地找到内核栈和pt_regs的位置,如下图所示

通过内核栈找task_struct（32位）

如上图所示，内核栈的上方（栈是自底向上增长的），存放着thread_info结构体，thread_info中存有指向task_struct的指针。

64位下cpu的task_struct/内核栈索引

64位的cpu中有一个Per-CPU变量，用来存放task_struct的指针，因此不再需要thread_info进行索引
32位和64位的不同由下图表示:

图大多出自内核栈和用户栈

内核态和用户态转变（again）

现在有一定基础后，再来看看linux用户态和内核态的转变吧！

用户态->内核态

方式:

系统调用

异常（fault/trap）

外设中断

在发生如上情况时，os主要做了以下几件事

切换gs寄存器 swapgs
保存用户态栈帧信息（用户栈顶放入CPU独占变量，CPU独占变量里的内核栈顶放入rsp/esp寄存器中）
保存用户态寄存器信息(push各个寄存器的值到内核栈上(pt_regs))
通过汇编指令判断为32位/64位
控制器转交内核，执行系统调用(sys_call_table)

内核态->用户态

1. swapgs
2. iretq/sysretq
	user_shell_addr
	user_cs
	user_eflags
	user_sp
	user_ss

references

arttnba3的blog
内核栈和用户栈
 kmalloc、kzalloc、vmalloc、__get_free_pages的区别
 linux内存布局（官方）
一篇文章读懂mmap

linux内核基础 二 物理内存映射区