linux kernel — 从timer wheel到non-cascading timer

一些背景知识

linux低精度timer的时间轮(timer wheel)在kernel开发人员中可以说是人尽皆知了。它的基本思路就是将剩余时间不同的timer进行级联处理,每次定时器的中断到来的时候,首先检查最低一级是否有到期的timer,再检查是否需要将高一级timer移动到第一级timer上。这样的设计减少每次tick带来时需要检查的timer数量同时也能保证定时器准确的到期时间。

如今的kernel timer经过了重新设计,取消了级联,也就是所谓的non-cascading timer。通常来说我们对定时器精度的要求会随着定时周期长短而发生变化。定时的周期越长,往往我们对定时精度的要求也就越低。所以kernel当中就产生了根据定时周期长短来决定不同的定时精度的设计。

这里先贴上kernel当中的一段注释(timer.c的开头部分):

/*
* The timer wheel has LVL_DEPTH array levels. Each level provides an array of
* LVL_SIZE buckets. Each level is driven by its own clock and therefor each
* level has a different granularity.
*
* The level granularity is:        LVL_CLK_DIV ^ lvl
* The level clock frequency is:    HZ / (LVL_CLK_DIV ^ level)
*
* The array level of a newly armed timer depends on the relative expiry
* time. The farther the expiry time is away the higher the array level and
* therefor the granularity becomes.
*
* Contrary to the original timer wheel implementation, which aims for 'exact'
* expiry of the timers, this implementation removes the need for recascading
* the timers into the lower array levels. The previous 'classic' timer wheel
* implementation of the kernel already violated the 'exact' expiry by adding
* slack to the expiry time to provide batched expiration. The granularity
* levels provide implicit batching.
*
* This is an optimization of the original timer wheel implementation for the
* majority of the timer wheel use cases: timeouts. The vast majority of
* timeout timers (networking, disk I/O ...) are canceled before expiry. If
* the timeout expires it indicates that normal operation is disturbed, so it
* does not matter much whether the timeout comes with a slight delay.
*
* The only exception to this are networking timers with a small expiry
* time. They rely on the granularity. Those fit into the first wheel level,
* which has HZ granularity.
*
* We don't have cascading anymore. timers with a expiry time above the
* capacity of the last wheel level are force expired at the maximum timeout
* value of the last wheel level. From data sampling we know that the maximum
* value observed is 5 days (network connection tracking), so this should not
* be an issue.
*
* The currently chosen array constants values are a good compromise between
* array size and granularity.
*
* This results in the following granularity and range levels:
*
* HZ 1000 steps
* Level Offset  Granularity            Range
*  0      0         1 ms                0 ms -         63 ms
*  1     64         8 ms               64 ms -        511 ms
*  2    128        64 ms              512 ms -       4095 ms (512ms - ~4s)
*  3    192       512 ms             4096 ms -      32767 ms (~4s - ~32s)
*  4    256      4096 ms (~4s)      32768 ms -     262143 ms (~32s - ~4m)
*  5    320     32768 ms (~32s)    262144 ms -    2097151 ms (~4m - ~34m)
*  6    384    262144 ms (~4m)    2097152 ms -   16777215 ms (~34m - ~4h)
*  7    448   2097152 ms (~34m)  16777216 ms -  134217727 ms (~4h - ~1d)
*  8    512  16777216 ms (~4h)  134217728 ms - 1073741822 ms (~1d - ~12d)
*
* HZ  300
* Level Offset  Granularity            Range
*  0      0         3 ms                0 ms -        210 ms
*  1     64        26 ms              213 ms -       1703 ms (213ms - ~1s)
*  2    128       213 ms             1706 ms -      13650 ms (~1s - ~13s)
*  3    192      1706 ms (~1s)      13653 ms -     109223 ms (~13s - ~1m)
*  4    256     13653 ms (~13s)    109226 ms -     873810 ms (~1m - ~14m)
*  5    320    109226 ms (~1m)     873813 ms -    6990503 ms (~14m - ~1h)
*  6    384    873813 ms (~14m)   6990506 ms -   55924050 ms (~1h - ~15h)
*  7    448   6990506 ms (~1h)   55924053 ms -  447392423 ms (~15h - ~5d)
*  8    512  55924053 ms (~15h) 447392426 ms - 3579139406 ms (~5d - ~41d)
*
* HZ  250
* Level Offset  Granularity            Range
*  0      0         4 ms                0 ms -        255 ms
*  1     64        32 ms              256 ms -       2047 ms (256ms - ~2s)
*  2    128       256 ms             2048 ms -      16383 ms (~2s - ~16s)
*  3    192      2048 ms (~2s)      16384 ms -     131071 ms (~16s - ~2m)
*  4    256     16384 ms (~16s)    131072 ms -    1048575 ms (~2m - ~17m)
*  5    320    131072 ms (~2m)    1048576 ms -    8388607 ms (~17m - ~2h)
*  6    384   1048576 ms (~17m)   8388608 ms -   67108863 ms (~2h - ~18h)
*  7    448   8388608 ms (~2h)   67108864 ms -  536870911 ms (~18h - ~6d)
*  8    512  67108864 ms (~18h) 536870912 ms - 4294967288 ms (~6d - ~49d)
*
* HZ  100
* Level Offset  Granularity            Range
*  0      0         10 ms               0 ms -        630 ms
*  1     64         80 ms             640 ms -       5110 ms (640ms - ~5s)
*  2    128        640 ms            5120 ms -      40950 ms (~5s - ~40s)
*  3    192       5120 ms (~5s)     40960 ms -     327670 ms (~40s - ~5m)
*  4    256      40960 ms (~40s)   327680 ms -    2621430 ms (~5m - ~43m)
*  5    320     327680 ms (~5m)   2621440 ms -   20971510 ms (~43m - ~5h)
*  6    384    2621440 ms (~43m) 20971520 ms -  167772150 ms (~5h - ~1d)
*  7    448   20971520 ms (~5h) 167772160 ms - 1342177270 ms (~1d - ~15d)
*/

当tick频率为1000hz的时候,定时器精度分为1,8,64ms…等不同的等级,同时他们对应了0-63,64-511,512-4095ms等不同的定时长度。总结一下就是,在不同的level当中,精度以2^3为倍数递增,而范围以2^6为倍数递增。timer的相关代码中,非常重要的一点就是理解kernel如何进行timer到期的检查。

timer相关的代码可以分为两个部分,一部分是是对于timer的操作(包含了创建销毁等),另一部分就是kernel如何找到到期的timer并执行他们。当然timer的设计当中还考虑了很多细节问题,本文会暂时忽略这些不影响对timer框架理解的细节问题,或许以后会以单独的章节补充在本文的最后。

数据结构

kernel当中使用struct timer_list存储timer的数据。

  • entry是存储了所有timer的hlist的入口
  • expires为timer的到期时间
  • function是timer到期时执行的callback
  • flags保存了一些标志位
    struct timer_list {
/*
* All fields that change during normal runtime grouped to the
* same cacheline
*/
struct hlist_node   entry;
unsigned long       expires;
void            (*function)(struct timer_list *);
u32         flags;
#ifdef CONFIG_LOCKDEP
struct lockdep_map  lockdep_map;
#endif
};

为了管理timer,kernel当中定义了一个percpu变量,timer_base用来作为当前cpu的timer的管理结构。当kernel支持deferable timer时,每个CPU会有两个timer base结构,分别用来存储普通的timer和deferable timer。

  • running_timer用来表示当前正在执行的timer
  • clk存储当前时间
  • next_expiry表示下一个到期的timer时间
  • pending_map用一个bit表示一个timer哈希表,如果某个bit置位表示某个精度某个时间有timer注册
  • vectors给每个精度每个时间分配一个哈希表,用来记录这个时间到期的timer
    static DEFINE_PER_CPU(struct timer_base, timer_bases[NR_BASES]);
struct timer_base {
raw_spinlock_t      lock;
struct timer_list   *running_timer;
unsigned long       clk;
unsigned long       next_expiry;
unsigned int        cpu;
bool            is_idle;
bool            must_forward_clk;
DECLARE_BITMAP(pending_map, WHEEL_SIZE);
struct hlist_head   vectors[WHEEL_SIZE];
} ____cacheline_aligned

timer的创建和销毁

创建timer

kernel中我们可以动态或者静态的定义timer。

静态定义timer需要使用DEFINE_TIME宏。该宏主要是完成timer_list的初始化。

#define __TIMER_INITIALIZER(_function, _flags) {       \
.entry = { .next = TIMER_ENTRY_STATIC },    \
.function = (_function),            \
.flags = (_flags),              \
__TIMER_LOCKDEP_MAP_INITIALIZER(        \
__FILE__ ":" __stringify(__LINE__)) \
#define DEFINE_TIMER(_name, _function)             \
struct timer_list _name =               \
__TIMER_INITIALIZER(_function, 0

动态定义后需要使用timer_setup来初始化结构体中的内容。

/**
* timer_setup - prepare a timer for first use
* @timer: the timer in question
* @callback: the function to call when timer expires
* @flags: any TIMER_* flags
*
* Regular timer initialization should use either DEFINE_TIMER() above,
* or timer_setup(). For timers on the stack, timer_setup_on_stack() must
* be used and must be balanced with a call to destroy_timer_on_stack().
*/
#define timer_setup(timer, callback, flags)            \
__init_timer((timer), (callback), (flags)
#define __init_timer(_timer, _fn, _flags)              \
init_timer_key((_timer), (_fn), (_flags), NULL, NULL)
/**
* init_timer_key - initialize a timer
* @timer: the timer to be initialized
* @func: timer callback function
* @flags: timer flags
* @name: name of the timer
* @key: lockdep class key of the fake lock used for tracking timer
*       sync lock dependencies
*
* init_timer_key() must be done to a timer prior calling *any* of the
* other timer functions.
*/
void init_timer_key(struct timer_list *timer,
void (*func)(struct timer_list *), unsigned int flags,
const char *name, struct lock_class_key *key)
{
debug_init(timer);
do_init_timer(timer, func, flags, name, key);
}
static void do_init_timer(struct timer_list *timer,
void (*func)(struct timer_list *),
unsigned int flags,
const char *name, struct lock_class_key *key)
{
timer->entry.pprev = NULL;
timer->function = func;
timer->flags = flags | raw_smp_processor_id();
lockdep_init_map(&timer->lockdep_map, name, key, 0);
}

注册(修改)timer

注册timer可以使用add_timer函数。而注册timer和修改timer所需要做的工作其实是相似的。kernel中都是使用mod_timer函数实现的,而add_timer只是mod_timer的一个简单的封装。

/**
* add_timer - start a timer
* @timer: the timer to be added
*
* The kernel will do a ->function(@timer) callback from the
* timer interrupt at the ->expires point in the future. The
* current time is 'jiffies'.
*
* The timer's ->expires, ->function fields must be set prior calling this
* function.
*
* Timers with an ->expires field in the past will be executed in the next
* timer tick.
*/
void add_timer(struct timer_list *timer)
{
BUG_ON(timer_pending(timer));
mod_timer(timer, timer->expires);
}

修改timer使用的接口是mod_timer。通过注释可以知道,mod_timer相当于完成了以下的工作:del_timer,set expire,add_timer。

mod_timer首先需要判断当前的timer是否处于pending状态,timer pending指的就是当前的timer已经注册,并等待执行的状态。如果timer pending,则将timer移除,重新计算level后再进行注册。如果timer没有处于pending状态,则直接注册当前timer即可。具体的操作我们可以参考以下代码

/**
* mod_timer - modify a timer's timeout
* @timer: the timer to be modified
* @expires: new timeout in jiffies
*
* mod_timer() is a more efficient way to update the expire field of an
* active timer (if the timer is inactive it will be activated)
*
* mod_timer(timer, expires) is equivalent to:
*
*     del_timer(timer); timer->expires = expires; add_timer(timer);
*
* Note that if there are multiple unserialized concurrent users of the
* same timer, then mod_timer() is the only safe way to modify the timeout,
* since add_timer() cannot modify an already running timer.
*
* The function returns whether it has modified a pending timer or not.
* (ie. mod_timer() of an inactive timer returns 0, mod_timer() of an
* active timer returns 1.)
*/
int mod_timer(struct timer_list *timer, unsigned long expires)
{
return __mod_timer(timer, expires, 0);
}
static inline int
__mod_timer(struct timer_list *timer, unsigned long expires, unsigned int options)
{
struct timer_base *base, *new_base;
unsigned int idx = UINT_MAX;
unsigned long clk = 0, flags;
int ret = 0;
BUG_ON(!timer->function);
/*
* This is a common optimization triggered by the networking code - if
* the timer is re-modified to have the same timeout or ends up in the
* same array bucket then just return:
*/
/*检查timer是否处于pending状态*/
if (timer_pending(timer)) {
/*
* The downside of this optimization is that it can result in
* larger granularity than you would get from adding a new
* timer with this expiry.
*/
long diff = timer->expires - expires;
if (!diff)
return 1;
if (options & MOD_TIMER_REDUCE && diff <= 0)
return 1;
/*
* We lock timer base and calculate the bucket index right
* here. If the timer ends up in the same bucket, then we
* just update the expiry time and avoid the whole
* dequeue/enqueue dance.
*/
base = lock_timer_base(timer, &flags);
forward_timer_base(base);
/*
* 在timer base已经确定了之后,再次检查之前的条件
* 如果设定了MOD_TIMER_REDUCE的话,只允许提前timer不允许推后timer
*/
if (timer_pending(timer) && (options & MOD_TIMER_REDUCE) &&
time_before_eq(timer->expires, expires)) {
ret = 1;
goto out_unlock;
}
clk = base->clk;
//index指的就是timer在pending_map中是第几个bit
idx = calc_wheel_index(expires, clk);
/*
* Retrieve and compare the array index of the pending
* timer. If it matches set the expiry to the new value so a
* subsequent call will exit in the expires check above.
*/
/*
* 如果index发生变化的话,说明实际上的到期时间是没变的
* 所以只需要修改一下expire把新的时间记录下来即可
*/
if (idx == timer_get_idx(timer)) {
if (!(options & MOD_TIMER_REDUCE))
timer->expires = expires;
else if (time_after(timer->expires, expires))
timer->expires = expires;
ret = 1;
goto out_unlock;
}
} else {
base = lock_timer_base(timer, &flags);
forward_timer_base(base);
}
/*如果处于pending状态的话,把当前timer从hlist上删除*/
ret = detach_if_pending(timer, base, false);
if (!ret && (options & MOD_TIMER_PENDING_ONLY))
goto out_unlock;
/*
* 获取当前timer的timer base,在smp并开启了no hz common的系统上
* 每个cpu会存在两个timer base,
* deferrable timer会被存储在另一个timer base上
*此外,节约功耗,休眠的cpu有可能会将他的timer转移到另一个cpu上
*/
new_base = get_target_base(base, timer->flags);
/*
* 假如需要变更timer base
* 则向flag中写入TIMER_MIGRATION后更改base指针
*/
if (base != new_base) {
/*
* We are trying to schedule the timer on the new base.
* However we can't change timer's base while it is running,
* otherwise del_timer_sync() can't detect that the timer's
* handler yet has not finished. This also guarantees that the
* timer is serialized wrt itself.
*/
if (likely(base->running_timer != timer)) {
/* See the comment in lock_timer_base() */
timer->flags |= TIMER_MIGRATING;
raw_spin_unlock(&base->lock);
base = new_base;
raw_spin_lock(&base->lock);
WRITE_ONCE(timer->flags,
(timer->flags & ~TIMER_BASEMASK) | base->cpu);
forward_timer_base(base);
}
}
debug_activate(timer, expires);
timer->expires = expires;
/*
* If 'idx' was calculated above and the base time did not advance
* between calculating 'idx' and possibly switching the base, only
* enqueue_timer() and trigger_dyntick_cpu() is required. Otherwise
* we need to (re)calculate the wheel index via
* internal_add_timer().
*/
/*
* 如果设定了一个index,并且clk没有发生变化
* 则直接将timer加入hlist,否则的话则需要重新计算idx
*/
if (idx != UINT_MAX && clk == base->clk) {
enqueue_timer(base, timer, idx);
trigger_dyntick_cpu(base, timer);
} else {
internal_add_timer(base, timer);
}
out_unlock:
raw_spin_unlock_irqrestore(&base->lock, flags);
return ret;
}

几个与hlist操作相关的函数也列在下方做个参考,内容都很简单,就不做详细解释了。需要注意的是,在加入hlist和从hlist删除的时候,也要对pending_map做相应的操作。

static int detach_if_pending(struct timer_list *timer, struct timer_base *base,
bool clear_pending)
{
unsigned idx = timer_get_idx(timer);
if (!timer_pending(timer))
return 0;
if (hlist_is_singular_node(&timer->entry, base->vectors + idx))
__clear_bit(idx, base->pending_map);
detach_timer(timer, clear_pending);
return 1;
}
static inline void detach_timer(struct timer_list *timer, bool clear_pending)
{
struct hlist_node *entry = &timer->entry;
debug_deactivate(timer);
__hlist_del(entry);
if (clear_pending)
entry->pprev = NULL;
entry->next = LIST_POISON2;
}
static void enqueue_timer(struct timer_base *base, struct timer_list *timer,
unsigned int idx)
{
hlist_add_head(&timer->entry, base->vectors + idx);
__set_bit(idx, base->pending_map);
timer_set_idx(timer, idx);
}

而有关timer到期判定的关键,就在于pending_map上。还记得我们之前说的不同level情况下精度和范围的递增关系么。pending_map事实上就是以2^6(范围倍数关系)个bit为单位,把不同精度的timer组合起来。每个bit就代表了某种精度的某个事件是否有timer已经注册,并且提供了一个链表把这些timer组合起来。

pending_map如果用一个简单的示意图来表示大概就是这样的(其中不同的level就对应了不同的精度):

·····|— level 2,2^6 bits —|— level 1, 2^6 bits —|—level 0, 2^6 bits—|

所以在每个中断到来的时候,只需要根据当前的事件,推算出我们当前对应的是那个bit,如果是置位状态,执行链表上所有的任务即可。关于如何计算出需要检查的bit,在后面的代码中会再说到。

最后说说这个最关键的calc_wheel_index函数,由他计算出了我们的timer应该对应到pending_map的哪个bit上。首先需要根据到期时间距现在时间的长短得出timer属于哪个level。对时间粒度的处理上,kernel宁愿推后timer的执行也不愿意让timer提前执行,所以根据这个level我们舍弃掉低于这个level的精度的尾数再进一,注意如果刚好expire time是时间粒度的整数倍的话,其实是会延后到下一次对应粒度的时间去执行的(假如粒度是8ms,expire time也是8ms,实际执行的时间会和expire time为9ms的一样延后到16ms执行)。最后再移位到pending_map中属于这个level的位置上即可。

前面这部分很难用语言描述清楚,我们以1000HZ,clk=0, expire=200为例:

  1. 根据各level的范围,200ms在64-511ms之间,为level1
  2. level1的精度是8ms,所以首先处理进位,加上8。再进行移位,将208左移2^3位得到26。
  3. level1在pending_map中是从12^6+1个bit开始的,所以我们将上面得到的26再加上12^6这个偏移量,就是最后200ms这个timer所对应bit的位置了。

最后也附上相关的代码给大家参考 :

/* Clock divisor for the next level */
#define LVL_CLK_SHIFT  3
#define LVL_CLK_DIV    (1UL << LVL_CLK_SHIFT)
#define LVL_CLK_MASK   (LVL_CLK_DIV - 1)
#define LVL_SHIFT(n)   ((n) * LVL_CLK_SHIFT)
#define LVL_GRAN(n)    (1UL << LVL_SHIFT(n))
/*
* The time start value for each level to select the bucket at enqueue
* time.
*/
#define LVL_START(n)   ((LVL_SIZE - 1) << (((n) - 1) * LVL_CLK_SHIFT))
/* Size of each clock level */
#define LVL_BITS   6
#define LVL_SIZE   (1UL << LVL_BITS)
#define LVL_MASK   (LVL_SIZE - 1)
#define LVL_OFFS(n)    ((n) * LVL_SIZE
static inline unsigned calc_index(unsigned expires, unsigned lvl)
{
/*右移level对应时间粒度,加上LVL_GRAN(lvl)来处理进位*/
expires = (expires + LVL_GRAN(lvl)) >> LVL_SHIFT(lvl);
/*把expire舍弃小于对应粒度之后的值加上该粒度在pending_map中对应的起始偏移*/
return LVL_OFFS(lvl) + (expires & LVL_MASK);
}
static int calc_wheel_index(unsigned long expires, unsigned long clk)
{
unsigned long delta = expires - clk;
unsigned int idx;
/*首先根据delta的大小判定应该是属于哪个level的*/
if (delta < LVL_START(1)) {
idx = calc_index(expires, 0);
} else if (delta < LVL_START(2)) {
idx = calc_index(expires, 1);
} else if (delta < LVL_START(3)) {
idx = calc_index(expires, 2);
} else if (delta < LVL_START(4)) {
idx = calc_index(expires, 3);
} else if (delta < LVL_START(5)) {
idx = calc_index(expires, 4);
} else if (delta < LVL_START(6)) {
idx = calc_index(expires, 5);
} else if (delta < LVL_START(7)) {
idx = calc_index(expires, 6);
} else if (LVL_DEPTH > 8 && delta < LVL_START(8)) {
idx = calc_index(expires, 7);
} else if ((long) delta < 0) {
idx = clk & LVL_MASK;
} else {
/*
* Force expire obscene large timeouts to expire at the
* capacity limit of the wheel.
*/
if (expires >= WHEEL_TIMEOUT_CUTOFF)
expires = WHEEL_TIMEOUT_MAX;
idx = calc_index(expires, LVL_DEPTH - 1);
}
return idx;
}

删除(销毁)timer

删除或者销毁timer通常有两个接口可以使用:del_timer和del_timer_sync。

del_timer_sync和del_timer的区别在于:del_timer_sync是个同步接口,在SMP系统中,它会等待timer handler执行完成再移除timer。

int del_timer(struct timer_list *timer)
{
struct timer_base *base;
unsigned long flags;
int ret = 0;
debug_assert_init(timer);
if (timer_pending(timer)) {
base = lock_timer_base(timer, &flags);
ret = detach_if_pending(timer, base, true);
raw_spin_unlock_irqrestore(&base->lock, flags);
}
return ret;
}
int del_timer_sync(struct timer_list *timer)
{
#ifdef CONFIG_LOCKDEP
unsigned long flags;
/*
* If lockdep gives a backtrace here, please reference
* the synchronization rules above.
*/
local_irq_save(flags);
lock_map_acquire(&timer->lockdep_map);
lock_map_release(&timer->lockdep_map);
local_irq_restore(flags);
#endif
/*
* don't use it in hardirq context, because it
* could lead to deadlock.
*/
WARN_ON(in_irq() && !(timer->flags & TIMER_IRQSAFE));
for (;;) {
int ret = try_to_del_timer_sync(timer);
if (ret >= 0)
return ret;
cpu_relax();
}
}

timer到期处理流程

每个tick到来的时候会触发一个TIMER_SOFTIRQ, 低精度的timer到期检查和运行就是在这个softirq中执行的。

timer的softirq handler是run_timer_softirq:

/*
* This function runs timers and the timer-tq in bottom half context.
*/
static __latent_entropy void run_timer_softirq(struct softirq_action *h)
{
struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_STD]);
__run_timers(base);
if (IS_ENABLED(CONFIG_NO_HZ_COMMON))
__run_timers(this_cpu_ptr(&timer_bases[BASE_DEF]));
}

__run_timers函数会逐个检查clk指导clk等于当前的jiffies。每次检查当前clk到期的所有level的timer,并执行。

/**
* __run_timers - run all expired timers (if any) on this CPU.
* @base: the timer vector to be processed.
*/
static inline void __run_timers(struct timer_base *base)
{
/*定义了一个长度包含了所有level的数组,用来存储到期的timer hlist*/
struct hlist_head heads[LVL_DEPTH];
int levels;
if (!time_after_eq(jiffies, base->clk))
return;
timer_base_lock_expiry(base);
raw_spin_lock_irq(&base->lock);
/*
* timer_base::must_forward_clk must be cleared before running
* timers so that any timer functions that call mod_timer() will
* not try to forward the base. Idle tracking / clock forwarding
* logic is only used with BASE_STD timers.
*
* The must_forward_clk flag is cleared unconditionally also for
* the deferrable base. The deferrable base is not affected by idle
* tracking and never forwarded, so clearing the flag is a NOOP.
*
* The fact that the deferrable base is never forwarded can cause
* large variations in granularity for deferrable timers, but they
* can be deferred for long periods due to idle anyway.
*/
base->must_forward_clk = false;
while (time_after_eq(jiffies, base->clk)) {
/*检查到期的timer并将对应的hlist存储在heads数组当中*/
levels = collect_expired_timers(base, heads);
base->clk++;
/* 执行所有到期的timer*/
while (levels--)
expire_timers(base, heads + levels);
}
raw_spin_unlock_irq(&base->lock);
timer_base_unlock_expiry(base);
}

collect_expired_timers是对__collect_expired_timers的一层封装,处理了在休眠很长时间后将clk forward到最近一个timer的到期时间之前以减少无用的计算。

static int collect_expired_timers(struct timer_base *base,
struct hlist_head *heads)
{
/*
* NOHZ optimization. After a long idle sleep we need to forward the
* base to current jiffies. Avoid a loop by searching the bitfield for
* the next expiring timer.
*/
if ((long)(jiffies - base->clk) > 2) {
/* 
unsigned long next = __next_timer_interrupt(base);
/*
* If the next timer is ahead of time forward to current
* jiffies, otherwise forward to the next expiry time:
*/
if (time_after(next, jiffies)) {
/*
* The call site will increment base->clk and then
* terminate the expiry loop immediately.
*/
base->clk = jiffies;
return 0;
}
base->clk = next;
}
return __collect_expired_timers(base, heads);
}

其中最重要的依旧是timer到期判定部分的代码。之前有说到,根据延时的长短,kernel会将pending_map中对应的bit置1。这里的判定是类似的,根据当前的clk,不断的右移得到对应精度的clk数值,再加上对应level在pending_map中所处的位置,就可以检测对应的bit来判定有无定时器到期了。

static int __collect_expired_timers(struct timer_base *base,
struct hlist_head *heads)
{
unsigned long clk = base->clk;
struct hlist_head *vec;
int i, levels = 0;
unsigned int idx;
for (i = 0; i < LVL_DEPTH; i++) {
idx = (clk & LVL_MASK) + i * LVL_SIZE;
if (__test_and_clear_bit(idx, base->pending_map)) {
vec = base->vectors + idx;
hlist_move_list(vec, heads++);
levels++;
}
/* Is it time to look at the next level? */
if (clk & LVL_CLK_MASK)
break;
/* Shift clock for the next level granularity */
clk >>= LVL_CLK_SHIFT;
}
return levels;
}

最后在expire_timers中将timer从hlist中移除,并根据flag执行timer到期的回调函数。

static void expire_timers(struct timer_base *base, struct hlist_head *head)
{
while (!hlist_empty(head)) {
struct timer_list *timer;
void (*fn)(struct timer_list *);
timer = hlist_entry(head->first, struct timer_list, entry);
base->running_timer = timer;
detach_timer(timer, true);
fn = timer->function;
if (timer->flags & TIMER_IRQSAFE) {
raw_spin_unlock(&base->lock);
call_timer_fn(timer, fn);
raw_spin_lock(&base->lock);
} else {
raw_spin_unlock_irq(&base->lock);
call_timer_fn(timer, fn);
raw_spin_lock_irq(&base->lock);
}
}
}

总结时间

无论是timer wheel还是non cascading timer都是kernel开发人员所创造的非常精巧的设计。从timer wheel到non cascading的切换主要基于以下两点:大部分的级联操作都是没有意义的,因为大部分定时器在这之前都被重新装填;在NOHZ系统中,系统唤醒后需要一个快速的方法找到下一个到期的定时器。而non cascading timer无疑是从根本上解决了这两个问题。但timer wheel在定时的准确性上也有自己的优势,只是non cascading timer更加适合当今的应用场景。kernel中大部分设计都不是完美的,正如timer的精准和效率总是相互矛盾的。对于我们来说,如何取得这两者之间的平衡就显得尤为重要。linux kernel的设计也极好的体现了代码设计中的平衡。

发表评论

电子邮件地址不会被公开。 必填项已用*标注