<?xml version="1.0"?>
<linuxdoc><article opts="null"><titlepag><title>KernelAnalysis-HOWTO</title><author><name>Roberto Arcomano </name></author><date>v0.63 - July 31, 2002</date><abstract>This document tries to explain some things about the Linux Kernel, such
 as the most important components, how they work, and so on. This HOWTO should
 help prevent the reader from needing to browse all the kernel source files
 searching for the"right function," declaration, and definition, and then linking
 each to the other. You can find the latest version of this document at <url url="http://bertolinux.fatamorgana.com" name="http://bertolinux.fatamorgana.com"></url> If
 you have suggestions to help make this document better, please submit your
 ideas to me at the following address: <url url="mailto:berto@fatamorgana.com" name="berto@fatamorgana.com"></url></abstract></titlepag><sect><heading>Introduction</heading><sect1><heading>Introduction</heading><p>This HOWTO tries to define how parts of the<bf> </bf>Linux Kernel work, what are
the main functions and data structures used, and how the "wheel spins". You can
find the latest version of this document at <url url="http://www.fatamorgana.com/bertolinux" name="http://www.fatamorgana.com/bertolinux"></url> If you have suggestions to help
make this document better, please submit your ideas to me at the following
address: <url url="mailto:berto@fatamorgana.com" name="berto@fatamorgana.com"></url>Code used within this document refers to the Linux Kernel version
2.4.x, which is the last stable kernel version at time of writing this HOWTO.</p></sect1><sect1><heading>Copyright</heading><p>Copyright (C) 2000,2001,2002 Roberto Arcomano. This document is free; you
can redistribute it and/or modify it under the terms of the GNU General Public
License as published by the Free Software Foundation; either version 2 of the
License, or (at your option) any later version. This document is distributed
in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even
the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
See the GNU General Public License for more details. You can get a copy of
the GNU GPL <url url="http://www.gnu.org/copyleft/gpl.html" name="here"></url></p></sect1><sect1><heading>Translations</heading><p>If you want to translate this document you are free to do so. However,
you will need to do the following: </p><p><enum><item>Check that another version of the document doesn't already exist at your
local LDP</item><item>Maintain all 'Introduction' sections (including 'Introduction', 'Copyright',
'Translations' , 'Credits').</item></enum></p><p>Warning! You don't have to translate TXT or HTML file, you have to modify
LYX file, so that it is possible to convert it all other formats (TXT, HTML,
RIFF, etc.): to do that you can use "LyX" application you download from <url url="http://www.lyx.org" name="http://www.lyx.org"></url>.</p><p>No need to ask me to translate! You just have to let me know (if you want)
about your translation.</p><p>Thank you for your translation!</p></sect1><sect1><heading>Credits</heading><p>Thanks to <url url="http://www.linuxdoc.org" name="Linux Documentation Project"></url> for publishing and uploading my document quickly.</p></sect1></sect><sect><heading>Syntax used</heading><sect1><heading>Function Syntax</heading><p>When speaking about a function, we write: </p><p><verb>entfunction_name  ent file location . extension entent</verb></p><p>For example: </p><p><verb>entschedule entkernel/sched.centent </verb></p><p>tells us that we talk about </p><p>entscheduleent </p><p>function retrievable from file </p><p>ent kernel/sched.c ent</p><p>Note: We also assume /usr/src/linux as the starting directory.</p></sect1><sect1><heading>Indentation</heading><p>Indentation in source code is 3 blank characters.</p></sect1><sect1><heading>InterCallings Analysis</heading><sect2><heading>Overview</heading><p>We use the"InterCallings Analysis "(ICA) to see (in an indented fashion)
how kernel functions call each other. </p><p>For example, the sleepenton command is described in ICA below:</p><p><verb>|sleep_on
|init_waitqueue_entry      --
|__add_wait_queue            |   enqueuing request  
   |list_add                 |
      |__list_add          -- 
   |schedule              ---     waiting for request to be executed
      |__remove_wait_queue --   
      |list_del              |   dequeuing request
         |__list_del       -- 
 
                          sleep_on ICA
</verb></p><p>The indented ICA is followed by functions' locations:</p><p><itemize><item>sleepenton entkernel/sched.cent</item><item>initentwaitqueueententry entinclude/linux/wait.hent</item><item>ententaddentwaitentqueue</item><item>listentadd entinclude/linux/list.hent</item><item>ententlistentadd</item><item>schedule entkernel/sched.cent</item><item>ententremoveentwaitentqueue entinclude/linux/wait.hent</item><item>listentdel entinclude/linux/list.hent</item><item>ententlistentdel</item></itemize></p><p>Note: We don't specify anymore file location, if specified just before.</p></sect2><sect2><heading>Details</heading><p>In an ICA a line like looks like the following </p><p><verb> function1 -ent function2</verb></p><p>means that ent function1 ent is a generic pointer to another function.
In this case ent function1 ent  points to ent function2 ent.</p><p>When we write:</p><p><verb>  function:</verb></p><p>it means that ent function ent is not a real function. It is a label
(typically assembler label).</p><p>In many sections we may report a ''C'' code or a ''pseudo-code''. In real
source files, you could use ''assembler'' or ''not structured'' code.  This
difference is for learning purposes.</p></sect2><sect2><heading>PROs of using ICA</heading><p>The advantages of using ICA (InterCallings Analysis) are many:</p><p><itemize><item>You get an overview of what happens when you call a kernel function </item><item>Function locations are indicated after the function, so ICA could also
be considered as a little ''function reference''</item><item>InterCallings Analysis (ICA) is useful in sleep/awake mechanisms, where
we can view what we do before sleeping, the proper sleeping action, and what
we'll do after waking up (after schedule).</item></itemize></p></sect2><sect2><heading>CONTROs of using ICA</heading><p><itemize><item>Some of the disadvantages of using ICA are listed below:</item></itemize></p><p>As all theoretical models, we simplify reality avoiding many details, such
as real source code and special conditions.</p><p><itemize><item>Additional diagrams should be added to better represent stack conditions,
data values, and so on.</item></itemize></p></sect2></sect1></sect><sect><heading>Fundamentals</heading><sect1><heading>What is the kernel?</heading><p>The kernel is the "core" of any computer system: it is the  "software" which
allows users to share computer resources.</p><p>The kernel can be thought ofas the main software of the OS (Operating System),
which may also include graphics management. </p><p>For example, under Linux (like other Unix-like OSs), the XWindow environment
doesn't belong to the Linux Kernel, because it manages only graphical operations
(it uses user mode I/O to access video card devices). </p><p>By contrast, Windows environments (Win9x, WinME, WinNT, Win2K, WinXP, and
so on) are a mix between a graphical environment and kernel.</p></sect1><sect1><heading>What is the difference between User Mode and Kernel Mode?</heading><sect2><heading>Overview</heading><p>Many years ago, when computers were as big as a room, users ran their applications
with much difficulty and, sometimes, their applications crashed the computer.</p><p></p></sect2><sect2><heading>Operative modes</heading><p>To avoid having applications that constantly crashed, newer OSs were designed
with 2 different operative modes:</p><p><enum><item>Kernel Mode: the machine operates with critical data structure, direct
hardware (IN/OUT or memory mapped), direct memory, IRQ, DMA, and so on.</item><item>User Mode: users can run applications.</item></enum></p><p><verb>                      
               |          Applications           /|ent
               |         ______________           |
               |         | User Mode  |           |  
               |         ______________           | 
               |               |                  |  
Implementation |        _______ _______           |   Abstraction
    Detail     |        | Kernel Mode |           |
               |        _______________           |
               |               |                  |
               |               |                  | 
               |               |                  |
              ent|/          Hardware               |</verb></p><p>Kernel Mode "prevents" User Mode applications from damaging the system or
its features.</p><p>Modern microprocessors implement in hardware at least 2 different states.
For example under Intel, 4 states determine the PL (Privilege Level). It is
possible to use 0,1,2,3 states, with 0 used in Kernel Mode. </p><p>Unix OS requires only 2 privilege levels, and we will use such a paradigm
as point of reference.</p></sect2></sect1><sect1><heading>Switching from User Mode to Kernel Mode</heading><sect2><heading>When do we switch?</heading><p>Once we understand that there are 2 different modes, we have to know when
we switch from one to the other.</p><p>Typically, there are 2 points of switching:</p><p><enum><item>When calling a System Call: after calling a System Call, the task voluntary
calls pieces of code living in Kernel Mode</item><item>When an IRQ (or exception) comes: after the IRQ an IRQ handler (or exception
handler) is called, then control returns back to the task that was interrupted
like nothing was happened.</item></enum></p></sect2><sect2><heading>System Calls</heading><p>System calls are like special functions that manage OS routines which live
in Kernel Mode.</p><p>A system call can be called when we:</p><p><itemize><item>access an I/O device or a file (like read or write)</item><item>need to access privileged information (like pid, changing scheduling policy
or other information)</item><item>need to change execution context (like forking or executing some other
application) </item><item>need to execute a particular command (like ''chdir'', ''kill", ''brk'',
or ''signal'')</item></itemize></p><p><verb>                                 |                |
                         -------ent| System Call i  | (Accessing Devices)
|                |       |       |  entsys_read()ent  |
| ...            |       |       |                |
| system_call(i) |--------       |                |
|   entread()ent     |               |                |
| ...            |               |                |
| system_call(j) |--------       |                |  
|   entget_pid()ent  |       |       |                |
| ...            |       -------ent| System Call j  | (Accessing kernel data structures)
|                |               |  entsys_getpid()ent|
                                 |                | 
 
    USER MODE                        KERNEL MODE
 
  
                        Unix System Calls Working </verb></p><p>System calls are almost the only interface used by User Mode to talk with
low level resources (hardware). The only exception to this statement is when
a process uses ''ioperm'' system call. In this case a device can be accessed
directly by User Mode process (IRQs cannot be used).</p><p>NOTE: Not every ''C'' function is a system call, only some of them.</p><p>Below is a list of System Calls under Linux Kernel 2.4.17, from ent
arch/i386/kernel/entry.S ent</p><p><verb>        .long SYMBOL_NAME(sys_ni_syscall)       /* 0  -  old entsetup()ent system call*/
        .long SYMBOL_NAME(sys_exit)
        .long SYMBOL_NAME(sys_fork)
        .long SYMBOL_NAME(sys_read)
        .long SYMBOL_NAME(sys_write)
        .long SYMBOL_NAME(sys_open)             /* 5 */
        .long SYMBOL_NAME(sys_close)
        .long SYMBOL_NAME(sys_waitpid)
        .long SYMBOL_NAME(sys_creat)
        .long SYMBOL_NAME(sys_link)
        .long SYMBOL_NAME(sys_unlink)           /* 10 */
        .long SYMBOL_NAME(sys_execve)
        .long SYMBOL_NAME(sys_chdir)
        .long SYMBOL_NAME(sys_time)
        .long SYMBOL_NAME(sys_mknod)
        .long SYMBOL_NAME(sys_chmod)            /* 15 */
        .long SYMBOL_NAME(sys_lchown16)
        .long SYMBOL_NAME(sys_ni_syscall)                               /* old break syscall holder */
        .long SYMBOL_NAME(sys_stat)
        .long SYMBOL_NAME(sys_lseek)
        .long SYMBOL_NAME(sys_getpid)           /* 20 */
        .long SYMBOL_NAME(sys_mount)
        .long SYMBOL_NAME(sys_oldumount)
        .long SYMBOL_NAME(sys_setuid16)
        .long SYMBOL_NAME(sys_getuid16)
        .long SYMBOL_NAME(sys_stime)            /* 25 */
        .long SYMBOL_NAME(sys_ptrace)
        .long SYMBOL_NAME(sys_alarm)
        .long SYMBOL_NAME(sys_fstat)
        .long SYMBOL_NAME(sys_pause)
        .long SYMBOL_NAME(sys_utime)            /* 30 */
        .long SYMBOL_NAME(sys_ni_syscall)                               /* old stty syscall holder */
        .long SYMBOL_NAME(sys_ni_syscall)                               /* old gtty syscall holder */
        .long SYMBOL_NAME(sys_access)
        .long SYMBOL_NAME(sys_nice)
        .long SYMBOL_NAME(sys_ni_syscall)       /* 35 */                /* old ftime syscall holder */
        .long SYMBOL_NAME(sys_sync)
        .long SYMBOL_NAME(sys_kill)
        .long SYMBOL_NAME(sys_rename)
        .long SYMBOL_NAME(sys_mkdir)
        .long SYMBOL_NAME(sys_rmdir)            /* 40 */
        .long SYMBOL_NAME(sys_dup)
        .long SYMBOL_NAME(sys_pipe)
        .long SYMBOL_NAME(sys_times)
        .long SYMBOL_NAME(sys_ni_syscall)                               /* old prof syscall holder */
        .long SYMBOL_NAME(sys_brk)              /* 45 */
        .long SYMBOL_NAME(sys_setgid16)
        .long SYMBOL_NAME(sys_getgid16)
        .long SYMBOL_NAME(sys_signal)
        .long SYMBOL_NAME(sys_geteuid16)
        .long SYMBOL_NAME(sys_getegid16)        /* 50 */
        .long SYMBOL_NAME(sys_acct)
        .long SYMBOL_NAME(sys_umount)                                   /* recycled never used phys() */
        .long SYMBOL_NAME(sys_ni_syscall)                               /* old lock syscall holder */
        .long SYMBOL_NAME(sys_ioctl)
        .long SYMBOL_NAME(sys_fcntl)            /* 55 */
        .long SYMBOL_NAME(sys_ni_syscall)                               /* old mpx syscall holder */
        .long SYMBOL_NAME(sys_setpgid)
        .long SYMBOL_NAME(sys_ni_syscall)                               /* old ulimit syscall holder */
        .long SYMBOL_NAME(sys_olduname)
        .long SYMBOL_NAME(sys_umask)            /* 60 */
        .long SYMBOL_NAME(sys_chroot)
        .long SYMBOL_NAME(sys_ustat)
        .long SYMBOL_NAME(sys_dup2)
        .long SYMBOL_NAME(sys_getppid)
        .long SYMBOL_NAME(sys_getpgrp)          /* 65 */
        .long SYMBOL_NAME(sys_setsid)
        .long SYMBOL_NAME(sys_sigaction)
        .long SYMBOL_NAME(sys_sgetmask)
        .long SYMBOL_NAME(sys_ssetmask)
        .long SYMBOL_NAME(sys_setreuid16)       /* 70 */
        .long SYMBOL_NAME(sys_setregid16)
        .long SYMBOL_NAME(sys_sigsuspend)
        .long SYMBOL_NAME(sys_sigpending)
        .long SYMBOL_NAME(sys_sethostname)
        .long SYMBOL_NAME(sys_setrlimit)        /* 75 */
        .long SYMBOL_NAME(sys_old_getrlimit)
        .long SYMBOL_NAME(sys_getrusage)
        .long SYMBOL_NAME(sys_gettimeofday)
        .long SYMBOL_NAME(sys_settimeofday)
        .long SYMBOL_NAME(sys_getgroups16)      /* 80 */
        .long SYMBOL_NAME(sys_setgroups16)
        .long SYMBOL_NAME(old_select)
        .long SYMBOL_NAME(sys_symlink)
        .long SYMBOL_NAME(sys_lstat)
        .long SYMBOL_NAME(sys_readlink)         /* 85 */
        .long SYMBOL_NAME(sys_uselib)
        .long SYMBOL_NAME(sys_swapon)
        .long SYMBOL_NAME(sys_reboot)
        .long SYMBOL_NAME(old_readdir)
        .long SYMBOL_NAME(old_mmap)             /* 90 */
        .long SYMBOL_NAME(sys_munmap)
        .long SYMBOL_NAME(sys_truncate)
        .long SYMBOL_NAME(sys_ftruncate)
        .long SYMBOL_NAME(sys_fchmod)
        .long SYMBOL_NAME(sys_fchown16)         /* 95 */
        .long SYMBOL_NAME(sys_getpriority)
        .long SYMBOL_NAME(sys_setpriority)
        .long SYMBOL_NAME(sys_ni_syscall)                               /* old profil syscall holder */
        .long SYMBOL_NAME(sys_statfs)
        .long SYMBOL_NAME(sys_fstatfs)          /* 100 */
        .long SYMBOL_NAME(sys_ioperm)
        .long SYMBOL_NAME(sys_socketcall)
        .long SYMBOL_NAME(sys_syslog)
        .long SYMBOL_NAME(sys_setitimer)
        .long SYMBOL_NAME(sys_getitimer)        /* 105 */
        .long SYMBOL_NAME(sys_newstat)
        .long SYMBOL_NAME(sys_newlstat)
        .long SYMBOL_NAME(sys_newfstat)
        .long SYMBOL_NAME(sys_uname)
        .long SYMBOL_NAME(sys_iopl)             /* 110 */
        .long SYMBOL_NAME(sys_vhangup)
        .long SYMBOL_NAME(sys_ni_syscall)       /* old entidleent system call */
        .long SYMBOL_NAME(sys_vm86old)
        .long SYMBOL_NAME(sys_wait4)
        .long SYMBOL_NAME(sys_swapoff)          /* 115 */
        .long SYMBOL_NAME(sys_sysinfo)
        .long SYMBOL_NAME(sys_ipc)
        .long SYMBOL_NAME(sys_fsync)
        .long SYMBOL_NAME(sys_sigreturn)
        .long SYMBOL_NAME(sys_clone)            /* 120 */
        .long SYMBOL_NAME(sys_setdomainname)
        .long SYMBOL_NAME(sys_newuname)
        .long SYMBOL_NAME(sys_modify_ldt)
        .long SYMBOL_NAME(sys_adjtimex)
        .long SYMBOL_NAME(sys_mprotect)         /* 125 */
        .long SYMBOL_NAME(sys_sigprocmask)
        .long SYMBOL_NAME(sys_create_module)
        .long SYMBOL_NAME(sys_init_module)
        .long SYMBOL_NAME(sys_delete_module)
        .long SYMBOL_NAME(sys_get_kernel_syms)  /* 130 */
        .long SYMBOL_NAME(sys_quotactl)
        .long SYMBOL_NAME(sys_getpgid)
        .long SYMBOL_NAME(sys_fchdir)
        .long SYMBOL_NAME(sys_bdflush)
        .long SYMBOL_NAME(sys_sysfs)            /* 135 */
        .long SYMBOL_NAME(sys_personality)
        .long SYMBOL_NAME(sys_ni_syscall)       /* for afs_syscall */
        .long SYMBOL_NAME(sys_setfsuid16)
        .long SYMBOL_NAME(sys_setfsgid16)
        .long SYMBOL_NAME(sys_llseek)           /* 140 */
        .long SYMBOL_NAME(sys_getdents)
        .long SYMBOL_NAME(sys_select)
        .long SYMBOL_NAME(sys_flock)
        .long SYMBOL_NAME(sys_msync)
        .long SYMBOL_NAME(sys_readv)            /* 145 */
        .long SYMBOL_NAME(sys_writev)
        .long SYMBOL_NAME(sys_getsid)
        .long SYMBOL_NAME(sys_fdatasync)
        .long SYMBOL_NAME(sys_sysctl)
        .long SYMBOL_NAME(sys_mlock)            /* 150 */
        .long SYMBOL_NAME(sys_munlock)
        .long SYMBOL_NAME(sys_mlockall)
        .long SYMBOL_NAME(sys_munlockall)
        .long SYMBOL_NAME(sys_sched_setparam)
        .long SYMBOL_NAME(sys_sched_getparam)   /* 155 */
        .long SYMBOL_NAME(sys_sched_setscheduler)
        .long SYMBOL_NAME(sys_sched_getscheduler)
        .long SYMBOL_NAME(sys_sched_yield)
        .long SYMBOL_NAME(sys_sched_get_priority_max)
        .long SYMBOL_NAME(sys_sched_get_priority_min)  /* 160 */
        .long SYMBOL_NAME(sys_sched_rr_get_interval)
        .long SYMBOL_NAME(sys_nanosleep)
        .long SYMBOL_NAME(sys_mremap)
        .long SYMBOL_NAME(sys_setresuid16)
        .long SYMBOL_NAME(sys_getresuid16)      /* 165 */
        .long SYMBOL_NAME(sys_vm86)
        .long SYMBOL_NAME(sys_query_module)
        .long SYMBOL_NAME(sys_poll)
        .long SYMBOL_NAME(sys_nfsservctl)
        .long SYMBOL_NAME(sys_setresgid16)      /* 170 */
        .long SYMBOL_NAME(sys_getresgid16)
        .long SYMBOL_NAME(sys_prctl)
        .long SYMBOL_NAME(sys_rt_sigreturn)
        .long SYMBOL_NAME(sys_rt_sigaction)
        .long SYMBOL_NAME(sys_rt_sigprocmask)   /* 175 */
        .long SYMBOL_NAME(sys_rt_sigpending)
        .long SYMBOL_NAME(sys_rt_sigtimedwait)
        .long SYMBOL_NAME(sys_rt_sigqueueinfo)
        .long SYMBOL_NAME(sys_rt_sigsuspend)
        .long SYMBOL_NAME(sys_pread)            /* 180 */
        .long SYMBOL_NAME(sys_pwrite)
        .long SYMBOL_NAME(sys_chown16)
        .long SYMBOL_NAME(sys_getcwd)
        .long SYMBOL_NAME(sys_capget)
        .long SYMBOL_NAME(sys_capset)           /* 185 */
        .long SYMBOL_NAME(sys_sigaltstack)
        .long SYMBOL_NAME(sys_sendfile)
        .long SYMBOL_NAME(sys_ni_syscall)               /* streams1 */
        .long SYMBOL_NAME(sys_ni_syscall)               /* streams2 */
        .long SYMBOL_NAME(sys_vfork)            /* 190 */
        .long SYMBOL_NAME(sys_getrlimit)
        .long SYMBOL_NAME(sys_mmap2)
        .long SYMBOL_NAME(sys_truncate64)
        .long SYMBOL_NAME(sys_ftruncate64)
        .long SYMBOL_NAME(sys_stat64)           /* 195 */
        .long SYMBOL_NAME(sys_lstat64)
        .long SYMBOL_NAME(sys_fstat64)
        .long SYMBOL_NAME(sys_lchown)
        .long SYMBOL_NAME(sys_getuid)
        .long SYMBOL_NAME(sys_getgid)           /* 200 */
        .long SYMBOL_NAME(sys_geteuid)
        .long SYMBOL_NAME(sys_getegid)
        .long SYMBOL_NAME(sys_setreuid)
        .long SYMBOL_NAME(sys_setregid)
        .long SYMBOL_NAME(sys_getgroups)        /* 205 */
        .long SYMBOL_NAME(sys_setgroups)
        .long SYMBOL_NAME(sys_fchown)
        .long SYMBOL_NAME(sys_setresuid)
        .long SYMBOL_NAME(sys_getresuid)
        .long SYMBOL_NAME(sys_setresgid)        /* 210 */
        .long SYMBOL_NAME(sys_getresgid)
        .long SYMBOL_NAME(sys_chown)
        .long SYMBOL_NAME(sys_setuid)
        .long SYMBOL_NAME(sys_setgid)
        .long SYMBOL_NAME(sys_setfsuid)         /* 215 */
        .long SYMBOL_NAME(sys_setfsgid)
        .long SYMBOL_NAME(sys_pivot_root)
        .long SYMBOL_NAME(sys_mincore)
        .long SYMBOL_NAME(sys_madvise)
        .long SYMBOL_NAME(sys_getdents64)       /* 220 */
        .long SYMBOL_NAME(sys_fcntl64)
        .long SYMBOL_NAME(sys_ni_syscall)       /* reserved for TUX */
        .long SYMBOL_NAME(sys_ni_syscall)       /* Reserved for Security */
        .long SYMBOL_NAME(sys_gettid)
        .long SYMBOL_NAME(sys_readahead)        /* 225 */

</verb></p></sect2><sect2><heading>IRQ Event</heading><p>When an IRQ comes, the task that is running is interrupted in order to
service the IRQ Handler.</p><p>After the IRQ is handled, control returns backs exactly to point of interrupt,
like nothing  happened.</p><p><verb>
           
              Running Task 
             |-----------|          (3)
NORMAL       |   |       | entbreak executionent IRQ Handler
EXECUTION (1)|   |       |     -------------ent|---------| 
             |  ent|/      |     |             |  does   |         
 IRQ (2)----ent| ..        |-----ent             |  some   |      
             |   |       |ent-----             |  work   |       
BACK TO      |   |       |     |             |  ..(4). |
NORMAL    (6)|  ent|/      |     ent-------------|_________|
EXECUTION    |___________|  entreturn to codeent
                                    (5)
               USER MODE                     KERNEL MODE

         User-entKernel Mode Transition caused by IRQ event
     </verb></p><p>The numbered steps below refer to the sequence of events in the diagram
above:</p><p><enum><item>Process is executing</item><item>IRQ comes while the task is running.</item><item>Task is interrupted to call an "Interrupt handler".</item><item>The "Interrupt handler" code is executed.</item><item>Control returns back to task user mode (as if nothing happened)</item><item>Process returns back to normal execution</item></enum></p><p>Special interest has the Timer IRQ, coming every TIMER ms to manage:</p><p><enum><item>Alarms</item><item>System and task counters (used by schedule to decide when stop a process
or for accounting)</item><item>Multitasking based on wake up mechanism after TIMESLICE time.</item></enum></p></sect2></sect1><sect1><heading>Multitasking</heading><sect2><heading>Mechanism</heading><p>The key point of modern OSs is the "Task". The Task is an application running
in memory sharing all resources (included CPU and Memory) with other Tasks.</p><p>This "resource sharing" is managed by the  "Multitasking Mechanism". The Multitasking
Mechanism switches from one task to another after a "timeslice" time. Users have
the "illusion" that they own all resources. We can also imagine a single user
scenario, where a user can have the "illusion" of running many tasks at the same
time.</p><p>To implement this multitasking, the task uses "the state" variable, which
can be:</p><p><enum><item>READY, ready for execution</item><item>BLOCKED, waiting for a resource</item></enum></p><p>The task state is managed by its presence in a relative list: READY list
and BLOCKED list.</p></sect2><sect2><heading>Task Switching</heading><p>The movement from one task to another is called ''Task Switching''. many
computers have a hardware instruction which automatically performs this operation.
Task Switching occurs in the following cases:</p><p><enum><item>After Timeslice ends: we need to schedule a "Ready for execution" task and
give it access.</item><item>When a Task has to wait for a device: we need to schedule a new task and
switch to it *</item></enum></p><p>* We schedule another task to prevent   "Busy Form Waiting", which occurs
when we are waiting for a device instead performing other work.</p><p>Task Switching is managed by the "Schedule" entity.</p><p><verb> 
Timer    |           |
 IRQ     |           |                            Schedule
  |      |           |                     ________________________
  |-----ent|   Task 1  |ent------------------ent|(1)Chooses a Ready Task |
  |      |           |                    |(2)Task Switching       |
  |      |___________|                    |________________________|   
  |      |           |                               /|ent
  |      |           |                                | 
  |      |           |                                |
  |      |           |                                |
  |      |           |                                |      
  |-----ent|   Task 2  |ent-------------------------------|
  |      |           |                                |
  |      |___________|                                |
  .      .     .     .                                .
  .      .     .     .                                .
  .      .     .     .                                .
  |      |           |                                |
  |      |           |                                |
  ------ent|   Task N  |ent--------------------------------
         |           |
         |___________| 
    
            Task Switching based on TimeSlice
 </verb></p><p>A typical Timeslice for Linux is about 10 ms.</p><p><verb>
 

 |           |            
 |           | Resource    _____________________________
 |   Task 1  |-----------ent|(1) Enqueue Resource request |
 |           |  Access    |(2)  Mark Task as blocked    |
 |           |            |(3)  Choose a Ready Task     |
 |___________|            |(4)    Task Switching        |
                          |_____________________________|
                                       |
                                       |
 |           |                         |
 |           |                         |
 |   Task 2  |ent-------------------------
 |           |  
 |           |
 |___________|
 
     Task Switching based on Waiting for a Resource
 </verb></p></sect2></sect1><sect1><heading>Microkernel vs Monolithic OS</heading><sect2><heading>Overview</heading><p>Until now we viewed so called Monolithic OS, but there is also another
kind of OS: ''Microkernel''.</p><p>A Microkernel OS uses Tasks, not only for user mode processes, but also
as a real kernel manager, like Floppy-Task, HDD-Task, Net-Task and so on. Some
examples are Amoeba, and Mach. </p></sect2><sect2><heading>PROs and CONTROs of Microkernel OS </heading><p>PROS:</p><p><itemize><item>OS is simpler to maintain because each Task manages a single kind of operation.
So if you want to modify networking, you modify Net-Task (ideally, if it is
not needed a structural update).</item></itemize></p><p>CONS:</p><p><itemize><item>Performances are worse than Monolithic OS, because you have to add 2*TASKentSWITCH
times (the first to enter the specific Task, the second to go out from it).</item></itemize></p><p>My personal opinion is that, Microkernels are a good didactic example (like
Minix) but they are not ''optimal'', so not really suitable. Linux uses a few
Tasks, called "Kernel Threads" to implement a little microkernel structure (like
kswapd, which is used to retrieve memory pages from mass storage). In this
case there are no problems with perfomance because swapping is a very slow
job.</p></sect2></sect1><sect1><heading>Networking</heading><sect2><heading>ISO OSI levels</heading><p>Standard ISO-OSI describes a network architecture with the following levels:</p><p><enum><item>Physical level (examples: PPP and Ethernet)</item><item>Data-link level (examples: PPP and Ethernet)</item><item>Network level (examples: IP, and X.25)</item><item>Transport level (examples: TCP, UDP)</item><item>Session level (SSL)</item><item>Presentation level (FTP binary-ascii coding)</item><item>Application level (applications like Netscape)</item></enum></p><p>The first 2 levels listed above are often implemented in hardware. Next
levels are in software (or firmware for routers).</p><p>Many protocols are used by an OS: one of these is TCP/IP (the most important
living on 3-4 levels).</p></sect2><sect2><heading>What does the kernel?</heading><p>The kernel doesn't know anything (only addresses) about first 2 levels
of ISO-OSI.</p><p>In RX it:</p><p><enum><item>Manages handshake with low levels devices (like ethernet card or modem)
receiving "frames" from them.</item><item>Builds TCP/IP "packets" from "frames" (like Ethernet or PPP ones), </item><item>Convers ''packets'' in ''sockets'' passing them to the right application
(using port number) or</item><item>Forwards packets to the right queue</item></enum></p><p><verb>frames         packets              sockets
NIC ---------ent Kernel ----------ent Application
                  |    packets
                  --------------ent Forward
                        - RX - </verb></p><p>In TX  stage it:</p><p><enum><item>Converts sockets or </item><item>Queues datas into TCP/IP ''packets''</item><item>Splits ''packets" into "frames" (like Ethernet or PPP ones)</item><item>Sends ''frames'' using HW drivers</item></enum></p><p><verb>sockets       packets                     frames
Application ---------ent Kernel ----------ent NIC
              packets     /|ent    
Forward  -------------------
                        - TX -  

</verb></p></sect2></sect1><sect1><heading>Virtual Memory</heading><sect2><heading>Segmentation</heading><p>Segmentation is the first method to solve memory allocation problems: it
allows you to compile source code without caring where the application will
be placed in memory. As a matter of fact, this feature helps applications developers
to develop in a independent fashion from the OS e also from the hardware.</p><p><verb>     
            |       Stack        |
            |          |         |
            |         ent|/        |
            |        Free        | 
            |         /|ent        |     Segment ent---ent Process    
            |          |         |
            |        Heap        |
            | Data uninitialized |
            |  Data initialized  |
            |       Code         |
            |____________________|  
 
                   Segment  
</verb></p><p>We can say that a segment is the logical entity of an application, or the
image of the application in memory.</p><p>When programming, we don't care where our data is put in memory, we only
care about the offset inside our segment (our application).</p><p>We use to assign a Segment to each Process and vice versa. In Linux this
is not true. Linux uses only 4 segments for either Kernel and all Processes.</p><sect3><heading>Problems of Segmentation</heading><p><verb> 
                                 ____________________
                          -----ent|                    |-----ent
                          | IN  |     Segment A      | OUT
 ____________________     |     |____________________|   
|                    |____|     |                    |   
|     Segment B      |          |     Segment B      |
|                    |____      |                    |   
|____________________|    |     |____________________|   
                          |     |     Segment C      |   
                          |     |____________________|
                          -----ent|     Segment D      |-----ent 
                            IN  |____________________| OUT 
 
                     Segmentation problem

</verb></p><p>In the diagram above, we want to get exit processes A, and D and enter
process B. As we can see there is enough space for B, but we cannot split it
in 2 pieces, so we CANNOT load it (memory out).</p><p>The reason this problem occurs is because pure segments are continuous
areas (because they are logical areas) and cannot be split.</p></sect3></sect2><sect2><heading>Pagination</heading><p><verb> 
             ____________________
            |     Page 1         |
            |____________________|
            |     Page 2         |
            |____________________| 
            |      ..            |     Segment ent---ent Process    
            |____________________|
            |     Page n         |
            |____________________|
            |                    |
            |____________________|
            |                    |
            |____________________|  
 
                   Segment  
 </verb></p><p>Pagination splits memory in entnent pieces, each one with a fixed
length.</p><p>A process may be loaded in one or more Pages. When memory is freed, all
pages are freed (see Segmentation Problem, before).</p><p>Pagination is also used for another important purpose, "Swapping". If a page
is not present in physical memory then it generates an EXCEPTION, that will
make the Kernel search for a new page in storage memory. This mechanism allow
OS to load more applications than the ones allowed by physical memory only.</p><sect3><heading>Pagination Problem</heading><p><verb>             ____________________
   Page   X |     Process Y      |
            |____________________|
            |                    |
            |       WASTE        |
            |       SPACE        |
            |____________________|  
   
              Pagination Problem
 </verb></p><p>In the diagram above, we can see what is wrong with the pagination policy:
when a Process Y loads into Page X, ALL memory space of the Page is allocated,
so the remaining space at the end of Page is wasted.</p></sect3></sect2><sect2><heading>Segmentation and Pagination</heading><p>How can we solve segmentation and pagination problems? Using either 2 policies.</p><p><verb> 
                                  |      ..            |
                                  |____________________|
                            -----ent|      Page 1        |
                            |     |____________________|
                            |     |      ..            |
 ____________________       |     |____________________|
|                    |      |----ent|      Page 2        |
|      Segment X     |  ----|     |____________________|
|                    |      |     |       ..           |
|____________________|      |     |____________________|
                            |     |       ..           |
                            |     |____________________|
                            |----ent|      Page 3        |
                                  |____________________|
                                  |       ..           |
 </verb></p><p>Process X, identified by Segment X, is split in 3 pieces and each of one
is loaded in a page.</p><p>We do not have:</p><p><enum><item>Segmentation problem: we allocate per Pages, so we also free  Pages and
we manage free space in an optimized way.</item><item>Pagination problem: only last page wastes space, but we can decide to use
very small pages, for example 4096 bytes length (losing at maximum 4096*NentTasks
bytes) and manage hierarchical paging (using 2 or 3 levels of paging)</item></enum></p><p><verb> 
 

                          |         |           |         |
                          |         |   Offset2 |  Value  |
                          |         |        /|ent|         |
                  Offset1 |         |-----    | |         |
                      /|ent |         |    |    | |         |
                       |  |         |    |   ent|/|         | 
                       |  |         |    ------ent|         |
                      ent|/ |         |           |         |
 Base Paging Address ----ent|         |           |         |
                          | ....... |           | ....... |
                          |         |           |         |    
 
                     Hierarchical Paging</verb></p></sect2></sect1></sect><sect><heading>Linux Startup</heading><p>We start the Linux kernel first from C code executed from ''startupent32:''
asm label:</p><p><verb>|startup_32:
   |start_kernel
      |lock_kernel
      |trap_init
      |init_IRQ
      |sched_init
      |softirq_init
      |time_init
      |console_init 
      |entifdef CONFIG_MODULES 
         |init_modules 
      |entendif 
      |kmem_cache_init 
      |sti 
      |calibrate_delay 
      |mem_init
      |kmem_cache_sizes_init
      |pgtable_cache_init
      |fork_init
      |proc_caches_init 
      |vfs_caches_init
      |buffer_init
      |page_cache_init
      |signals_init 
      |entifdef CONFIG_PROC_FS 
        |proc_root_init 
      |entendif 
      |entif defined(CONFIG_SYSVIPC) 
         |ipc_init
      |entendif 
      |check_bugs      
      |smp_init
      |rest_init
         |kernel_thread
         |unlock_kernel
         |cpu_idle</verb></p><p><itemize><item>startupent32 entarch/i386/kernel/head.Sent</item><item>startentkernel entinit/main.cent</item><item>lockentkernel entinclude/asm/smplock.hent</item><item>trapentinit entarch/i386/kernel/traps.cent</item><item>initentIRQ entarch/i386/kernel/i8259.cent</item><item>schedentinit entkernel/sched.cent</item><item>softirqentinit entkernel/softirq.cent</item><item>timeentinit entarch/i386/kernel/time.cent</item><item>consoleentinit entdrivers/char/ttyentio.cent</item><item>initentmodules entkernel/module.cent</item><item>kmementcacheentinit entmm/slab.cent</item><item>sti entinclude/asm/system.hent</item><item>calibrateentdelay entinit/main.cent</item><item>mementinit entarch/i386/mm/init.cent</item><item>kmementcacheentsizesentinit entmm/slab.cent</item><item>pgtableentcacheentinit entarch/i386/mm/init.cent</item><item>forkentinit entkernel/fork.cent</item><item>procentcachesentinit </item><item>vfsentcachesentinit entfs/dcache.cent</item><item>bufferentinit entfs/buffer.cent</item><item>pageentcacheentinit entmm/filemap.cent</item><item>signalsentinit entkernel/signal.cent</item><item>procentrootentinit entfs/proc/root.cent</item><item>ipcentinit entipc/util.cent</item><item>checkentbugs entinclude/asm/bugs.hent</item><item>smpentinit entinit/main.cent</item><item>restentinit</item><item>kernelentthread entarch/i386/kernel/process.cent</item><item>unlockentkernel entinclude/asm/smplock.hent</item><item>cpuentidle entarch/i386/kernel/process.cent</item></itemize></p><p>The last function ''restentinit'' does the following:</p><p><enum><item>launches the kernel thread ''init''</item><item>calls unlockentkernel</item><item>makes the kernel run cpuentidle routine, that will be the idle loop executing
when nothing is scheduled</item></enum></p><p>In fact the startentkernel procedure never ends. It will execute cpuentidle
routine endlessly.</p><p>Follows ''init'' description, which is the first Kernel Thread:</p><p><verb>|init
   |lock_kernel
   |do_basic_setup
      |mtrr_init
      |sysctl_init
      |pci_init
      |sock_init
      |start_context_thread
      |do_init_calls
         |(*call())-ent kswapd_init
   |prepare_namespace
   |free_initmem
   |unlock_kernel
   |execve</verb></p></sect><sect><heading>Linux Peculiarities</heading><sect1><heading>Overview</heading><p>Linux has some peculiarities that distinguish it from other OSs. These
peculiarities include:</p><p><enum><item>Pagination only</item><item>Softirq</item><item>Kernel threads</item><item>Kernel modules</item><item>''Proc'' directory</item></enum></p><sect2><heading>Flexibility Elements</heading><p>Points 4 and 5 give system administrators an enormous flexibility on system
configuration from user mode allowing them to solve also critical kernel bugs
or specific problems without have to reboot the machine. For example, if you
needed to change something on a big server and you didn't want to make a reboot,
you could prepare the kernel to talk with a module, that you'll write. </p></sect2></sect1><sect1><heading>Pagination only</heading><p>Linux doesn't use segmentation to distinguish Tasks from each other; it
uses pagination. (Only 2 segments are used for all Tasks, CODE and DATA/STACK)</p><p></p><p>We can also say that an interTask page fault never occurs, because each
Task uses a set of Page Tables that are different for each Task. These tables
cannot point to the same physical addresses.</p><sect2><heading>Linux segments</heading><p>Under the Linux kernel only 4 segments exist: </p><p><enum><item>Kernel Code ent0x10ent</item><item>Kernel Data / Stack ent0x18ent</item><item>User Code ent0x23ent</item><item>User Data / Stack ent0x2bent</item></enum></p><p>entsyntax is ''Purpose entSegmentent''ent</p><p>Under Intel architecture, the segment registers used are:</p><p><itemize><item>CS for Code Segment</item><item>DS for Data Segment</item><item>SS for Stack Segment</item><item>ES for Alternative Segment (for example used to make a memory copy between
2 different segments)</item></itemize></p><p>So, every Task uses 0x23 for code and 0x2b for data/stack.</p></sect2><sect2><heading>Linux pagination</heading><p>Under Linux 3 levels of pages are used, depending on the architecture.
Under Intel only 2 levels are supported. Linux also supports Copy on Write
mechanisms (please see Cap.10 for more information).</p></sect2><sect2><heading>Why don't interTasks address conflicts exist?</heading><p>The answer is very very simple: interTask  address conflicts cannot exist
because  they are impossible. Linear -ent physical mapping is done by "Pagination",
so it just needs to assign physical pages in an univocal fashion.</p></sect2><sect2><heading>Do we need to defragment memory?</heading><p>No. Page assigning is a dynamic process. We need a page only when a Task
asks for it, so we choose it from free memory paging in an ordered fashion.
When we want to release the page, we only have to add it to the free pages
list.</p></sect2><sect2><heading>What about Kernel Pages?</heading><p>Kernel pages have a problem: they can be allocated in a dynamic fashion
but we cannot have a guarantee that they are in contiguous area allocation,
because linear kernel space is equivalent to physical kernel space.</p><p>For Code Segment there is no problem. Boot code is allocated at boot time
(so we have a fixed amount of memory to allocate), and on modules we only have
to allocate a memory area which could contain module code.</p><p>The real problem is the stack segment because each Task uses some kernel
stack pages. Stack segments must be contiguous (according to stack definition),
so we have to establish a maximum limit for each Task's stack dimension. If
we exceed this limit bad things happen. We overwrite kernel mode process data
structures.</p><p>The structure of the Kernel helps us, because kernel functions are never:</p><p><itemize><item>recursive</item><item>intercalling more than N times.</item></itemize></p><p>Once we know N, and we know the average of static variables for all kernel
functions, we can estimate a stack limit.</p><p>If you want to try the problem out, you can create a module with a function
inside calling itself many times. After a fixed number of times, the kernel
module will hang because of a page fault exception handler (typically write
to a read-only page).</p></sect2></sect1><sect1><heading>Softirq</heading><p>When an IRQ comes, task switching is deferred until later to get better
performance. Some Task jobs (that could have to be done just after the IRQ
and that could take much CPU in interrupt time, like building up a TCP/IP packet)
are queued and will be done at scheduling time (once a time-slice will end).</p><p>In recent kernels (2.4.x) the softirq mechanisms are given to a kernelentthread:
''ksoftirqdentCPUn''. n stands for the number of CPU executing kernelentthread
(in a monoprocessor system ''ksoftirqdentCPU0'' uses PID 3).</p><sect2><heading>Preparing Softirq</heading></sect2><sect2><heading>Enabling Softirq</heading><p>''cpuentraiseentsoftirq'' is a routine that will wakeentup ''ksoftirqdentCPU0''
kernel thread, to let it manage the enqueued job.</p><p><verb>|cpu_raise_softirq
   |__cpu_raise_softirq
   |wakeup_softirqd
      |wake_up_process</verb></p><p><itemize><item>cpuentraiseentsoftirq entkernel/softirq.cent</item><item>ententcpuentraiseentsoftirq entinclude/linux/interrupt.hent</item><item>wakeupentsoftirq entkernel/softirq.cent</item><item>wakeentupentprocess entkernel/sched.cent</item></itemize></p><p>''ententcpuentraiseentsoftirq'' routine will set right bit in the vector describing
softirq pending.</p><p>''wakeupentsoftirq'' uses ''wakeupentprocess'' to wake up ''ksoftirqdentCPU0''
kernel thread.</p></sect2><sect2><heading>Executing Softirq</heading><p>TODO: describing data structures involved in softirq mechanism.</p><p>When kernel thread ''ksoftirqdentCPU0'' has been woken up, it will execute
queued jobs</p><p>The code of ''ksoftirqdentCPU0'' is (main endless loop):</p><p><verb>for (;;) ent
   if (!softirq_pending(cpu)) 
      schedule();
      __set_current_state(TASK_RUNNING);
   while (softirq_pending(cpu)) ent 
      do_softirq(); 
      if (current-entneed_resched) 
         schedule 
   ent
   __set_current_state(TASK_INTERRUPTIBLE)
ent</verb></p><p><itemize><item>ksoftirqd entkernel/softirq.cent</item></itemize></p><p></p><p></p></sect2></sect1><sect1><heading>Kernel Threads</heading><p>Even though Linux is a monolithic OS, a few ''kernel threads'' exist to
do  housekeeping work. </p><p>These Tasks don't utilize USER memory; they share KERNEL memory. They also
operate at the highest privilege (RING 0 on a i386 architecture) like any other
kernel mode piece of code.</p><p>Kernel threads are created by ''kernelentthread entarch/i386/kernel/processent''
function, which calls ''clone'' entarch/i386/kernel/process.cent system
call from assembler (which is a ''fork'' like system call):</p><p><verb>int kernel_thread(int (*fn)(void *), void * arg, unsigned long flags)
ent
        long retval, d0;
 
        __asm__ __volatile__(
                entmovl ententesp,ententesientnenttent
                entint ent0x80entnenttent         /* Linux/i386 system call */
                entcmpl ententesp,ententesientnenttent  /* child or parent? */
                entje 1fentnenttent             /* parent - jump */
                /* Load the argument into eax, and push it.  That way, it does
                 * not matter whether the called function is compiled with
                 * -mregparm or not.  */
                entmovl ent4,ententeaxentnenttent
                entpushl ententeaxentnenttent               
                entcall *ent5entnenttent          /* call fn */
                entmovl ent3,ent0entnenttent        /* exit */
                entint ent0x80entnent
                ent1:enttent
                :ent=entaent (retval), ent=entSent (d0)
                :ent0ent (__NR_clone), entient (__NR_exit),
                 entrent (arg), entrent (fn),
                 entbent (flags | CLONE_VM)
                : entmemoryent);
        return retval;
ent</verb></p><p>Once called, we have a new Task (usually with very low PID number, like
2,3, etc.) waiting for a very slow resource, like swap or usb event. A very
slow resource is used because we would have a task switching overhead otherwise.</p><p>Below is a list of most common kernel threads (from ''ps x'' command):</p><p><verb>PID      COMMAND
 1        init
 2        keventd
 3        kswapd
 4        kreclaimd
 5        bdflush
 6        kupdated
 7        kacpid
67        khubd
</verb></p><p>'init' kernel thread is the first process created, at boot time. It will
call all other User Mode Tasks (from file /etc/inittab) like console daemons,
tty daemons and network daemons (''rc'' scripts).</p><sect2><heading>Example of Kernel Threads: kswapd entmm/vmscan.cent.</heading><p>''kswapd'' is created by ''clone() entarch/i386/kernel/process.cent''</p><p>Initialisation routines:</p><p><verb>|do_initcalls
   |kswapd_init
      |kernel_thread
         |syscall fork (in assembler)</verb></p><p>doentinitcalls entinit/main.cent</p><p>kswapdentinit entmm/vmscan.cent</p><p>kernelentthread entarch/i386/kernel/process.cent</p></sect2></sect1><sect1><heading>Kernel Modules</heading><sect2><heading>Overview</heading><p>Linux Kernel modules are pieces of code (examples: fs, net, and hw driver)
running in kernel mode that you can add at runtime.</p><p>The Linux core cannot be modularized: scheduling and interrupt management
or core network, and so on.</p><p>Under "/lib/modules/KERNELentVERSION/" you can find all the modules installed
on your system.</p></sect2><sect2><heading>Module loading and unloading</heading><p>To load a module, type the following:</p><p><verb>insmod MODULE_NAME parameters

example: insmod ne io=0x300 irq=9</verb></p><p>NOTE: You can use modprobe in place of insmod if you want the kernel automatically
search some parameter (for example when using PCI driver, or if you have specified
parameter under /etc/conf.modules file).</p><p>To unload a module, type the following:</p><p><verb> rmmod MODULE_NAME</verb></p></sect2><sect2><heading>Module definition</heading><p>A module always contains:</p><p><enum><item>"initentmodule" function, executed at insmod (or modprobe) command </item><item>"cleanupentmodule" function, executed at rmmod command</item></enum></p><p>If these functions are not in the module, you need to add 2 macros to specify
what functions  will act as init and exit module:</p><p><enum><item>moduleentinit(FUNCTIONentNAME)</item><item>moduleentexit(FUNCTIONentNAME)</item></enum></p><p>NOTE: a module can "see" a kernel variable only if it has been exported (with
macro EXPORTentSYMBOL).</p></sect2><sect2><heading>A useful trick for adding flexibility to your kernel</heading><p><verb>// kernel sources side
void (*foo_function_pointer)(void *);
 
if (foo_function_pointer)
  (foo_function_pointer)(parameter);
  
 


// module side
extern void (*foo_function_pointer)(void *);

void my_function(void *parameter) ent
  //My code
ent
 
int init_module() ent
  foo_function_pointer = entmy_function;
ent

int cleanup_module() ent
  foo_function_pointer = NULL;
ent</verb></p><p>This simple trick allows you to have very high flexibility in your Kernel,
because only when you load the module you'll make "myentfunction" routine execute.
This routine will do everything you want to do: for example ''rshaper'' module,
which controls bandwidth input traffic from the network, works in this kind
of matter.</p><p>Notice that the whole module mechanism is possible thanks to some global
variables exported to modules, such as head list (allowing you to extend the
list as much as you want). Typical examples are fs, generic devices (char,
block, net, telephony). You have to prepare the kernel to accept your new module;
in some cases you have to create an infrastructure (like telephony one, that
was recently created) to be as standard as possible.</p></sect2></sect1><sect1><heading>Proc directory</heading><p>Proc fs is located in the  /proc directory, which is a special directory
allowing you to talk directly with kernel.</p><p>Linux uses ''proc'' directory to support direct kernel communications:
this is necessary in many cases, for example when you want see main processes
data structures or enable ''proxy-arp'' feature on one interface and not in
others, you want to change max number of threads, or if you want to debug some
bus state, like ISA or PCI, to know what cards are installed and what I/O addresses
and IRQs are assigned to them.</p><p><verb>|-- bus
|   |-- pci
|   |   |-- 00
|   |   |   |-- 00.0
|   |   |   |-- 01.0
|   |   |   |-- 07.0
|   |   |   |-- 07.1
|   |   |   |-- 07.2
|   |   |   |-- 07.3
|   |   |   |-- 07.4
|   |   |   |-- 07.5
|   |   |   |-- 09.0
|   |   |   |-- 0a.0
|   |   |   `-- 0f.0
|   |   |-- 01
|   |   |   `-- 00.0
|   |   `-- devices
|   `-- usb
|-- cmdline
|-- cpuinfo
|-- devices
|-- dma
|-- dri
|   `-- 0
|       |-- bufs
|       |-- clients
|       |-- mem
|       |-- name
|       |-- queues
|       |-- vm
|       `-- vma
|-- driver
|-- execdomains
|-- filesystems
|-- fs
|-- ide
|   |-- drivers
|   |-- hda -ent ide0/hda
|   |-- hdc -ent ide1/hdc
|   |-- ide0
|   |   |-- channel
|   |   |-- config
|   |   |-- hda
|   |   |   |-- cache
|   |   |   |-- capacity
|   |   |   |-- driver
|   |   |   |-- geometry
|   |   |   |-- identify
|   |   |   |-- media
|   |   |   |-- model
|   |   |   |-- settings
|   |   |   |-- smart_thresholds
|   |   |   `-- smart_values
|   |   |-- mate
|   |   `-- model
|   |-- ide1
|   |   |-- channel
|   |   |-- config
|   |   |-- hdc
|   |   |   |-- capacity
|   |   |   |-- driver
|   |   |   |-- identify
|   |   |   |-- media
|   |   |   |-- model
|   |   |   `-- settings
|   |   |-- mate
|   |   `-- model
|   `-- via
|-- interrupts
|-- iomem
|-- ioports
|-- irq
|   |-- 0
|   |-- 1
|   |-- 10
|   |-- 11
|   |-- 12
|   |-- 13
|   |-- 14
|   |-- 15
|   |-- 2
|   |-- 3
|   |-- 4
|   |-- 5
|   |-- 6
|   |-- 7
|   |-- 8
|   |-- 9
|   `-- prof_cpu_mask
|-- kcore
|-- kmsg
|-- ksyms
|-- loadavg
|-- locks
|-- meminfo
|-- misc
|-- modules
|-- mounts
|-- mtrr
|-- net
|   |-- arp
|   |-- dev
|   |-- dev_mcast
|   |-- ip_fwchains
|   |-- ip_fwnames
|   |-- ip_masquerade
|   |-- netlink
|   |-- netstat
|   |-- packet
|   |-- psched
|   |-- raw
|   |-- route
|   |-- rt_acct
|   |-- rt_cache
|   |-- rt_cache_stat
|   |-- snmp
|   |-- sockstat
|   |-- softnet_stat
|   |-- tcp
|   |-- udp
|   |-- unix
|   `-- wireless
|-- partitions
|-- pci
|-- scsi
|   |-- ide-scsi
|   |   `-- 0
|   `-- scsi
|-- self -ent 2069
|-- slabinfo
|-- stat
|-- swaps
|-- sys
|   |-- abi
|   |   |-- defhandler_coff
|   |   |-- defhandler_elf
|   |   |-- defhandler_lcall7
|   |   |-- defhandler_libcso
|   |   |-- fake_utsname
|   |   `-- trace
|   |-- debug
|   |-- dev
|   |   |-- cdrom
|   |   |   |-- autoclose
|   |   |   |-- autoeject
|   |   |   |-- check_media
|   |   |   |-- debug
|   |   |   |-- info
|   |   |   `-- lock
|   |   `-- parport
|   |       |-- default
|   |       |   |-- spintime
|   |       |   `-- timeslice
|   |       `-- parport0
|   |           |-- autoprobe
|   |           |-- autoprobe0
|   |           |-- autoprobe1
|   |           |-- autoprobe2
|   |           |-- autoprobe3
|   |           |-- base-addr
|   |           |-- devices
|   |           |   |-- active
|   |           |   `-- lp
|   |           |       `-- timeslice
|   |           |-- dma
|   |           |-- irq
|   |           |-- modes
|   |           `-- spintime
|   |-- fs
|   |   |-- binfmt_misc
|   |   |-- dentry-state
|   |   |-- dir-notify-enable
|   |   |-- dquot-nr
|   |   |-- file-max
|   |   |-- file-nr
|   |   |-- inode-nr
|   |   |-- inode-state
|   |   |-- jbd-debug
|   |   |-- lease-break-time
|   |   |-- leases-enable
|   |   |-- overflowgid
|   |   `-- overflowuid
|   |-- kernel
|   |   |-- acct
|   |   |-- cad_pid
|   |   |-- cap-bound
|   |   |-- core_uses_pid
|   |   |-- ctrl-alt-del
|   |   |-- domainname
|   |   |-- hostname
|   |   |-- modprobe
|   |   |-- msgmax
|   |   |-- msgmnb
|   |   |-- msgmni
|   |   |-- osrelease
|   |   |-- ostype
|   |   |-- overflowgid
|   |   |-- overflowuid
|   |   |-- panic
|   |   |-- printk
|   |   |-- random
|   |   |   |-- boot_id
|   |   |   |-- entropy_avail
|   |   |   |-- poolsize
|   |   |   |-- read_wakeup_threshold
|   |   |   |-- uuid
|   |   |   `-- write_wakeup_threshold
|   |   |-- rtsig-max
|   |   |-- rtsig-nr
|   |   |-- sem
|   |   |-- shmall
|   |   |-- shmmax
|   |   |-- shmmni
|   |   |-- sysrq
|   |   |-- tainted
|   |   |-- threads-max
|   |   `-- version
|   |-- net
|   |   |-- 802
|   |   |-- core
|   |   |   |-- hot_list_length
|   |   |   |-- lo_cong
|   |   |   |-- message_burst
|   |   |   |-- message_cost
|   |   |   |-- mod_cong
|   |   |   |-- netdev_max_backlog
|   |   |   |-- no_cong
|   |   |   |-- no_cong_thresh
|   |   |   |-- optmem_max
|   |   |   |-- rmem_default
|   |   |   |-- rmem_max
|   |   |   |-- wmem_default
|   |   |   `-- wmem_max
|   |   |-- ethernet
|   |   |-- ipv4
|   |   |   |-- conf
|   |   |   |   |-- all
|   |   |   |   |   |-- accept_redirects
|   |   |   |   |   |-- accept_source_route
|   |   |   |   |   |-- arp_filter
|   |   |   |   |   |-- bootp_relay
|   |   |   |   |   |-- forwarding
|   |   |   |   |   |-- log_martians
|   |   |   |   |   |-- mc_forwarding
|   |   |   |   |   |-- proxy_arp
|   |   |   |   |   |-- rp_filter
|   |   |   |   |   |-- secure_redirects
|   |   |   |   |   |-- send_redirects
|   |   |   |   |   |-- shared_media
|   |   |   |   |   `-- tag
|   |   |   |   |-- default
|   |   |   |   |   |-- accept_redirects
|   |   |   |   |   |-- accept_source_route
|   |   |   |   |   |-- arp_filter
|   |   |   |   |   |-- bootp_relay
|   |   |   |   |   |-- forwarding
|   |   |   |   |   |-- log_martians
|   |   |   |   |   |-- mc_forwarding
|   |   |   |   |   |-- proxy_arp
|   |   |   |   |   |-- rp_filter
|   |   |   |   |   |-- secure_redirects
|   |   |   |   |   |-- send_redirects
|   |   |   |   |   |-- shared_media
|   |   |   |   |   `-- tag
|   |   |   |   |-- eth0
|   |   |   |   |   |-- accept_redirects
|   |   |   |   |   |-- accept_source_route
|   |   |   |   |   |-- arp_filter
|   |   |   |   |   |-- bootp_relay
|   |   |   |   |   |-- forwarding
|   |   |   |   |   |-- log_martians
|   |   |   |   |   |-- mc_forwarding
|   |   |   |   |   |-- proxy_arp
|   |   |   |   |   |-- rp_filter
|   |   |   |   |   |-- secure_redirects
|   |   |   |   |   |-- send_redirects
|   |   |   |   |   |-- shared_media
|   |   |   |   |   `-- tag
|   |   |   |   |-- eth1
|   |   |   |   |   |-- accept_redirects
|   |   |   |   |   |-- accept_source_route
|   |   |   |   |   |-- arp_filter
|   |   |   |   |   |-- bootp_relay
|   |   |   |   |   |-- forwarding
|   |   |   |   |   |-- log_martians
|   |   |   |   |   |-- mc_forwarding
|   |   |   |   |   |-- proxy_arp
|   |   |   |   |   |-- rp_filter
|   |   |   |   |   |-- secure_redirects
|   |   |   |   |   |-- send_redirects
|   |   |   |   |   |-- shared_media
|   |   |   |   |   `-- tag
|   |   |   |   `-- lo
|   |   |   |       |-- accept_redirects
|   |   |   |       |-- accept_source_route
|   |   |   |       |-- arp_filter
|   |   |   |       |-- bootp_relay
|   |   |   |       |-- forwarding
|   |   |   |       |-- log_martians
|   |   |   |       |-- mc_forwarding
|   |   |   |       |-- proxy_arp
|   |   |   |       |-- rp_filter
|   |   |   |       |-- secure_redirects
|   |   |   |       |-- send_redirects
|   |   |   |       |-- shared_media
|   |   |   |       `-- tag
|   |   |   |-- icmp_echo_ignore_all
|   |   |   |-- icmp_echo_ignore_broadcasts
|   |   |   |-- icmp_ignore_bogus_error_responses
|   |   |   |-- icmp_ratelimit
|   |   |   |-- icmp_ratemask
|   |   |   |-- inet_peer_gc_maxtime
|   |   |   |-- inet_peer_gc_mintime
|   |   |   |-- inet_peer_maxttl
|   |   |   |-- inet_peer_minttl
|   |   |   |-- inet_peer_threshold
|   |   |   |-- ip_autoconfig
|   |   |   |-- ip_conntrack_max
|   |   |   |-- ip_default_ttl
|   |   |   |-- ip_dynaddr
|   |   |   |-- ip_forward
|   |   |   |-- ip_local_port_range
|   |   |   |-- ip_no_pmtu_disc
|   |   |   |-- ip_nonlocal_bind
|   |   |   |-- ipfrag_high_thresh
|   |   |   |-- ipfrag_low_thresh
|   |   |   |-- ipfrag_time
|   |   |   |-- neigh
|   |   |   |   |-- default
|   |   |   |   |   |-- anycast_delay
|   |   |   |   |   |-- app_solicit
|   |   |   |   |   |-- base_reachable_time
|   |   |   |   |   |-- delay_first_probe_time
|   |   |   |   |   |-- gc_interval
|   |   |   |   |   |-- gc_stale_time
|   |   |   |   |   |-- gc_thresh1
|   |   |   |   |   |-- gc_thresh2
|   |   |   |   |   |-- gc_thresh3
|   |   |   |   |   |-- locktime
|   |   |   |   |   |-- mcast_solicit
|   |   |   |   |   |-- proxy_delay
|   |   |   |   |   |-- proxy_qlen
|   |   |   |   |   |-- retrans_time
|   |   |   |   |   |-- ucast_solicit
|   |   |   |   |   `-- unres_qlen
|   |   |   |   |-- eth0
|   |   |   |   |   |-- anycast_delay
|   |   |   |   |   |-- app_solicit
|   |   |   |   |   |-- base_reachable_time
|   |   |   |   |   |-- delay_first_probe_time
|   |   |   |   |   |-- gc_stale_time
|   |   |   |   |   |-- locktime
|   |   |   |   |   |-- mcast_solicit
|   |   |   |   |   |-- proxy_delay
|   |   |   |   |   |-- proxy_qlen
|   |   |   |   |   |-- retrans_time
|   |   |   |   |   |-- ucast_solicit
|   |   |   |   |   `-- unres_qlen
|   |   |   |   |-- eth1
|   |   |   |   |   |-- anycast_delay
|   |   |   |   |   |-- app_solicit
|   |   |   |   |   |-- base_reachable_time
|   |   |   |   |   |-- delay_first_probe_time
|   |   |   |   |   |-- gc_stale_time
|   |   |   |   |   |-- locktime
|   |   |   |   |   |-- mcast_solicit
|   |   |   |   |   |-- proxy_delay
|   |   |   |   |   |-- proxy_qlen
|   |   |   |   |   |-- retrans_time
|   |   |   |   |   |-- ucast_solicit
|   |   |   |   |   `-- unres_qlen
|   |   |   |   `-- lo
|   |   |   |       |-- anycast_delay
|   |   |   |       |-- app_solicit
|   |   |   |       |-- base_reachable_time
|   |   |   |       |-- delay_first_probe_time
|   |   |   |       |-- gc_stale_time
|   |   |   |       |-- locktime
|   |   |   |       |-- mcast_solicit
|   |   |   |       |-- proxy_delay
|   |   |   |       |-- proxy_qlen
|   |   |   |       |-- retrans_time
|   |   |   |       |-- ucast_solicit
|   |   |   |       `-- unres_qlen
|   |   |   |-- route
|   |   |   |   |-- error_burst
|   |   |   |   |-- error_cost
|   |   |   |   |-- flush
|   |   |   |   |-- gc_elasticity
|   |   |   |   |-- gc_interval
|   |   |   |   |-- gc_min_interval
|   |   |   |   |-- gc_thresh
|   |   |   |   |-- gc_timeout
|   |   |   |   |-- max_delay
|   |   |   |   |-- max_size
|   |   |   |   |-- min_adv_mss
|   |   |   |   |-- min_delay
|   |   |   |   |-- min_pmtu
|   |   |   |   |-- mtu_expires
|   |   |   |   |-- redirect_load
|   |   |   |   |-- redirect_number
|   |   |   |   `-- redirect_silence
|   |   |   |-- tcp_abort_on_overflow
|   |   |   |-- tcp_adv_win_scale
|   |   |   |-- tcp_app_win
|   |   |   |-- tcp_dsack
|   |   |   |-- tcp_ecn
|   |   |   |-- tcp_fack
|   |   |   |-- tcp_fin_timeout
|   |   |   |-- tcp_keepalive_intvl
|   |   |   |-- tcp_keepalive_probes
|   |   |   |-- tcp_keepalive_time
|   |   |   |-- tcp_max_orphans
|   |   |   |-- tcp_max_syn_backlog
|   |   |   |-- tcp_max_tw_buckets
|   |   |   |-- tcp_mem
|   |   |   |-- tcp_orphan_retries
|   |   |   |-- tcp_reordering
|   |   |   |-- tcp_retrans_collapse
|   |   |   |-- tcp_retries1
|   |   |   |-- tcp_retries2
|   |   |   |-- tcp_rfc1337
|   |   |   |-- tcp_rmem
|   |   |   |-- tcp_sack
|   |   |   |-- tcp_stdurg
|   |   |   |-- tcp_syn_retries
|   |   |   |-- tcp_synack_retries
|   |   |   |-- tcp_syncookies
|   |   |   |-- tcp_timestamps
|   |   |   |-- tcp_tw_recycle
|   |   |   |-- tcp_window_scaling
|   |   |   `-- tcp_wmem
|   |   `-- unix
|   |       `-- max_dgram_qlen
|   |-- proc
|   `-- vm
|       |-- bdflush
|       |-- kswapd
|       |-- max-readahead
|       |-- min-readahead
|       |-- overcommit_memory
|       |-- page-cluster
|       `-- pagetable_cache
|-- sysvipc
|   |-- msg
|   |-- sem
|   `-- shm
|-- tty
|   |-- driver
|   |   `-- serial
|   |-- drivers
|   |-- ldisc
|   `-- ldiscs
|-- uptime
`-- version
</verb></p><p>In the directory there are also all the tasks using PID as file names (you
have access to all Task information, like path of binary file, memory used,
and so on).</p><p>The interesting point is that you cannot only see kernel values (for example,
see info about any task or about network options enabled of your TCP/IP stack)
but you are also able to modify some of it, typically that ones under /proc/sys
directory:</p><p><verb>/proc/sys/ 
          acpi
          dev
          debug
          fs
          proc
          net
          vm
          kernel</verb></p><sect2><heading>/proc/sys/kernel</heading><p>Below are very important and well-know kernel values, ready to be modified:</p><p><verb>overflowgid
overflowuid
random
threads-max // Max number of threads, typically 16384
sysrq // kernel hack: you can view istant register values and more
sem
msgmnb
msgmni
msgmax
shmmni
shmall
shmmax
rtsig-max
rtsig-nr
modprobe // modprobe file location
printk
ctrl-alt-del
cap-bound
panic
domainname // domain name of your Linux box
hostname // host name of your Linux box
version // date info about kernel compilation
osrelease // kernel version (i.e. 2.4.5)
ostype // Linux!</verb></p></sect2><sect2><heading>/proc/sys/net</heading><p>This can be considered the most useful proc subdirectory. It allows you
to change very important settings for your network kernel configuration.</p><p><verb>core
ipv4
ipv6
unix
ethernet
802</verb></p><sect3><heading>/proc/sys/net/core</heading><p>Listed below are general net settings, like "netdeventmaxentbacklog" (typically
300), the length of all your network packets. This value can limit your network
bandwidth when receiving packets, Linux has to wait up to scheduling time to
flush buffers (due to bottom half mechanism), about 1000/HZ ms</p><p><verb>  300    *        100             =     30 000
packets     HZ(Timeslice freq)         packets/s
 
30 000   *       1000             =      30 M
packets     average (Bytes/packet)   throughput Bytes/s</verb></p><p>If you want to get higher throughput, you need to increase netdeventmaxentbacklog,
by typing:</p><p><verb>echo 4000 ent /proc/sys/net/core/netdev_max_backlog</verb></p><p>Note: Warning for some HZ values: under some architecture (like alpha or
arm-tbox) it is 1000, so you can have 300 MBytes/s of average throughput.</p></sect3><sect3><heading>/proc/sys/net/ipv4</heading><p>"ipentforward", enables or disables ip forwarding in your Linux box. This is
a generic setting for all devices, you can  specify  each device you choose.</p><sect4><heading>/proc/sys/net/ipv4/conf/interface</heading><p>I think this is the most useful /proc entry, because it allows you to change
some net settings to support wireless networks (see <url url="http://bertolinux.fatamorgana.com/wireless/english" name="Wireless-HOWTO"></url> for more information).</p><p>Here are some examples of when you could use this setting:</p><p><itemize><item>"forwarding", to enable ip forwarding for your interface</item><item>"proxyentarp", to enable proxy arp feature. For more see Proxy arp HOWTO under
<url url="http://www.linuxdoc.org" name="Linux Documentation Project"></url> and <url url="http://bertolinux.fatamorgana.com/wireless/english" name="Wireless-HOWTO"></url> for proxy arp use in Wireless networks.</item><item>"sendentredirects" to avoid interface to send ICMPentREDIRECT (as before, see
<url url="http://bertolinux.fatamorgana.com/wireless/english" name="Wireless-HOWTO"></url> for more).</item></itemize></p></sect4></sect3></sect2></sect1></sect><sect><heading>Linux Multitasking</heading><sect1><heading>Overview</heading><p>This section will analyze data structures--the mechanism used to manage
multitasking environment under Linux.</p><sect2><heading>Task States</heading><p>A Linux Task can be one of the following states (according to entinclude/linux.hent):</p><p><enum><item>TASKentRUNNING, it means that it is in the "Ready List"</item><item>TASKentINTERRUPTIBLE, task waiting for a signal or a resource (sleeping)</item><item>TASKentUNINTERRUPTIBLE, task waiting for a resource (sleeping), it is in
same "Wait Queue"</item><item>TASKentZOMBIE, task child without father</item><item>TASKentSTOPPED, task being debugged</item></enum></p></sect2><sect2><heading>Graphical Interaction</heading><p><verb>       ______________     CPU Available     ______________
      |              |  ----------------ent  |              |
      | TASK_RUNNING |                     | Real Running |  
      |______________|  ent----------------  |______________|
                           CPU Busy
            |   /|ent       
Waiting for |    | Resource  
 Resource   |    | Available             
           ent|/   |      
    ______________________                     
   |                      |
   | TASK_INTERRUPTIBLE / |
   | TASK-UNINTERRUPTIBLE |
   |______________________|
 
                     Main Multitasking Flow</verb></p></sect2></sect1><sect1><heading>Timeslice</heading><sect2><heading>PIT 8253 Programming</heading><p>Each 10 ms (depending on HZ value) an IRQ0 comes, which helps us in a multitasking
environment. This signal comes from PIC 8259 (in arch 386+) which is connected
to PIT 8253 with a clock of 1.19318 MHz.</p><p><verb>    _____         ______        ______        
   | CPU |ent------| 8259 |------| 8253 |
   |_____| IRQ0  |______|      |___/|ent|
                                    |_____ CLK 1.193.180 MHz
          
// From include/asm/param.h
entifndef HZ 
entdefine HZ 100 
entendif
 
// From include/asm/timex.h
entdefine CLOCK_TICK_RATE 1193180 /* Underlying HZ */
 
// From include/linux/timex.h
entdefine LATCH ((CLOCK_TICK_RATE + HZ/2) / HZ) /* For divider */
 
// From arch/i386/kernel/i8259.c
outb_p(0x34,0x43); /* binary, mode 2, LSB/MSB, ch 0 */ 
outb_p(LATCH ent 0xff , 0x40); /* LSB */
outb(LATCH entent 8 , 0x40); /* MSB */
 </verb></p><p>So we program 8253 (PIT, Programmable Interval Timer) with LATCH = (1193180/HZ)
= 11931.8 when HZ=100 (default). LATCH indicates the frequency divisor factor.</p><p>LATCH = 11931.8 gives to 8253 (in output) a frequency of 1193180 / 11931.8
= 100 Hz, so period = 10ms</p><p>So Timeslice = 1/HZ.</p><p>With each Timeslice we temporarily interrupt current process execution
(without task switching), and we do some housekeeping work, after which we'll
return back to our previous process.</p></sect2><sect2><heading>Linux Timer IRQ ICA</heading><p><verb>Linux Timer IRQ
IRQ 0 entTimerent
 |  
ent|/
|IRQ0x00_interrupt        //   wrapper IRQ handler
   |SAVE_ALL              ---   
      |do_IRQ                |   wrapper routines
         |handle_IRQ_event  ---
            |handler() -ent timer_interrupt  // registered IRQ 0 handler
               |do_timer_interrupt
                  |do_timer  
                     |jiffies++;
                     |update_process_times  
                     |if (--counter ent= 0) ent // if time slice ended then
                        |counter = 0;        //   reset counter           
                        |need_resched = 1;   //   prepare to reschedule
                     |ent
         |do_softirq
         |while (need_resched) ent // if necessary
            |schedule             //   reschedule
            |handle_softirq
         |ent
   |RESTORE_ALL
 </verb></p><p>Functions can be found under:</p><p><itemize><item>IRQ0x00entinterrupt, SAVEentALL entinclude/asm/hwentirq.hent</item><item>doentIRQ, handleentIRQentevent entarch/i386/kernel/irq.cent</item><item>timerentinterrupt, doenttimerentinterrupt entarch/i386/kernel/time.cent</item><item>doenttimer, updateentprocessenttimes entkernel/timer.cent</item><item>doentsoftirq entkernel/softentirq.cent</item><item>RESTOREentALL, while loop entarch/i386/kernel/entry.Sent</item></itemize></p><p>Notes:</p><p><enum><item>Function "IRQ0x00entinterrupt" (like others IRQ0xXYentinterrupt) is directly
pointed by IDT (Interrupt Descriptor Table, similar to Real Mode Interrupt
Vector Table, see Cap 11 for more), so EVERY interrupt coming to the processor
is managed by "IRQ0xentNRentinterrupt" routine, where entNR is the interrupt
number. We refer to it as "wrapper irq handler".</item><item>wrapper routines are executed, like "doentIRQ","handleentIRQentevent" entarch/i386/kernel/irq.cent.</item><item>After this, control is passed to official IRQ routine (pointed by "handler()"),
previously registered with "requestentirq" entarch/i386/kernel/irq.cent,
in this case "timerentinterrupt" entarch/i386/kernel/time.cent.</item><item>"timerentinterrupt" entarch/i386/kernel/time.cent routine is executed
and, when it ends,</item><item>control backs to some assembler routines entarch/i386/kernel/entry.Sent.</item></enum></p><p>Description: </p><p>To manage Multitasking, Linux (like every other Unix) uses a  ''counter''
variable to keep track of how much CPU was used by the task. So, on each IRQ
0, the counter is decremented (point 4) and, when it reaches 0, we need to
switch task to manage timesharing (point 4 "needentresched" variable is set to
1, then, in point 5 assembler routines control "needentresched" and call, if needed,
"schedule" entkernel/sched.cent).</p></sect2></sect1><sect1><heading>Scheduler</heading><p>The scheduler is the piece of code that chooses what Task has to be executed
at a given time.</p><p>Any time you need to change running task, select a candidate. Below is
the ''schedule entkernel/sched.cent'' function.</p><p><verb>|schedule
   |do_softirq // manages post-IRQ work
   |for each task
      |calculate counter
   |prepare_to__switch // does anything
   |switch_mm // change Memory context (change CR3 value)
   |switch_to (assembler)
      |SAVE ESP
      |RESTORE future_ESP
      |SAVE EIP
      |push future_EIP *** push parameter as we did a call 
         |jmp __switch_to (it does some TSS work) 
         |__switch_to()
          ..
         |ret *** ret from call using future_EIP in place of call address
      new_task
</verb></p></sect1><sect1><heading>Bottom Half, Task Queues. and Tasklets</heading><sect2><heading>Overview</heading><p>In classic Unix, when an IRQ comes (from a device), Unix makes  "task switching"
to interrogate the task that requested the device.</p><p>To improve performance, Linux can postpone the non-urgent work until later,
to better manage high speed event.</p><p>This feature is managed since kernel 1.x by the "bottom half" (BH). The irq
handler "marks" a bottom half, to be executed later, in scheduling time.</p><p>In the latest kernels there is a "task queue"that is more dynamic than BH
and there is also a "tasklet" to manage multiprocessor environments.</p><p>BH schema is:</p><p><enum><item>Declaration</item><item>Mark</item><item>Execution</item></enum></p></sect2><sect2><heading>Declaration</heading><p><verb>entdefine DECLARE_TASK_QUEUE(q) LIST_HEAD(q)
entdefine LIST_HEAD(name) ent
   struct list_head name = LIST_HEAD_INIT(name) 
struct list_head ent 
   struct list_head *next, *prev; 
ent;
entdefine LIST_HEAD_INIT(name) ent ent(name), ent(name) ent 
 
      ''DECLARE_TASK_QUEUE'' entinclude/linux/tqueue.h, include/linux/list.hent </verb></p><p>"DECLAREentTASKentQUEUE(q)" macro is used to declare a structure named "q" managing
task queue.</p></sect2><sect2><heading>Mark</heading><p>Here is the ICA schema for "markentbh" entinclude/linux/interrupt.hent
function:</p><p><verb>|mark_bh(NUMBER)
   |tasklet_hi_schedule(bh_task_vec + NUMBER)
      |insert into tasklet_hi_vec
         |__cpu_raise_softirq(HI_SOFTIRQ) 
            |soft_active |= (1 entent HI_SOFTIRQ)
 
                   ''mark_bh''entinclude/linux/interrupt.hent</verb></p><p>For example, when an IRQ handler wants to "postpone" some work, it would
"markentbh(NUMBER)", where NUMBER is a BH declarated (see section before).</p></sect2><sect2><heading>Execution</heading><p>We can see this calling from "doentIRQ" entarch/i386/kernel/irq.cent
function:</p><p><verb>|do_softirq
   |h-entaction(h)-ent softirq_vecentTASKLET_SOFTIRQent-entaction -ent tasklet_action
      |tasklet_vecent0ent.list-entfunc
         
</verb></p><p>"h-entaction(h);" is the function has been previously queued.</p></sect2></sect1><sect1><heading>Very low level routines</heading><p>setentintrentgate</p><p>setenttrapentgate</p><p>setenttaskentgate (not used).</p><p>(*interrupt)entNRentIRQSent(void) = ent IRQ0x00entinterrupt, IRQ0x01entinterrupt,
..ent</p><p>NRentIRQS = 224 entkernel 2.4.2ent</p></sect1><sect1><heading>Task Switching</heading><sect2><heading>When does Task switching occur?</heading><p>Now we'll see how the Linux Kernel switchs from one task to another.</p><p>Task Switching is needed in many cases, such as the following:</p><p><itemize><item>when TimeSlice ends, we need to give access to some other task</item><item>when a task decide to access a resource, it sleeps for it, so we have to
choose another task</item><item>when a task waits for a pipe, we have to give access to other task, which
would write to pipe</item></itemize></p></sect2><sect2><heading>Task Switching</heading><p><verb>                           TASK SWITCHING TRICK
entdefine switch_to(prev,next,last) do ent                                  ent
        asm volatile(entpushl ententesientnenttent                                  ent
                     entpushl ententedientnenttent                                  ent
                     entpushl ententebpentnenttent                                  ent
                     entmovl ententesp,ent0entnenttent        /* save ESP */          ent
                     entmovl ent3,ententespentnenttent        /* restore ESP */       ent
                     entmovl ent1f,ent1entnenttent          /* save EIP */          ent
                     entpushl ent4entnenttent             /* restore EIP */       ent
                     entjmp __switch_toentnent                                ent
                     ent1:enttent                                             ent
                     entpopl ententebpentnenttent                                   ent
                     entpopl ententedientnenttent                                   ent
                     entpopl ententesientnenttent                                   ent
                     :ent=ment (prev-entthread.esp),ent=ment (prev-entthread.eip),  ent
                      ent=bent (last)                                       ent
                     :entment (next-entthread.esp),entment (next-entthread.eip),    ent
                      entaent (prev), entdent (next),                           ent
                      entbent (prev));                                      ent
ent while (0)</verb></p><p>Trick is here:</p><p><enum><item>''pushl ent4'' which puts futureentEIP into the stack</item><item>''jmp ententswitchentto'' which execute ''ententswitchentto'' function, but in opposite
of ''call'' we will return to valued pushed in point 1 (so new Task!)</item></enum></p><p><verb>      U S E R   M O D E                 K E R N E L     M O D E

 |          |     |          |       |          |     |          |
 |          |     |          | Timer |          |     |          |
 |          |     |  Normal  |  IRQ  |          |     |          |
 |          |     |   Exec   |------ent|Timer_Int.|     |          |
 |          |     |     |    |       | ..       |     |          |
 |          |     |    ent|/   |       |schedule()|     | Task1 Ret|
 |          |     |          |       |_switch_to|ent--  |  Address |
 |__________|     |__________|       |          |  |  |          |
                                     |          |  |S |          | 
Task1 Data/Stack   Task1 Code        |          |  |w |          |
                                     |          | T|i |          |
                                     |          | a|t |          |
 |          |     |          |       |          | s|c |          |
 |          |     |          | Timer |          | k|h |          |
 |          |     |  Normal  |  IRQ  |          |  |i |          | 
 |          |     |   Exec   |------ent|Timer_Int.|  |n |          |
 |          |     |     |    |       | ..       |  |g |          |
 |          |     |    ent|/   |       |schedule()|  |  | Task2 Ret|
 |          |     |          |       |_switch_to|ent--  |  Address |
 |__________|     |__________|       |__________|     |__________|
 
Task2 Data/Stack   Task2 Code        Kernel Code  Kernel Data/Stack</verb></p></sect2></sect1><sect1><heading>Fork</heading><sect2><heading>Overview</heading><p>Fork is used to create another task. We start from a Task Parent, and we
copy many data structures to Task Child.</p><p><verb> 
                               |         |
                               | ..      |
         Task Parent           |         |
         |         |           |         |
         |  fork   |----------ent|  CREATE |   
         |         |          /|   NEW   |
         |_________|         / |   TASK  |
                            /  |         |
             ---           /   |         |
             ---          /    | ..      |
                         /     |         |
         Task Child     / 
         |         |   /
         |  fork   |ent-/
         |         |
         |_________|
              
                       Fork SysCall</verb></p></sect2><sect2><heading>What is not copied</heading><p>New Task just created (''Task Child'') is almost equal to Parent (''Task
Parent''), there are only few differences:</p><p><enum><item>obviously PID</item><item>child ''fork()'' will return 0, while parent ''fork()'' will return PID
of Task Child, to distinguish them each other in User Mode</item><item>All child data pages are marked ''READ + EXECUTE'', no "WRITE'' (while parent
has WRITE right for its own pages) so, when a write request comes, a ''Page
Fault'' exception is generated which will create a new independent page: this
mechanism is called ''Copy on Write'' (see Cap.10 for more).</item></enum></p></sect2><sect2><heading>Fork ICA</heading><p><verb>|sys_fork 
   |do_fork
      |alloc_task_struct 
         |__get_free_pages
       |p-entstate = TASK_UNINTERRUPTIBLE
       |copy_flags
       |p-entpid = get_pid    
       |copy_files
       |copy_fs
       |copy_sighand
       |copy_mm // should manage CopyOnWrite (I part)
          |allocate_mm
          |mm_init
             |pgd_alloc -ent get_pgd_fast
                |get_pgd_slow
          |dup_mmap
             |copy_page_range
                |ptep_set_wrprotect
                   |clear_bit // set page to read-only              
          |copy_segments // For LDT
       |copy_thread
          |childregs-enteax = 0  
          |p-entthread.esp = childregs // child fork returns 0
          |p-entthread.eip = ret_from_fork // child starts from fork exit
       |retval = p-entpid // parent fork returns child pid
       |SET_LINKS // insertion of task into the list pointers
       |nr_threads++ // Global variable
       |wake_up_process(p) // Now we can wake up just created child
       |return retval
              
               fork ICA
 </verb></p><p><itemize><item>sysentfork entarch/i386/kernel/process.cent</item><item>doentfork entkernel/fork.cent</item><item>allocenttaskentstruct entinclude/asm/processor.cent</item><item>ententgetentfreeentpages entmm/pageentalloc.cent</item><item>getentpid entkernel/fork.cent</item><item>copyentfiles </item><item>copyentfs</item><item>copyentsighand</item><item>copyentmm</item><item>allocateentmm</item><item>mmentinit</item><item>pgdentalloc -ent getentpgdentfast entinclude/asm/pgalloc.hent</item><item>getentpgdentslow</item><item>dupentmmap entkernel/fork.cent</item><item>copyentpageentrange entmm/memory.cent</item><item>ptepentsetentwrprotect entinclude/asm/pgtable.hent</item><item>clearentbit entinclude/asm/bitops.hent</item><item>copyentsegments entarch/i386/kernel/process.cent</item><item>copyentthread</item><item>SETentLINKS entinclude/linux/sched.hent</item><item>wakeentupentprocess entkernel/sched.cent</item></itemize></p></sect2><sect2><heading>Copy on Write</heading><p>To implement Copy on Write for Linux:</p><p><enum><item>Mark all copied pages as read-only, causing a Page Fault when a child tries
to write to them.</item><item>Page Fault handler creates a new page for the Task caused exception.</item></enum></p><p><verb> 
 | Page 
 | Fault 
 | Exception
 |
 |
 -----------ent |do_page_fault
                 |handle_mm_fault
                    |handle_pte_fault 
                       |do_wp_page        
                          |alloc_page      // Allocate a new page
                          |break_cow
                             |copy_cow_page // Copy old page to new one
                             |establish_pte // reconfig Page Table pointers
                                |set_pte
                            
              Page Fault ICA
 </verb></p><p><itemize><item>doentpageentfault entarch/i386/mm/fault.cent </item><item>handleentmmentfault entmm/memory.cent</item><item>handleentpteentfault </item><item>doentwpentpage</item><item>allocentpage entinclude/linux/mm.hent</item><item>breakentcow entmm/memory.cent</item><item>copyentcowentpage</item><item>establishentpte</item><item>setentpte entinclude/asm/pgtable-3level.hent</item></itemize></p></sect2></sect1></sect><sect><heading>Linux Memory Management</heading><sect1><heading>Overview</heading><p>Linux uses segmentation + pagination, which simplifies notation. </p><sect2><heading>Segments</heading><p>Linux uses only 4 segments:</p><p><itemize><item>2 segments (code and data/stack) for KERNEL SPACE from ent0xC000 0000ent
(3 GB) to ent0xFFFF FFFFent (4 GB)</item><item>2 segments (code and data/stack) for USER SPACE from ent0ent (0 GB)
to ent0xBFFF FFFFent (3 GB)</item></itemize></p><p><verb>                               __
   4 GB---ent|                |    |
           |     Kernel     |    |  Kernel Space (Code + Data/Stack)
           |                |  __|
   3 GB---ent|----------------|  __
           |                |    |
           |                |    |
   2 GB---ent|                |    |
           |     Tasks      |    |  User Space (Code + Data/Stack)
           |                |    |
   1 GB---ent|                |    |
           |                |    |
           |________________|  __| 
 0x00000000
          Kernel/User Linear addresses
 </verb></p></sect2></sect1><sect1><heading>Specific i386 implementation</heading><p>Again, Linux implements Pagination using 3 Levels of Paging, but in i386
architecture only 2 of them are really used:</p><p><verb> 
   ------------------------------------------------------------------
   L    I    N    E    A    R         A    D    D    R    E    S    S
   ------------------------------------------------------------------
        ent___/                 ent___/                     ent_____/ 
 
     PD offset              PF offset                 Frame offset 
     ent10 bitsent              ent10 bitsent                 ent12 bitsent       
          |                     |                          |
          |                     |     -----------          |        
          |                     |     |  Value  |----------|---------
          |     |         |     |     |---------|   /|ent    |        |
          |     |         |     |     |         |    |     |        |
          |     |         |     |     |         |    | Frame offset |
          |     |         |     |     |         |   ent|/             |
          |     |         |     |     |---------|ent------            |
          |     |         |     |     |         |      |            |
          |     |         |     |     |         |      | x 4096     |
          |     |         |  PF offset|_________|-------            |
          |     |         |       /|ent |         |                   |
      PD offset |_________|-----   |  |         |          _________|
            /|ent |         |    |   |  |         |          | 
             |  |         |    |  ent|/ |         |         ent|/
 _____       |  |         |    ------ent|_________|   PHYSICAL ADDRESS 
|     |     ent|/ |         |    x 4096 |         |
| CR3 |--------ent|         |           |         |
|_____|         | ....... |           | ....... |
                |         |           |         |    
 
               Page Directory          Page File

                       Linux i386 Paging
 

</verb></p></sect1><sect1><heading>Memory Mapping</heading><p>Linux manages Access Control with Pagination only, so different Tasks will
have the same segment addresses, but different CR3 (register used to store
Directory Page Address), pointing to different Page Entries.</p><p>In User mode a task cannot overcome 3 GB limit (0 x C0 00 00 00), so only
the  first 768 page directory entries are meaningful (768*4MB = 3GB).</p><p>When a Task goes in Kernel Mode (by System call or by IRQ) the other 256
pages directory entries become important, and they point to the same page files
as all other Tasks (which are the same as the Kernel).</p><p>Note that Kernel (and only kernel) Linear Space is equal to Kernel Physical
Space, so:</p><p><verb> 
            ________________ _____                    
           |Other KernelData|___  |  |                |
           |----------------|   | |__|                |
           |     Kernel     |ent  |____|   Real Other   |
  3 GB ---ent|----------------| ent      |   Kernel Data  |
           |                |ent ent     |                |
           |              __|_ent_ent____|__   Real       |
           |      Tasks     |  ent ent   |     Tasks      |
           |              __|___ent_ent__|__   Space      |
           |                |    ent ent |                |
           |                |     ent ent|----------------|
           |                |      ent |Real KernelSpace|
           |________________|       ent|________________|
      
           Logical Addresses          Physical Addresses
 </verb></p><p>Linear Kernel Space corresponds to Physical Kernel Space  translated  3
GB down (in fact page tables are something like ent "00000000", "00000001" ent,
so they operate no virtualization, they only report physical addresses  they
take from linear ones).</p><p>Notice that you'll not have an "addresses conflict" between Kernel and User
spaces because we can manage physical addresses with Page Tables.</p></sect1><sect1><heading>Low level memory allocation</heading><sect2><heading>Boot Initialization</heading><p>We start from kmementcacheentinit (launched by startentkernel entinit/main.cent
at boot up).</p><p><verb>|kmem_cache_init
   |kmem_cache_estimate
</verb></p><p>kmementcacheentinit entmm/slab.cent</p><p>kmementcacheentestimate</p><p>Now we continue with mementinit (also launched by startentkernelentinit/main.cent)</p><p><verb>|mem_init
   |free_all_bootmem
      |free_all_bootmem_core</verb></p><p>mementinit entarch/i386/mm/init.cent</p><p>freeentallentbootmem entmm/bootmem.cent</p><p>freeentallentbootmementcore</p></sect2><sect2><heading>Run-time allocation</heading><p>Under Linux, when we want to allocate memory, for example during "copyentonentwrite"
mechanism (see Cap.10), we call:</p><p><verb>|copy_mm 
   |allocate_mm = kmem_cache_alloc
      |__kmem_cache_alloc
         |kmem_cache_alloc_one
            |alloc_new_slab
               |kmem_cache_grow
                  |kmem_getpages
                     |__get_free_pages
                        |alloc_pages
                           |alloc_pages_pgdat
                              |__alloc_pages
                                 |rmqueue   
                                 |reclaim_pages
</verb></p><p>Functions can be found under:</p><p><itemize><item>copyentmm entkernel/fork.cent</item><item>allocateentmm entkernel/fork.cent</item><item>kmementcacheentalloc entmm/slab.cent</item><item>ententkmementcacheentalloc </item><item>kmementcacheentallocentone</item><item>allocentnewentslab</item><item>kmementcacheentgrow</item><item>kmementgetpages</item><item>ententgetentfreeentpages entmm/pageentalloc.cent</item><item>allocentpages entmm/numa.cent</item><item>allocentpagesentpgdat</item><item>ententallocentpages entmm/pageentalloc.cent</item><item>rmentqueue</item><item>reclaimentpages entmm/vmscan.cent</item></itemize></p><p>TODO: Understand Zones</p></sect2></sect1><sect1><heading>Swap</heading><sect2><heading>Overview</heading><p>Swap is managed by the kswapd daemon (kernel thread).</p></sect2><sect2><heading>kswapd</heading><p>As other kernel threads, kswapd has a main loop that wait to wake up.</p><p><verb>|kswapd
   |// initialization routines
   |for (;;) ent // Main loop
      |do_try_to_free_pages
      |recalculate_vm_stats
      |refill_inactive_scan
      |run_task_queue
      |interruptible_sleep_on_timeout // we sleep for a new swap request
   |ent</verb></p><p><itemize><item>kswapd entmm/vmscan.cent</item><item>doenttryenttoentfreeentpages</item><item>recalculateentvmentstats entmm/swap.cent</item><item>refillentinactiveentscan entmm/vmswap.cent</item><item>runenttaskentqueue entkernel/softirq.cent</item><item>interruptibleentsleepentonenttimeout entkernel/sched.cent</item></itemize></p></sect2><sect2><heading>When do we need swapping?</heading><p>Swapping is needed when we have to access a page that is not in physical
memory.</p><p>Linux uses ''kswapd'' kernel thread to carry out this purpose. When the
Task receives a page fault exception we do the following:</p><p><verb> 
 | Page Fault Exception
 | cause by all these conditions: 
 |   a-) User page 
 |   b-) Read or write access 
 |   c-) Page not present
 |
 |
 -----------ent |do_page_fault
                 |handle_mm_fault
                    |pte_alloc 
                       |pte_alloc_one
                          |__get_free_page = __get_free_pages
                             |alloc_pages
                                |alloc_pages_pgdat
                                   |__alloc_pages
                                      |wakeup_kswapd // We wake up kernel thread kswapd
   
                   Page Fault ICA
 </verb></p><p><itemize><item>doentpageentfault entarch/i386/mm/fault.cent </item><item>handleentmmentfault entmm/memory.cent</item><item>pteentalloc</item><item>pteentallocentone entinclude/asm/pgalloc.hent</item><item>ententgetentfreeentpage entinclude/linux/mm.hent</item><item>ententgetentfreeentpages entmm/pageentalloc.cent</item><item>allocentpages entmm/numa.cent</item><item>allocentpagesentpgdat</item><item>ententallocentpages</item><item>wakeupentkswapd entmm/vmscan.cent</item></itemize></p></sect2></sect1></sect><sect><heading>Linux Networking</heading><sect1><heading>How Linux networking is managed?</heading><p>There exists a device driver for each kind of NIC. Inside it, Linux will
ALWAYS call a standard high level routing: "netifentrx entnet/core/dev.cent",
which will controls what 3 level protocol the frame belong to, and it will
call the right 3 level function (so we'll use a pointer to the function to
determine which is right).</p></sect1><sect1><heading>TCP example</heading><p>We'll see now an example of what happens when we send a TCP packet to Linux,
starting from ''netifentrx entnet/core/dev.cent'' call.</p><sect2><heading>Interrupt management: "netifentrx"</heading><p><verb>|netif_rx
   |__skb_queue_tail
      |qlen++
      |* simple pointer insertion *    
   |cpu_raise_softirq
      |softirq_active(cpu) |= (1 entent NET_RX_SOFTIRQ) // set bit NET_RX_SOFTIRQ in the BH vector
 </verb></p><p>Functions:</p><p><itemize><item>ententskbentqueueenttail entinclude/linux/skbuff.hent</item><item>cpuentraiseentsoftirq entkernel/softirq.cent</item></itemize></p></sect2><sect2><heading>Post Interrupt management: "netentrxentaction"</heading><p>Once IRQ interaction is ended, we need to follow the next part of the frame
life and examine what NETentRXentSOFTIRQ does.</p><p>We will next call ''netentrxentaction entnet/core/dev.cent'' according
to "netentdeventinit entnet/core/dev.cent".</p><p><verb>|net_rx_action
   |skb = __skb_dequeue (the exact opposite of __skb_queue_tail)
   |for (ptype = first_protocol; ptype ent max_protocol; ptype++) // Determine 
      |if (skb-entprotocol == ptype)                               // what is the network protocol
         |ptype-entfunc -ent ip_rcv // according to ''struct ip_packet_type entnet/ipv4/ip_output.cent''
 
    **** NOW WE KNOW THAT PACKET IS IP ****
         |ip_rcv
            |NF_HOOK (ip_rcv_finish)
               |ip_route_input // search from routing table to determine function to call
                  |skb-entdst-entinput -ent ip_local_deliver // according to previous routing table check, destination is local machine
                     |ip_defrag // reassembles IP fragments
                        |NF_HOOK (ip_local_deliver_finish)
                           |ipprot-enthandler -ent tcp_v4_rcv // according to ''tcp_protocol entinclude/net/protocol.cent''
 
     **** NOW WE KNOW THAT PACKET IS TCP ****
                           |tcp_v4_rcv   
                              |sk = __tcp_v4_lookup 
                              |tcp_v4_do_rcv
                                 |switch(sk-entstate) 

     *** Packet can be sent to the task which uses relative socket ***
                                 |case TCP_ESTABLISHED:
                                    |tcp_rcv_established
                                       |__skb_queue_tail // enqueue packet to socket
                                       |sk-entdata_ready -ent sock_def_readable 
                                          |wake_up_interruptible
                                

     *** Packet has still to be handshaked by 3-way TCP handshake ***
                                 |case TCP_LISTEN:
                                    |tcp_v4_hnd_req
                                       |tcp_v4_search_req
                                       |tcp_check_req
                                          |syn_recv_sock -ent tcp_v4_syn_recv_sock
                                       |__tcp_v4_lookup_established
                                 |tcp_rcv_state_process

                    *** 3-Way TCP Handshake ***
                                    |switch(sk-entstate)
                                    |case TCP_LISTEN: // We received SYN
                                       |conn_request -ent tcp_v4_conn_request
                                          |tcp_v4_send_synack // Send SYN + ACK
                                             |tcp_v4_synq_add // set SYN state
                                    |case TCP_SYN_SENT: // we received SYN + ACK
                                       |tcp_rcv_synsent_state_process
                                          tcp_set_state(TCP_ESTABLISHED)
                                             |tcp_send_ack
                                                |tcp_transmit_skb
                                                   |queue_xmit -ent ip_queue_xmit
                                                      |ip_queue_xmit2
                                                         |skb-entdst-entoutput
                                    |case TCP_SYN_RECV: // We received ACK
                                       |if (ACK)
                                          |tcp_set_state(TCP_ESTABLISHED)
                              </verb></p><p>Functions can be found under:</p><p><itemize><item>netentrxentaction entnet/core/dev.cent</item><item>ententskbentdequeue entinclude/linux/skbuff.hent</item><item>ipentrcv entnet/ipv4/ipentinput.cent</item><item>NFentHOOK -ent nfenthookentslow entnet/core/netfilter.cent</item><item>ipentrcventfinish entnet/ipv4/ipentinput.cent</item><item>ipentrouteentinput entnet/ipv4/route.cent</item><item>ipentlocalentdeliver entnet/ipv4/ipentinput.cent</item><item>ipentdefrag entnet/ipv4/ipentfragment.cent</item><item>ipentlocalentdeliverentfinish entnet/ipv4/ipentinput.cent</item><item>tcpentv4entrcv entnet/ipv4/tcpentipv4.cent</item><item>ententtcpentv4entlookup</item><item>tcpentv4entdoentrcv</item><item>tcpentrcventestablished entnet/ipv4/tcpentinput.cent</item><item>ententskbentqueueenttail entinclude/linux/skbuff.hent</item><item>sockentdefentreadable entnet/core/sock.cent</item><item>wakeentupentinterruptible entinclude/linux/sched.hent</item><item>tcpentv4enthndentreq entnet/ipv4/tcpentipv4.cent</item><item>tcpentv4entsearchentreq</item><item>tcpentcheckentreq</item><item>tcpentv4entsynentrecventsock</item><item>ententtcpentv4entlookupentestablished</item><item>tcpentrcventstateentprocess entnet/ipv4/tcpentinput.cent</item><item>tcpentv4entconnentrequest entnet/ipv4/tcpentipv4.cent</item><item>tcpentv4entsendentsynack</item><item>tcpentv4entsynqentadd</item><item>tcpentrcventsynsententstateentprocess entnet/ipv4/tcpentinput.cent</item><item>tcpentsetentstate entinclude/net/tcp.hent</item><item>tcpentsendentack entnet/ipv4/tcpentoutput.cent</item></itemize></p><p>Description:</p><p><itemize><item>First we determine protocol type (IP, then TCP)</item><item>NFentHOOK (function) is a wrapper routine that first manages the network
filter (for example firewall), then it calls ''function''.</item><item>After we manage 3-way TCP Handshake which consists of:</item></itemize></p><p><verb>SERVER (LISTENING)                       CLIENT (CONNECTING)
                           SYN 
                   ent-------------------
 
 
                        SYN + ACK
                   -------------------ent

 
                           ACK 
                   ent-------------------

                    3-Way TCP handshake
</verb></p><p><itemize><item>In the end we only have to launch "tcpentrcventestablished entnet/ipv4/tcpentinput.cent"
which gives the packet to the user socket and wakes it up.</item></itemize></p></sect2></sect1></sect><sect><heading>Linux File System</heading><p>TODO</p></sect><sect><heading>Useful Tips</heading><sect1><heading>Stack and Heap</heading><sect2><heading>Overview</heading><p>Here we view how "stack" and "heap" are allocated in memory</p></sect2><sect2><heading>Memory allocation</heading><p><verb>
FF..        |                 | ent-- bottom of the stack
       /|ent  |                 |   | 
 higher |   |                 |   |   stack
 values |   |                 |  ent|/  growing
            |                 |
XX..        |                 | ent-- top of the stack entStack Pointerent
            |                 |
            |                 |
            |                 |
00..        |_________________| ent-- end of stack entStack Segmentent
                 
                   Stack
</verb></p><p>Memory address values start from 00.. (which is also where Stack Segment
begins) and they grow going toward FF.. value.</p><p>XX.. is the actual value of the Stack Pointer.</p><p>Stack is used by functions for:</p><p><enum><item>global variables  </item><item>local variables</item><item>return address</item></enum></p><p>For example, for a classical function:</p><p><verb>
 |int foo_function (parameter_1, parameter_2, ..., parameter_n) ent
    |variable_1 declaration;
    |variable_2 declaration;
      ..
    |variable_n declaration;
   
    |// Body function
    |dynamic variable_1 declaration;
    |dynamic variable_2 declaration;
     ..
    |dynamic variable_n declaration;
   
    |// Code is inside Code Segment, not Data/Stack segment!
    
    |return (ret-type) value; // often it is inside some register, for i386 eax register is used.
 |ent
we have

          |                       |
          | 1. parameter_1 pushed | ent
    S     | 2. parameter_2 pushed |  | Before 
    T     | ...................   |  | the calling
    A     | n. parameter_n pushed | /
    C     | ** Return address **  | -- Calling
    K     | 1. local variable_1   | ent 
          | 2. local variable_2   |  | After
          | .................     |  | the calling
          | n. local variable_n   | /
          |                       | 
         ...                     ...   Free
         ...                     ...   stack
          |                       |
    H     | n. dynamic variable_n | ent
    E     | ...................   |  | Allocated by
    A     | 2. dynamic variable_2 |  | malloc ent kmalloc
    P     | 1. dynamic variable_1 | /
          |_______________________|
        
            Typical stack usage
 
Note: variables order can be different depending on hardware architecture.
</verb></p></sect2></sect1><sect1><heading>Application vs Process</heading><sect2><heading>Base definition</heading><p>We have to distinguish 2 concepts:</p><p><itemize><item>Application: that is the useful code we want to execute</item><item>Process: that is the IMAGE on memory of the application (it depends on
memory strategy used, segmentation and/or Pagination).</item></itemize></p><p>Often Process is also called Task or Thread.</p></sect2></sect1><sect1><heading>Locks</heading><sect2><heading>Overview</heading><p>2 kind of locks:</p><p><enum><item>intraCPU</item><item>interCPU</item></enum></p></sect2></sect1><sect1><heading>Copyentonentwrite</heading><p>Copyentonentwrite is a mechanism used to reduce memory usage. It postpones
memory allocation until the memory is really needed.</p><p>For example, when a task executes the "fork()" system call (to create another
task), we still use the same memory pages as the  parent, in read only mode.
When the new task WRITES into the old page, it causes an exception and the
page is copied and marked "rw" (read, write).</p><p><verb> 
1-) Page X is shared between Task Parent and Task Child
 Task Parent
 |         | RW Access  ______
 |         |----------ent|Page X|    
 |_________|           |______|
                          /|ent
                           |
 Task Child                | 
 |         | R Access      |  
 |         |----------------                
 |_________| 
 
 
2-) Write request from Task Child
 Task Parent
 |         | RW Access  ______
 |         |----------ent|Page X|    
 |_________|           |______|
                          /|ent
                           |
 Task Child                | 
 |         | W Access      |  
 |         |----------------                
 |_________| 
 
 
3-) Final Configuration: Task Parent and Task Child have an independent copy of the Page, X and Y
 Task Parent
 |         | RW Access  ______
 |         |----------ent|Page X|    
 |_________|           |______|
              
              
 Task Child
 |         | RW Access  ______
 |         |----------ent|Page Y|    
 |_________|           |______|</verb></p></sect1></sect><sect><heading>80386 specific details</heading><sect1><heading>Boot procedure</heading><p><verb>bbootsect.s entarch/i386/bootent
setup.S (+video.S) 
head.S (+misc.c) entarch/i386/boot/compressedent
start_kernel entinit/main.cent</verb></p></sect1><sect1><heading>80386 (and more) Descriptors</heading><sect2><heading>Overview</heading><p>Descriptors are data structure used by Intel microprocessor i386+ to virtualize
memory.</p></sect2><sect2><heading>Kind of descriptors</heading><p><itemize><item>GDT (Global Descriptor Table)</item><item>LDT (Local Descriptor Table)</item><item>IDT (Interrupt Descriptor Table)</item></itemize></p></sect2></sect1></sect><sect><heading>IRQ </heading><sect1><heading>Overview</heading><p>IRQ is an asyncronous signal sent to microprocessor to advertise  a requested
work is completed</p></sect1><sect1><heading>Interaction schema</heading><p><verb>                                 |ent--ent  IRQ(0) entTimerent
                                 |ent--ent  IRQ(1) entDevice 1ent
                                 | ..
                                 |ent--ent  IRQ(n) entDevice nent
    _____________________________| 
     /|ent      /|ent          /|ent
      |        |            |
     ent|/      ent|/          ent|/
 
    Task(1)  Task(2) ..   Task(N)
              
             
             IRQ - Tasks Interaction Schema
  
</verb></p><sect2><heading>What happens?</heading><p>A typical O.S. uses many IRQ signals to interrupt normal process execution
and does some housekeeping work. So:</p><p><enum><item>IRQ (i) occurs and Task(j) is interrupted</item><item>IRQ(i)enthandler is executed</item><item>control backs to Task(j) interrupted</item></enum></p><p>Under Linux, when an IRQ comes, first the IRQ wrapper routine (named "interrupt0x??")
is called, then the "official" IRQ(i)enthandler will be executed. This allows some
duties like timeslice preemption.</p></sect2></sect1></sect><sect><heading>Utility functions</heading><sect1><heading>listententry entinclude/linux/list.hent</heading><p>Definition:</p><p><verb>entdefine list_entry(ptr, type, member) ent
((type *)((char *)(ptr)-(unsigned long)(ent((type *)0)-entmember)))</verb></p><p>Meaning:</p><p>"listententry" macro is used to retrieve a parent struct pointer, by using
only one of internal struct pointer. </p><p>Example:</p><p><verb>struct __wait_queue ent
   unsigned int flags; 
   struct task_struct * task; 
   struct list_head task_list;
ent;
struct list_head ent 
   struct list_head *next, *prev; 
ent;

// and with type definition:
typedef struct __wait_queue wait_queue_t;

// we'll have
wait_queue_t *out list_entry(tmp, wait_queue_t, task_list);

// where tmp point to list_head</verb></p><p>So, in this case, by means of *tmp pointer entlistentheadent we retrieve
an *out pointer entwaitentqueueenttent.</p><p><verb>
 ____________ ent---- *out entwe calculate thatent
|flags       |             /|ent
|task *--ent   |              |
|task_list   |ent----    list_entry
|  prev * --ent|    |         |
|  next * --ent|    |         |
|____________|    ----- *tmp entwe have thisent
 </verb></p></sect1><sect1><heading>Sleep </heading><sect2><heading>Sleep code</heading><p>Files: </p><p><itemize><item>kernel/sched.c</item><item>include/linux/sched.h</item><item>include/linux/wait.h</item><item>include/linux/list.h</item></itemize></p><p>Functions:</p><p><itemize><item>interruptibleentsleepenton</item><item>interruptibleentsleepentonenttimeout</item><item>sleepenton</item><item>sleepentonenttimeout</item></itemize></p><p>Called functions:</p><p><itemize><item>initentwaitqueueententry</item><item>ententaddentwaitentqueue</item><item>listentadd</item><item>ententlistentadd</item><item>ententremoveentwaitentqueue</item></itemize></p><p>InterCallings Analysis:</p><p><verb>|sleep_on
   |init_waitqueue_entry  --
   |__add_wait_queue        |   enqueuing request to resource list
      |list_add              |
         |__list_add        -- 
   |schedule              ---     waiting for request to be executed
      |__remove_wait_queue --   
      |list_del              |   dequeuing request from resource list
         |__list_del        -- 
 
</verb></p><p>Description:</p><p>Under Linux each resource (ideally an object shared between many users
and many processes), , has a queue to manage ALL tasks requesting it.</p><p>This queue is called "wait queue" and it consists of many items we'll call
the"wait queue element":</p><p><verb>***   wait queue structure entinclude/linux/wait.hent  ***


struct __wait_queue ent
   unsigned int flags; 
   struct task_struct * task; 
   struct list_head task_list;
ent
struct list_head ent 
   struct list_head *next, *prev; 
ent;</verb></p><p>Graphic working: </p><p><verb>        ***  wait queue element  ***

                             /|ent
                              |
       ent--entprev *, flags, task *, next *ent--ent
 
                     


                 ***  wait queue list ***  
 
          /|ent           /|ent           /|ent                /|ent
           |             |             |                  |
--ent ent--enttask1ent--ent ent--enttask2ent--ent ent--enttask3ent--ent .... ent--enttaskNent--ent ent--
|                                                                  |
|__________________________________________________________________|
          

           
              ***   wait queue head ***

       task1 ent--entprev *, lock, next *ent--ent taskN
   
 </verb></p><p>"wait queue head" point to first (with next *) and last (with prev *) elements
of the "wait queue list".</p><p>When a new element has to be added, "ententaddentwaitentqueue" entinclude/linux/wait.hent
is called, after which  the generic routine "listentadd" entinclude/linux/wait.hent,
will be executed:</p><p><verb>***   function list_add entinclude/linux/list.hent  ***

// classic double link list insert
static __inline__ void __list_add (struct list_head * new,  ent
                                   struct list_head * prev, ent
                                   struct list_head * next) ent 
   next-entprev = new; 
   new-entnext = next; 
   new-entprev = prev; 
   prev-entnext = new; 
ent</verb></p><p>To complete the description, we see also "ententlistentdel" entinclude/linux/list.hent
function called by "listentdel" entinclude/linux/list.hent inside "removeentwaitentqueue"
entinclude/linux/wait.hent:</p><p><verb>***   function list_del entinclude/linux/list.hent  ***


// classic double link list delete
static __inline__ void __list_del (struct list_head * prev, struct list_head * next) ent 
   next-entprev = prev; 
   prev-entnext = next; 
ent</verb></p></sect2><sect2><heading>Stack consideration</heading><p>A typical list (or queue) is usually managed allocating it into the Heap
(see Cap.10 for Heap and Stack definition and about where variables are allocated).
Otherwise here, we statically allocate Wait Queue data in a local variable
(Stack), then function is interrupted by scheduling, in the end, (returning
from scheduling) we'll erase local variable.</p><p><verb>  new task ent----|          task1 ent------|          task2 ent------|
                |                       |                       |
                |                       |                       | 
|..........|    |       |..........|    |       |..........|    | 
|wait.flags|    |       |wait.flags|    |       |wait.flags|    |
|wait.task_|____|       |wait.task_|____|       |wait.task_|____|   
|wait.prev |--ent         |wait.prev |--ent         |wait.prev |--ent
|wait.next |--ent         |wait.next |--ent         |wait.next |--ent   
|..        |            |..        |            |..        |    
|schedule()|            |schedule()|            |schedule()|     
|..........|            |..........|            |..........|    
|__________|            |__________|            |__________|     
 
   Stack                   Stack                   Stack</verb></p></sect2></sect1></sect><sect><heading>Static variables</heading><sect1><heading>Overview</heading><p>Linux is written in ''C'' language, and as every application has:</p><p><enum><item>Local variables</item><item>Module variables (inside the source file and relative only to that module)</item><item>Global/Static variables present in only 1 copy (the same for all modules)</item></enum></p><p>When a Static variable is modified by a module, all other modules will
see the new value.</p><p>Static variables under Linux are very important, cause they are the only
kind to add new support to kernel: they typically are pointers to the head
of a list of registered elements, which can be:</p><p><itemize><item>added</item><item>deleted</item><item>maybe modified</item></itemize></p><p><verb>                           _______      _______      _______
Global variable  -------ent |Item(1)| -ent |Item(2)| -ent |Item(3)|  ..
                          |_______|    |_______|    |_______|</verb></p></sect1><sect1><heading>Main variables</heading><sect2><heading>Current</heading><p><verb>                           ________________
Current ----------------ent | Actual process |
                          |________________|</verb></p><p>Current points to ''taskentstruct'' structure, which contains all data about
a process like:</p><p><itemize><item>pid, name, state, counter, policy of scheduling</item><item>pointers to many data structures like: files, vfs, other processes, signals...</item></itemize></p><p>Current is not a real variable, it is </p><p><verb>static inline struct task_struct * get_current(void) ent 
   struct task_struct *current; 
   __asm__(entandl ententesp,ent0; ent:ent=rent (current) : ent0ent (ent8191UL)); 
   return current; 
ent
entdefine current get_current()</verb></p><p>Above lines just takes value of ''esp'' register (stack pointer) and get
it available like a variable, from which we can point to our taskentstruct structure.</p><p>From ''current'' element we can access directly to any other process (ready,
stopped or in any other state) kernel data structure, for example changing
STATE (like a I/O driver does), PID, presence in ready list or blocked list,
etc.</p><p></p><p></p></sect2><sect2><heading>Registered filesystems</heading><p><verb>                       ______      _______      ______
file_systems  ------ent | ext2 | -ent | msdos | -ent | ntfs |
 entfs/super.cent         |______|    |_______|    |______|</verb></p><p>When you use command like ''modprobe someentfs'' you will add a new entry
to file systems list, while removing it (by using ''rmmod'') will delete it.</p></sect2><sect2><heading>Mounted filesystems</heading><p><verb>                        ______      _______      ______
mount_hash_table  ----ent|   /  | -ent | /usr  | -ent | /var |
entfs/namespace.cent       |______|    |_______|    |______|</verb></p><p>When you use ''mount'' command to add a fs, the new entry will be inserted
in the list, while an ''umount'' command will delete the entry.</p></sect2><sect2><heading>Registered Network Packet Type</heading><p><verb>                        ______      _______      ______ 
     ptype_all  ------ent|  ip  | -ent |  x25  | -ent | ipv6 |
entnet/core/dev.cent       |______|    |_______|    |______|</verb></p><p>For example, if you add support for IPv6 (loading relative module) a new
entry will be added in the list.</p></sect2><sect2><heading>Registered Network Internet Protocol</heading><p><verb>                          ______      _______      _______ 
inet_protocol_base -----ent| icmp | -ent |  tcp  | -ent |  udp  |
entnet/ipv4/protocol.cent    |______|    |_______|    |_______|</verb></p><p>Also others packet type have many internal protocols in each list (like
IPv6).</p><p><verb>                          ______      _______      _______ 
inet6_protos -----------ent|icmpv6| -ent | tcpv6 | -ent | udpv6 |
entnet/ipv6/protocol.cent    |______|    |_______|    |_______|</verb></p></sect2><sect2><heading>Registered Network Device</heading><p><verb>                          ______      _______      _______ 
dev_base ---------------ent|  lo  | -ent |  eth0 | -ent |  ppp0 |
entdrivers/core/Space.cent   |______|    |_______|    |_______|</verb></p></sect2><sect2><heading>Registered Char Device</heading><p><verb>                          ______      _______      ________ 
chrdevs ----------------ent|  lp  | -ent | keyb  | -ent | serial |
entfs/devices.cent           |______|    |_______|    |________|</verb></p><p>''chrdevs'' is not a pointer to a real list, but it is a standard vector.</p></sect2><sect2><heading>Registered Block Device</heading><p><verb>                          ______      ______      ________ 
bdev_hashtable ---------ent|  fd  | -ent |  hd  | -ent |  scsi  |
entfs/block_dev.cent         |______|    |______|    |________|</verb></p><p>''bdeventhashtable'' is an hash vector.</p></sect2></sect1></sect><sect><heading>Glossary</heading></sect><sect><heading>Links</heading><p><url url="http://www.kernel.org" name="Official Linux kernels and patches download site"></url></p><p><url url="http://jungla.dit.upm.es/~jmseyas/linux/kernel/hackers-docs.html" name="Great documentation about Linux Kernel"></url></p><p><url url="http://www.uwsg.indiana.edu/hypermail/linux/kernel/index.html" name="Official Kernel Mailing list"></url></p><p><url url="http://www.linuxdoc.org/guides.html" name="Linux Documentation Project Guides"></url></p></sect></article></linuxdoc>

