Article List

system_call

This article covers various Linux system calls in C, providing a brief explanation and example code for each topic. This overview should help you understand how to interact with Linux system resources directly from C programs.

1. Linux System Calls

System calls are the interface between user-space applications and the Linux kernel. They allow your programs to request services from the kernel, such as file operations, process management, and inter-process communication.

2. Using `strace`

strace is a powerful diagnostic, debugging, and instructional utility for Linux and other Unix-like operating systems. It allows you to trace system calls and signals made and received by a process. strace provides tips into how applications interact with the system's kernel, which can be invaluable for debugging, performance tuning, and understanding the inner workings of programs.

2.1 Basic Usage

To start tracing a program, you can use strace followed by the command you wish to run:

strace ls

This command traces the ls command, showing all system calls made by ls.

2.2 Tracing a Running Process

You can attach strace to an already running process using the -p option followed by the process ID (PID):

strace -p 1234

Replace 1234 with the actual PID of the process you want to trace.

2.3 Filtering Traced Output

strace can generate a lot of output, making it hard to find relevant information. You can limit the output to certain system calls using the -e option. For example, to trace only open and close system calls, you can use:

strace -e trace=open,close <command>

2.4 Writing Output to a File

To save the output of strace to a file, use the -o option:

strace -o output.txt <command>

This command runs <command> and writes the trace output to output.txt.

2.5 Tracing Specific Events

strace can trace more than just system calls. For instance, you can trace network operations, process control, file operations, and more. The -e option allows you to specify exactly what you want to trace. Check the man page (man strace) for a full list of what can be traced.

2.6 Tracing Child Processes

By default, strace traces only the main process. To trace child processes created by fork or similar system calls, use the -f option:

strace -f <command>

2.7 Useful Options

-c: Provides a summary of system calls made by the program, including how many times each was called and the time spent in each call.
-d: Debug mode for strace itself, useful for diagnosing problems with strace.
-t: Prefix each line of the strace output with the time of day.
-T: Show the time spent in each system call.

2.8 Example Use Case

Suppose you have a program that's failing to open a configuration file, but you're not sure why. You can use strace to trace file operations:

strace -e trace=file <program>

This command can help you identify attempts to open files, showing both successful and failed operations, along with the paths being accessed. This can quickly lead you to the problem, such as attempting to open a non-existent file or lacking the necessary permissions.

strace is a versatile tool that, once mastered, becomes an indispensable part of the Linux programmer's and system administrator's toolkit. Its ability to reveal what a program is doing "under the hood" makes it an excellent tool for learning, debugging, and optimizing code.

3. `access`: Test the Permissions of a File

In the access system call example, we check whether the calling process can access a file in a particular way—specifically, whether the file can be read, written, or executed. The access function is defined in <unistd.h> and its prototype looks like this:

int access(const char *pathname, int mode);

pathname: The path to the file you want to check.
mode: A mask consisting of one or more of the following flags ORed together:
R_OK: Test for read permission.
W_OK: Test for write permission.
X_OK: Test for execute permission.
F_OK: Test for the existence of the file.

When using access, you're asking the question, "Does the user running this program have the specified access to the file?" This check is based on the real UID (user ID) and GID (group ID) of the process, rather than the effective IDs. This is important in programs that may run with elevated privileges (e.g., setuid programs).

Here's an expanded version of the access example, demonstrating how to check for different permissions as well as the existence of a file:

#include <unistd.h>
#include <stdio.h>

int main() {
    const char *filepath = "example.txt";

    // Check for the existence of the file
    if (access(filepath, F_OK) == 0) {
        printf("The file exists.\n");

        // Check for read permission
        if (access(filepath, R_OK) == 0) {
            printf("Read permission granted.\n");
        } else {
            printf("Read permission denied.\n");
        }

        // Check for write permission
        if (access(filepath, W_OK) == 0) {
            printf("Write permission granted.\n");
        } else {
            printf("Write permission denied.\n");
        }

        // Check for execute permission
        if (access(filepath, X_OK) == 0) {
            printf("Execute permission granted.\n");
        } else {
            printf("Execute permission denied.\n");
        }
    } else {
        printf("The file does not exist.\n");
    }

    return 0;
}

This code snippet demonstrates how to use access to perform comprehensive permission checks on a file. It first checks if the file exists using F_OK. If the file exists, it then checks for read, write, and execute permissions in turn. This is a basic pattern you might use to pre-validate file access in your applications before attempting to open or execute the file, ensuring that your program behaves gracefully if the necessary permissions are not available.

4. `fcntl`: Locks and File Operations

fcntl can change the properties of a file that's already open.

4.1 Understanding File Locks with `fcntl`

File locks are mechanisms that allow synchronization between different processes to prevent them from concurrently modifying a file, potentially leading to data corruption. In Unix-like systems, fcntl can be used to apply advisory file locks. These locks are "advisory" because they do not prevent other processes from accessing the file unless those processes also use and check for the same locks. It's a cooperative mechanism, not enforced by the system.

F_WRLCK: Requests a write lock. No other process can hold a write or read lock.
F_RDLCK: Requests a read lock. Other processes can also hold a read lock, but not a write lock.
F_UNLCK: Releases a lock.

The F_SETLKW command tells fcntl to set the lock and wait if the lock cannot be acquired immediately, as opposed to F_SETLK which returns immediately if the lock cannot be acquired.

4.2 Example Program: File Locking with `fcntl`

This example demonstrates how to use fcntl to place a write lock on a file specified by the command-line argument. The program waits for the user to press Enter before releasing the lock.

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <string.h>

int main(int argc, char* argv[]) {
    if (argc < 2) {
        printf("Usage: %s <file>\n", argv[0]);
        return 1;
    }

    char* file = argv[1];
    int fd;

    printf("Opening %s\n", file);
    // Open a file descriptor.
    fd = open(file, O_WRONLY);
    if (fd == -1) {
        perror("open");
        return 1;
    }

    printf("Locking\n");
    // Initialize the flock structure.
    struct flock lock;
    memset(&lock, 0, sizeof(lock));
    lock.l_type = F_WRLCK; // Request a write lock
    // Attempt to place a write lock on the file.
    if (fcntl(fd, F_SETLKW, &lock) == -1) {
        perror("fcntl");
        close(fd);
        return 1;
    }

    printf("Locked; press Enter to unlock... ");
    // Wait for the user to press Enter.
    getchar();

    printf("Unlocking\n");
    // Release the lock.
    lock.l_type = F_UNLCK;
    if (fcntl(fd, F_SETLKW, &lock) == -1) {
        perror("fcntl");
        close(fd);
        return 1;
    }

    close(fd);
    return 0;
}

4.3 How It Works

The program first checks for the correct usage, requiring a filename as an argument.
It then attempts to open the specified file in write-only mode.
If successful, it initializes a struct flock and requests a write lock using F_SETLKW. This call will block if the lock cannot be immediately acquired, waiting until the lock is available.
The program waits for the user to press Enter. Upon receiving input, it sets the lock type to F_UNLCK to release the lock and then closes the file descriptor.

This example provides a straightforward demonstration of using file locks in C to coordinate access to a file between different processes. It's essential to handle potential errors, such as the file not opening or the locking mechanism failing, to ensure the program behaves as expected under various conditions.

5. `fsync` and `fdatasync`: Purging Disk Buffers

To ensure data integrity, especially in scenarios where your application maintains a log, database, or any other form of critical data storage, it's essential to have a mechanism that guarantees the data has been physically written to the storage device. In Linux, this is where fsync and fdatasync system calls come into play.

5.1 `fsync` and `fdatasync`: Purging Disk Buffers

When your application writes data to a file, the data might initially be placed in a buffer (in-memory cache) by the kernel to improve performance. However, buffered data can be lost if the system crashes or loses power before the buffer is flushed (i.e., written out to the disk). To mitigate this risk, you can use fsync or fdatasync.

fsync(int fd): Synchronizes a file's in-memory state with that on the physical disk to ensure that all modifications are written out. fsync flushes both data and metadata (like file modification times).
fdatasync(int fd): Similar to fsync, but it only flushes data, not metadata. This can be more efficient in scenarios where metadata changes aren't crucial to preserve immediately.

5.2 Example: Ensuring Data Integrity in a Journaling System

The following example demonstrates how to use fsync to ensure that a journal entry is physically written to disk, thus preventing data loss in the event of a system crash or power failure. This code snippet expands on your provided template with error handling to make it more robust.

#include <fcntl.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>

const char* journal_filename = "journal.log";

void write_journal_entry(char* entry) {
    // Open the journal file with appropriate flags and permissions
    int fd = open(journal_filename, O_WRONLY | O_CREAT | O_APPEND, 0660);
    if (fd == -1) {
        perror("Failed to open journal file");
        return;
    }

    // Write the journal entry to the file
    if (write(fd, entry, strlen(entry)) == -1) {
        perror("Failed to write entry");
        close(fd);
        return;
    }

    // Write a newline character after the entry
    if (write(fd, "\n", 1) == -1) {
        perror("Failed to write newline");
        close(fd);
        return;
    }

    // Ensure the entry is physically written to the disk
    if (fsync(fd) == -1) {
        perror("Failed to fsync");
        close(fd);
        return;
    }

    // Close the file descriptor
    close(fd);
}

int main() {
    // Example usage
    char* entry = "Sample journal entry";
    write_journal_entry(entry);
    return 0;
}

5.3 How It Works

Open the File: The file is opened (or created if it doesn't exist) with write-only access. The O_APPEND flag ensures that each write operation appends data at the end of the file.
Write the Entry: The journal entry is written to the file followed by a newline character. Each write call is checked for errors.
Synchronize: fsync is called to flush the file's data and metadata to disk. This is crucial for ensuring the durability of the journal entry.
Close the File: Finally, the file descriptor is closed.

Error handling is crucial in file operations, especially when dealing with critical data. This example includes basic error checks after each system call to handle potential failures gracefully.

6. `getrlimit` and `setrlimit`: Resource Limits

The getrlimit and setrlimit system calls in Linux allow processes to get and set resource limits, respectively. These limits are crucial for controlling the amount of system resources a process can consume, which can prevent individual processes from exhausting system resources, thus ensuring system stability and fairness among multiple processes.

6.1 Resource Limits Overview

Each process in a Unix-like system has associated resource limits, which are constraints on the system resources that the process may consume. Examples of such resources include the maximum number of file descriptors a process can open (RLIMIT_NOFILE), the maximum size of the process's heap (RLIMIT_DATA), and the maximum size of the process stack (RLIMIT_STACK).

6.2 Using `getrlimit` and `setrlimit`

The getrlimit function retrieves the current limits for a specified resource, while setrlimit sets new limits. The resource limits are specified by two values:

rlim_cur: The soft limit — the value that the kernel enforces for the corresponding resource.
rlim_max: The hard limit — the ceiling for the soft limit. Only privileged processes (typically those with root privileges) can raise the hard limit.

6.3 Example: Checking and Modifying File Descriptor Limit

This example demonstrates how to use getrlimit and setrlimit to check and then modify the maximum number of file descriptors that the current process can open.

#include <stdio.h>
#include <stdlib.h>
#include <sys/resource.h>

int main() {
    struct rlimit limit;

    // Get the current limit on file descriptors
    if (getrlimit(RLIMIT_NOFILE, &limit) != 0) {
        perror("getrlimit failed");
        return EXIT_FAILURE;
    }

    printf("Current Limits: soft = %ld, hard = %ld\n", limit.rlim_cur, limit.rlim_max);

    // Attempt to increase the soft limit to the hard limit value
    limit.rlim_cur = limit.rlim_max;

    if (setrlimit(RLIMIT_NOFILE, &limit) != 0) {
        perror("setrlimit failed");
        // Non-root processes may fail to increase the hard limit
    } else {
        printf("Soft limit raised to hard limit: %ld\n", limit.rlim_cur);
    }

    // Verify the change
    if (getrlimit(RLIMIT_NOFILE, &limit) != 0) {
        perror("getrlimit failed");
        return EXIT_FAILURE;
    }

    printf("Updated Limits: soft = %ld, hard = %ld\n", limit.rlim_cur, limit.rlim_max);

    return EXIT_SUCCESS;
}

6.4 How It Works

Retrieve Current Limits: The program first calls getrlimit for RLIMIT_NOFILE to fetch the current soft and hard limits on the number of file descriptors.
Modify the Soft Limit: It then tries to increase the soft limit (rlim_cur) to match the hard limit (rlim_max). This is a common practice to maximize resource utilization without requiring root privileges to modify the hard limit. However, attempting to set the soft limit above the hard limit will fail unless the process has adequate permissions.
Verify the Change: Finally, the program calls getrlimit again to verify that the soft limit was successfully updated.

Note that changing resource limits can have significant implications for system stability and security. Increasing limits for certain resources may allow processes to consume more system resources, potentially leading to resource exhaustion. Therefore, adjustments to resource limits should be made judiciously, with a clear understanding of the implications.

7. `getrusage`: Process Statistics

The getrusage system call is a powerful tool for monitoring the resource usage of processes in Unix-like operating systems. It provides detailed statistics about the system resources that a specific process or group of processes has consumed. This information is invaluable for performance analysis, debugging, and system monitoring.

7.1 Understanding `getrusage`

Prototype:

#include <sys/resource.h>

int getrusage(int who, struct rusage *usage);

who: Specifies which process(es) to retrieve the usage information for. Common values are:
RUSAGE_SELF: To get the resource usage of the calling process.
RUSAGE_CHILDREN: To get the resource usage of all children of the calling process that have terminated and been waited for.
usage: A pointer to a struct rusage structure where the resource usage information will be stored.

The struct rusage structure contains many fields, including CPU time used, maximum resident set size, number of page faults, number of context switches, etc.

7.2 Example: Monitoring CPU Time and Page Faults

The following example demonstrates how to use getrusage to obtain and print the CPU time used by the current process and the number of minor page faults it has caused.

#include <stdio.h>
#include <stdlib.h>
#include <sys/resource.h>

int main() {
    struct rusage usage;

    if (getrusage(RUSAGE_SELF, &usage) == -1) {
        perror("getrusage failed");
        return EXIT_FAILURE;
    }

    // User CPU time and system CPU time
    printf("User CPU time used: %ld.%06ld sec\n", usage.ru_utime.tv_sec, usage.ru_utime.tv_usec);
    printf("System CPU time used: %ld.%06ld sec\n", usage.ru_stime.tv_sec, usage.ru_stime.tv_usec);

    // Page faults
    printf("Minor page faults: %ld\n", usage.ru_minflt);
    printf("Major page faults: %ld\n", usage.ru_majflt);

    return EXIT_SUCCESS;
}

7.3 Understanding the Output

User CPU time: The amount of time the CPU spent executing instructions in user mode (outside the kernel) on behalf of the process.
System CPU time: The amount of time the CPU spent executing system calls (inside the kernel) on behalf of the process.
Minor page faults: These occur when the process accesses a page that is not in memory but can be loaded without disk access (e.g., a page that was swapped out is still in the swap cache).
Major page faults: These occur when the process accesses a page that is not in memory, requiring disk access to retrieve.

This example gives a snapshot of the process's resource consumption at the time getrusage is called. By calling getrusage at different points in a program's execution, you can measure the resources consumed during specific operations, which is useful for profiling and optimization.

8. `gettimeofday`: System Time

The gettimeofday system call is used to get the current time and date. Unlike time() which provides the current time in seconds since the Epoch (1970-01-01 00:00:00 UTC), gettimeofday provides a higher resolution time by also including microseconds. It's part of the POSIX specification but is considered obsolete in favor of the clock_gettime() call for newer applications, mainly because gettimeofday does not provide a timezone conversion capability and is limited by the resolution and the issues around system clock changes (e.g., adjustments or daylight saving time shifts).

8.1 Prototype of `gettimeofday`

#include <sys/time.h>

int gettimeofday(struct timeval *tv, struct timezone *tz);

tv: Pointer to a struct timeval structure where the current time will be stored.
tz: This argument is obsolete and should generally be specified as NULL. Historically, it was used to obtain timezone information, but this usage is now deprecated.

The struct timeval structure is defined as follows:

struct timeval {
    time_t      tv_sec;  // seconds since Jan. 1, 1970
    suseconds_t tv_usec; // and microseconds
};

8.2 Example: Getting the Current Time with `gettimeofday`

Here's a simple example demonstrating how to use gettimeofday to fetch the current time with microsecond precision:

#include <stdio.h>
#include <sys/time.h>

int main() {
    struct timeval tv;
    int res;

    // Get the current time
    res = gettimeofday(&tv, NULL);
    if (res == 0) {
        printf("Current time: %ld seconds and %ld microseconds since the Epoch\n",
               (long)tv.tv_sec, (long)tv.tv_usec);
    } else {
        perror("gettimeofday failed");
        return 1;
    }

    return 0;
}

8.3 How It Works

The gettimeofday function fills in the struct timeval you provide with the current time in seconds and microseconds since the Epoch (00:00:00 UTC, January 1, 1970).
The function returns 0 on success, and -1 on failure, setting errno to indicate the error.
This example prints the current time with microsecond precision. Note that the actual resolution of the system clock may vary and may not always provide microsecond accuracy.

While gettimeofday is useful for obtaining high-resolution time stamps and measuring time intervals with microsecond precision, for new applications, especially those needing monotonic time or handling time zones, consider using clock_gettime() with CLOCK_REALTIME or other suitable clocks provided by the system.

9. The `mlock` Family: Physical Memory Lock

The mlock family of system calls in Linux is used to control the memory locking of a process's address space. Memory locking is a mechanism that ensures pages residing in the virtual memory area of a process are not swapped out to the swap area (disk or similar storage) under any circumstances, ensuring they remain in RAM. This capability is crucial for real-time applications or those that handle sensitive information, where it's necessary to prevent delays due to page faults or to ensure that sensitive information is not written to disk.

9.1 The `mlock` Family of Functions

mlock: Locks a specified region of the process's address space, preventing those pages from being paged out.
munlock: Unlocks a specified region of the process's address space, allowing those pages to be paged out again.
mlockall: Locks all pages mapped into the address space of the calling process.
munlockall: Unlocks all pages mapped into the address space of the calling process.

9.2 Using `mlock` and `munlock`

Here's how you might use mlock and munlock in a program:

9.3 Prototype

#include <sys/mman.h>

int mlock(const void *addr, size_t len);
int munlock(const void *addr, size_t len);

addr: The starting address of the memory to lock or unlock.
len: The length of the memory region to lock or unlock.

9.4 Example: Locking Memory to Prevent Swapping

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>

int main() {
    const size_t size = 1024 * 1024; // 1 MB of memory
    void *buffer = malloc(size);
    if (!buffer) {
        perror("malloc failed");
        return 1;
    }

    // Initialize the memory with some data
    memset(buffer, 0, size);

    // Lock the memory to prevent swapping
    if (mlock(buffer, size) == -1) {
        perror("mlock failed");
        free(buffer);
        return 1;
    }

    printf("Memory is locked in RAM.\n");

    // Here you can work with the locked memory

    // Unlock the memory
    if (munlock(buffer, size) == -1) {
        perror("munlock failed");
    }

    free(buffer);
    return 0;
}

9.5 Important Considerations

Permissions: Locking memory usually requires privileged (root) permissions or an appropriate limit set via the ulimit command or /etc/security/limits.conf because locked memory is guaranteed to stay in RAM, which could potentially exhaust system resources.
Use Cases: Memory locking is typically used in applications where timing is critical (real-time applications) or where it's imperative that sensitive information (e.g., cryptographic keys) not be written to disk, even in a swap area.
Resource Management: Excessive use of memory locking can negatively impact system performance by reducing the amount of RAM available for other processes and the operating system's caching mechanisms. It's crucial to lock only as much memory as necessary and to unlock it as soon as it's no longer needed.

This feature should be used judiciously, keeping in mind the overall system performance and the security implications of locking sensitive information into physical memory.

10. `mprotect`: Set Memory Permissions

The mprotect system call in Linux is used to change the access permissions of any pages in the virtual address space of a process. This call allows a program to control whether a region of memory is readable, writable, or executable, or some combination of these. It's particularly useful for implementing security measures, such as creating read-only data areas, or for stack overflow protection by marking areas of memory as non-executable.

10.1 Prototype

#include <sys/mman.h>

int mprotect(void *addr, size_t len, int prot);

addr: The starting address of the memory region whose access permissions are to be changed. This address must be aligned to a page boundary.
len: The length of the memory region whose access permissions are to be changed.
prot: The new protection flags for the memory region, which can be a combination of:
PROT_NONE: Pages cannot be accessed.
PROT_READ: Pages can be read.
PROT_WRITE: Pages can be written.
PROT_EXEC: Pages can be executed.

10.2 Example: Changing Memory Permissions with `mprotect`

The following example demonstrates how mprotect can be used to change a memory region's permissions to read-only after initializing it, and then back to read-write.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <unistd.h>

int main() {
    // Allocate a page of memory
    size_t pagesize = sysconf(_SC_PAGESIZE);
    void* buffer = malloc(pagesize);

    if (buffer == NULL) {
        perror("malloc failed");
        return 1;
    }

    // Initialize the memory with some data
    strcpy(buffer, "Hello, mprotect!");

    // Change the memory permissions to read-only
    if (mprotect(buffer, pagesize, PROT_READ) == -1) {
        perror("mprotect failed to set read-only");
        free(buffer);
        return 1;
    }

    printf("Memory set to read-only: %s\n", (char*)buffer);

    // Attempt to write to the memory (this will cause a segmentation fault)
    // strcpy(buffer, "Write attempt"); // Uncommenting this line will crash the program

    // Change the memory permissions back to read-write
    if (mprotect(buffer, pagesize, PROT_READ | PROT_WRITE) == -1) {
        perror("mprotect failed to set read-write");
        free(buffer);
        return 1;
    }

    // Now writing to the memory is possible again
    strcpy(buffer, "Now in read-write mode");
    printf("Updated buffer: %s\n", (char*)buffer);

    free(buffer);
    return 0;
}

10.3 Important Notes

Page Alignment: The address passed to mprotect must be aligned to a page boundary. If it's not, mprotect fails with EINVAL.
Security Implications: Modifying memory protections can have security implications. For example, marking previously executable areas as non-executable can mitigate certain types of attacks, like code injection or return-oriented programming (ROP) attacks.
Error Handling: It's important to check the return value of mprotect for errors, as attempting to access memory in a way not allowed by its current permissions (e.g., writing to a read-only area) will result in a segmentation fault.

This functionality is a cornerstone of modern security and memory management techniques, allowing for dynamic control over how applications interact with their allocated memory.

11. `nanosleep`: Pause in High Precision

The nanosleep system call is used in Linux to suspend the execution of the calling thread for a specified duration, with nanosecond precision. It provides a higher precision alternative to functions like sleep or usleep, which provide second and microsecond precision, respectively. nanosleep is particularly useful in real-time programming where precise timing is crucial.

11.1 Using `nanosleep` : Prototype

#include <time.h>

int nanosleep(const struct timespec *req, struct timespec *rem);

req: A pointer to a struct timespec that specifies the desired sleep time. The timespec structure contains two fields: tv_sec (seconds) and tv_nsec (nanoseconds).
rem: If non-NULL, nanosleep will store the remaining time not slept if the call is interrupted by a signal handler. This allows the program to resume sleeping for the full duration if desired.

11.2 Example: High Precision Sleep with `nanosleep`

This example demonstrates how to use nanosleep to pause the program execution for a specific duration, specified in seconds and nanoseconds.

#include <stdio.h>
#include <time.h>

int main() {
    // Specify the sleep time: 2.5 seconds
    struct timespec req = {
        .tv_sec = 2,            // 2 seconds
        .tv_nsec = 500000000L   // 500 million nanoseconds (0.5 seconds)
    };

    printf("Sleeping for 2.5 seconds...\n");

    // Sleep for the requested duration
    if (nanosleep(&req, NULL) == -1) {
        perror("nanosleep");
        return 1;
    }

    printf("Wake up!\n");

    return 0;
}

11.3 Handling Interrupts

If nanosleep is interrupted by a signal handler, you can use the rem parameter to determine how much of the requested time has not been slept, and then call nanosleep again with rem as the req parameter to complete the intended duration. This approach ensures that your program sleeps for the total specified time, even if interrupted.

11.4 Precision and Accuracy

While nanosleep offers nanosecond precision, the actual resolution is limited by the system's timer, which may not provide nanosecond accuracy. The sleep duration might be rounded up to the nearest value supported by the system timer. Additionally, system load and scheduling behavior can affect the timing accuracy.

11.5 Use Cases

nanosleep is ideal for applications requiring precise control over timing, such as multimedia applications, scientific simulations, or any application that needs to wait for specific hardware events or responses with high precision.

This function is a powerful tool for managing precise timing and delays in Linux programs, enabling developers to implement sleep intervals with much greater accuracy than traditional sleep functions.

12. `readlink`: Reading Symbolic Links

The readlink system call is utilized in Linux and Unix-like operating systems to read the value of a symbolic link. Symbolic links are essentially shortcuts or references to other files or directories. Unlike hard links, which act as direct references to file data, symbolic links point to another entry in the filesystem by name. Reading a symbolic link means obtaining the path to which the symbolic link points.

12.1 Using `readlink`: Prototype

#include <unistd.h>

ssize_t readlink(const char *restrict path, char *restrict buf, size_t bufsize);

path: The pathname of the symbolic link.
buf: A buffer where the link's target path will be stored. This buffer will not be null-terminated automatically.
bufsize: The size of the buffer. This determines the maximum number of bytes that can be read.

12.2 Example: Reading a Symbolic Link

The following example demonstrates how to use readlink to read the target of a symbolic link and print it. This code includes handling to ensure the result is null-terminated and thus can be treated as a valid C string.

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

int main(int argc, char *argv[]) {
    if (argc != 2) {
        fprintf(stderr, "Usage: %s <symlink>\n", argv[0]);
        return EXIT_FAILURE;
    }

    char *symlinkPath = argv[1];
    char buf[1024]; // Buffer for the symlink target

    ssize_t len = readlink(symlinkPath, buf, sizeof(buf) - 1);
    if (len == -1) {
        perror("readlink");
        return EXIT_FAILURE;
    }

    buf[len] = '\0'; // Ensure null-termination

    printf("The symbolic link '%s' points to '%s'\n", symlinkPath, buf);

    return EXIT_SUCCESS;
}

12.3 Important Notes

Buffer Size: It's crucial to provide a buffer that is large enough to hold the entire path plus the null terminator. If the buffer is too small to hold all of the link content, the result is truncated to bufsize - 1 characters, potentially leading to an incomplete path.
Null Termination: readlink does not append a null terminator to the buffer. You must do this yourself, as shown in the example, by setting buf[len] = '\0'.
Return Value: On success, readlink returns the number of bytes placed in the buffer. On failure, it returns -1 and sets errno to indicate the error.

Using readlink, you can programmatically resolve the targets of symbolic links, which can be particularly useful in scripts or programs that need to work with filesystem structures or navigate through directories that contain symbolic links.

13. `sendfile`: Fast Data Transfers

The sendfile system call is a specialized Linux mechanism designed for transferring data between two file descriptors without the need to copy data into user space. This offers a more efficient way to move data, especially useful for high-performance network servers or file manipulation utilities, because it can significantly reduce CPU usage and increase throughput by leveraging the kernel to handle data transfers directly.

13.1 Using `sendfile` : Prototype

#include <sys/sendfile.h>

ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count);

out_fd: The file descriptor of the output file, typically a socket for network operations.
in_fd: The file descriptor of the input file from which data will be read.
offset: A pointer to an off_t variable that specifies the starting point for the data transfer. If offset is not NULL, sendfile will start reading data from this offset in the input file and will update offset to reflect the new position just past the last byte read. If offset is NULL, sendfile starts reading from the current file offset and updates the file offset accordingly.
count: The number of bytes to transfer.

13.2 Example: Copying File Content with `sendfile`

The following example demonstrates using sendfile to copy the content from one file to another. This could be part of a file copy utility or a server sending a file to a client over a socket.

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/sendfile.h>
#include <sys/stat.h>

int main(int argc, char *argv[]) {
    if (argc != 3) {
        fprintf(stderr, "Usage: %s <source_file> <destination_file>\n", argv[0]);
        return EXIT_FAILURE;
    }

    // Open the source file for reading
    int src = open(argv[1], O_RDONLY);
    if (src == -1) {
        perror("Failed to open source file");
        return EXIT_FAILURE;
    }

    // Open the destination file for writing
    int dest = open(argv[2], O_WRONLY | O_CREAT | O_TRUNC, 0666);
    if (dest == -1) {
        perror("Failed to open destination file");
        close(src);
        return EXIT_FAILURE;
    }

    // Get the size of the source file
    struct stat stat_src;
    if (fstat(src, &stat_src) == -1) {
        perror("Failed to stat source file");
        close(src);
        close(dest);
        return EXIT_FAILURE;
    }

    // Perform the file copy
    off_t offset = 0;
    ssize_t bytesSent = sendfile(dest, src, &offset, stat_src.st_size);
    if (bytesSent == -1) {
        perror("Failed to send file");
        close(src);
        close(dest);
        return EXIT_FAILURE;
    }

    printf("Copied %zd bytes from %s to %s\n", bytesSent, argv[1], argv[2]);

    close(src);
    close(dest);
    return EXIT_SUCCESS;
}

Notes

Efficiency: sendfile is particularly efficient for copying data between a file and a socket because the data transfer occurs entirely within the kernel, avoiding the overhead of moving data to and from user space.
Limitations: Originally, sendfile could only be used with sockets as the output file descriptor. Modern Linux kernels have relaxed this limitation, allowing sendfile to be used with various types of file descriptors.
Use Cases: sendfile is widely used in web servers and FTP servers for efficiently sending files over the network. It's also useful in applications that require fast file duplication or transformation processes that can work with file descriptors directly.

This direct data transfer capability makes sendfile an invaluable tool for developing high-performance file handling and network communication applications.

14. `setitimer`: Create Timers

The setitimer function in Unix-like operating systems allows a process to set a timer that can generate signals after a specified interval, offering a mechanism for periodic operations or implementing timeouts. It's especially useful in scenarios where non-blocking or asynchronous behavior is desired.

14.1 Understanding `setitimer`

The setitimer function can set three types of timers:

ITIMER_REAL: Decreases in real time. Upon expiration, SIGALRM is delivered.
ITIMER_VIRTUAL: Counts down only when the process is executing. Upon expiration, SIGVTALRM is delivered.
ITIMER_PROF: Decreases both when the process executes and when the system is executing on behalf of the process. Upon expiration, SIGPROF is delivered.

14.2 Using `setitimer`: Prototype

#include <sys/time.h>

int setitimer(int which, const struct itimerval *new_value, struct itimerval *old_value);

which: The timer type (ITIMER_REAL, ITIMER_VIRTUAL, ITIMER_PROF).
new_value: Specifies the new timer value.
old_value: If not NULL, setitimer stores the current timer value here before it is updated.

14.3 The `itimerval` Structure

struct itimerval {
    struct timeval it_interval; // Next value for the timer.
    struct timeval it_value;    // Current value (initial countdown).
};

Each timeval structure represents an amount of time as seconds (tv_sec) and microseconds (tv_usec).

14.4 Example: Setting a Periodic Timer

Here's how to set up a periodic ITIMER_REAL timer that sends SIGALRM every 2 seconds, demonstrating basic usage of setitimer.

#include <signal.h>
#include <stdio.h>
#include <string.h>
#include <sys/time.h>

void handle_sigalrm(int sig) {
    printf("Timer expired\n");
}

int main() {
    struct itimerval timer;
    struct sigaction sa;

    // Set up the signal handler
    memset(&sa, 0, sizeof(sa));
    sa.sa_handler = &handle_sigalrm;
    sigaction(SIGALRM, &sa, NULL);

    // Configure the timer to expire after 2 seconds...
    timer.it_value.tv_sec = 2;
    timer.it_value.tv_usec = 0;

    // ...and every 2 seconds after that.
    timer.it_interval.tv_sec = 2;
    timer.it_interval.tv_usec = 0;

    // Start the timer
    if (setitimer(ITIMER_REAL, &timer, NULL) == -1) {
        perror("setitimer");
        return 1;
    }

    // Main loop
    while (1) {
        // Your application logic here
        pause(); // Wait for signals
    }

    return 0;
}

14.5 Notes and Best Practices

Signal Handling: Ensure that you've set up a signal handler for the timer's signal (SIGALRM, SIGVTALRM, or SIGPROF) before starting the timer.
Accuracy and Resolution: While setitimer allows specifying time in microseconds, the actual resolution depends on the system's clock tick rate. Linux systems typically have a clock tick rate of 10 milliseconds (100 Hz), but this can vary.
Use in Modern Applications: For new applications, consider using timerfd (with timerfd_create, timerfd_settime, etc.) for timer functionality, especially in event-driven programs that use I/O multiplexing (select, poll, epoll). Timerfd integrates better with such models by providing a file descriptor that can be monitored for timer expiration events.
Portability: While setitimer and associated signals are standardized across Unix-like systems, specific behaviors and available resolution might vary, making timerfd a preferable choice for Linux-specific applications requiring fine-grained timer control.

15. `sysinfo`: Retrieving System Statistics

The setitimer system call allows you to create timers that generate signals after a specified interval. This functionality is particularly useful for implementing timeout operations, periodic tasks, or measuring time intervals in Unix-like operating systems. The timer counts down in real time (or process time) and sends a signal upon reaching zero. You can specify an initial countdown and an interval for periodic signals.

15.1 Using `setitimer`: Prototype

#include <sys/time.h>

int setitimer(int which, const struct itimerval *new_value, struct itimerval *old_value);

which: Specifies the timer. Common values are ITIMER_REAL (decrements in real time, and sends SIGALRM upon expiration), ITIMER_VIRTUAL (decrements only when the process is executing, and sends SIGVTALRM), and ITIMER_PROF (decrements both when the process executes and when the system is executing on behalf of the process, sending SIGPROF).
new_value: Points to a struct itimerval that specifies the new value for the timer.
old_value: If not NULL, the current value of the timer is stored here before it is updated.

The struct itimerval is defined as follows:

struct itimerval {
    struct timeval it_interval; // Next value
    struct timeval it_value;    // Current value
};

Each struct timeval represents an amount of time as seconds (tv_sec) and microseconds (tv_usec).

15.2 Example: Using `setitimer` for a Periodic Timer

This example demonstrates setting up a periodic timer that fires every 2 seconds, using ITIMER_REAL to measure real (wall-clock) time.

#include <stdio.h>
#include <string.h>
#include <signal.h>
#include <sys/time.h>
#include <unistd.h>

void timer_handler(int signum) {
    static int count = 0;
    printf("timer expired %d times\n", ++count);
}

int main() {
    struct sigaction sa;
    struct itimerval timer;

    // Install timer_handler as the signal handler for SIGALRM.
    memset(&sa, 0, sizeof(sa));
    sa.sa_handler = &timer_handler;
    sigaction(SIGALRM, &sa, NULL);

    // Configure the timer to expire after 2 sec... 
    timer.it_value.tv_sec = 2;
    timer.it_value.tv_usec = 0;
    // ...and every 2 sec after that.
    timer.it_interval.tv_sec = 2;
    timer.it_interval.tv_usec = 0;
    // Start a real timer.
    setitimer(ITIMER_REAL, &timer, NULL);

    // Do something else...
    while (1) {
        sleep(1);
    }

    return 0;
}

15.3 Key Points

Signal Handling: The timer sends SIGALRM upon expiration. You must set up a signal handler to catch this signal and define the timer's behavior when it expires.
Timer Types: Choose the appropriate timer type (ITIMER_REAL, ITIMER_VIRTUAL, ITIMER_PROF) based on whether you need wall-clock timing or process/user timing.
Periodicity: To make a timer periodic, set the it_interval field of the itimerval structure to the desired period. To create a one-shot timer, set it_interval to zero.

This mechanism provides a flexible way to manage time-driven operations in your program, from simple periodic updates to more complex timing control.

16. `uname`

The uname system call in Unix-like operating systems provides a simple way to retrieve the system's basic information, including the operating system name, version, architecture, and more. This information can be particularly useful for programs that need to adjust their behavior based on the system they're running on.

16.1 Using `uname`: Prototype

#include <sys/utsname.h>

int uname(struct utsname *buf);

buf: A pointer to a struct utsname structure that will be filled with the system information.

The struct utsname structure is defined as follows:

struct utsname {
    char sysname[];  // Operating system name (e.g., "Linux")
    char nodename[]; // Name within "some implementation-defined network"
    char release[];  // Operating system release (e.g., "4.15.0-54-generic")
    char version[];  // Operating system version
    char machine[];  // Hardware identifier (e.g., "x86_64")
    char domainname[]; // NIS or YP domain name
};

16.2 Example: Retrieving and Displaying System Information

This example demonstrates how to use uname to fetch and display the system's information:

#include <stdio.h>
#include <sys/utsname.h>

int main() {
    struct utsname unameData;

    // Fetch the system information
    if (uname(&unameData) < 0) {
        perror("uname");
        return 1;
    }

    // Display the fetched information
    printf("System Name: %s\n", unameData.sysname);
    printf("Node Name:  %s\n", unameData.nodename);
    printf("Release:    %s\n", unameData.release);
    printf("Version:    %s\n", unameData.version);
    printf("Machine:    %s\n", unameData.machine);
    #ifdef _GNU_SOURCE
    printf("Domain Name: %s\n", unameData.domainname); // GNU extension
    #endif

    return 0;
}

16.3 Key Points

The uname system call fills in a struct utsname with information about the system.
This information includes the operating system name, the network node hostname, the OS release level, the OS version, and the hardware type.
The domainname field is a GNU extension and might not be present on all systems. Use conditional compilation (as shown) to ensure portability when accessing this field.
The uname command in Unix-like systems' command-line interfaces is based on the same system call and provides similar information.

Using uname, applications can identify the operating environment, making it possible to perform platform-specific operations or optimizations. This is particularly useful for portable applications that need to run across different Unix-like systems.

This tutorial covers a broad range of system calls you can use to interact with the Linux kernel and manipulate various system resources. Each example provides a basic usage scenario for the corresponding system call. Remember, when working with system calls, always check the return value for errors and handle them appropriately in your real applications.

📝 Article Author : SEMRADE Tarik
🏷️ Author position : Embedded Software Engineer
🔗 Author LinkedIn : LinkedIn profile

Table of Contents

1. Linux System Calls

2. Using strace

2.1 Basic Usage

2.2 Tracing a Running Process

2.3 Filtering Traced Output

2.4 Writing Output to a File

2.5 Tracing Specific Events

2.6 Tracing Child Processes

2.7 Useful Options

2.8 Example Use Case

3. access: Test the Permissions of a File

4. fcntl: Locks and File Operations

4.1 Understanding File Locks with fcntl

4.2 Example Program: File Locking with fcntl

4.3 How It Works

5. fsync and fdatasync: Purging Disk Buffers

5.1 fsync and fdatasync: Purging Disk Buffers

5.2 Example: Ensuring Data Integrity in a Journaling System

5.3 How It Works

6. getrlimit and setrlimit: Resource Limits

6.1 Resource Limits Overview

6.2 Using getrlimit and setrlimit

6.3 Example: Checking and Modifying File Descriptor Limit

6.4 How It Works

7. getrusage: Process Statistics

7.1 Understanding getrusage

7.2 Example: Monitoring CPU Time and Page Faults

7.3 Understanding the Output

8. gettimeofday: System Time

8.1 Prototype of gettimeofday

8.2 Example: Getting the Current Time with gettimeofday

8.3 How It Works

9. The mlock Family: Physical Memory Lock

9.1 The mlock Family of Functions

9.2 Using mlock and munlock