/* * dev_nic.h - NIC (Network Controler) generic device API definition. * * Author Alain Greiner (2016,2017,2018,2019,2020) * * Copyright (c) UPMC Sorbonne Universites * * This file is part of ALMOS-MKH * * ALMOS-MKH is free software; you can redistribute it and/or modify it * under the terms of the GNU General Public License as published by * the Free Software Foundation; version 2.0 of the License. * * ALMOS-MKH is distributed in the hope that it will be useful, but * WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU * General Public License for more details. * * You should have received a copy of the GNU General Public License * along with ALMOS-kernel; if not, write to the Free Software Foundation, * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA */ #ifndef _DEV_NIC_H #define _DEV_NIC_H #include #include #include #include #include /**** Forward declarations ****/ struct chdev_s; /***************************************************************************************** * Generic Network Interface Controler definition * * This device provides access to a generic Gigabit Ethernet network controler. * It assumes that the NIC hardware peripheral handles two packets queues for sent (TX) * and received (RX) packets. * * The supported protocols stack is : Ethernet / IPV4 / TCP or UDP * * 1) hardware assumptions * * The NIC device is handling two (infinite) streams of packets to or from the network. * It is the driver responsibility to move the RX packets from the NIC to the RX queue, * and the TX packets from the TX queue to the NIC. * * AS the RX and TX queues are independant, there is one NIC_RX device descriptor * to handle RX packets, and another NIC_TX device descriptor to handle TX packets. * * In order to improve throughput, the NIC controller can implement multiple (N) channels. * In this case, the channel index is defined by an hash function computed from the remote * IP address and port. This index is computed by the hardware for an RX packet, and is * computed by the kernel for a TX packet, using a specific driver function. TODO ... * The 2*N chdevs, and the associated server threads implementing the protocols stack, * are distributed in 2*N different clusters. * * 2) User API * * On the user side, ALMOS-MKH implements the POSIX socket API. * The kernel functions implementing the socket related syscalls are : * - dev_nic_socket() : create a local socket registered in process fd_array[]. * - dev_nic_bind() : attach a local IP address and port to a local socket. * - dev_nic_listen() : local server makes a passive open. * - dev_nic_connect() : local client makes an active open to a remote server. * - dev_nic_accept() : local server accept a new remote client. * - dev_nic_send() : send data on a connected socket. * - dev_nic_recv() : receive data on a connected socket. * - dev_nic_sendto() : send a packet to a remote (IP address/port). * - dev_nic_recvfrom() : receive a paket from a remote (IP address/port). * - dev_nic_close() : close a socket * * 3) TX stream * * The internal API between the client threads and the TX server thread defines * the 3 following commands: * . SOCKET_TX_CONNECT : request to execute the 3 steps TCP connection handshake. * . SOCKET_TX_SEND : send data to a remote socket (UDP or TCP). * . SOCKET_TX_CLOSE : request to execute the 3 steps TCP close handshake. * * - These 3 commands are blocking for the client thread that registers the command in the * socket descriptor, blocks on the BLOCKED_IO condition, and deschedules. * - The TX server thread is acting as a multiplexer. It scans the list of attached sockets, * to handle all valid commands: one UDP packet or TCP segment per iteration. * It uses the user buffer defined by the client thread, and attached to socket descriptor, * as a retransmission buffer. It blocks and deschedules on the BLOCKED_CLIENT condition, * when there is no more active TX command registered in any socket. It is re-activated * by the first client thread registering a new TX command in the socket descriptor. * It unblocks a client thread only when a command is fully completed. It signals errors * to the client thread using the tx_error field in socket descriptor. * * 4) RX stream * * The communication between the RX server thread and the client threads expecting data * is done through receive buffers (one private buffer per socket) that are handled * as single-writer / single reader-FIFOs, called rx_buf. * - The RX server thread is acting as a demultiplexor: it handle one TCP segment or UDP * packet per iteration, and register the data in the rx_buf of the socket matching * the packet. It simply discard all packets that does not match a registered socket. * When a client thread is registered in the socket descriptor, the RX server thread * unblocks this client thread as soon as there is data available in rx_buf. * It blocks and deschedules on the BLOCKED_ISR condition when there is no more packets * in the NIC_RX queue. It is unblocked by the hardware ISR. * - The client thread simply access the rx_buf attached to socket descriptor, and consumes * the available data when the rx_buf is non empty. It blocks on the BLOCKED_IO condition, * and deschedules when the rx_buf is empty. * * 5) R2T queue * * To implement the TCP "3 steps handshake" protocol, the RX server thread can directly * request the associated TX server thread to send control packets in the TX stream, * using a dedicate R2T (RX to TX) FIFO stored in the socket descriptor. * * 6) NIC driver API * * The generic NIC device "driver" API defines the following commands to the NIC driver: * - READABLE : returns true if at least one RX paquet is available in RX queue. * - WRITABLE : returns true if at least one empty slot is available in TX queue. * - READ : consume one packet from the RX queue. * - WRITE : produce one packet to the TX queue. * All RX or TX paquets are sent or received in standard 2 Kbytes kernel buffers, * that are dynamically allocated by the protocols stack. * * The actual TX an RX queues structures depends on the hardware NIC implementation, * and are defined in the HAL specific driver code. * * WARNING: the WTI mailboxes used by the driver ro receive events from the hardware * (available RX packet, or available free TX slot, for a given channel), must be * statically allocated during the kernel initialisation phase, and must be * routed to the cluster containing the associated TX/RX chdev and server thread. * *****************************************************************************************/ /**** Forward declarations ****/ struct chdev_s; /****************************************************************************************** * Various constants used by the Protocols stack *****************************************************************************************/ #define SRC_MAC_54 0x54 #define SRC_MAC_32 0x32 #define SRC_MAC_10 0x10 #define DST_MAC_54 0x54 #define DST_MAC_32 0x32 #define DST_MAC_10 0x10 #define TCP_HEAD_LEN 20 #define UDP_HEAD_LEN 8 #define IP_HEAD_LEN 20 #define ETH_HEAD_LEN 14 #define PROTOCOL_UDP 0x11 #define PROTOCOL_TCP 0x06 #define TCP_ISS 0x10000 #define PAYLOAD_MAX_LEN 1500 // max payload for and UDP packet or a TCP segment #define TCP_FLAG_FIN 0x01 #define TCP_FLAG_SYN 0x02 #define TCP_FLAG_RST 0x04 #define TCP_FLAG_PSH 0x08 #define TCP_FLAG_ACK 0x10 #define TCP_FLAG_URG 0x20 #define NIC_RX_BUF_SIZE 0x100000 // 1 Mbytes #define NIC_R2T_QUEUE_SIZE 0x64 // smallest KCM size #define NIC_CRQ_QUEUE_SIZE 0x8 // 8 * sizeof(sockaddr_t) = smallest KCM size #define NIC_PKT_MAX_SIZE 1500 // for Ethernet #define NIC_KERNEL_BUF_SIZE 2000 // for on ETH/IP/TCP packet /***************************************************************************************** * This defines the extension for the generic NIC device. * The actual queue descriptor depends on the implementation. * * WARNING : for all NIC_TX and NIC_RX chdevs, the xlist rooted in in the chdev * ("wait_root" and "wait_lock" fields) is actually a list of sockets. ****************************************************************************************/ typedef struct nic_extend_s { void * queue; /*! local pointer on NIC queue descriptor (RX or TX) */ } nic_extend_t; /***************************************************************************************** * This enum defines the various implementations of the generic NIC peripheral. * This array must be kept consistent with the define in the arch_info.h file. ****************************************************************************************/ typedef enum nic_impl_e { IMPL_NIC_CBF = 0, IMPL_NIC_I86 = 1, } nic_impl_t; /**************************************************************************************** * This defines the (implementation independant) commands to access the NIC hardware. * These commands are registered by the NIC_TX and NIC_RX server threads in the * server thread descriptor, to be used by the NIC driver. * The buffer is always a 2K bytes kernel buffer, containing an Ethernet packet. ****************************************************************************************/ typedef enum nic_cmd_e { NIC_CMD_WRITABLE = 10, /*! test TX queue not full (for a given packet length) */ NIC_CMD_WRITE = 11, /*! put one (given length) packet to TX queue */ NIC_CMD_READABLE = 12, /*! test RX queue not empty (for any packet length) */ NIC_CMD_READ = 13, /*! get one (any length) packet from RX queue */ } nic_cmd_t; typedef struct nic_command_s { xptr_t dev_xp; /*! extended pointer on NIC chdev descriptor */ nic_cmd_t type; /*! command type */ uint8_t * buffer; /*! local pointer on buffer (kernel or user space) */ uint32_t length; /*! number of bytes in buffer */ uint32_t status; /*! return value (depends on command type) */ uint32_t error; /*! return an error from the hardware (0 if no error) */ } nic_command_t; /***************************************************************************************** * This structure defines a socket descriptor. In order to parallelize the transfers, * the set of all registered sockets is split in several subsets. * The number of subsets is the number of NIC channels. * The distribution key is computed from the (remote_addr/remote_port) couple. * This computation is done by the NIC hardware for RX packets, * and by the dev_nic_connect() function for the TX packets. * * A socket is attached to the NIC_TX[channel] & NIC_RX[channel] chdevs. * Each socket descriptor allows the TX and TX server threads to access various buffers: * - the user "send" buffer contains the data to be send by the TX server thread. * - the kernel "receive" buffer contains the data received by the RX server thread. * - the kernel "r2t" buffer allows the RX server thread to make direct requests * to the associated TX server (to implement the TCP 3 steps handshake). * * The synchronisation mechanism between the clients threads and the servers threads * is different for TX and RX transfers: * * 1) For a TX transfer, it can exist only one client thread for a given socket, * the transfer is always initiated by the local process, and all TX commands * (CONNECT/SEND/CLOSE) are blocking for the client thread. The user buffer is * used by TCP to handle retransmissions when required.in case of re * The client thread registers the command in the thread descriptor, registers itself * in the socket descriptor, unblocks the TX server thread from the BLOCKED_CLIENT * condition, blocks itself on the BLOCKED_IO condition, and deschedules. * When the command is completed, the TX server thread unblocks the client thread. * The TX server blocks itself on the BLOCKED_CLIENT condition, when there is no * pending commands and the R2T queue is empty. It is unblocked when a client * register a new command, or when the TX server thread register a mew request * in the R2T queue. * The tx_valid flip-flop is SET by the client thread to signal a valid command. * It is RESET by the server thread when the command is completed: For a SEND, * all bytes have been sent (UDP) or acknowledged (TCP). * * 2) For an RX transfer, it can exist only one client thread for a given socket, * but the transfer is initiated by the remote process, and the RECV command * is not really blocking: the data can arrive before the local RECV command is * executed, and the server thread does not wait to receive all requested data * to deliver data to client thread. Therefore each socket contains a receive * buffer (rx_buf) handled as a single-writer/single-reader fifo. * The client thread consumes data from the rx_buf when possible. It blocks on the * BLOCKED_IO condition and deschedules when the rx_buf is empty. * It is unblocked by the RX server thread when new data is available in the rx_buf. * The RX server blocks itself on the BLOCKED_ISR condition When the NIC_RX packets * queue is empty. It is unblocked by the hardware when new packets are available. * * Note : the socket domains and types are defined in the "shared_socket.h" file. ****************************************************************************************/ /****************************************************************************************** * This function returns a printable string for a given NIC command . ****************************************************************************************** * @ type : NIC command type *****************************************************************************************/ char * nic_cmd_str( uint32_t type ); /****************************************************************************************** * This function returns a printable string for a given socket . ****************************************************************************************** * @ state : socket state *****************************************************************************************/ char * socket_state_str( uint32_t state ); /****************************************************************************************** * This function completes the NIC-RX and NIC-TX chdev descriptors initialisation. * namely the link with the implementation specific driver. * The func, impl, channel, is_rx, base fields have been previously initialised. * It calls the specific driver initialisation function, to initialise the hardware * device and the specific data structures when required. * It creates the associated server thread and allocates a WTI from local ICU. * For a TX_NIC chedv, it allocates and initializes the R2T waiting queue used by the * NIC_RX[channel] server to send direct requests to the NIC_TX[channel] server. * It must de executed by a local thread. ****************************************************************************************** * @ chdev : local pointer on NIC chdev descriptor. *****************************************************************************************/ void dev_nic_init( struct chdev_s * chdev ); /* functions implementing the socket API */ /**************************************************************************************** * This function implements the socket() syscall. * This function allocates and intializes in the calling thread cluster: * - a new socket descriptor, defined by the and arguments, * - a new file descriptor, associated to this socket, * It registers the file descriptor in the reference process fd_array[], set * the socket state to IDLE, and returns the value. **************************************************************************************** * @ domain : [in] socket protocol family (AF_UNIX / AF_INET) * @ type : [in] socket type (SOCK_DGRAM / SOCK_STREAM). * @ return a file descriptor if success / return -1 if failure. ***************************************************************************************/ int dev_nic_socket( uint32_t domain, uint32_t type ); /**************************************************************************************** * This function implements the bind() syscall. * It initializes the "local_addr" and "local_port" fields in the socket * descriptor identified by the argument and set the socket state to BOUND. * It can be called by a thread running in any cluster. **************************************************************************************** * @ fdid : [in] file descriptor identifying the socket. * @ addr : [in] local IP address. * @ port : [in] local port. * @ return 0 if success / return -1 if failure. ***************************************************************************************/ int dev_nic_bind( uint32_t fdid, uint32_t addr, uint16_t port ); /**************************************************************************************** * This function implements the listen() syscall(). * It is called by a (local) server process to specify the max size of the queue * registering the (remote) client process connections, and set the socket identified * by the argument to LISTEN state. It applies only to sockets of type TCP. * It can be called by a thread running in any cluster. * TODO handle the argument... **************************************************************************************** * @ fdid : [in] file descriptor identifying the local server socket. * @ max_pending : [in] max number of accepted remote client connections. ***************************************************************************************/ int dev_nic_listen( uint32_t fdid, uint32_t max_pending ); /**************************************************************************************** * This function implements the connect() syscall. * It is used by a (local) client process to connect a local socket identified by * the argument, to a remote socket identified by the and * arguments. It can be used for both UDP and TCP sockets. * It computes the nic_channel index from and values, * and initializes "remote_addr","remote_port", "nic_channel" in local socket. * It registers the socket in the two lists of clients rooted in the NIC_RX[channel] * and NIC_TX[channel] chdevs. It can be called by a thread running in any cluster. * WARNING : the clients are the socket descriptors, and NOT the threads descriptors. **************************************************************************************** * Implementation Note: * - For a TCP socket, it updates the "remote_addr", "remote_port", "nic_channel" fields * in the socket descriptor defined by the argument, and register this socket, * in the lists of sockets attached to the NIC_TX and NIC_RX chdevs. * Then, it registers a CONNECT command in the "nic_cmd" field ot the client thread * descriptor to request the NIC_TX server thread to execute the 3 steps handshake, * and updates the "tx_client" field in the socket descriptor. It unblocks the NIC_TX * server thread, blocks on the THREAD_BLOCKED_IO condition and deschedules. * - For an UDP socket, it simply updates "remote_addr", "remote_port", "nic_channel" * in the socket descriptor defined by the argument, and register this socket, * in the lists of sockets attached to the NIC_TX and NIC_RX chdevs. * Then, it set the socket state to CONNECT, without unblocking the NIC_TX server * thread, and without blocking itself. * TODO : the nic_channel index computation must be done by a driver specific function. **************************************************************************************** * @ fdid : [in] file descriptor identifying the socket. * @ remote_addr : [in] remote IP address. * @ remote_port : [in] remote port. * @ return 0 if success / return -1 if failure. ***************************************************************************************/ int dev_nic_connect( uint32_t fdid, uint32_t remote_addr, uint16_t remote_port ); /**************************************************************************************** * This function implements the accept() syscall(). * It is executed by a server process, waiting for one (or several) client process(es) * requesting a connection on a socket identified by the argument. * This socket was previouly created with socket(), bound to a local address with bind(), * and is listening for connections after a listen(). * This function extracts the first connection request on the CRQQ queue of pending * requests, creates a new socket with the same properties as the existing socket, * and allocates a new file descriptor for this new socket. * If no pending connections are present on the queue, it blocks the caller until a * connection is present. * The new socket cannot accept more connections, but the original socket remains open. * It returns the new socket , and register in the
an arguments * the remote client IP address & port. It applies only to sockets of type SOCK_STREAM. **************************************************************************************** * @ fdid : [in] file descriptor identifying the listening socket. * @ address : [out] server IP address. * @ port : [out] server port address length in bytes. * @ return the new socket if success / return -1 if failure ***************************************************************************************/ int dev_nic_accept( uint32_t fdid, uint32_t * address, uint16_t * port ); /**************************************************************************************** * This blocking function implements the send() syscall. * It is used to send data stored in the user buffer, identified the and * arguments, to a connected (TCP or UDP) socket, identified by the argument. * The work is actually done by the NIC_TX server thread, and the synchronisation * between the client and the server threads uses the "rx_valid" set/reset flip-flop: * The client thread registers itself in the socket descriptor, registers in the queue * rooted in the NIC_TX[index] chdev, set "rx_valid", unblocks the server thread, and * finally blocks on THREAD_BLOCKED_IO, and deschedules. * When the TX server thread completes the command (all data has been sent for an UDP * socket, or acknowledeged for a TCP socket), the server thread reset "rx_valid" and * unblocks the client thread. * This function can be called by a thread running in any cluster. * WARNING : This implementation does not support several concurent SEND/SENDTO commands * on the same socket, as only one TX thread can register in a given socket. **************************************************************************************** * @ fdid : [in] file descriptor identifying the socket. * @ u_buf : [in] pointer on buffer containing packet in user space. * @ length : [in] packet size in bytes. * @ return number of sent bytes if success / return -1 if failure. ***************************************************************************************/ int dev_nic_send( uint32_t fdid, uint8_t * u_buf, uint32_t length ); /**************************************************************************************** * This blocking function implements the sendto() syscall. * It registers the and arguments in the local socket * descriptor, and does the same thing as the dev_nic_send() function above, * but can be called on an unconnected UDP socket. **************************************************************************************** * @ fdid : [in] file descriptor identifying the socket. * @ u_buf : [in] pointer on buffer containing packet in user space. * @ length : [in] packet size in bytes. * @ remote_addr : [in] destination IP address. * @ remote_port : [in] destination port. * @ return number of sent bytes if success / return -1 if failure. ***************************************************************************************/ int dev_nic_sendto( uint32_t fdid, uint8_t * u_buf, uint32_t length, uint32_t remote_addr, uint32_t remote_port ); /**************************************************************************************** * This blocking function implements the recv() syscall. * It is used to receive data that has been stored by the NIC_RX server thread in the * rx_buf of a connected (TCP or UDP) socket, identified by the argument. * The synchronisation between the client and the server threads uses the "rx_valid" * set/reset flip-flop: If "rx_valid" is set, the client simply moves the available * data from the "rx_buf" to the user buffer identified by the and * arguments, and reset the "rx_valid" flip_flop. If "rx_valid" is not set, the client * thread register itself in the socket descriptor, registers in the clients queue rooted * in the NIC_RX[index] chdev, and finally blocks on THREAD_BLOCKED_IO, and deschedules. * The client thread is re-activated by the RX server, that set the "rx_valid" flip-flop * as soon as data is available in the "rcv_buf" (can be less than the user buffer size). * This function can be called by a thread running in any cluster. * WARNING : This implementation does not support several concurent RECV/RECVFROM * commands on the same socket, as only one RX thread can register in a given socket. **************************************************************************************** * @ fdid : [in] file descriptor identifying the socket. * @ u_buf : [in] pointer on buffer in user space. * @ length : [in] buffer size in bytes. * @ return number of received bytes if success / return -1 if failure. ***************************************************************************************/ int dev_nic_recv( uint32_t fdid, uint8_t * u_buf, uint32_t length ); /**************************************************************************************** * This blocking function implements the recvfrom() syscall. * It registers the and arguments in the local socket * descriptor, and does the same thing as the dev_nic_recv() function above, * but can be called on an unconnected UDP socket. **************************************************************************************** * @ fdid : [in] file descriptor identifying the socket. * @ u_buf : [in] pointer on buffer containing packet in user space. * @ length : [in] packet size in bytes. * @ remote_addr : [in] destination IP address. * @ remote_port : [in] destination port. * @ return number of received bytes if success / return -1 if failure. ***************************************************************************************/ int dev_nic_recvfrom( uint32_t fdid, uint8_t * u_buf, uint32_t length, uint32_t remote_addr, uint32_t remote_port ); /* Instrumentation functions */ /****************************************************************************************** * This instrumentation function displays on the TXT0 kernel terminal the content * of the instrumentation registers contained in the NIC device. *****************************************************************************************/ void dev_nic_print_stats( void ); /****************************************************************************************** * This instrumentation function reset all instrumentation registers contained * in the NIC device. *****************************************************************************************/ void dev_nic_clear_stats( void ); /* Functions executed by the TX and RX server threads */ /****************************************************************************************** * This function is executed by the server thread associated to a NIC_TX[channel] chdev. * This TX server thread is created by the dev_nic_init() function. * It build and send UDP packets or TCP segments for all clients threads registered in * the NIC_TX[channel] chdev. The command types are (CONNECT / SEND / CLOSE), and the * priority between clients is round-robin. It takes into account the request registered * by the RX server thread in the R2T queue associated to the involved socket. * When a command is completed, it unblocks the client thread. For a SEND command, the * last byte must have been sent for an UDP socket, and it must have been acknowledged * for a TCP socket. * When the TX client threads queue is empty, it blocks on THREAD_BLOCKED_CLIENT * condition and deschedules. It is re-activated by a client thread registering a command. ****************************************************************************************** * Implementation note: * It execute an infinite loop in which it takes the lock protecting the clients list * to build a "kleenex" list of currently registered clients. * For each client registered in this "kleenex" list, it takes the lock protecting the * socket state, build one packet/segment in a local 2K bytes kernel buffer, calls the * transport layer to add the UDP/TCP header, calls the IP layer to add the IP header, * calls the ETH layer to add the ETH header, and moves the packet to the NIC_TX_QUEUE. * Finally, it updates the socket state, and release the socket lock. ****************************************************************************************** * @ chdev : [in] local pointer on one local NIC_TX[channel] chdev descriptor. *****************************************************************************************/ void dev_nic_tx_server( struct chdev_s * chdev ); /****************************************************************************************** * This function is executed by the server thread associated to a NIC_RX[channel] chdev. * This RX server thread is created by the dev_nic_init() function. * It handles all UDP packets or TCP segments received by the sockets attached to * the NIC_RX[channel] chdev. It writes the received data in the socket rcv_buf, and * unblocks the client thread waiting on a RECV command. * To implement the three steps handshahke required by a TCP connection, it posts direct * requests to the TX server, using the R2T queue attached to the involved socket. * It blocks on the THREAD_BLOCKED_ISR condition and deschedules when the NIC_RX_QUEUE * is empty. It is re-activated by the NIC_RX_ISR, when the queue becomes non empty. ****************************************************************************************** * Implementation note: * It executes an infinite loop in which it extracts one packet from the NIC_RX_QUEUE * of received packets, copies this packet in a local 2 kbytes kernel buffer, checks * the Ethernet header, checks the IP header, calls the relevant (TCP or UDP) transport * protocol that search a matching socket for the received packet. It copies the payload * to the relevant socket rcv_buf when the packet is acceptable, and unblocks the client * thread. It discard the packet if no socket found. ****************************************************************************************** * @ chdev : [in] local pointer on one local NIC_RX[channel] chdev descriptor. *****************************************************************************************/ void dev_nic_rx_server( struct chdev_s * chdev ); #endif /* _DEV_NIC_H */