Part 6: Errors
Errors, errors, errors. Errors are everywhere. I recall a joke I had made on social media a few years ago where I said
I am a good enough programmer, don’t expect errors in every program I write B)
This was a remark on the (in)famous practical record books I had during my undergrad, where I had to code a simple program that would probably exemplify linked lists, arrays, and what not, and there I’d have to write an answer or provide an example of an output. It also had sections for “errors encountered”.
Except, writing “no errors” on nearly every topic was sub-optimal. Especially when all you had to do was copy the program that they had written for you, and just run it.
Fast forward a few years, and the concept of errors now is more of a design question than an implementational one. When you deal with many moving parts, you realize that wrangling them all is a pain, especially when the errors you see will most probably be some synchronization issues or race conditions, which will resolve by themselves if you just let them try again!
A system’s ability to handle such errors is called resiliency. In such systems, if something goes wrong, all the system needs to do is potentially make a few changes and try again. Of course, we can have manual intervention necessary to handle these errors when they cross a threshold, but that should be a case of last resort.
Errors? In our program??⌗
Whenever we work with external blackboxes in our code, we cannot always be sure that these external entities will work the way we intend to. Fault tolerance is one of the backbones of a good computing system, and fault tolerance in a distributed system involves a lot of allowances for entities beyond our control. There may be a communication issue between computers, or a third party API might be failing. Should we block out the entire system for it? Or should we let the malfunctioning component recover by itself?
This leads to two classes of errors:
Retriable errors - these are errors that (should) eventually go away with minimal to no change in the system. Your system ideally has a fixed number of retries for this error, and fails once the threshold is hit.
Fatal errors - these are errors that are fundamental to your component. If your component hits this error, it has no option but to fail. For example, if your receiver cannot come up, it is a fatal error. If your receiver cannot send a message back to a transmitter, then it is not necessarily a fatal error. Of course, in this case, your transmitter might consider the error fatal.
In order to work with errors, we will need to be able to:
- Define errors
- Classify them as fatal / non-fatal
- Retry
Diving in⌗
An Error
in rust needs to be able to implement the Display
, Debug
and the
Clone
traits. While we can use a simple #[derive(Clone)]
for our code, we
shall add more complex implementations for Display
and Debug
.
Let’s define a NwError
object (since we only have network errors as of yet)
as follows:
#[derive(Clone)]
pub struct NwError {
pub kind: ErrorKind,
pub context: String,
}
ErrorKind
is an enum, which we shall define as follows:
pub enum ErrorKind {
ErrorNone,
TransmitterConnectError,
TransmitterTransmitError,
ReceiverListenError,
ReceiverAcceptError,
ReceiverWriteBackError,
ReceiverReadFromSocketError,
}
If you’ve worked with enums before, you’ll know that more often than not, enums are simple integers where each integer defines what the enum stands for. They improve readability, but don’t really do much for print access. If something says, “ErrorKind 3 was enountered”, it doesn’t really make sense as an end user.
Let’s build a way for us to display the kind of error when we Display
it:
fn error_kind_to_str(ek: ErrorKind) -> String {
let returnable: &str;
match ek {
ErrorKind::ErrorNone => {
returnable = "Default: No Error";
}
ErrorKind::TransmitterConnectError => {
returnable = "Transmitter: Connect Error";
}
ErrorKind::TransmitterTransmitError => {
returnable = "Transmitter: Transmit Error";
}
ErrorKind::ReceiverListenError => {
returnable = "Receiver: Listen Error";
}
ErrorKind::ReceiverAcceptError => {
returnable = "Receiver: Accept Error";
}
ErrorKind::ReceiverWriteBackError => {
returnable = "Receiver: Write Back Error";
}
ErrorKind::ReceiverReadFromSocketError => {
returnable = "Receiver: Read From Socket Error";
}
};
return String::from_str(returnable).expect("Error in printing error");
}
This function matches the error kind and returns a String
for the caller to
use.
Now let’s implement the display and debug logic for our error:
impl fmt::Display for NwError {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
match write!(
f,
"{}, context: {}",
error_kind_to_str(self.kind),
self.context
) {
Ok(_) => {},
Err(e) => { panic!("{}", e)}
};
Ok(())
}
}
impl fmt::Debug for NwError {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
match write!(
f,
"Error! {{ kind: {}, context: {} }}",
error_kind_to_str(self.kind),
self.context
) {
Ok(_) => {},
Err(e) => { panic!("{}", e)}
};
Ok(())
}
}
These functions print the errors on the screen when called through println!()
.
Now let’s import this ability to our transmitter and receiver.
For our receiver, we shall:
- Capture any error we see in our listening process (instead of panicking or swallowing all the time)
- Once a
tokio::spawn
ends for a request, check for the error - Panic for certain errors, swallow the rest.
Capturing errors⌗
We shall capture the error and return it as follows:
pub async fn listen(&mut self) -> Result<(), NwError> {}
Now let’s return this error in our code as follows:
// In the listener instantiation
let listener = match TcpListener::bind(self.listening_address.clone()).await {
Ok(v) => v,
Err(e) => {
return Err(NwError {
kind: ErrorKind::ReceiverListenError,
context: e.to_string(),
});
}
};
// In the socket acceptance
match listener.accept().await {
Ok(stream_and_addr) => stream_and_addr,
Err(e) => {
let err = NwError {
kind: ErrorKind::ReceiverAcceptError,
context: e.to_string(),
};
println!("Accept error, swallowing : {}", err);
continue;
}
};
Note that this becomes a bit more complicated when we have to deal with errors
that we encounter inside the tokio
runtime, as it is an additional layer of
abstraction within our logic. tokio::spawn
returns a Result
, which contains
either the value we wish to return, or a JoinError
for internal errors.
// In the socket read functions:
// No error if connection was gracefully closed
match socket.read(&mut buf).await {
Ok(0) => {
// Zero bytes implies closed connection
println!("Connection closed by {}", inc_addr);
return NwError {
kind: ErrorKind::ErrorNone,
context: String::from("connection closed")
};
}
Ok(n) => {
// code to read the data...
if let Err(e) = socket
.write_all(&returnable.as_bytes()[..return_size])
.await
{
return NwError {
kind: ErrorKind::ReceiverWriteBackError,
context: e.to_string(),
};
}
}
Err(e) => {
// Could not read from socket
return NwError {
kind: ErrorKind::ReceiverReadFromSocketError,
context: e.to_string(),
};
}
}
Let’s break this down. Since our outer function returns a Result
, in case of
unrecoverable errors, we wrap the NwError
inside the Err()
and return it.
Inside the tokio::spawn
call, however, we shall simply return the NwError
error as a standalone struct and handle it. Your code will now look as follows:
match tokio::spawn(async move {
// Logic goes here
}).await {
Ok(val) => {
if val.kind != ErrorKind::ErrorNone {
println!("Swallowing error: {}", val);
}
}
Err(e) => {
panic!("{}", e);
}
}
What happens if it’s ErrorNone
? We simply rerun the loop without printing
anything! When we hit an error in the tokio::spawn
function, we can be sure
that it is pertaining to that connection only, and so the error can be
swallowed and the loop can continue. We do something similar in the listener
accept error before we spawn the task as well.
With this design, we know that if any error is thrown to the main()
function,
it is panic-inducing, as recoverable errors would have been swallowed already.
For comparing two items in rust, make sure they derive the
PartialEq
andEq
traits!
Within main⌗
Change your listener
call in main
as follows:
if let Err(e) = receiver.listen().await {
// Any unrecoverable error will cause a panic
panic!("{}", e);
}
In the transmitter⌗
The receiver isn’t the only thing that will hit an error. The transmitter might hit one too, if, for example, the receiver isn’t ready yet. Let’s add some error handling in the transmitter as well.
For the transmitter, we shall explore moving the retry logic outside the actual transmitter struct. Rewrite your function in the transmitter as follows:
pub async fn transmit(&self) -> Result<(), NwError> {
let mut write_stream = match TcpStream::connect(self.destination.as_str()).await {
Ok(val) => val,
Err(e) => {
return Err(NwError {
kind: ErrorKind::TransmitterConnectError,
context: e.to_string(),
});
}
};
match write_stream.write_all(self.payload.as_slice()).await {
Ok(_) => {
println!("Wrote data");
}
Err(e) => {
return Err(NwError {
kind: ErrorKind::TransmitterTransmitError,
context: e.to_string(),
});
}
};
Ok(())
}
This makes the caller decide on whether to retry or panic. We can do so in
main
:
let transmitter = nw::transmitter::Transmitter::new(target_addr, data);
let mut transmitted = false;
while !transmitted {
match transmitter.transmit().await {
Ok(_) => {
transmitted = true;
}
Err(e) => {
println!("Recoverable error: {}", e);
// Retry after a timeout
let two_s = time::Duration::from_secs(2);
thread::sleep(two_s);
}
};
}
Until the data is transmitted, we shall infinitely retry (very bad practice in real life).
Testing this out⌗
Once you do a cargo build
, try the following:
$ ./simple_socket_comm -c one.toml & sleep 5 && ./simple_socket_comm -c two.toml
[1] 680552
Server is listening on 127.0.0.1:9998
Recoverable error: Transmitter: Connect Error, context: Connection refused (os error 111)
Recoverable error: Transmitter: Connect Error, context: Connection refused (os error 111)
Recoverable error: Transmitter: Connect Error, context: Connection refused (os error 111)
Server is listening on 127.0.0.1:9999
Wrote data
New connection from 127.0.0.1:45070
New connection from 127.0.0.1:42442
Wrote data
Received data: Talking to myself
Received data: Talking to myself
Connection closed by 127.0.0.1:45070
Connection closed by 127.0.0.1:42442
Above, you can see how there is a connection refused error when the second listener isn’t up yet. The transmitter keeps retrying after a delay. Next, let us trigger a “panic”-able error (two servers trying to listen on the same port for example)
./simple_socket_comm -c one.toml & sleep 5 && ./simple_socket_comm -c one.toml
[1] 688624
Server is listening on 127.0.0.1:9998
Recoverable error: Transmitter: Connect Error, context: Connection refused (os error 111)
Recoverable error: Transmitter: Connect Error, context: Connection refused (os error 111)
Recoverable error: Transmitter: Connect Error, context: Connection refused (os error 111)
thread 'main' panicked at src/main.rs:62:9:
Receiver: Listen Error, context: Address already in use (os error 98)
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
There you go! As your program evolves, so will your errors. The transmit error,
for example, might have a n_retries
limit, after which it might want to panic.
If the listener panics on the address, you might have a way for randomly
choosing the newer port for the listener to listen to, and so on.