The distributed nature of Cassandra's databases is a key feature with both technological and commercial benefits.
When an application is under a lot of stress, Cassandra databases scale readily, and the distribution also guards against data loss due to hardware failure in any specific datacenter.
Technical power is another benefit of a distributed design; for instance, a developer can modify the throughput of separate read and write queries.
The term "distributed" refers to Cassandra's ability to function across various computers and present itself to users as a single entity.
Running Cassandra as a single node serves little purpose, but it is quite beneficial to do so to help you become familiar with how it operates.
Cassandra is made to handle huge data workloads across several nodes.
Its architecture is predicated on the knowledge that hardware and system failures can and do happen.
By using a peer-to-peer distributed architecture across homogeneous nodes where data is spread among all nodes in the cluster, Cassandra overcomes the issue of failures.
Using the peer-to-peer gossip communication protocol, each node often transmits state information about itself and other nodes throughout the cluster.
To ensure data persistence, each node maintains a sequentially written commit log that records write activities.
Information is indexed and written to a memtable, an in-memory structure that mimics a write-back cache.
The information is written to disk in an SSTables data file whenever the memory structure is full.
All writes are replicated and automatically partitioned throughout the cluster.
SSTables are periodically consolidated by Cassandra via a technique known as compaction, which involves removing outdated data marked for deletion with a tombstone.
Reliability and fault tolerance in Cassandra can be achieved by replicating a single piece of data over many nodes.
Row copies in Cassandra are known as replicas.
Cassandra assigns a hash value to each partition key.
Consistent hashing allows distribution of data across a cluster to minimize reorganization when nodes are added or removed.
In a Cassandra cluster, each node is in charge of two sets of tokens in addition to its primary set.
A node is set up by default to keep the data it controls in a directory specified in the cassandra.yaml file.
You can switch the commitlog-directory in a production cluster deployment to a different disk drive than the data_file_directories.
The cassandra.yaml configuration file is the primary configuration file used to set a cluster's initialization parameters, table caching parameters, tuning and resource-use attributes, timeout settings, client connections, backups, and security.
Set dynamic snitch thresholds in the cassandra.yaml configuration file on each node.
System keyspace table properties can be set using a client application like CQL on a per-keyspace or per-table basis.
Data replication and distribution in Cassandra go hand in hand, with each table's data identified by a primary key, which also identifies the node on which it is kept.
The replication factor (RF) concept, which specifies how many copies of your data should exist in the database, is supported by Cassandra.
Each node in the cluster is responsible for a range of data based on the hash value.
Various repair procedures are used to guarantee that the consistency of all data across the cluster.
Any node in the cluster can receive client read or write requests.
A node acts as the coordinator for a client action when a client connects to it with a request.
Between the client application and the nodes that control the requested data, the coordinator serves as a proxy.
Smaller and larger computers can be utilized to construct a cluster since the proportion of vnodes allotted to each machine in the cluster can be adjusted.
Virtual nodes are chosen at random and are not contiguous within a cluster.
A token is a 64-bit integer by default, resulting in the range of possible tokens from -263 to 263-1.
Every node in a cluster is mapped by a Cassandra to one or more tokens on a continuous ring shape.
In the case of a single token per node, each node is in charge of a token range of values between the allocated token and that node's assigned token plus one.
Data is distributed via vnodes using consistent hashing rather than new token creation and assignment.
To associate nodes with one or more tokens, Cassandra employs a consistent hashing algorithm.
To determine which node the data belongs to, this token value is contrasted with the token range for each node.
Because every other node in the cluster is involved, rebuilding a dead node is quicker.