Dynamo<\/a> by Amazon. It was originally developed by Facebook, later Cassandra became an Apache project and is now one of the top-level projects at Apache. Cassandra is based on the idea of a decentralized, distributed system without a single point of failure and is designed for high data throughput and high availability. <\/p>\r\n\r\n\r\n\r\nCassandras Data Structure<\/h2>\r\n\r\n\r\n\r\n I decided to begin my series with Cassandras data structure because it is a good introduction to the general ideas behind Cassandra and a good foundation for future posts regarding the Cassandra Query Language and the distributed nature of it. <\/p>\r\n\r\n\r\n\r\n
I try to give you an overview how data is stored in Cassandra and show you some similarities and differences to a relational database, so let’s get right to it. <\/p>\r\n\r\n\r\n\r\n
Columns, Rows and Tables<\/h3>\r\n\r\n\r\n\r\n The basic component in Cassandras data structure is the column<\/strong>, which consists classically of a key\/value pair. Individual columns are combined in a row<\/strong> and uniquely identified by a primary key<\/strong>. It consists of one or more columns and the primary key, which can also consist of one or more columns. To connect individual rows describing the same entity in a logical unit, Cassandra defines tables<\/strong>, which are a container for similar data in row format, equivalent to relations in relational databases. <\/p>\r\n\r\n\r\n\r\n \r\nthe row data structure in Cassandra<\/figcaption>\r\n<\/figure>\r\n\r\n\r\n\r\nHowever, there is a remarkable difference to the tables in relational databases. If individual columns of a row are not used when writing to the database, Cassandra does not replace the value with zero, but the entire column is not stored. This represents a storage space optimization, so the data model of tables has similarities to a multidimensional array or a nested map. <\/p>\r\n\r\n\r\n\r\n \r\ntable consisting of skinny rows<\/figcaption>\r\n<\/figure>\r\n\r\n\r\n\r\nSkinny and Wide Rows<\/h3>\r\n\r\n\r\n\r\n Another special feature of the tables in Cassandra is the distinction between skinny and wide rows. I only described skinny rows so far, i.e. they do not have a complex primary key with clustering columns and few entries in the individual partitions, in most cases only one entry per partition. <\/p>\r\n\r\n\r\n\r\n
You can imagine a partition as an isolated storage unit within Cassandra. There are typically several hundred of said partitions in a Cassandra installation. During a write or read operation the value of the primary key gets hashed. The resulting value of the hash algorithm can be assigned to a specific partition inside the Cassandra installation, as every partition is responsible for a certain range of hash values. I will dedicate a whole blog post to the underlying storage engine of Cassandra, so this little explanation has to suffice for now. <\/p>\r\n\r\n\r\n\r\n
Wide rows typically have a significant number of entries per partition. These wide rows are identified by a composite key, consisting of a partition key and optional clustering keys. <\/p>\r\n\r\n\r\n\r\n \r\ntable consisting of wide rows<\/figcaption>\r\n<\/figure>\r\n\r\n\r\n\r\n When using wide rows you have to pay attention to the defined limit of two billion entries in a partition, which can happen quite fast when storing measured values of a sensor, because after reaching the limit no more values can be stored in this partition.<\/p>\r\n\r\n\r\n\r\n
The partition key can consist of one or more columns, just like the primary key. Therefore, in order to stay with the example of the sensor data, it makes sense to select the partition key according to several criteria. Instead of simply partitioning according to for example a sensor_id<\/strong>, which depending on the number of incoming measurement data would sooner or later inevitably exceed the limit of 2 billion entries per partition, you can combine the partition key with the date of the measurement. If you combine the sensor_id with the date of the measurement the data is written to another partition on a daily basis. Of course you can make this coarser or grainer as you wish (hourly, daily, weekly, monthly). <\/p>\r\n\r\n\r\n\r\nThe clustering columns are needed to sort data within a partition. Primary keys are also partition keys without additional clustering columns.<\/p>\r\n\r\n\r\n\r\n
Several tables are collected in to a keypsace<\/strong>, which is the exact equivalent of a database in relational databases. <\/p>\r\n\r\n\r\n\r\nSummary<\/h3>\r\n\r\n\r\n\r\n The basic data structures are summarized, <\/p>\r\n\r\n\r\n\r\n
\r\nthe column<\/strong>, consisting of key\/value pairs,<\/li>\r\nthe row<\/strong>, which is a container for contiguous columns, identified by a primary key,<\/li>\r\nthe table<\/strong>, which is a container for rows and<\/li>\r\nthe keyspace<\/strong>, which is a container for tables.<\/li>\r\n<\/ul>\r\n\r\n\r\n\r\nI hope I was able to give you a rough overview of the data structure Cassandra uses. The next post in this series will be about the Cassandra Query Language (CQL), in which I will give you some more concrete examples how the data structure affects the data manipulation.<\/p>\r\n\r\n\r\n\r\n
Cheers,<\/p>\r\n\r\n\r\n\r\n
Leon<\/p>\r\n","protected":false},"excerpt":{"rendered":"
Hey guys, during my studies I had to analyze the NoSQL database Cassandra as a possible replacement for a regular relational database.During my research I dove really deep into the architecture and the data model of Cassandra and I figured that someone may profit from my previous research, maybe for your own evaluation process of Cassandra or just personal curiosity. … Read More<\/a><\/p>\n","protected":false},"author":7,"featured_media":2340,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"inline_featured_image":false,"footnotes":""},"categories":[111,108,109],"tags":[],"acf":[],"yoast_head":"\nA deep dive into Apache Cassandra - Part 1: Data Structure (was not continued) - CraftCoders.app<\/title>\n \n \n \n \n \n \n \n \n \n \n \n\t \n\t \n\t \n \n \n \n\t \n\t \n\t \n