Saturday, November 22, 2014

DATABASES - SQL, CRUD AND ACID - 102


REVIEW:

In the Database 101 post we discussed how databases are used for tracking and managing objects using records. Sometimes those records are digital objects like .pdf documents stored directly in the database file. Sometimes the records in the database simply hold metadata about objects such as employees who exist outside of the database file. We also learned about how databases store the records in a file on disk but use smaller sorted indexes stored in RAM for fast searching. Indexes contain only one small piece or metadata as well as the location of the complete record on disk that the index refers to. 

In this post we will discuss two acronyms that stand for properties that databases demonstrate called CRUD and ACID.   


CRUD:

CRUD is an acronym for Create, Read, Update and Delete. These are the four basic operations that the DBMS must allow on the records in the database. It may seem obvious that if you have a collection of records for managing objects, you may want to:
1. create new records
2. read back records that you stored in the database file
3. update those records when some property of the tracked object changes
4. delete a record for an object that you no longer want in the collection.

An example of CRUD for an database of employee records would be the following. You will want to CREATE a record for each employee in the company. Issue a query to READ back all employee records who work in a particular office. UPDATE an employee record when their location moves to a new office as well as DELETE employees when the leave the company. 



SMITH indexes point to SMITH record locations
CRUD operations are performed not only on the object records in the database but also the indexes used to search those records. For example, when you update an employee record to change their office location you must also update the sorted indexes as well. When we discuss ACID next, you'll see that implementing CRUD operations can get tricky when multiple CRUD operations overlap each other in time. 




Databases files by themselves simply store records. It's the DBMS application that presents an interface to the user to display and modify those records. For relational DBMSes there is an agreed upon method or language for how the users communicates with the DBMS. One such language is SQL or Structured Query Language. As you can imagine the language has words for all four basic CRUD operations. 



Example of SQL statements
SQL language or any query language for that matter, is crucial for usability of DBMSes. Application programmers, the people who create application software via programming languages (Java, Python or maybe C++) are the users of databases and not database administrators. I say this because they want a simple method of storing and retrieving records no matter what the underlying database is (Microsoft SQL, Oracle 12c or maybe IBM Informix 10.2). By learning the SQL language they have everything they need to perform CRUD operations on database records. I have included a picture of what SQL language actually looks like. You can see that it is quite readable and understandable even for those who have never seen the language before. SQL is called a "Declarative" language as it simply declares what data you want rather than telling the DBMS how to get it.



MULTIPLE USERS AND CONCURRENCY:

If you are creating a database that only has one user who issues SQL statements one at a time, your database implementation would be quite simple. However, most DBMSes are the backend record storage for front-end applications that have multiple users modifying records at the same time. This presents several problems that ACID attempts to define and solve.





ACID:

ACID stands for Atomicity pronounced "Atom-miss-city", Consistency, Isolation and Durability. Let's look at each one-by-one. 



A brief explanation follows:


  • Atomicity - pronounced "atom-miss-city". By "atomic" we mean "as a single atom or unit". This is a method of grouping data together as single unit so that the atomic unit of data is either entered into the database or not. It's either "All or nothing" as partially updating a record in the database can have bad results in real-life. Imagine you deposit $1M dollars into your bank account but the deposit transaction only updated the database with your account number but not the deposit amount. You would be very unhappy if you deposited $1M but the database transaction to record that failed before increasing your balance. If you were the bank teller who was using the software application that allows for entering a check deposit, you need instant application feedback about the success or failure when you hit submit after entering such things as the account number and deposit amount. The teller needs confirmation that the transaction was entire transaction completed successfully or failed completely and nothing about that deposit was added to the database. When failure happens for any reason because the transaction was an all or nothing the teller is free to attempt the transaction again without fear that he or she is adding a second $1M deposit.  
  • Consistency - can be thought of as... will an update to a database record field leave the database in a good or consistent state? Imagine if you designed your US customer database to store a customer's phone number. Here in the US, we use a 10 digit phone number that consists of a 3 digit area code followed by a 3 digit local prefix and a 4 digit number to uniquely identify a phone endpoint in that locality. For example (605-475-6968). Imagine if a request to create a record with only a four digit number, 6968, came into the database. This erroneous piece of data would leave that particular record in a logically in-consistent or corrupted state.
  • Isolation - Because databases allow multiple updates to be occurring at the same time, each transaction must be isolated from the others to prevent unwanted side-effects of transactions "stepping" on one another. An example of when this can be tricky is if a database were to delete a record and before the associate index could be removed, a second transaction to find records matching a certain criteria found the index for that record and went to retrieve the record that was just deleted. This would cause an error because the two transactions stepped on one another in time.
  • Durability - Once a transaction is stored in the database it must be durable in the sense of it will exist even in the face of a computer power outage (reboot). Modern databases often store the received data in RAM due to it's speed matches that of the CPU. However, RAM is volatile if power is lost. To be durable the transaction is often written to a database log file before being acknowledged as "written" back to the application which asked to store data. 

TAKEAWAYS:

Modern DBMSs store metadata about entities that front-end applications submit using SQL. DBMSs store their data in a database file for persistence and provide abilities such as query, indexing, CRUD, ACID and concurrency. In future posts we will discuss aspects of different databases in terms of CRUD and ACID.

Friday, November 21, 2014

DATABASES - RECORDS AND INDEXES - 101


The term "database" was first used in a 1962 and coincided with storing data on disk drives as opposed to tape. To me, the term "database" referred specifically to storage, updating and retrieval of information stored in a file on disk rather than tape. Randomly accessed disk drives opened a door to a new way to manage data. This new-style of data access was very different from sequential tape access. A file full of data serves as the "base" that you can get "data" from. This "database" is often combined with a running process/application that serves as the single application process that can access & modify the database file. Humans are required to "query" this application process to "locate and retrieve" data from inside the "base" file on their behalf. We collectively refer to the application process and its associated datafile as a DBMS or DataBase Management System. Humans simply write a question referred to as a "query" for a particular piece of data in the collection and the DBMS has methods of fast search and retrieval of data the matching the query description. Oracle's 12C, Microsoft's SQL 2014 and IBM's DB2 v10.5 are specific examples of DBMSs.


COLLECTIONS OF DATA:

Collection of books
Lets face it, human beings have been collecting & tracking objects in collections long before the 1962 "database" term or the invention of disk drives which started only a few years prior, in 1957.  We can all easily think of use cases why early humans wanted to collect information about objects and search that metadata. Recall that metadata is oft called "data about data". 



HISTORICLE COLLECTIONS:

Library of Alexandria in Ancient Egypt

Glancing backward through history to the time of the "Library of Alexandria" which existed in Egypt roughly some 2300 years ago. The library of Alexandria was early mankind's attempt to gather up or "collect" all the written knowledge in the world. This first known effort to preserve an understanding of the natural world and its history. The library existed as a physical location, full of shelves to store possibly a half million writings on scrolls. Anytime you collect more objects that you can track in your head reliably, you need to develop a system to track and organize the objects for fast search and retrieval. 


MODERN-DAY LIBRARIES:



Library Card Catalog
If we jump forward to modern day libraries circa 1990, they used bibliographical records which are cards containing summarized information about the books they refer to. We refer to these "bibliographical records" simply as metadata since they are NOT the actual objects being tracked but by definition, "data about data". Specifically, metadata about the books in the library's collection. It is interesting to note that if we scanned all of the books into PDF files and stored them directly into a modern-day database then the database would not be a collection of records where each record held metadata about the book that exists in the real world but instead the database records would hold the ACTUAL objects being collected and managed. When the things being collected & managed are digital , a database often contains those digital objects. This way the digital objects are contained inside the database files. When the entities being tracked are not-digital but real-world physical objects, we give them some unique identifier (EmployeeID if we are tracking real world employees) and use database records that hold the employeeIDs as well as metadata about that specific object.



Library's Online Database
Many readers may recognize the above photo of a library card catalog. I recall searching an "author's name" catalog or "Book Title" catalog for a particular book. As I flipped through the index cards I could not help but think... Wow, what a lot of work to type up all these index cards on a typewriter and then insert them into their sorted location and keep them sorted! This is the exact type of repetitive, tedious work databases were created to do. 


THE WORK DBMSs DO:

Work Automated by DBMS
Thinking about this a bit ... so every physical book exists on only one Floor-Isle-Shelf location in the library. That being true, we can create multiple metadata catalogs. A catalog being a set of metadata cards each with only two pieces of information on them, a  location that is a unique reference to a single book in the library and the "other" piece of information the cards are sorted on. Each catalog could contain cards sorted by "author's last name" or maybe the "book's title" or even "publication year". The cards in the catalog serve as an "index" which points to the location of the physical book which the index card itself was derived from. I am using this "library of books" analogy because it is a collection or things that we should all be familiar with and it was not long ago that databases took over how books are searched and tracked in modern-day libraries. DB management systems are software application processes that accept requests for a set of records matching a particular set of criteria. The DBMS process searches the catalogs of indexes that it builds and maintains. That is the work that a database does. I think of a database as a robot that stores, indexes and retrieves particular pieces of data according to my query.

DEWEY DECIMAL SYSTEM:

Dewey Decimal Location Marker
Since I'm using a library analogy. The 1876 Dewey Decimal System may come to mind here. The DDS used a system of logical decimal numbers that could be used to point to a physical location. Since the books are physical objects, humans find it useful to group books of similar topics together to facilitate browsing books on the shelf to the left and right of the book you located. 




Dewey Decimal Top-Level Classes
When you think about lesser methods like simply giving each shelf location a number that starts at one and increases as you add library shelf locations. While this expandable system of numbered locations would allow for adding shelves to the library, it would require you to re-number for additions and leave empty shelf locations if a book was removed from the library's collection. The DDS allows for books of a similar topic to be physically located in areas of the library grouped by a classes. The whole number in front of the decimal point would represent a particular "Class" which can be further divided into "Divisions" and even further divided into "Sections". The photo above of a location that starts with 341.237  would be part of the 300 - "Social Sciences" Class, the 340 - "Law" Division and the 341 - "Law of Nations" Section. 



Dewey Decimal Number Decoder
This clever system of decimal location numbers  gives library users the ability to "browse books of similar topic" simply by going to the library isle location for a topic. 

However, when the DBMS stores digital objects directly in it's database file, there is no need for browsing. Objects are stored at a byte offset from the start of the database file. If the digital records are each 100 bytes long and you want to retrieve the 5th record, simply read 100 bytes of data starting 500 bytes in from the start of the database file. 



DATABASE FILE STRUCTURE:



Database File Record Offsets
The picture to the right shows three records each with only two fields. The records are stored inside a database file which is normally visualized as a long linear string of bytes but shown here instead as stacked on top of each other for display purposes. In real-life there can be more than two columns of data about each entity but for display purposes we have shown only two. This is how the records are actually stored in the database file. Just like in an Excel spreadsheet, each row contains columns of data about a single entity. 



Single index is smaller than full book record it points to on disk
Recall, that the "job" of a database is to NOT just store the records row by row in a database file but to allow humans to query for specific records in that database file. The DB file could potential contain millions of rows of records. If when the DBMS receives a query from a human asking for all the books with a "publication date" after 2012,
Last_Name index shows record location
It would take too long to simply read through each of the 1 million records describing the specific books pulling out the books that match the requested criteria (>2012). The database needs to build a digital version of a library card catalog. The DBMS can search these indexes hold a single piece of sorted metadata and the location of that specific book. Because the indexes are sorted and hold far fewer columns of data than the actual records they point to ... they can be searched much quicker than reading through the whole table of full records. Searching a catalog of index records which have been sorted by "publication year" allows the DBMS to quickly locate all the "books published after 2012".

Indexes sorted alphabetically in RAM

While the the database records are stored on disk, the indexes, due to their smaller size, can be stored in RAM. As each new book is added to or removed from the library, the DBMS will need to update each books associated index. Because the indexes are in memory this process is much faster than if they were stored on disk. The requirement of keeping the indexes in memory is one of the main reasons database servers require lots of RAM. It should be noted that the use of indexes in addition to the records themselves is a duplication of data. Having many different indexes say by First_Name, Last_Name, Hire_Date, Office_location adds to the duplication and work since every time you modify or insert a record you must also update the indexes. It should also be noted that if the database records are never modified, the indexes would never also never need to change. Performance tests are often done to determine whether adding another index will have a positive or negative performance effect on the database.




TAKEAWAYS:



Modern DBMSs either store records full of metadata about entities that exist in the real world or the database records are the actual digital objects themselves. DBMSs store there data records in a database file and give humans the ability to query for a specific set of data records matching some criteria. The DBMS will keep sorted indexes in RAM to allow for fast location and retrieval of the requested set of records called a "recordset". While the database term may have been coined in 1962 to refer to the methods of storing and retrieving digital data, the concepts such as indexing and metadata have existed for millennia. In future blog posts DATABASE 102 & 103, we will investigate the database concept further. In even more database blog posts, I will investigate the different types of databases such as relational, NoSQL and NewSQL as well as their use cases.