Information Technology Sales: November 2014

The term "database" was first used in a 1962 and coincided with storing data on disk drives as opposed to tape. To me, the term "database" referred specifically to storage, updating and retrieval of information stored in a file on disk rather than tape. Randomly accessed disk drives opened a door to a new way to manage data. This new-style of data access was very different from sequential tape access. A file full of data serves as the "base" that you can get "data" from. This "database" is often combined with a running process/application that serves as the single application process that can access & modify the database file. Humans are required to "query" this application process to "locate and retrieve" data from inside the "base" file on their behalf. We collectively refer to the application process and its associated datafile as a DBMS or DataBase Management System. Humans simply write a question referred to as a "query" for a particular piece of data in the collection and the DBMS has methods of fast search and retrieval of data the matching the query description. Oracle's 12C, Microsoft's SQL 2014 and IBM's DB2 v10.5 are specific examples of DBMSs.

COLLECTIONS OF DATA:

Collection of books

Lets face it, human beings have been collecting & tracking objects in collections long before the 1962 "database" term or the invention of disk drives which started only a few years prior, in 1957. We can all easily think of use cases why early humans wanted to collect information about objects and search that metadata. Recall that metadata is oft called "data about data".

HISTORICLE COLLECTIONS:

Library of Alexandria in Ancient Egypt

Glancing backward through history to the time of the "Library of Alexandria" which existed in Egypt roughly some 2300 years ago. The library of Alexandria was early mankind's attempt to gather up or "collect" all the written knowledge in the world. This first known effort to preserve an understanding of the natural world and its history. The library existed as a physical location, full of shelves to store possibly a half million writings on scrolls. Anytime you collect more objects that you can track in your head reliably, you need to develop a system to track and organize the objects for fast search and retrieval.

MODERN-DAY LIBRARIES:

Library Card Catalog

If we jump forward to modern day libraries circa 1990, they used bibliographical records which are cards containing summarized information about the books they refer to. We refer to these "bibliographical records" simply as metadata since they are NOT the actual objects being tracked but by definition, "data about data". Specifically, metadata about the books in the library's collection. It is interesting to note that if we scanned all of the books into PDF files and stored them directly into a modern-day database then the database would not be a collection of records where each record held metadata about the book that exists in the real world but instead the database records would hold the ACTUAL objects being collected and managed. When the things being collected & managed are digital , a database often contains those digital objects. This way the digital objects are contained inside the database files. When the entities being tracked are not-digital but real-world physical objects, we give them some unique identifier (EmployeeID if we are tracking real world employees) and use database records that hold the employeeIDs as well as metadata about that specific object.

Library's Online Database

Many readers may recognize the above photo of a library card catalog. I recall searching an "author's name" catalog or "Book Title" catalog for a particular book. As I flipped through the index cards I could not help but think... Wow, what a lot of work to type up all these index cards on a typewriter and then insert them into their sorted location and keep them sorted! This is the exact type of repetitive, tedious work databases were created to do.

THE WORK DBMSs DO:

Work Automated by DBMS

Thinking about this a bit ... so every physical book exists on only one Floor-Isle-Shelf location in the library. That being true, we can create multiple metadata catalogs. A catalog being a set of metadata cards each with only two pieces of information on them, a location that is a unique reference to a single book in the library and the "other" piece of information the cards are sorted on. Each catalog could contain cards sorted by "author's last name" or maybe the "book's title" or even "publication year". The cards in the catalog serve as an "index" which points to the location of the physical book which the index card itself was derived from. I am using this "library of books" analogy because it is a collection or things that we should all be familiar with and it was not long ago that databases took over how books are searched and tracked in modern-day libraries. DB management systems are software application processes that accept requests for a set of records matching a particular set of criteria. The DBMS process searches the catalogs of indexes that it builds and maintains. That is the work that a database does. I think of a database as a robot that stores, indexes and retrieves particular pieces of data according to my query.

DEWEY DECIMAL SYSTEM:

Dewey Decimal Location Marker

Since I'm using a library analogy. The 1876 Dewey Decimal System may come to mind here. The DDS used a system of logical decimal numbers that could be used to point to a physical location. Since the books are physical objects, humans find it useful to group books of similar topics together to facilitate browsing books on the shelf to the left and right of the book you located.

Dewey Decimal Top-Level Classes

When you think about lesser methods like simply giving each shelf location a number that starts at one and increases as you add library shelf locations. While this expandable system of numbered locations would allow for adding shelves to the library, it would require you to re-number for additions and leave empty shelf locations if a book was removed from the library's collection. The DDS allows for books of a similar topic to be physically located in areas of the library grouped by a classes. The whole number in front of the decimal point would represent a particular "Class" which can be further divided into "Divisions" and even further divided into "Sections". The photo above of a location that starts with 341.237 would be part of the 300 - "Social Sciences" Class, the 340 - "Law" Division and the 341 - "Law of Nations" Section.

Dewey Decimal Number Decoder

This clever system of decimal location numbers gives library users the ability to "browse books of similar topic" simply by going to the library isle location for a topic.

However, when the DBMS stores digital objects directly in it's database file, there is no need for browsing. Objects are stored at a byte offset from the start of the database file. If the digital records are each 100 bytes long and you want to retrieve the 5th record, simply read 100 bytes of data starting 500 bytes in from the start of the database file.

DATABASE FILE STRUCTURE:

Database File Record Offsets

The picture to the right shows three records each with only two fields. The records are stored inside a database file which is normally visualized as a long linear string of bytes but shown here instead as stacked on top of each other for display purposes. In real-life there can be more than two columns of data about each entity but for display purposes we have shown only two. This is how the records are actually stored in the database file. Just like in an Excel spreadsheet, each row contains columns of data about a single entity.

Single index is smaller than full book record it points to on disk

Recall, that the "job" of a database is to NOT just store the records row by row in a database file but to allow humans to query for specific records in that database file. The DB file could potential contain millions of rows of records. If when the DBMS receives a query from a human asking for all the books with a "publication date" after 2012,

Last_Name index shows record location

It would take too long to simply read through each of the 1 million records describing the specific books pulling out the books that match the requested criteria (>2012). The database needs to build a digital version of a library card catalog. The DBMS can search these indexes hold a single piece of sorted metadata and the location of that specific book. Because the indexes are sorted and hold far fewer columns of data than the actual records they point to ... they can be searched much quicker than reading through the whole table of full records. Searching a catalog of index records which have been sorted by "publication year" allows the DBMS to quickly locate all the "books published after 2012".

Indexes sorted alphabetically in RAM

While the the database records are stored on disk, the indexes, due to their smaller size, can be stored in RAM. As each new book is added to or removed from the library, the DBMS will need to update each books associated index. Because the indexes are in memory this process is much faster than if they were stored on disk. The requirement of keeping the indexes in memory is one of the main reasons database servers require lots of RAM. It should be noted that the use of indexes in addition to the records themselves is a duplication of data. Having many different indexes say by First_Name, Last_Name, Hire_Date, Office_location adds to the duplication and work since every time you modify or insert a record you must also update the indexes. It should also be noted that if the database records are never modified, the indexes would never also never need to change. Performance tests are often done to determine whether adding another index will have a positive or negative performance effect on the database.

TAKEAWAYS:

Modern DBMSs either store records full of metadata about entities that exist in the real world or the database records are the actual digital objects themselves. DBMSs store there data records in a database file and give humans the ability to query for a specific set of data records matching some criteria. The DBMS will keep sorted indexes in RAM to allow for fast location and retrieval of the requested set of records called a "recordset". While the database term may have been coined in 1962 to refer to the methods of storing and retrieving digital data, the concepts such as indexing and metadata have existed for millennia. In future blog posts DATABASE 102 & 103, we will investigate the database concept further. In even more database blog posts, I will investigate the different types of databases such as relational, NoSQL and NewSQL as well as their use cases.

Information Technology Sales

Saturday, November 22, 2014

DATABASES - SQL, CRUD AND ACID - 102

REVIEW:

CRUD:

MULTIPLE USERS AND CONCURRENCY:

ACID:

TAKEAWAYS:

Friday, November 21, 2014

DATABASES - RECORDS AND INDEXES - 101

COLLECTIONS OF DATA:

HISTORICLE COLLECTIONS:

MODERN-DAY LIBRARIES:

THE WORK DBMSs DO:

DEWEY DECIMAL SYSTEM:

DATABASE FILE STRUCTURE:

TAKEAWAYS: