Friday, August 14, 2009

Data Collection, Sorting, Archiving and Access #1 (Thesis)

babbling my thoughts and theories to nobody and everybody, or in other words, to anybody who might be interested in it :). 

Introduction 8994_nervous_schoolboy_wearing_cap_and_gown_while_holding_his_diploma

I collected myself a vast amount of data and have issues with using them effectively. The lack of ability to find and use specific data that I collected when I need them is a problem that bothers me and where I am constantly working on improving. This problem of mine is of course a problem that pretty much every person has to varying extend, so there are of course numerous options and tools that were created by others already, trying to solve my exact issues.

Considering the fact that virtually everybody has the same problem and that so many "solutions" and tools exist to solve your problem, it surprised me, that virtually of them that I check out failed to solve exactly those problems for ME. This bothered me a lot and got me thinking why this is the case. It didn't make sense at first glance, but after giving it more and more thoughts, I started to realize why most tools that are out there fail to solve the problems for me. 

This is great, but does not help me with actually solving my problem. So I kept thinking about the problem itself, to learn that while the problem appears to be the same for everybody, generally speaking, it is actually not, if you start looking at things a bit more closely. Those differences can make a tool useless for me, but work perfectly fine for somebody else. Many of the differences can also be compensated for, by tools, if the amount of data is relatively small or the available computer resources (hardware, almost every aspect of it) are abundant and can be wasted by getting you what you want, but in a very inefficient manner. 


Once the amount of data becomes too large and/or the resources available for the processing too low to get away with wasting tons of them are suddenly not there anymore. At that point the tool(s) that you are using break and become unable to solve your problem. In order to find or create (or inspire somebody else to create) a tool that is truly capable of solving the same problem with its subtle (but eventually significant) differences, for all or at least for most people, we must determine the basic elements and causes of the problems, the goals that need to be accomplished and their reason. 

Then we can think about ways that allow us to solve all those different problems in an efficient way to accomplish the set goals satisfactory.A one-fits-all approach does obviously not work.To accomplish what we want, the tool has to be able to adjust (or be adjusted by the user) to cope with the varying problems and changing facts, such as amount and diversity of data. 

I started with thinking about the root of the problem; the core and heart. I also tried to determine constants, the things that are the same for everybody, because deep inside there is always something that is shared across all the different scenarios that are possible due to the large number of things that can be different and the exponentially increasing number of possible combinations that result from it. 


Underlying Context (A Theory)

People collect data throughout their entire life. I use the word "DATA", because information is not brought enough in my opinion. Most DATA collected by human beings are information of some sort. Information are per my own definition data that are classified and structured, viewed from the perspective of each individual human separately. I will use the phrase "Classification and Structure" repeatedly in this document and refer to it as C&S or spell it out, depending on what I feel to be appropriate. 


The C&S system is a collection of indexes and needed to access data quickly. If no C&S system would be in place, the data would have to be processed one by one, until the data that are currently needed are found. Data are like a pile of small boxes, where each box contains a junk. "C", the Classification,  would be labels put on the box, with keywords that describe the content. "S", the structure, would be stacks of boxes that share a common keyword, grouped in columns by even broader keywords, then rooms, floors, buildings, facilities etc. 

The same data might be classified and structured for one person, but not for another. Classification and Structure is stored separately from the Data, because every human develops his own framework and classification structure, designed for his current needs, priorities and understandings of the world. This framework is constantly changing and also expanding throughout the entire life time of each person. It is not and will never be static and fixed.

Of course, if classifications and structure can be found for the data that a person collects, it could be the case that it is identical or partially compatible with the persons own classification system and structure. In some cases it can be adapted (replicated) as is, but in many cases is it necessary to modify the C&S found with the data before it is integrated into the existing C&S. Different terms might be used for the same thing or the structure needs to be diversified or simplified. (See the box with my example for clarifications and illustration) 


Example: Differences of the C&S used for ONE set of Data by Different People

A set of geological data about the moons of the planet Saturn. The data were compiled by a physicist/astronomer who is working for an university. The astronomer is a member of NASA's Cassini mission team and his full job is to work with the geological data collected by the probe about the moons. Since the physicist is an expert, who was working for years already on this specific subject and thus makes up a huge portion of that persons life, the data related to this subject are structured very deeply and many scientific classifications are used where most ordinary people never heard of. As most NASA mission data, this geological data are published on the NASA web site and made accessible to the public.

A person interested in astronomy comes across those data and decides to "store" some parts or all of it in his own archive. The classifications used by the scientist would most likely too specific and replaced by more generic ones instead. The structure would probably be flattened and simplified as well. While the physicist distinguishes moons by very specific geological properties, like by type of material the moons are made of, or properties like "geologically active" or not, with atmosphere or without etc., the person does not, classifying all of them simply as "moon of planet Saturn".

A Bio-Technical Marvel

The data with the corresponding C&S is stored in the human brain (memory). The human brain is a marvelous archiving system with a huge storage capacity. But the capacity is not limitless. It has two different sets where it can store the data, which work differently. There is the set that contains recently collected data and data that were recently used, but collected a long time ago already.The access to those data is very fast and ready to use  in an instant, if needed.The second set, with the largest capacity, holds data that were collected a long time ago and not used for a while either. 

The long term memory, the humans data archive and vault. Most data stored here take some time to access, many even require that the right trigger (Tag, Specific Classification) is used to even find the data itself. Usually data that were not used for while that are stored in the short and mid-term memory, are "moved' to the archive of the long-term memory and then moved to the deep areas of the archive, the one that requires the use of the right triggers to access them again. The brain automatically determines the relevance of each data and drops the ones it believes to be not to be relevant anymore. 

Every Has Flaws

This marvel of biological engineering has some serious flaws though. Classification of data happens based on the current C&S system used by the person. Data are also altered (mostly shrunk) over the course of time and their move between or within the 2 memory sets.It is virtually impossible to determine now if and when a set of data might be required in the future again. Vital data might get dropped or shrunk and thus unavailable or useless, once we need them again or the classification system changed so much, that it became hard to impossible to find and deliberately use the right triggers to access data that archived deep in the long term memory

Humans are perfectly aware of this problem and used various methods to compensate for those limitations and shortcomings. Things, physical objects, that can act as a trigger to access data in the distant future can be kept and stored somewhere and used when needed. Those object might even contain some or all of the data itself. When writing was invented for example, documents could be created that contain all the data, so the object would not just be a trigger, it would hold the entire data itself and allow the complete reconstruction, if the data that were kept in a persons memory were erased, shrunk or altered. Storing just the data is not enough, if the amount of data comes very large. Accessing a specific one when needed is not easy to virtually impossible.

The Catch 

The C&S must be stored outside the brain, together with the data in order to be able to access data again in a reasonable amount of time. The C&S attached to the data can also be stored like a written document outside the human brain of course. 

There is a problem though. The C&S system of a person changes over time. It is possible that it changed so much that the C&S used in the past and stored with the data became virtually incompatible to the current C&S system in use.It would make the data that were stored as inaccessible as if no C&S would have been stored with them in the first place.

The C&S of data in the short, mid-term and to a limited extend also the long-term memory are updated by the human brain automatically, if a change in the C&S system of a person occurs. This update must also be done, manually, with the C&S of the data that were stored "offline" as backup. Updating it every time there is a change would be insufficient and consume too much time. It should be done though, before the number and type of chances become too much to be applied without loss to the "offline" archive. 

A Look at Human History

Humans created archives with data that are relevant and/or vital for the society as whole. Those archives are typically administered by people who's job it is to keep the C&S system used for the archive up to date and most importantly help other people who would like to use data from that archive to find them, by assisting with the matching of that persons C&S systems to the one used by the archive, exact matching preferably, but at least partially only a smaller set of data might have to be processed one after another to find the one that matches.

A New Age

Now we entered the digital age and the access, replication and storage of vast amounts of data becomes easier, while the amount of data that is created or made accessible grows rapidly at the same time. The limitations of physical archives in terms of physical storage space to keep it and the time and cost associated with the creation of the documents are multiple times less when done electronically. Thanks to the rapid advancements in technologies, this reduction of limitations continues in an exponential rate. 

People can now create vast archives that contain data of interest to them on a scale unheard of in human history. The climax would be reached, if ALL DATA in existence would be accessible to ANYBODY in an instant at ANY TIME, which is of course only hypothetical, because this will NEVER happen. As long as words such as privacy, intellectual property and secret have any meaning in human society, ALL DATA will never become accessible by ANYBODY. Some data will always be restricted to one individual person or restricted and limited group of people. This means that there cannot be just a single data archive for everything without the need to repeated replication. 

Back to Reality

The reality is that even today, some data that are accessible today, might not so in the future for different reasons. So if a person accesses data today and deems it to be relevant and important to be able to access in the future again, this person will replicate the data and save them somewhere, where he has control of it (add, edit and delete) and access to (read) whenever needed.

Even if people believe that access will be available in the future and replication of them not required for a large amount of data, the problem of C&S of those data remains. The creation of a general C&S system for this huge and centralized digital data archive cannot substitute the C&S system in place by an individual. It would be way too complex and detailed and thus be useless as an index for fast access.


The C&S system of the person only has the size and dimensions that an individual needs today. Refinements are made in areas where more data are collected and used, such as when a person becomes a specialist in something. Structures are simplified if the data become less important and/or less frequently used, for example in the case where a person changes his career or if an once important hobby becomes less interesting and fades away, maybe because something else is pursuit  now instead.

Does this make sense to anybody but myself? Are my assumptions and speculations making sense or am I missing the mark entirely? Voice your thoughts and opinions via the comments section below, openly or anonymously.


Carsten aka Roy/SAC

No comments:

Post a Comment

Hi, thanks for taking the time to comment at my blog.

Due to spam issues comments are not immediately posted on the site and require my manual approval first, before they become visible.

I try to approve comments as quickly as possible and usually within 24 hours.

To be notified about follow up comments that are made after yours, use the subscribe option with your email address and you will receive an email alert, if somebody else comments at this post in the future.

Also check out the rest of the website beyond this blog, visit Also see my YouTube channels, SACReleases for intros and demos.

Carsten aka Roy/SAC

Note: Only a member of this blog may post a comment.