Big data comes with a lot of new terminology that is sometimes hard to understand. Therefore we have created an extensive Big Data glossary that should give some insights. Some of the definitions refer to a corresponding blog post. Of course this big data glossary is not 100% complete, so please let us know if there are missing terminology that you would like to see included.A
Aggregation a process of searching, gathering and presenting data
Algorithms a mathematical formula that can perform certain analyses on data
Analytics the discovery of insights in data
Anomaly detection the search for data items in a dataset that do not match a projected pattern or expected behaviour. Anomalies are also called outliers, exceptions, surprises or contaminants and they often provide critical and actionable information.
Anonymization making data anonymous; removing all data points that could lead to identify a person
Application ‘ computer software that enables a computer to perform a certain task
Artificial Intelligence developing intelligence machines and software that are capable of perceiving the environment and take corresponding action when required and even learn from those actions.
Behavioural Analytics analytics that informs about the how, why and what instead of just the who and when. It looks at humanized patterns in the data
Big Data Scientist someone who is able to develop the algorithms to make sense out of big data
Big data startup a young company that has developed new big data technology
Biometrics the identification of humans by their characteristics
Brontobytes ‘ approximately 1000 Yottabytes and the size of the digital universe tomorrow. A Brontobyte contains 27 zeros
Business Intelligence the theories, methodologies and processes to make data understandable
Classification analysis – a systematic process for obtaining important and relevant information about data, also meta data called; data about data.
Cloud computing ‘ a distributed computing system over a network used for storing data off-premises
Clustering analysis ‘ the process of identifying objects that are similar to each other and cluster them in order to understand the differences as well as the similarities within the data.
Cold data storage ‘ storing old data that is hardly used on low-power servers. Retrieving the data will take longer
Comparative analysis ‘ it ensures a step-by-step procedure of comparisons and calculations to detect patterns within very large data sets.
Complex structured data ‘ data that are composed of two or more complex, complicated, and interrelated parts that cannot be easily interpreted by structured query languages and tools.
Computer generated data ‘ data generated by computers such as log files
Concurrency performing and executing multiple tasks and processes at the same time
Correlation analysis ‘ the analysis of data to determine a relationship between variables and whether that relationship is negative (- 1.00) or positive (+1.00).
Customer Relationship Management ‘ managing the sales and business processes, big data will affect CRM strategies
Dashboard a graphical representation of the analyses performed by the algorithms
Data aggregation tools – the process of transforming scattered data from numerous sources into a single new one.
Data analyst someone analysing, modelling, cleaning or processing data
Database a digital collection of data stored via a certain technique
Database-as-a-Service a database hosted in the cloud on a pay per use basis, for example Amazon Web Services
Database Management System ‘ collecting, storing and providing access of data
Data centre a physical location that houses the servers for storing data
Data cleansing ‘ the process of reviewing and revising data in order to delete duplicates, correct errors and provide consistency
Data custodian ‘ someone who is responsible for the technical environment necessary for data storage
Data ethical guidelines guidelines that help organizations being transparent with their data, ensuring simplicity, security and privacy
Data feed a stream of data such as a Twitter feed or RSS
Data marketplace an online environment to buy and sell data sets
Data mining the process of finding certain patterns or information from data sets
Data modelling the analysis of data objects using data modelling techniques to create insights from the data
Data set a collection of data
Data virtualization a data integration process in order to gain more insights. Usually it involves databases, applications, file systems, websites, big data techniques, etc.)
De-identification same as anonymization; ensuring a person cannot be identified through the data
Discriminant analysis – cataloguing of the data; distributing data into groups, classes or categories. A statistical analysis used where certain groups or clusters in data are known upfront and that uses that information to derive the classification rule.
Distributed File System ‘ systems that offer simplified, highly available access to storing, analysing and processing data
Document Store Databases ‘ a document-oriented database that is especially designed to store, manage and retrieve documents, also known as semi structured data.
Exploratory analysis ‘ finding patterns within data without standard procedures or methods. It is a means of discovering the data and to find the data sets main characteristics.
Exabytes ‘ approximately 1000 petabytes or 1 billion gigabytes. Today we create one Exabyte of new information globally on a daily basis.
Extract, Transform and Load (ETL) ‘ a process in a database and data warehousing meaning extracting the data from various sources, transforming it to fit operational needs and loading it into the database
Failover switching automatically to a different server or node should one fail
Fault-tolerant design a system designed to continue working even if certain parts fail
Gamification ‘ using game elements in a non game context; very useful to create data therefore coined as the friendly scout of big data
Graph Databases ‘ they use graph structures (a finite set of ordered pairs or certain entities), with edges, properties and nodes for data storage. It provides index-free adjacency, meaning that every element is directly linked to its neighbour element.
Grid computing ‘ connecting different computer systems from various location, often via the cloud, to reach a common goal
Hadoop an open-source framework that is built to enable the process and storage of big data across a distributed file system
HBase an open source, non-relational, distributed database running in conjunction with Hadoop
HDFS Hadoop Distributed File System; a distributed file system designed to run on commodity hardware
High-Performance-Computing (HPC) using supercomputers to solve highly complex and advanced computing problems
In-memory a database management system stores data on the main memory instead of the disk, resulting is very fast processing, storing and loading of the data
Internet of Things ordinary devices that are connected to the internet at any time any where via sensors
Juridical data compliance relevant when you use cloud solutions and where the data is stored in a different country or continent. Be aware that data stored in a different country has to oblige to the law in that country.
KeyValue Databases ‘ they store data with a primary key, a uniquely identifiable record, which makes easy and fast to look up. The data stored in a KeyValue is normally some kind of primitive of the programming language.
Latency ‘ a measure of time delayed in a system
Legacy system ‘ an old system, technology or computer system that is not supported any more
Load balancing ‘ distributing workload across multiple computers or servers in order to achieve optimal results and utilization of the system
Location data ‘ GPS data describing a geographical location
Log file a file automatically created by a computer to record events that occur while operational
Machine2Machine data two or more machines that are communicating with each other
Machine data data created by machines via sensors or algorithms
Machine learning ‘ part of artificial intelligence where machines learn from what they are doing and become better over time
MapReduce a software framework for processing vast amounts of data
Massively Parallel Processing (MPP) ‘ using many different processors (or computers) to perform certain computational tasks at the same time
Metadata data about data; gives information about what the data is about.
MongoDB an open-source NoSQL database
Multi-Dimensional Databases ‘ a database optimized for data online analytical processing (OLAP) applications and for data warehousing.
MultiValue Databases ‘ they are a type of NoSQL and multidimensional databases that understand 3 dimensional data directly. They are primarily giant strings that are perfect for manipulating HTML and XML strings directly