• Skip to main content
  • Skip to secondary menu
  • Skip to primary sidebar
  • Skip to footer
  • Articles
  • News
  • Events
  • Advertize
  • Jobs
  • Courses
  • Contact
  • (0)
  • LoginRegister
    • Facebook
    • LinkedIn
    • RSS
      Articles
      News
      Events
      Job Posts
    • Twitter
Datafloq

Datafloq

Data and Technology Insights

  • Categories
    • Big Data
    • Blockchain
    • Cloud
    • Internet Of Things
    • Metaverse
    • Robotics
    • Cybersecurity
    • Startups
    • Strategy
    • Technical
  • Big Data
  • Blockchain
  • Cloud
  • Metaverse
  • Internet Of Things
  • Robotics
  • Cybersecurity
  • Startups
  • Strategy
  • Technical

Understanding Data Fusion: The Challenges and A Way Forward

Madhusudan Therani / 5 min read.
October 15, 2018
Datafloq AI Score
×

Datafloq AI Score: 76

Datafloq enables anyone to contribute articles, but we value high-quality content. This means that we do not accept SEO link building content, spammy articles, clickbait, articles written by bots and especially not misinformation. Therefore, we have developed an AI, built using multiple built open-source and proprietary tools to instantly define whether an article is written by a human or a bot and determine the level of bias, objectivity, whether it is fact-based or not, sentiment and overall quality.

Articles published on Datafloq need to have a minimum AI score of 60% and we provide this graph to give more detailed information on how we rate this article. Please note that this is a work in progress and if you have any suggestions, feel free to contact us.

floq.to/wbo1i

Data fusion, the art of merging information from multiple sources to create sophisticated models, presents significant potential benefits for a range of industries. This is a fundamental problem in many domains as one tries to bootstrap a data economy.

Bringing together heterogeneous datasets, however, poses several conceptual and technical questions, particularly when considering different existing analytical approaches and also developing appropriate metrics for evaluation. Understanding these potential challenges, along with the complexities surrounding different kinds of heterogeneity in the data is key to developing viable data fusion approaches. The boundary between generic and domain-specific approaches towards data fusion is yet to be defined.

For starters, the following are the main dimensions of data heterogeneity to consider when discussing datasets:

  • Semantic Heterogeneity Are these different data sets referring to the same phenomena? Are they complementing data sets or conflicting data sets? Are they sharing the same event source? etc.
  • Temporal Heterogeneity Data sources may be static or dynamic. Static data sets are snapshots of phenomena at a point in time. Dynamic data may be streaming data that reflects a phenomenon continuously or with larger time intervals
  • Spatial Heterogeneity Data may reflect spatial effects or capture spatial nuances in one, two or three dimensions. Do the different data sets share similar abstractions?
  • Modelling Heterogeneity Most of the data gathered are from sensors and devices that capture analogue and digital phenomena based on an underlying model. The nature of this abstraction itself matters when fusing data to understand what one should expect when they fuse the data. Are the underlying assumptions compatible?
  • Infrastructural Heterogeneity Large data sets may not be captured due to storage/power/bandwidth limitations, data may be corrupted due to infrastructure issues. Data may be incomplete due to operational and systems issues. How do you fuse data on an ongoing basis under such conditions?

Knowing these various properties (and the fact that different properties will become relevant in different scenarios), anyone looking to fuse data must consider the questions they wish to answer to determine the most appropriate approach in tackling the complexities in merging their data. Let’s use the example of a retailer wishing to develop various count (time-series datasets) to illustrate our point.

Many retailers want to put together detailed views of their customers and the performance of their stores/products. Those retailers might gather consumer data from a multitude of sources, such as:

  • WiFi hotspots within the outlet
  • A beacon-based system
  • A digital POS system that tracks transactions in-store
  • In-store video cameras
  • Human mobility data from a third party

With the data from those sources, they might want to answer questions regarding store activity in the recent past, like:

  • How many customers came near the store?
  • How many came into the store?
  • How many were window shoppers?
  • How many actually bought something?
  • How many are repeat customers? How many buy each time they come into the store?
  • How many did not find what they were looking for?
  • How many were price-shopping? Comparing products?
  • How much time was spent by each customer?
  • How many were disappointed by the customer service?
  • Which products really sold well versus those that did not?
  • How many visitors show up in one source and not the other? Should one count visits or the number of people per day?

Depending on the nature of the outlet, different questions listed above will take priority, and each of the five possible data sources may be fused in different ways depending on the question being answered.

A generic department store, for instance, may have different needs from a restaurant which may be different from a shoe store. Stores co-located in a mall will require a different approach than those located in independent areas. There are even differing spatial constraints between stores that factor into the data fusion process.


Interested in what the future will bring? Download our 2023 Technology Trends eBook for free.

Consent

And the above applies merely to understand what has happened in a sliver of time in the past. To build something predictive, to answer questions about the future like how many customers will come to the store or how many customers a store owner should reach brings with it even more challenges.

The rabbit hole goes deeper. Those five aforementioned data streams bring with them questions and hurdles of their own:

  • Some streams are every minute, whereas some are daily
  • Some data is available instantaneously, some come delayed
  • There are blackouts in the data on different streams at different times
  • There are spikes and troughs in the data
  • How does one verify the data’s accuracy?
  • How does one validate the fused model?
  • If decisions are to be made on the model what is an acceptable margin of error? What are the risks involved with evaluating different data trade-offs during the fusion process?

Layer upon layer of complexity just for relatively straightforward counts. The above speaks nothing to addressing subsets of counts like weekdays vs weekends, counts in one store or another, etc., or of fusing data related to other attributes like demographics, income, and the like.

All of these considerations are exacerbated by the fact that data is not always available concurrently. More often than not, it becomes available incrementally so how can anyone combine multiple sources of data to further fusion efforts that will help understand the past and predict the future?

Building data fusion approaches ground up into a data processing platform is essential. From a conceptual perspective, adopting a Bayesian worldview is essential, given the sound underpinnings of the mathematical formalisms to allow incremental assimilation and propagation of updates/inferences across the complete data stack. Data-driven thinking has to pervade the architectural design of the data platform. Current approaches treat architecting data processing pipelines as an independent activity from the design/use of data – the classic split between data engineering and data science. Retrofitting Bayesian data views on frameworks that are primarily non-Bayesian is cumbersome, to say the least. Current approaches are ad hoc, highly heuristic and really one does not know when it is going to work and when it is not. Building data teams with an underlying Bayesian vision is still in its early days.

Data Fusion as it stands today is not a well-defined problem. Every data platform is facing this issue – however big or small. Developing a viable, extensible approach even for your own organisation requires your technology team to staying informed about the constraints of data fusion, understanding its implications on your products/services, and invest in the underlying R&D activities with rigour to forge a path forward.

For more information, please reach out to me via https://near.co

Categories: Big Data
Tags: consumer data, Data, Data analytics, Data integration, data sources

About Madhusudan Therani

Madhu is responsible for developing Near's ambient intelligence platform and associated products. He leads the engineering and data science efforts at Near based out of Silicon Valley.

He is a seasoned tech entrepreneur, a former academic, and has been building software and hardware for the past couple of decades. He has a proven track record of working on large scale data analysis, machine learning, text analysis, and decision-making models in a variety of domains including engineering design, product lifecycle management, online search and computational advertising. He is an alumnus of Carnegie-Mellon Univ., with interests in real-world applications of AI and Robotics.

On the personal front, Madhu is a movie buff, loves travelling and playing a round of golf or two in his free time.

Primary Sidebar

E-mail Newsletter

Sign up to receive email updates daily and to hear what's going on with us!

Publish
AN Article
Submit
a press release
List
AN Event
Create
A Job Post

Related Articles

The Advantages of IT Staff Augmentation Over Traditional Hiring

May 4, 2023 By Mukesh Ram

The State of Digital Asset Management in 2023

May 3, 2023 By pimcoremkt

Test Data Management – Implementation Challenges and Tools Available

May 1, 2023 By yash.mehta262

Related Jobs

  • Software Engineer | South Yorkshire, GB - February 07, 2023
  • Software Engineer with C# .net Investment House | London, GB - February 07, 2023
  • Senior Java Developer | London, GB - February 07, 2023
  • Software Engineer – Growing Digital Media Company | London, GB - February 07, 2023
  • LBG Returners – Senior Data Analyst | Chester Moor, GB - February 07, 2023
More Jobs

Tags

AI Amazon analysis analytics application Artificial Intelligence BI Big Data business China Cloud Companies company crypto customers Data design development digital engineer engineering environment experience future Google+ government health information learning machine learning market mobile news public research security services share skills social social media software solutions strategy technology

Related Events

  • 6th Middle East Banking AI & Analytics Summit 2023 | Riyadh, Saudi Arabia - May 10, 2023
  • Data Science Salon NYC: AI & Machine Learning in Finance & Technology | The Theater Center - December 7, 2022
  • Big Data LDN 2023 | Olympia London - September 20, 2023
More events

Related Online Courses

  • Oracle Cloud Data Management Foundations Workshop
  • Data Science at Scale
  • Statistics with Python
More courses

Footer


Datafloq is the one-stop source for big data, blockchain and artificial intelligence. We offer information, insights and opportunities to drive innovation with emerging technologies.

  • Facebook
  • LinkedIn
  • RSS
  • Twitter

Recent

  • 5 Reasons Why Modern Data Integration Gives You a Competitive Advantage
  • 5 Most Common Database Structures for Small Businesses
  • 6 Ways to Reduce IT Costs Through Observability
  • How is Big Data Analytics Used in Business? These 5 Use Cases Share Valuable Insights
  • How Realistic Are Self-Driving Cars?

Search

Tags

AI Amazon analysis analytics application Artificial Intelligence BI Big Data business China Cloud Companies company crypto customers Data design development digital engineer engineering environment experience future Google+ government health information learning machine learning market mobile news public research security services share skills social social media software solutions strategy technology

Copyright © 2023 Datafloq
HTML Sitemap| Privacy| Terms| Cookies

  • Facebook
  • Twitter
  • LinkedIn
  • WhatsApp

In order to optimize the website and to continuously improve Datafloq, we use cookies. For more information click here.

settings

Dear visitor,
Thank you for visiting Datafloq. If you find our content interesting, please subscribe to our weekly newsletter:

Did you know that you can publish job posts for free on Datafloq? You can start immediately and find the best candidates for free! Click here to get started.

Not Now Subscribe

Thanks for visiting Datafloq
If you enjoyed our content on emerging technologies, why not subscribe to our weekly newsletter to receive the latest news straight into your mailbox?

Subscribe

No thanks

Privacy Overview

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.

Necessary Cookies

Strictly Necessary Cookie should be enabled at all times so that we can save your preferences for cookie settings.

If you disable this cookie, we will not be able to save your preferences. This means that every time you visit this website you will need to enable or disable cookies again.

Marketing cookies

This website uses Google Analytics to collect anonymous information such as the number of visitors to the site, and the most popular pages.

Keeping this cookie enabled helps us to improve our website.

Please enable Strictly Necessary Cookies first so that we can save your preferences!