ACHIEVING DATA QUALITY VIA AI

05 nov 2019

By Danyo Dimitrov

Quality is King

Danyo Dimitrov takes a close look at data quality and presents a real-life Artificial Intelligence (AI) solution.

In modern market research, data collection is becoming more and more digitalized. Agencies often form a wide variety of data blends in order to best fit their objectives and budgets. Regardless of the data source, the data quality has to always be verified in order to obtain meaningful insights. The task of keeping poor performers, speedsters, over- and under-reporters and all sorts of other quality deviations at bay is much harder when so many data sources are merged together and information flows at such a fast pace. As with many other modern riddles, technology poses new problems, but it also provides the answers.

Data Quality – a Definition

Arguably the simplest description of what perfect survey data is would be data where all respondents have provided an honest answer to the question provided. Reflecting their actual opinion on the matter, while paying attention to the exact notion of every question asked. Having all these factors simultaneously verified and accounted for in every single research survey database is the “unattainable research beauty standard” we are all after. Leaving irony and complaining aside, this is what we need to deliver, and this is what our clients expect of us, isn’t it?
When thinking about verified, clean data, one’s standpoint always sculptures the definition differently. From the client standpoint, high quality data is to generate enough weight for all specific sub- divisions of the target, thus generating insights for the full array of audiences surveyed. Agencies view high quality data as what their clients perceive it is, plus they also ensure no questionable responses creep their way into the data. The third alternative definition comes from the sample provider point of view. Providers obviously want to maximize the interviews their clients, who most often are the agencies, accept as valid, and minimize the counts discarded as “bad data”. For providers, “bad data” not only reduces revenue, but most importantly, seriously impacts their business image and trustworthiness.
Regardless of the difference in perception, it is clear that all stakeholders need to have data quality on their mind when they go about any research project. The various definitions listed above, however, need to all be covered when the matter of data quality is tackled, at both the conceptual and the technological level.

Factors Affecting Data Quality

The mix of variables in any research project brings an extra layer of complexity as to where things can go wrong and generate poor-quality data. From putting together the questionnaire to choosing the scripting tool and selecting the sample providers. Below are some of the main reasons and what makes them a problem, luckily a solvable one.

Sample Sources

Historically, the first steps towards automated data quality checks started with verifying uniqueness when blending different sample sources. It became obvious very early on that using IP as a sole, or even main, identifier is not effective enough, especially in certain parts of the world, where internet service providers are not legally obliged to connect IPs to users non-variably. Still, bigger sample sizes or more complex target audiences often require using more than one provider. It is possible that a respondent has an account in more than one local online panel, or is reached by various providers to take part in the study. Considering the various ways to contact a potential survey participant, there is a strong chance that one does enter in the same study more than once, which is clearly not advisable, neither on a participant experience level, nor from a quality point of view. Still, these cases treat actual people responding to survey questions. The problem becomes much bigger when the issue of digital (automated) fake data generators are taken into consideration, the so-called survey farms. Their ways of bypassing the data quality controls become more and more elaborate, making them a very complex target to tackle.

Questionnaire Design and Visualization

It is a commonly known fact that visually appealing surveys are much more engaging for the respondent, make the experience more enjoyable and deliver more unbiased data. What is often overlooked, however, are the types of questions used and the sequence they are used in. Longer grids, simple radio buttons, same question type, repeated in a row, these techniques tend to bring attention and engagement levels down. Formulating questions where the answer is too obvious, especially in the pre-screening section, is also a definite lead to poor data and distortion.

Survey Length

Another commonly known fact is that people’s attention can be held for only so long in a survey. The lengthy questionnaire, especially in combination with same question types in sequence, can literally destroy the patience of even the most tolerant respondent. In that respect, it is crucial to make sure the number of questions is kept within a reasonable range, but also that the topic is presented in a visually appealing way to the respondent.

Lower Incidence Rate

Interviewees are almost always reward-motivated. Be that some sort of point system, or other incentive, it is clear that the pure altruistic desire to help is not the only drive for people to invest time from their day into a survey. Keeping that thought, they are not happy when they get screened out, leading to the intrinsic drive to try and figure out how to “trick the system” by over- or under-reporting, and testing their luck of how to best pass through the screening section of the questionnaire.

Cross-Cultural Awareness

Although in its essence this topic relates to questionnaire design, it is so significant that it can drastically change or completely obliterate any attempts to automate quality checks in any survey. One needs to be aware of so many details when fine-tuning the automated process, it starts from simpler things like nuances of language and expression, through inner drive to match expectations, variations of discussable and non-discussable topics, manners to express likability and many others. Unfortunately, it is sometimes the case that clients are so desperate to cover the “poor data quality” risk that they end up setting the controls so strictly that they end up with biased data, which does not properly reflect the actual opinion landscape in the given market.

Too Much Data Quality?

If we circle back to preserving data quality, a logical question comes up. With so many variables in place, isn’t it way too complex a problem to process? On the one hand, yes, there are multiple levels, on which cues of poor data can be missed, especially with the significant volume of live surveys, client requirements and targeted output contemporary market research operates with. On the other hand, no, clearly identifying the problematic areas gets half the work done. The second half of the solution comes from having the right mix of technology, a sufficient number of cases to work with, and the experience required to set the controls right. One issue that is not often discussed in our field is the fact that pursuing data quality too vigorously is just as bad as leaving poor data in. There are methodologies the JTN panels have been exposed to that cut off too much data. The “too much” comes from removing some data entries basing on just doubts; playing it too safe in a way. This leads to interesting outliers being taken out, hence the insights and overall analysis become too normalized and do not correspond well with reality.

How JTN is Doing it Differently

At this stage, in order not to be confusing: striving for data quality is crucial, but too much data quality is not good either. Where is the balance? It comes from the methodology blend. This is where the focus is put on the JTN Research development in the field. Our tool Field Detective (FD) is the evolution of more than a decade of work in that direction, which is being tested and re-adjusted constantly to best reflect the current situation and requirements. FD has three major components to its methodology blend.

Tracking the Right Raw Data Fields

Depending on the sample source, these fields vary, but points like reliable Geo IP check, machine ID, completion timings, and others, definitely have to be present in real time. That whole set has to be stored in such a way that it does not overload the technological infrastructure, but is still easily accessible for the remaining two components.

Historic Reference and Supply Ranking

Any survey participation and its outputs are recorded and again kept readily available. Those vary from usual participation patterns, any repetitiveness in the day and time the person usually devotes to surveys, to devices used for their participation, and a few more. With an online panel there are obviously far more points of historic reference, and the research industry has already noticed that, hence limiting non-panel sources to a relatively small portion in the overall survey database. Whenever external providers of any type are used, FD has a ranking system in place that constantly monitors data outputs and adjusts the rank according to the data quality results demonstrated. A similar type of ranking that relies on internal panel metrics is applied to the JTN proprietary panel supply. Overall, this is an advanced automated feature. It cannot be qualified as AI, but as more of a component of AI.

AI in Full Swing

This piece of the puzzle is what makes the real difference when it comes to taking real decisions and evaluating progress and potential. The AI module puts together the remaining components described above and develops alternative scenarios for both the prefield (feasibility check) and fieldwork stages. Relying on a complex hierarchical algorithm, taking into account all data quality, survey-related and other characteristics, it predictively delivers best cases for the sales or project manager to work with. Using these, the person decides how to best structure the sample to avoid poor data and achieve the pre-set target within an optimal time frame, without using too much panel resource.
The AI module is also crucial when the issue of survey farms is addressed. This module greatly outperforms any previous digital algorithms when it comes to cross-referencing our internal respondent IP databases with verified external Geo IP pool and reputation tracking. In the ongoing battle with survey farms, the FD AI is the assistant that keeps the gates safe.

Humans are Still Vital

The multiple benefits of the completely automated feature may suggest it has reached a stage where human involvement is no longer needed. Luckily for us humans, we are still vital in making this AI process useful. The truth of the matter is that input data coming from the commissioning client is very often incomplete, sometimes projected metrics do not match actual ones and many other unpredictable factors come into play in the last moment. This is why the Field Detective AI provides the human expert with options, which he or she will evaluate based on their best understanding of the case at hand. This human touch is the final, unquantifiable “pinch of salt” that makes the data quality recipe a successful one.

Back to Blog list
da_DKDanish