Preparing your data, a must | To advise

Data collection is therefore one of the first steps. If this is not done correctly, the whole solution will suffer. Initially, the data is never “clean”, so it must be cleaned. But “there are a host of difficulties that can be faced in working with this raw material and getting something out of it,” warns Sébastien Duguay, solution architect at Videns Analytics.

Nobody is safe from this type of error, even NASA! Thus Sébastien Duguay examined the data from the mission to Saturn of the Cassini-Huygens probe. Even in these, he found that there was missing data, or even no continuity in the formats.

“So don’t be embarrassed to have poor quality data that you have to spend time on. Even within large-scale projects led by a solid team, we should not assume that we will have exemplary data quality,” concludes Sarah Legendre-Bilodeau, CEO of Videns Analytics.


There are thus two types of errors when collecting data: human errors and technical errors.

The error that Sébastien Duguay is most often confronted with is intentional human error. “Often the data must be formatted in a certain way, but if it has not been well thought out or communicated, the user will find a way to circumvent these processes to facilitate his work”, testifies the expert.

Sarah Legendre-Bilodeau agrees. According to her, this often happens in operational systems. For example, in insurance, when an employee meets with a client and has a lot of information to enter, he may tend to cut corners to go faster.

There are also unintentional errors. It is thus possible to be mistaken if one misapplies the established rules, quite simply because one does not understand them. We can also see errors of forgetfulness or inattention.

On the side of technical errors, we can see errors in programming, methods, versions or parameters used.


There are several categories of errors that are found on a recurring basis. The first is at the data type level. The data is not in the expected format and is therefore misclassified. You also have to be careful with format conversions.

Sarah Legendre-Bilodeau remembers a company in which they used the branch number when collecting data. Unfortunately, this was plugged in as a numeric variable in the solution, thus skewing all the results.

A second category of errors is at the level of data ranges. We find, for example, future dates when it is impossible, or even restaurants rated six stars out of five, reports Sébastien Duguay. It is important to detect these aberrations before integrating them into your solution.

Another big problem is duplicates. “We often see this in data migration projects at CRM level, which are important systems in large organizations,” reports Sarah Legendre-Bilodeau. Often, we do this type of migration because users are not satisfied with the old system, but if no cleaning is done at the data level, it is certain that users will still not be satisfied with the new system.

Eventually there may be errors in alignment and construction. At the level of construction, it is important to ensure that the independent variables which make it possible to explain the situation are not constructed with what one is trying to predict. In large companies, we often do not know how the explanatory variables are constructed, because we do not know which team they come from. Sarah Legendre-Bilodeau says that when the model looks too good, there’s a problem, especially when the data comes from customers.

Prepare your data

In order to prepare your data, Sébastien Duguay gives some advice. The first being to delete irrelevant data. “Obviously you have to have the cleanest possible data, but that’s also a security consideration. Sometimes we have data with nominal considerations, but we don’t need it to develop the model. It is therefore better to delete them. It will also help to avoid mistakes! “, he specifies.

A second step is to deduplicate its data, then correct the structural errors at the collection level. “Prevention is always better than trying to correct the problem”, maintains the expert.

It is also essential to manage missing data. The goal is not to delete the line, because often missing data is also information, assures Sarah Legendre-Bilodeau. “You have to understand why the data is missing: is it a customer segment that has an aversion to sharing data? It is information, especially in marketing! She herself recommends keeping and transforming them, because they often bring a lot of value to the model.

It is also important to study outliers to identify their source and avoid finding others.

Sarah Legendre-Bilodeau recommends doing a lot of descriptive statistics in order to prevent possible problems, even though we are in a hurry to get to the model.

Practical advice

  • Resist the urge to adjust a pattern from the start;
  • Work on your textual data;
  • Keep in view the qualitative variables;
  • Beware of over-performing predictive models;
  • Understand the data update process;
  • Think development and deployment

We would love to thank the author of this article for this amazing material

Preparing your data, a must | To advise

You can find our social media pages here and other related pages here