News Room

Cleaning Up Bad Data

Why and How to Clean Agronomic Data

Posted on

There is a lot of talk about rescue treatments when a crop is in danger of severe loss due to an under-application of a herbicide, a lack of fertility, or to fight off an impending infestation. That same conversation can be applicable when it comes to agronomic data.

Plenty of challenges come with managing data. Defining when one crop year ends and the next begins. Establishing boundaries. And trying to work with inconsistent, inaccurate, or ‘dirty’ data tops the list.

While dirty data often refers to mistakes made when manually entering it into a spreadsheet or other software, the concept can also be applied to data that’s simply not set up correctly, recorded correctly, or done with consistency. The whole point of collecting data is to use it to make better agronomic decisions. If that data isn’t accurate or seems questionable, it can’t be used confidently or to its fullest potential.

But there is hope for this data.

Rescue Treatments
The good news is, cleaning up – or applying a ‘rescue treatment’ to – dirty data may be an option in some situations. The bad – or maybe I should say, challenging – news is, there are a couple things to keep in mind with any rescue treatment.

First, they do not always work. There are often times when the data’s quality just does not meet the standard necessary to be confident. Especially when dealing with items like test plots or field trials, the accuracy of a guidance system’s point data and the ability to line up base layers can be critical. If unable to overlay one dataset over the other, the digital copy of the trial for analytics or decision making can seemingly have no value.

Second, they do not always apply to everything. Unfortunately there is no “cure-all” when it comes to dealing with farm data. There are times when the problem is hardware related. Every now and then the problem lies with the software or in file corruption. Other times, the problem lies solely with the operator. In those cases, it takes a change in the actual method or practice by which the data is collected.

While rescue treatments won’t be a slam-dunk for every dirty data situation, there are a few possibilities.

Rescue Situation #1: Harvest Data
There are many situations that may require the cleaning up of yield data. Many of these reasons, however, will require an understanding of how a combine’s sensors send information to the display to calculate and document yield. Understanding how the movement of grain within the combine can create a delay in documentation may provide valuable insight when interpreting low yielding areas. These low yielding areas may actually be data errors which can easily be cleaned up, and not need further agronomic assistance. RescueSituation1_jpg_r3Pm5tFb

In a situation where a few passes of harvest data seem out of place, a tool like Yield Editor could be a lifesaver, and can help clean up some of those gaps.

A free tool available through USDA ARS, Yield Editor gives the ability to visualize passes and make edits to the dataset. The system accepts two file formats – Ag Leader Advanced Text and John Deere GreenStar Text. If working with data in a format other than these, it must first be converted.

This software works to delete overlaps and re-adjust shifts in flow delays. Flow delays are the gap in time between the logging of the GPS point by the combine’s receiver and the logging of the weight by the mass flow sensor.

Remember that, in the end, every time data is removed or re-adjusted within a dataset, there’s risk of losing additional attributes, such as machine data. Also, modified or manipulated data removes some of the realities that might actually exist in the field. In some cases, that extra pass may have been very necessary to recover additional grain which is not taken into account, or perhaps that “zero” value was actually an area that we planted but had no crop.

Rescue Situation #2: Corrupt Data Files
If the issue with a data file is that it seems corrupt or is unable to be uploaded and extracted, there are few things to look at.

First, make sure the software being used supports the file type. Some software does not support newer (or older) generations of given product lines, though there are some programs which can help with file conversion.

Second, make sure that the file structure is appropriate. If the software is having trouble reading the file, a couple things may be going on.

The lowest hanging fruit “issue” – and fix – may be the file being imported simply needs to be zipped. RescueSituation2_jpg_2kOZJXcC

Another area to evaluate is the file structure. A GreenStar 3 (or 2630) file, for example, is simply a collection of folders (profile/RCD/EIC) with a file found within the (in this case) EIC folder named global.ver that serves as the “index” file. Think of this file as the “Table of Contents”. In some cases, the index file is missing for whatever reason and all it takes to correct it is to create and save a replacement file in the right place in the file structure. This scenario is very generalized, but true of many machine file formats.

Rescue Situation #3: Post-Calibration of Data
The best way to ensure the accuracy and quality of a dataset is to make sure all machines and sensors have been calibrated prior to the pass. Of course, that doesn’t always happen, or isn’t always feasible. In those cases, virtually all software has the built-in ability to “post-calibrate” a dataset. 

What that means is that the documentation will log “x” by adding up all bushels of grain or pounds of seed. However, if the number is known and doesn’t match that, it’s simple to “clean” the dataset by giving the computer the known final result and telling the software to re-calculate all data points so that the values match.LookingAhead_jpg_yhPW3LVI

Finally, each software “post-calibrates” differently. For example, a few actually eliminate “zero” values in all datasets as they are imported. This is important to know prior to attempting to export and “clean” the data because some of the shifts that could be made in Yield Editor have potential to over-accentuate inaccuracies rather than improve them. Knowing the process going on behind the scenes by the algorithms in the software will go a long way in helping make the best decision on how to generate quality datasets for future decision-making.

Looking Ahead

There are also some situations where data can’t be rescued. Yield Editor and post-calibration can only do so much.

In those cases, it might be best to cut losses and move on. While this isn’t what most want to hear, try thinking of it as an opportunity to learn and think ahead to better set up the data in the future.


About the Author
Jacob Maurer has been working as an agronomist for 10 years. At RDO Equipment Co., he works with team members and growers in the Midwest region, focused on education in areas including data management and precision spraying. He’s a frequent speaker at universities and industry events, and a regular contributor to Connect with him on Twitter @RDOJacobM.

Interested in more about precision agriculture and data management? Read Jacob’s article about taking a smaller approach to big data.

Learn more about precision agriculture offerings from RDO Equipment Co. Browse listings of available used agriculture equipment for sale or find options by visiting your local RDO Equipment Co. store.