News & Updates

Welcome to CANalitica’s News & Updates page

Here, you’ll find the latest information on our projects, industry insights, and exciting developments that keep us at the forefront of advanced data analytics, AI integration, and modeling solutions.

Jan 6, 2025 – Weather Data Transformation

CANalitica demonstrates real benefits using simple data transformation techniques. Using the freely available data that is based on the Global Summary of the Day (GSOD) weather data provided by National Centers for Environmental Information (NCEI) in Asheville, NC. (https://www.ncei.noaa.gov/), the following results were achieved on the GSOD data from 1950 to 2024:

  1. Compression & Reduced Storage
    • Parquet is a columnar format that supports efficient compression.
    • Significant file-size reductions compared to CSV (size on disk lowered from 36 GB to 2.3 GB) lowering storage costs.
    • The number of files from over 570 thousand CSV files to just over 500 parquet files.
  2. Faster Reads & Lower Query Costs
    • Because Parquet is columnar, query engines only read the columns they need, which can greatly reduce I/O.
    • Queries often run much faster, and you pay less per query (since less data is scanned).
  3. Efficient Compression & Encoding
    • Parquet applies column-level compression and encoding, so repetitive data (e.g., station IDs, location fields) compresses very well.
    • CSV doesn’t inherently compress in a column-aware way, so it’s usually larger and more costly to process.
  4. Schema Enforcement
    • Parquet embeds a schema, making data types and column consistency explicit.
    • CSV files can be prone to schema drift and type inconsistencies (e.g., a column that occasionally contains strings instead of numbers).
  5. Better Handling of Complex Data
    • Parquet can store nested structures (arrays, maps), which are awkward to handle in CSV.
    • Even if your current dataset is flat, Parquet’s flexibility can be handy for future data requirements.
    • A reduction of the number of files from over 570,000 CSV files to just over 500 parquet files.

To download the data (~2.3 GB), you may use the following links:

Overview of Available Data: The weather data files are organized by continent and contain historical weather information. Available continents include:

Africa
Antarctica
Asia
Australia
Europe
North-America
South-America

Depending on your browser settings, the file may download automatically, or you may be prompted to choose a location to save the file.

The files are in .parquet format, which is a columnar storage file format optimized for performance in data processing tasks. To work with these files, you’ll need software that supports the Parquet format, such as Apache Spark, Pandas with PyArrow, or other data analysis tools.

The data described here are intended for free and unrestricted use in research, education, and other non-commercial activities.