Skip to main content
Bulk Data Onboarding

This guide will help you understand how Harmonic’s bulk data is structured, organized, and best utilized in your data environment.

Harmonic Team avatar
Written by Harmonic Team
Updated over a month ago

Harmonic Bulk Share offers access to our entire universe of company & people records updated on a weekly basis.

Supported Formats

Harmonic supports the following Bulk Share formats:

  • BigQuery

  • Snowflake

  • Google Cloud Storage Bucket

  • AWS Bucket

Overview

Harmonic maintains two of the most comprehensive datasets available, with detailed information on millions of organizations and people worldwide. Our data is sourced from numerous reliable channels and undergoes rigorous processing to ensure accuracy and usefulness.

Core Data Architecture

Harmonic maintains two interconnected datasets that form the foundation of our intelligence:

Companies

45M+ total records

Our company dataset provides a comprehensive view of organizations worldwide. About 27M of these companies are “surfaceable” in Harmonic Console, meaning they meet our data quality thresholds for venture-backable organizations.

People

200M profiles

Our people dataset contains professional profiles that are linked to companies through current and historic employment relationships.

Companies (45M total records) Our company dataset provides a comprehensive view of organizations worldwide. About 27M of these companies are surfaced in our console interface, representing organizations that meet our data quality thresholds for user-facing applications.

People (200M+ profiles) Our people dataset contains professional profiles that are linked to companies through employment relationships, providing context about team composition, professional movements, and industry expertise.

Understanding Data Quality, Visibility, and Freshness

We maintain different visibility levels across product offerings for company data based on quality and completeness:

Aspect

Surfaceable Companies (~27M)

Additional Companies (~18M)

Data Quality

Core fields (see Quality Criteria for Surfacing below)

May have partial information

Verification

Verified through multiple signals

Limited verification

Usage

Shown in console & bulk data share

Shown in bulk data share (not in console)

Freshness/updates

Frequently refreshed (see Data freshness)

Updated as new information available

Quality Criteria for Surfacing

Companies must meet specific criteria to be surfaced in our console:

  1. Verified company name

  2. At least one canonical identifier (i.e. website URL, LinkedIn profile)

  3. Connection to at least one professional in our people dataset

🥷🏼 Special Case: Stealth Companies

While most companies require canonical identifiers, stealth companies are handled differently:

  • Verified through founder relationships rather than traditional identifiers

  • Included in surfaced dataset despite lacking some standard markers

  • Particularly valuable for early-stage investment tracking

Data Freshness

  • Full dataset refresh occurs weekly

  • Updates run Saturday afternoon through Sunday morning

  • Process is atomic - entire dataset is replaced

  • Both companies and people data update together


Getting started with Harmonic Bulk Data

We recommend BigQuery and Snowflake as primary platforms for their integrated capabilities, and maintain full support for S3 and GCS to accommodate custom data pipelines.

Snowflake

Set up:

  1. Provide both your Snowflake Region (e.g., AWS_US_EAST_1) and Snowflake Account Locator to Harmonic

  2. Ensure your Snowflake instance is set up to run queries

  3. Once provisioned, look for an inbound share called PUBLIC

After your Snowflake share is set up:

Working with tables

  • The PUBLIC share provides direct access to company and people tables without data copying

  • Both companies and people tables are immediately queryable

  • Data refreshes are handled automatically by Harmonic

Tips

  • Consider materializing commonly-used views

  • Use time travel for point-in-time analysis

  • Take advantage of zero-copy cloning for testing

Google BigQuery

Set up:

  1. Create a user email address or service account in your GCP project

  2. Provide the email address to Harmonic

  3. Ensure your GCP project is set up to run BigQuery queries

  4. Once provisioned, access the database using identifier: innate-empire-283902

    1. Navigate to the public dataset

    2. Access the two available tables: companies and people

After your BigQuery share is set up:

Working with tables

  • Two main tables are available in the public dataset: companies and people

Tips

  • Filter for surfaced companies first when possible (these meet our quality criteria)

  • Use table partitioning for date-based queries

  • Test queries on sample data before running large analyses

Google Cloud Storage Bucket (GCS)

Set up:

  1. Provide the email address of the user that will be receiving access to the bulk share to Harmonic

  2. Once provisioned, access the bucket:

    • Companies files (jsonl & parquet format)

    • People files (jsonl & parquet format)

After gaining bucket access:

Working with files

  • Files are split into manageable chunks

  • Choose between JSONL and Parquet formats based on your processing needs

Tips

  • Process files in parallel for faster ingestion

  • Maintain file order during processing

  • Consider implementing checkpoints for large ingestion jobs

Amazon Web Services S3 Bucket (AWS)

Set up:

  1. Provide the AWS accountID to Harmonic

  2. Once provisioned, access the bucket harmonic-data-shares via the AWS console or programmatically via the bucket arn which is arn:aws:s3:::harmonic-data-shares. The bucket will contain:

    • Companies files (jsonl & parquet format)

    • People files (jsonl & parquet format)

After your S3 access is configured:

Working with files

  • Files are organized by type (companies/people)

  • Both JSONL and Parquet formats are available

  • Weekly updates replace all files

Tips

  • Implement error handling for large file processing

  • Plan for complete refresh cycles rather than incremental updates


FAQs

Q: What's the difference between the 27M and 45M company dataset numbers?

A: Our console displays 27M companies that meet minimum criteria: having a name, at least one canonical identifier (website or LinkedIn), and at least one person attached. The full dataset of 45M includes companies with less complete data. You can filter for the surfaced 27M companies by requiring these fields.

Q: Do you recommend starting with the full dataset or the surfaceable companies? A: Start with the surfaceable companies (27M) as they have higher data quality and completeness. Once you've established your processing pipeline, you can expand to include the additional companies based on your needs.

Q: How do you handle stealth companies in the data?

A: Stealth companies are a special case - they may lack traditional identifiers (website/LinkedIn) but are verified through founder relationships. They're included in the surfaced dataset despite not meeting standard criteria.

Q: How stable are the company and person IDs?

A: For companies, we recommend using domains as unique identifiers for most reliable tracking across updates. When a company ID is updated (such as in merger or acquisition cases), we don't currently maintain a link to the previous ID. Domains serve as stable canonical identifiers for company entities.

Q: How can we identify newly discovered companies?

A: initialized_date is the date we first created the entry in our database. Use this field when scanning for the most recently discovered companies.

Q: What's the recommended way to join company and people data?

A: Join on the company.ID present in both datasets. Be aware that a company may have multiple associated people, and a person may have multiple company associations through their employment history.

Q: How do the weekly updates work?

A: Updates run on weekends, typically Saturday afternoon through Sunday morning. The process replaces all files atomically - avoid processing during this window.

Q: How should we handle the weekly refresh in our data pipelines?

A: Design your pipeline to handle complete dataset replacements rather than incremental updates. Avoid processing during the weekend update window (Saturday afternoon through Sunday morning), and consider maintaining a copy of the previous dataset until new processing completes.

Q: Will the weekly refresh continue automatically?

A: Yes, it runs automatically throughout your term with Harmonic. While bulk updates happen weekly, Harmonic’s API can help with real-time lookups of the most current data.

Q: How are the files organized? (S3 & GCS only)

A: Files are split into 100 segments following a naming pattern of 000-of-100 through 099-of-100. The distribution of records across these files is random, allowing for parallel processing. Both JSONL and Parquet formats maintain identical content.

Q: Do the JSONL and Parquet files contain the same data? (S3 & GCS only)

A: Yes, they contain identical data. The format choice is just for your processing preference.

Your Harmonic account team is available to help you understand how best to implement Harmonic's data for your specific needs! Reach out with any questions.

Did this answer your question?