Skip to main content

Bulk Data Overview

This guide will help you understand how Harmonic’s bulk data is structured, organized, and best utilized in your data environment.

Harmonic Team avatar
Written by Harmonic Team
Updated this week

Harmonic Bulk Share offers access to our entire universe of company & people records updated on a weekly basis. In addition, we now offer premium add-ons for investors data and deal data to provide even deeper insights into the funding ecosystem.

Supported Formats

Harmonic supports the following Bulk Share formats:

  • BigQuery

  • Snowflake

  • AWS S3

Overview

Harmonic maintains comprehensive datasets with detailed information on millions of organizations and people worldwide. Our data is sourced from numerous reliable channels and undergoes rigorous processing to ensure accuracy and usefulness.

Core Data Architecture

There are four interconnected datasets that form the foundation of our intelligence:

Companies
(49M+ total records)

Provides a comprehensive view of organizations worldwide. About 27M of these companies are “surfaceable” in Harmonic Console, meaning they meet our data quality thresholds for venture-backable organizations.

People
(193M+ profiles)

Contains professional profiles that are linked to companies through current and historic employment relationships.

Investors (Add-On)*

Provides detailed profiles of investment firms and individual investors, with comprehensive tracking of investment activity, preferences by sector, stage, and geography. Includes portfolio performance metrics like follow-on rates and unicorn investments, plus AUM data where available.

Deal Data (Add-On)*

Captures comprehensive funding round information including announcement dates, amounts, and round types. Contains valuation data when available, lead investor and all investor participation information, along with source verification links.

*If you're interested in adding Investors or Deal Data datasets to your Bulk Share subscription, please reach out to your Harmonic account manager for pricing and implementation details.

Understanding Data Quality, Visibility, and Freshness

We maintain different visibility levels across product offerings for company data based on quality and completeness:

Aspect

Surfaceable Companies (~30M)

Additional Companies (~19M)

Data Quality

Core fields (see Quality Criteria for Surfacing below)

May have partial information

Verification

Verified through multiple signals

Limited verification

Usage

Shown in console & bulk data share

Shown in bulk data share (not in console)

Freshness/updates

Frequently refreshed (see Data freshness)

Updated as new information available

Quality Criteria for Surfacing

Companies must meet specific criteria to be surfaced in our console:

  1. Verified company name

  2. At least one canonical identifier (i.e. website URL, LinkedIn profile)

  3. Connection to at least one professional in our people dataset

🥷🏼 Special Case: Stealth Companies

While most companies require canonical identifiers, stealth companies are handled differently:

  • Verified through founder relationships rather than traditional identifiers

  • Included in surfaced dataset despite lacking some standard markers

  • Particularly valuable for early-stage investment tracking

Data Freshness

  • Full dataset refresh occurs weekly

  • Updates run Saturday afternoon through Sunday morning

  • Process is atomic - entire dataset is replaced

  • Both companies and people data update together


Getting started with Harmonic Bulk Data

We recommend BigQuery and Snowflake as primary platforms for their integrated capabilities, and maintain full support for S3 to accommodate custom data pipelines.

Snowflake

Set up:

  1. Provide both your Snowflake Region (e.g., AWS_US_EAST_1) and Snowflake Account Locator to Harmonic

  2. Ensure your Snowflake instance is set up to run queries

  3. Once provisioned, look for an inbound share called PUBLIC

After your Snowflake share is set up:

Working with tables

  • The PUBLIC share provides direct access to company and people tables without data copying

  • Both companies and people tables are immediately queryable

  • Data refreshes are handled automatically by Harmonic

Tips

  • Consider materializing commonly-used views

  • Use time travel for point-in-time analysis

  • Take advantage of zero-copy cloning for testing

Google BigQuery

Set up:

  1. Create a user email address or service account in your GCP project

  2. Provide the email address to Harmonic

  3. Ensure your GCP project is set up to run BigQuery queries

  4. Once provisioned, access the database using identifier: innate-empire-283902

    1. Navigate to the public dataset

    2. Access the two available tables: companies and people

After your BigQuery share is set up:

Working with tables

  • Two main tables are available in the public dataset: companies and people

Tips

  • Filter for surfaced companies first when possible (these meet our quality criteria)

  • Use table partitioning for date-based queries

  • Test queries on sample data before running large analyses

Amazon Web Services S3 Bucket (AWS)

Set up:

  1. Provide the AWS accountID to Harmonic

  2. Once provisioned, access the bucket harmonic-data-shares via the AWS console or programmatically via the bucket arn which is arn:aws:s3:::harmonic-data-shares. The bucket will contain:

    • Companies files (jsonl & parquet format)

    • People files (jsonl & parquet format)

After your S3 access is configured:

Working with files

  • Files are organized by type (companies/people)

  • Both JSONL and Parquet formats are available

  • Weekly updates replace all files

Tips

  • Implement error handling for large file processing

  • Plan for complete refresh cycles rather than incremental updates


FAQs

Q: What's the difference between the 27M and 45M company dataset numbers?

A: Our console displays 27M companies that meet minimum criteria: having a name, at least one canonical identifier (website or LinkedIn), and at least one person attached. The full dataset of 45M includes companies with less complete data. You can filter for the surfaced 27M companies by requiring these fields.

Q: Do you recommend starting with the full dataset or the surfaceable companies? A: Start with the surfaceable companies (27M) as they have higher data quality and completeness. Once you've established your processing pipeline, you can expand to include the additional companies based on your needs.

Q: How do you handle stealth companies in the data?

A: Stealth companies are a special case - they may lack traditional identifiers (website/LinkedIn) but are verified through founder relationships. They're included in the surfaced dataset despite not meeting standard criteria.

Q: How stable are the company and person IDs?

A: For companies, we recommend using domains as unique identifiers for most reliable tracking across updates. When a company ID is updated (such as in merger or acquisition cases), we don't currently maintain a link to the previous ID. Domains serve as stable canonical identifiers for company entities.

Q: How can we identify newly discovered companies?

A: initialized_date is the date we first created the entry in our database. Use this field when scanning for the most recently discovered companies.

Q: What's the recommended way to join company and people data?

A: Join on the company.ID present in both datasets. Be aware that a company may have multiple associated people, and a person may have multiple company associations through their employment history.

Q: How do I connect investors data with companies and deals?
A: Investors data can be joined with companies using the entity_urn field, which serves as a unique identifier across our datasets. For deal data, you can use the company_urn field to connect funding rounds to specific companies, and the investors and lead_investors fields to link to investor entities.

Q: What insights can I gain from the investors data that isn't available in the standard company data?
A: The investors dataset provides detailed investment patterns including sector, geography, and stage preferences over time. It also includes portfolio performance metrics like follow-on rates and unicorn investments that enable deeper analysis of investor strategies and success rates.

Q: How comprehensive is the deal data coverage?
A: Our deal data covers funding rounds across the global startup ecosystem, with particularly strong coverage in North America, Europe, and major Asian markets. Each funding record includes key details such as amount, date, round type, and participating investors when this information is publicly available.

Q: How do the weekly updates work?

A: Updates run on weekends, typically Saturday afternoon through Sunday morning. The process replaces all files atomically - avoid processing during this window.

Q: How should we handle the weekly refresh in our data pipelines?

A: Design your pipeline to handle complete dataset replacements rather than incremental updates. Avoid processing during the weekend update window (Saturday afternoon through Sunday morning), and consider maintaining a copy of the previous dataset until new processing completes.

Q: Will the weekly refresh continue automatically?

A: Yes, it runs automatically throughout your term with Harmonic. While bulk updates happen weekly, Harmonic’s API can help with real-time lookups of the most current data.

Q: How are the files organized? (S3 only)

A: Files are split into 100 segments following a naming pattern of 000-of-100 through 099-of-100. The distribution of records across these files is random, allowing for parallel processing. Both JSONL and Parquet formats maintain identical content.

Q: Do the JSONL and Parquet files contain the same data? (S3 only)

A: Yes, they contain identical data. The format choice is just for your processing preference.

Your Harmonic account team is available to help you understand how best to implement Harmonic's data for your specific needs! Reach out with any questions.

Did this answer your question?