Harmonic Bulk Share offers access to our entire universe of company & people records updated on a weekly basis.
Supported Formats
Harmonic supports the following Bulk Share formats:
BigQuery
Snowflake
Google Cloud Storage Bucket
AWS Bucket
Overview
Harmonic maintains two of the most comprehensive datasets available, with detailed information on millions of organizations and people worldwide. Our data is sourced from numerous reliable channels and undergoes rigorous processing to ensure accuracy and usefulness.
Core Data Architecture
Harmonic maintains two interconnected datasets that form the foundation of our intelligence:
Companies | 45M+ total records | Our company dataset provides a comprehensive view of organizations worldwide. About 27M of these companies are “surfaceable” in Harmonic Console, meaning they meet our data quality thresholds for venture-backable organizations. |
People | 200M profiles | Our people dataset contains professional profiles that are linked to companies through current and historic employment relationships. |
Companies (45M total records) Our company dataset provides a comprehensive view of organizations worldwide. About 27M of these companies are surfaced in our console interface, representing organizations that meet our data quality thresholds for user-facing applications.
People (200M+ profiles) Our people dataset contains professional profiles that are linked to companies through employment relationships, providing context about team composition, professional movements, and industry expertise.
Understanding Data Quality, Visibility, and Freshness
We maintain different visibility levels across product offerings for company data based on quality and completeness:
Aspect | Surfaceable Companies (~27M) | Additional Companies (~18M) |
Data Quality | Core fields (see Quality Criteria for Surfacing below) | May have partial information |
Verification | Verified through multiple signals | Limited verification |
Usage | Shown in console & bulk data share | Shown in bulk data share (not in console) |
Freshness/updates | Frequently refreshed (see Data freshness) | Updated as new information available |
Quality Criteria for Surfacing
Companies must meet specific criteria to be surfaced in our console:
Verified company name
At least one canonical identifier (i.e. website URL, LinkedIn profile)
Connection to at least one professional in our people dataset
🥷🏼 Special Case: Stealth Companies
While most companies require canonical identifiers, stealth companies are handled differently:
Verified through founder relationships rather than traditional identifiers
Included in surfaced dataset despite lacking some standard markers
Particularly valuable for early-stage investment tracking
Data Freshness
Full dataset refresh occurs weekly
Updates run Saturday afternoon through Sunday morning
Process is atomic - entire dataset is replaced
Both companies and people data update together
Getting started with Harmonic Bulk Data
We recommend BigQuery and Snowflake as primary platforms for their integrated capabilities, and maintain full support for S3 and GCS to accommodate custom data pipelines.
Snowflake
Set up:
Provide both your Snowflake Region (e.g.,
AWS_US_EAST_1
) and Snowflake Account Locator to HarmonicEnsure your Snowflake instance is set up to run queries
Once provisioned, look for an inbound share called
PUBLIC
After your Snowflake share is set up:
Working with tables
The
PUBLIC
share provides direct access to company and people tables without data copyingBoth companies and people tables are immediately queryable
Data refreshes are handled automatically by Harmonic
Tips
Consider materializing commonly-used views
Use time travel for point-in-time analysis
Take advantage of zero-copy cloning for testing
Google BigQuery
Set up:
Create a user email address or service account in your GCP project
Provide the email address to Harmonic
Ensure your GCP project is set up to run BigQuery queries
Once provisioned, access the database using identifier:
innate-empire-283902
Navigate to the
public
datasetAccess the two available tables:
companies
andpeople
After your BigQuery share is set up:
Working with tables
Two main tables are available in the
public
dataset:companies
andpeople
Tips
Filter for surfaced companies first when possible (these meet our quality criteria)
Use table partitioning for date-based queries
Test queries on sample data before running large analyses
Google Cloud Storage Bucket (GCS)
Set up:
Provide the email address of the user that will be receiving access to the bulk share to Harmonic
Once provisioned, access the bucket:
Companies files (jsonl & parquet format)
People files (jsonl & parquet format)
After gaining bucket access:
Working with files
Files are split into manageable chunks
Choose between JSONL and Parquet formats based on your processing needs
Tips
Process files in parallel for faster ingestion
Maintain file order during processing
Consider implementing checkpoints for large ingestion jobs
Amazon Web Services S3 Bucket (AWS)
Set up:
Provide the AWS accountID to Harmonic
Once provisioned, access the bucket
harmonic-data-shares
via the AWS console or programmatically via the bucket arn which isarn:aws:s3:::harmonic-data-shares
. The bucket will contain:Companies files (jsonl & parquet format)
People files (jsonl & parquet format)
After your S3 access is configured:
Working with files
Files are organized by type (companies/people)
Both JSONL and Parquet formats are available
Weekly updates replace all files
Tips
Implement error handling for large file processing
Plan for complete refresh cycles rather than incremental updates
FAQs
Q: What's the difference between the 27M and 45M company dataset numbers?
A: Our console displays 27M companies that meet minimum criteria: having a name, at least one canonical identifier (website or LinkedIn), and at least one person attached. The full dataset of 45M includes companies with less complete data. You can filter for the surfaced 27M companies by requiring these fields.
Q: Do you recommend starting with the full dataset or the surfaceable companies? A: Start with the surfaceable companies (27M) as they have higher data quality and completeness. Once you've established your processing pipeline, you can expand to include the additional companies based on your needs.
Q: How do you handle stealth companies in the data?
A: Stealth companies are a special case - they may lack traditional identifiers (website/LinkedIn) but are verified through founder relationships. They're included in the surfaced dataset despite not meeting standard criteria.
Q: How stable are the company and person IDs?
A: For companies, we recommend using domains as unique identifiers for most reliable tracking across updates. When a company ID is updated (such as in merger or acquisition cases), we don't currently maintain a link to the previous ID. Domains serve as stable canonical identifiers for company entities.
Q: How can we identify newly discovered companies?
A: initialized_date
is the date we first created the entry in our database. Use this field when scanning for the most recently discovered companies.
Q: What's the recommended way to join company and people data?
A: Join on the company.ID
present in both datasets. Be aware that a company may have multiple associated people, and a person may have multiple company associations through their employment history.
Q: How do the weekly updates work?
A: Updates run on weekends, typically Saturday afternoon through Sunday morning. The process replaces all files atomically - avoid processing during this window.
Q: How should we handle the weekly refresh in our data pipelines?
A: Design your pipeline to handle complete dataset replacements rather than incremental updates. Avoid processing during the weekend update window (Saturday afternoon through Sunday morning), and consider maintaining a copy of the previous dataset until new processing completes.
Q: Will the weekly refresh continue automatically?
A: Yes, it runs automatically throughout your term with Harmonic. While bulk updates happen weekly, Harmonic’s API can help with real-time lookups of the most current data.
Q: How are the files organized? (S3 & GCS only)
A: Files are split into 100 segments following a naming pattern of 000-of-100 through 099-of-100. The distribution of records across these files is random, allowing for parallel processing. Both JSONL and Parquet formats maintain identical content.
Q: Do the JSONL and Parquet files contain the same data? (S3 & GCS only)
A: Yes, they contain identical data. The format choice is just for your processing preference.
Your Harmonic account team is available to help you understand how best to implement Harmonic's data for your specific needs! Reach out with any questions.