Harmonic Bulk Share offers access to our entire universe of company & people records updated on a weekly basis. In addition, we now offer premium add-ons for investors data and deal data to provide even deeper insights into the funding ecosystem.
Supported Formats
Harmonic supports the following Bulk Share formats:
BigQuery
Snowflake
AWS S3
Overview
Harmonic maintains comprehensive datasets with detailed information on millions of organizations and people worldwide. Our data is sourced from numerous reliable channels and undergoes rigorous processing to ensure accuracy and usefulness.
Core Data Architecture
There are four interconnected datasets that form the foundation of our intelligence:
Companies | Provides a comprehensive view of organizations worldwide. About 27M of these companies are “surfaceable” in Harmonic Console, meaning they meet our data quality thresholds for venture-backable organizations. |
People | Contains professional profiles that are linked to companies through current and historic employment relationships. |
Investors (Add-On)* | Provides detailed profiles of investment firms and individual investors, with comprehensive tracking of investment activity, preferences by sector, stage, and geography. Includes portfolio performance metrics like follow-on rates and unicorn investments, plus AUM data where available. |
Deal Data (Add-On)* | Captures comprehensive funding round information including announcement dates, amounts, and round types. Contains valuation data when available, lead investor and all investor participation information, along with source verification links. |
*If you're interested in adding Investors or Deal Data datasets to your Bulk Share subscription, please reach out to your Harmonic account manager for pricing and implementation details.
Understanding Data Quality, Visibility, and Freshness
We maintain different visibility levels across product offerings for company data based on quality and completeness:
Aspect | Surfaceable Companies (~30M) | Additional Companies (~19M) |
Data Quality | Core fields (see Quality Criteria for Surfacing below) | May have partial information |
Verification | Verified through multiple signals | Limited verification |
Usage | Shown in console & bulk data share | Shown in bulk data share (not in console) |
Freshness/updates | Frequently refreshed (see Data freshness) | Updated as new information available |
Quality Criteria for Surfacing
Companies must meet specific criteria to be surfaced in our console:
Verified company name
At least one canonical identifier (i.e. website URL, LinkedIn profile)
Connection to at least one professional in our people dataset
🥷🏼 Special Case: Stealth Companies
While most companies require canonical identifiers, stealth companies are handled differently:
Verified through founder relationships rather than traditional identifiers
Included in surfaced dataset despite lacking some standard markers
Particularly valuable for early-stage investment tracking
Data Freshness
Full dataset refresh occurs weekly
Updates run Saturday afternoon through Sunday morning
Process is atomic - entire dataset is replaced
Both companies and people data update together
Getting started with Harmonic Bulk Data
We recommend BigQuery and Snowflake as primary platforms for their integrated capabilities, and maintain full support for S3 to accommodate custom data pipelines.
Snowflake
Set up:
Provide both your Snowflake Region (e.g.,
AWS_US_EAST_1
) and Snowflake Account Locator to HarmonicEnsure your Snowflake instance is set up to run queries
Once provisioned, look for an inbound share called
PUBLIC
After your Snowflake share is set up:
Working with tables
The
PUBLIC
share provides direct access to company and people tables without data copyingBoth companies and people tables are immediately queryable
Data refreshes are handled automatically by Harmonic
Tips
Consider materializing commonly-used views
Use time travel for point-in-time analysis
Take advantage of zero-copy cloning for testing
Google BigQuery
Set up:
Create a user email address or service account in your GCP project
Provide the email address to Harmonic
Ensure your GCP project is set up to run BigQuery queries
Once provisioned, access the database using identifier:
innate-empire-283902
Navigate to the
public
datasetAccess the two available tables:
companies
andpeople
After your BigQuery share is set up:
Working with tables
Two main tables are available in the
public
dataset:companies
andpeople
Tips
Filter for surfaced companies first when possible (these meet our quality criteria)
Use table partitioning for date-based queries
Test queries on sample data before running large analyses
Amazon Web Services S3 Bucket (AWS)
Set up:
Provide the AWS accountID to Harmonic
Once provisioned, access the bucket
harmonic-data-shares
via the AWS console or programmatically via the bucket arn which isarn:aws:s3:::harmonic-data-shares
. The bucket will contain:Companies files (jsonl & parquet format)
People files (jsonl & parquet format)
After your S3 access is configured:
Working with files
Files are organized by type (companies/people)
Both JSONL and Parquet formats are available
Weekly updates replace all files
Tips
Implement error handling for large file processing
Plan for complete refresh cycles rather than incremental updates
FAQs
Q: What's the difference between the 27M and 45M company dataset numbers?
A: Our console displays 27M companies that meet minimum criteria: having a name, at least one canonical identifier (website or LinkedIn), and at least one person attached. The full dataset of 45M includes companies with less complete data. You can filter for the surfaced 27M companies by requiring these fields.
Q: Do you recommend starting with the full dataset or the surfaceable companies? A: Start with the surfaceable companies (27M) as they have higher data quality and completeness. Once you've established your processing pipeline, you can expand to include the additional companies based on your needs.
Q: How do you handle stealth companies in the data?
A: Stealth companies are a special case - they may lack traditional identifiers (website/LinkedIn) but are verified through founder relationships. They're included in the surfaced dataset despite not meeting standard criteria.
Q: How stable are the company and person IDs?
A: For companies, we recommend using domains as unique identifiers for most reliable tracking across updates. When a company ID is updated (such as in merger or acquisition cases), we don't currently maintain a link to the previous ID. Domains serve as stable canonical identifiers for company entities.
Q: How can we identify newly discovered companies?
A: initialized_date
is the date we first created the entry in our database. Use this field when scanning for the most recently discovered companies.
Q: What's the recommended way to join company and people data?
A: Join on the company.ID
present in both datasets. Be aware that a company may have multiple associated people, and a person may have multiple company associations through their employment history.
Q: How do I connect investors data with companies and deals?
A: Investors data can be joined with companies using the entity_urn field, which serves as a unique identifier across our datasets. For deal data, you can use the company_urn field to connect funding rounds to specific companies, and the investors and lead_investors fields to link to investor entities.
Q: What insights can I gain from the investors data that isn't available in the standard company data?
A: The investors dataset provides detailed investment patterns including sector, geography, and stage preferences over time. It also includes portfolio performance metrics like follow-on rates and unicorn investments that enable deeper analysis of investor strategies and success rates.
Q: How comprehensive is the deal data coverage?
A: Our deal data covers funding rounds across the global startup ecosystem, with particularly strong coverage in North America, Europe, and major Asian markets. Each funding record includes key details such as amount, date, round type, and participating investors when this information is publicly available.
Q: How do the weekly updates work?
A: Updates run on weekends, typically Saturday afternoon through Sunday morning. The process replaces all files atomically - avoid processing during this window.
Q: How should we handle the weekly refresh in our data pipelines?
A: Design your pipeline to handle complete dataset replacements rather than incremental updates. Avoid processing during the weekend update window (Saturday afternoon through Sunday morning), and consider maintaining a copy of the previous dataset until new processing completes.
Q: Will the weekly refresh continue automatically?
A: Yes, it runs automatically throughout your term with Harmonic. While bulk updates happen weekly, Harmonic’s API can help with real-time lookups of the most current data.
Q: How are the files organized? (S3 only)
A: Files are split into 100 segments following a naming pattern of 000-of-100 through 099-of-100. The distribution of records across these files is random, allowing for parallel processing. Both JSONL and Parquet formats maintain identical content.
Q: Do the JSONL and Parquet files contain the same data? (S3 only)
A: Yes, they contain identical data. The format choice is just for your processing preference.
Your Harmonic account team is available to help you understand how best to implement Harmonic's data for your specific needs! Reach out with any questions.