S3

Availability:: Airbyte Cloud Airbyte OSS PyAirbyte
Support Level:: Airbyte Connector
Connector Version:: 4.7.4(Last updated a day ago)
CDK Version:: 3.9.6
Sync Success Rate:
Usage Rate:

This page contains the setup guide and reference information for the Amazon S3 source connector.

info

Please note that using cloud storage may incur egress costs. Egress refers to data that is transferred out of the cloud storage system, such as when you download files or access them from a different location. For detailed information on egress costs, please consult the Amazon S3 pricing guide.

Prerequisites

Access to the S3 bucket containing the files to replicate.
For private buckets, an AWS account with the ability to grant permissions to read from the bucket.

Setup guide

Step 1: Set up Amazon S3

If you are syncing from a private bucket, you need to authenticate the connection. This can be done either by using an IAM User (with AWS Access Key ID and Secret Access Key) or an IAM Role (with Role ARN). Begin by creating a policy with the necessary permissions:

Create a Policy

Log in to your Amazon AWS account and open the IAM console.
In the IAM dashboard, select Policies, then click Create Policy.
Select the JSON tab, then paste the following JSON into the Policy editor (be sure to substitute in your bucket name):

{
    "Version": "2012-10-17",
    "Statement": [
        {
        "Effect": "Allow",
        "Action": [
                "s3:GetObject",
                "s3:ListBucket"
        ],
        "Resource": [
                "arn:aws:s3:::{your-bucket-name}/*",
                "arn:aws:s3:::{your-bucket-name}"
        ]
        }
    ]
}

note

At this time, object-level permissions alone are not sufficient to successfully authenticate the connection. Please ensure you include the bucket-level permissions as provided in the example above.

Give your policy a descriptive name, then click Create policy.

Option 1: Using an IAM User

In the IAM dashboard, click Users. Select an existing IAM user or create a new one by clicking Add users.
If you are using an existing IAM user, click the Add permissions dropdown menu and select Add permissions. If you are creating a new user, you will be taken to the Permissions screen after selecting a name.
Select Attach policies directly, then find and check the box for your new policy. Click Next, then Add permissions.
After successfully creating your user, select the Security credentials tab and click Create access key. You will be prompted to select a use case and add optional tags to your access key. Click Create access key to generate the keys.

caution

Your Secret Access Key will only be visible once upon creation. Be sure to copy and store it securely for future use.

For more information on managing your access keys, please refer to the official AWS documentation.

Option 2: Using an IAM Role (Most secure)

note

S3 authentication using an IAM role member is not supported using the OSS platform.

note

S3 authentication using an IAM role member must be enabled by a member of the Airbyte team. If you'd like to use this feature, please contact the Sales team for more information.

In the IAM dashboard, click Roles, then Create role.
Choose the AWS account trusted entity type.
Set up a trust relationship for the role. This allows the Airbyte instance's AWS account to assume this role. You will also need to specify an external ID, which is a secret key that the trusting service (Airbyte) and the trusted role (the role you're creating) both know. This ID is used to prevent the "confused deputy" problem. The External ID should be your Airbyte workspace ID, which can be found in the URL of your workspace page. Edit the trust relationship policy to include the external ID:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::094410056844:user/delegated_access_user"
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "sts:ExternalId": "{your-airbyte-workspace-id}"
                }
            }
        }
    ]
}

Complete the role creation and note the Role ARN.

Step 2: Set up the Amazon S3 connector in Airbyte

Log in to your Airbyte Cloud account, or navigate to your Airbyte Open Source dashboard.
In the left navigation bar, click Sources. In the top-right corner, click + New source.
Find and select S3 from the list of available sources.
Enter the name of the Bucket containing your files to replicate.
Add a stream
1. Choose the File Format
2. In the Format box, use the dropdown menu to select the format of the files you'd like to replicate. The supported formats are CSV, Parquet, Avro and JSONL. Toggling the Optional fields button within the Format box will allow you to enter additional configurations based on the selected format. For a detailed breakdown of these settings, refer to the File Format section below.
3. Give a Name to the stream
4. (Optional) Enter the Globs which dictates which files to be synced. This is a regular expression that allows Airbyte to pattern match the specific files to replicate. If you are replicating all the files within your bucket, use ** as the pattern. For more precise pattern matching options, refer to the Globs section below.
5. (Optional) Modify the Days To Sync If History Is Full value. This gives you control of the lookback window that we will use to determine which files to sync if the state history is full. Details are in the State section below.
6. (Optional) If you want to enforce a specific schema, you can enter a Input schema. By default, this value is set to {} and will automatically infer the schema from the file(s) you are replicating. For details on providing a custom schema, refer to the User Schema section.
7. (Optional) Select the Schemaless option, to skip all validation of the records against a schema. If this option is selected the schema will be {"data": "object"} and all downstream data will be nested in a "data" field. This is a good option if the schema of your records changes frequently.
8. (Optional) Select a Validation Policy to tell Airbyte how to handle records that do not match the schema. You may choose to emit the record anyway (fields that aren't present in the schema may not arrive at the destination), skip the record altogether, or wait until the next discovery (which will happen in the next 24 hours).
To authenticate your private bucket:
- If using an IAM role, enter the AWS Role ARN.
- If using IAM user credentials, fill the AWS Access Key ID and AWS Secret Access Key fields with the appropriate credentials.

All other fields are optional and can be left empty. Refer to the S3 Provider Settings section below for more information on each field.

Supported sync modes

The Amazon S3 source connector supports the following sync modes:

Feature	Supported?
Full Refresh Sync	Yes
Incremental Sync	Yes
Replicate Incremental Deletes	No
Replicate Multiple Files (pattern matching)	Yes
Replicate Multiple Streams (distinct tables)	Yes
Namespaces	No

Supported streams

There is no predefined streams. The streams are based on content of your bucket.

File Compressions

Compression	Supported?
Gzip	Yes
Zip	Yes
Bzip2	Yes
Lzma	No
Xz	No
Snappy	No

Please let us know any specific compressions you'd like to see support for next!

Globs

(tl;dr -> path pattern syntax using wcmatch.glob. GLOBSTAR and SPLIT flags are enabled.)

This connector can sync multiple files by using glob-style patterns, rather than requiring a specific path for every file. This enables:

Referencing many files with just one pattern, e.g. ** would indicate every file in the bucket.
Referencing future files that don't exist yet (and therefore don't have a specific path).

You must provide a path pattern. You can also provide many patterns split with | for more complex directory layouts.

Each path pattern is a reference from the root of the bucket, so don't include the bucket name in the pattern(s).

Some example patterns:

** : match everything.
**/*.csv : match all files with specific extension.
myFolder/**/*.csv : match all csv files anywhere under myFolder.
*/** : match everything at least one folder deep.
*/*/*/** : match everything at least three folders deep.
**/file.*|**/file : match every file called "file" with any extension (or no extension).
x/*/y/* : match all files that sit in folder x -> any folder -> folder y.
**/prefix*.csv : match all csv files with specific prefix.
**/prefix*.parquet : match all parquet files with specific prefix.

Let's look at a specific example, matching the following bucket layout:

myBucket
    -> log_files
    -> some_table_files
        -> part1.csv
        -> part2.csv
    -> images
    -> more_table_files
        -> part3.csv
    -> extras
        -> misc
            -> another_part1.csv

We want to pick up part1.csv, part2.csv and part3.csv (excluding another_part1.csv for now). We could do this a few different ways:

We could pick up every csv file called "partX" with the single pattern **/part*.csv.
To be a bit more robust, we could use the dual pattern some_table_files/*.csv|more_table_files/*.csv to pick up relevant files only from those exact folders.
We could achieve the above in a single pattern by using the pattern *table_files/*.csv. This could however cause problems in the future if new unexpected folders started being created.
We can also recursively wildcard, so adding the pattern extras/**/*.csv would pick up any csv files nested in folders below "extras", such as "extras/misc/another_part1.csv".

As you can probably tell, there are many ways to achieve the same goal with path patterns. We recommend using a pattern that ensures clarity and is robust against future additions to the directory structure.

State

To perform incremental syncs, Airbyte syncs files from oldest to newest. Each file that's synced (up to 10,000 files) will be added as an entry in a "history" section of the connection's state message. Once history is full, we drop the older messages out of the file, and only read files that were last modified between the date of the newest file in history and Days to Sync if History is Full days prior.

User Schema

Providing a schema allows for more control over the output of this stream. Without a provided schema, columns and datatypes will be inferred from the first created file in the bucket matching your path pattern and suffix. This will probably be fine in most cases but there may be situations you want to enforce a schema instead, e.g.:

note

Without providing a schema for a CSV file all columns will be inferred as a string.

You only care about a specific known subset of the columns. The other columns would all still be included, but packed into the _ab_additional_properties map.
Your initial dataset is quite small (in terms of number of records), and you think the automatic type inference from this sample might not be representative of the data in the future.
You want to purposely define types for every column.
You know the names of columns that will be added to future data and want to include these in the core schema as columns rather than have them appear in the _ab_additional_properties map.

Or any other reason! The schema must be provided as valid JSON as a map of {"column": "datatype"} where each datatype is one of:

string
number
integer
object
array
boolean
null

For example:

{"id": "integer", "location": "string", "longitude": "number", "latitude": "number"}
{"username": "string", "friends": "array", "information": "object"}

note

Please note, the S3 Source connector used to infer schemas from all the available files and then merge them to create a superset schema. Starting from version 2.0.0 the schema inference works based on the first file found only. The first file we consider is the oldest one written to the prefix.

S3 Provider Settings

AWS Access Key ID: One half of the required credentials for accessing a private bucket.
AWS Secret Access Key: The other half of the required credentials for accessing a private bucket.
Endpoint: An optional parameter that enables the use of non-Amazon S3 compatible services. If you are using the default Amazon service, leave this field blank.
Start Date: An optional parameter that marks a starting date and time in UTC for data replication. Any files that have not been modified since this specified date/time will not be replicated. Use the provided datepicker (recommended) or enter the desired date programmatically in the format YYYY-MM-DDTHH:mm:ssZ. Leaving this field blank will replicate data from all files that have not been excluded by the Path Pattern and Path Prefix.

File Format Settings

CSV

Since CSV files are effectively plain text, providing specific reader options is often required for correct parsing of the files. These settings are applied when a CSV is created or exported so please ensure that this process happens consistently over time.

Header Definition: How headers will be defined. User Provided assumes the CSV does not have a header row and uses the headers provided and Autogenerated assumes the CSV does not have a header row and the CDK will generate headers using for f{i} where i is the index starting from 0. Else, the default behavior is to use the header from the CSV file. If a user wants to autogenerate or provide column names for a CSV having headers, they can set a value for the "Skip rows before header" option to ignore the header row.
Delimiter: Even though CSV is an acronym for Comma Separated Values, it is used more generally as a term for flat file data that may or may not be comma separated. The delimiter field lets you specify which character acts as the separator. To use tab-delimiters, you can set this value to \t. By default, this value is set to ,.
Double Quote: This option determines whether two quotes in a quoted CSV value denote a single quote in the data. Set to True by default.
Encoding: Some data may use a different character set (typically when different alphabets are involved). See the list of allowable encodings here. By default, this is set to utf8.
Escape Character: An escape character can be used to prefix a reserved character and ensure correct parsing. A commonly used character is the backslash (\). For example, given the following data:

Product,Description,Price
Jeans,"Navy Blue, Bootcut, 34\"",49.99

The backslash (\) is used directly before the second double quote (") to indicate that it is not the closing quote for the field, but rather a literal double quote character that should be included in the value (in this example, denoting the size of the jeans in inches: 34" ).

Leaving this field blank (default option) will disallow escaping.

False Values: A set of case-sensitive strings that should be interpreted as false values.
Null Values: A set of case-sensitive strings that should be interpreted as null values. For example, if the value 'NA' should be interpreted as null, enter 'NA' in this field.
Quote Character: In some cases, data values may contain instances of reserved characters (like a comma, if that's the delimiter). CSVs can handle this by wrapping a value in defined quote characters so that on read it can parse it correctly. By default, this is set to ".
Skip Rows After Header: The number of rows to skip after the header row.
Skip Rows Before Header: The number of rows to skip before the header row.
Strings Can Be Null: Whether strings can be interpreted as null values. If true, strings that match the null_values set will be interpreted as null. If false, strings that match the null_values set will be interpreted as the string itself.
True Values: A set of case-sensitive strings that should be interpreted as true values.

Parquet

Apache Parquet is a column-oriented data storage format of the Apache Hadoop ecosystem. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. At the moment, partitioned parquet datasets are unsupported. The following settings are available:

Convert Decimal Fields to Floats: Whether to convert decimal fields to floats. There is a loss of precision when converting decimals to floats, so this is not recommended.

Avro

The Avro parser uses the Fastavro library. The following settings are available:

Convert Double Fields to Strings: Whether to convert double fields to strings. This is recommended if you have decimal numbers with a high degree of precision because there can be a loss precision when handling floating point numbers.

JSONL

There are currently no options for JSONL parsing.

Document File Type Format (Experimental)

warning

The Document File Type Format is currently an experimental feature and not subject to SLAs. Use at your own risk.

The Document File Type Format is a special format that allows you to extract text from Markdown, TXT, PDF, Word and Powerpoint documents. If selected, the connector will extract text from the documents and output it as a single field named content. The document_key field will hold a unique identifier for the processed file which can be used as a primary key. The content of the document will contain markdown formatting converted from the original file format. Each file matching the defined glob pattern needs to either be a markdown (md), PDF (pdf), Word (docx) or Powerpoint (.pptx) file.

One record will be emitted for each document. Keep in mind that large files can emit large records that might not fit into every destination as each destination has different limitations for string fields.

Parsing via Unstructured.io Python Library

This connector utilizes the open source Unstructured library to perform OCR and text extraction from PDFs and MS Word files, as well as from embedded tables and images. You can read more about the parsing logic in the Unstructured docs and you can learn about other Unstructured tools and services at www.unstructured.io.

Reference

Config fields reference

Field

Type

Property name

array<object>

streams

string

bucket

string

start_date

string

aws_access_key_id

string

role_arn

string

aws_secret_access_key

string

endpoint

string

region_name

string

dataset

string

path_pattern

object

format

string

schema

object

provider

Changelog

Expand to review

Version	Date	Pull Request	Subject
4.7.4	2024-08-10	43667	Update dependencies
4.7.3	2024-08-03	43083	Update dependencies
4.7.2	2024-07-27	42814	Update dependencies
4.7.1	2024-07-20	42205	Update dependencies
4.7.0	2024-07-16	41934	Update to 3.5.1 CDK
4.6.3	2024-07-13	41934	Update dependencies
4.6.2	2024-07-10	41503	Update dependencies
4.6.1	2024-07-09	40067	Update dependencies
4.6.0	2024-06-26	39573	Improve performance: update to Airbyte CDK 2.0.0
4.5.17	2024-06-06	39214	[autopull] Upgrade base image to v1.2.2
4.5.16	2024-05-29	38674	Avoid error on empty stream when running discover
4.5.15	2024-05-20	38252	Replace AirbyteLogger with logging.Logger
4.5.14	2024-05-09	38090	Bump python-cdk version to include CSV field length fix
4.5.13	2024-05-03	37776	Update `airbyte-cdk` to fix the `discovery` command issue
4.5.12	2024-04-11	37001	Update airbyte-cdk to flush print buffer for every message
4.5.11	2024-03-14	36160	Bump python-cdk version to include CSV tab delimiter fix
4.5.10	2024-03-11	35955	Pin `transformers` transitive dependency
4.5.9	2024-03-06	35857	Bump poetry.lock to upgrade transitive dependency
4.5.8	2024-03-04	35808	Use cached AWS client
4.5.7	2024-02-23	34895	Run incremental syncs with concurrency
4.5.6	2024-02-21	35246	Fixes bug that occurred when creating CSV streams with tab delimiter.
4.5.5	2024-02-18	35392	Add support filtering by start date
4.5.4	2024-02-15	35055	Temporarily revert concurrency
4.5.3	2024-02-12	35164	Manage dependencies with Poetry.
4.5.2	2024-02-06	34930	Bump CDK version to fix issue when SyncMode is missing from catalog
4.5.1	2024-02-02	31701	Add `region` support
4.5.0	2024-02-01	34591	Run full refresh syncs concurrently
4.4.1	2024-01-30	34665	Pin moto & CDK version
4.4.0	2024-01-12	33818	Add IAM Role Authentication
4.3.1	2024-01-04	33937	Prepare for airbyte-lib
4.3.0	2023-12-14	33411	Bump CDK version to auto-set primary key for document file streams and support raw txt files
4.2.4	2023-12-06	33187	Bump CDK version to hide source-defined primary key
4.2.3	2023-11-16	32608	Improve document file type parser
4.2.2	2023-11-20	32677	Only read files with ".zip" extension as zipped files
4.2.1	2023-11-13	32357	Improve spec schema
4.2.0	2023-11-02	32109	Fix docs; add HTTPS validation for S3 endpoint; fix coverage
4.1.4	2023-10-30	31904	Update CDK
4.1.3	2023-10-25	31654	Reduce image size
4.1.2	2023-10-23	31383	Add handling NoSuchBucket error
4.1.1	2023-10-19	31601	Base image migration: remove Dockerfile and use the python-connector-base image
4.1.0	2023-10-17	31340	Add reading files inside zip archive
4.0.5	2023-10-16	31209	Add experimental Markdown/PDF/Docx file format
4.0.4	2023-09-18	30476	Remove streams.*.file_type from source-s3 configuration
4.0.3	2023-09-13	30387	Bump Airbyte-CDK version to improve messages for record parse errors
4.0.2	2023-09-07	28639	Always show S3 Key fields
4.0.1	2023-09-06	30217	Migrate inference error to config errors and avoir sentry alerts
4.0.0	2023-09-05	29757	New version using file-based CDK
3.1.11	2023-08-30	29986	Add config error for conversion error
3.1.10	2023-08-29	29943	Add config error for arrow invalid error
3.1.9	2023-08-23	29753	Feature parity update for V4 release
3.1.8	2023-08-17	29520	Update legacy state and error handling
3.1.7	2023-08-17	29505	v4 StreamReader and Cursor fixes
3.1.6	2023-08-16	29480	update Pyarrow to version 12.0.1
3.1.5	2023-08-15	29418	Avoid duplicate syncs when migrating from v3 to v4
3.1.4	2023-08-15	29382	Handle legacy path prefix & path pattern
3.1.3	2023-08-05	29028	Update v3 & v4 connector to handle either state message
3.1.2	2023-07-29	28786	Add a codepath for using the file-based CDK
3.1.1	2023-07-26	28730	Add human readable error message and improve validation for encoding field when it empty
3.1.0	2023-06-26	27725	License Update: Elv2
3.0.3	2023-06-23	27651	Handle Bucket Access Errors
3.0.2	2023-06-22	27611	Fix start date
3.0.1	2023-06-22	27604	Add logging for file reading
3.0.0	2023-05-02	25127	Remove ab_additional column; Use platform-handled schema evolution
2.2.0	2023-05-10	25937	Add support for Parquet Dataset
2.1.4	2023-05-01	25361	Parse nested avro schemas
2.1.3	2023-05-01	25706	Remove minimum block size for CSV check
2.1.2	2023-04-18	25067	Handle block size related errors; fix config validator
2.1.1	2023-04-18	25010	Refactor filter logic
2.1.0	2023-04-10	25010	Add `start_date` field to filter files based on `LastModified` option
2.0.4	2023-03-23	24429	Call `check` with a little block size to save time and memory.
2.0.3	2023-03-17	24178	Support legacy datetime format for the period of migration, fix time-zone conversion.
2.0.2	2023-03-16	24157	Return empty schema if `discover` finds no files; Do not infer extra data types when user defined schema is applied.
2.0.1	2023-03-06	23195	Fix datetime format string
2.0.0	2023-03-14	23189	Infer schema based on one file instead of all the files
1.0.2	2023-03-02	23669	Made `Advanced Reader Options` and `Advanced Options` truly `optional` for `CSV` format
1.0.1	2023-02-27	23502	Fix error handling
1.0.0	2023-02-17	23198	Fix Avro schema discovery
0.1.32	2023-02-07	22500	Speed up discovery
0.1.31	2023-02-08	22550	Validate CSV read options and convert options
0.1.30	2023-01-25	21587	Make sure spec works as expected in UI
0.1.29	2023-01-19	21604	Handle OSError: skip unreachable keys and keep working on accessible ones. Warn a customer
0.1.28	2023-01-10	21210	Update block size for json file format
0.1.27	2022-12-08	20262	Check config settings for CSV file format
0.1.26	2022-11-08	19006	Add virtual-hosted-style option
0.1.24	2022-10-28	18602	Wrap errors into AirbyteTracedException pointing to a problem file
0.1.23	2022-10-10	17800	Deleted `use_ssl` and `verify_ssl_cert` flags and hardcoded to `True`
0.1.23	2022-10-10	17991	Fix pyarrow to JSON schema type conversion for arrays
0.1.22	2022-09-28	17304	Migrate to per-stream state
0.1.21	2022-09-20	16921	Upgrade pyarrow
0.1.20	2022-09-12	16607	Fix for reading jsonl files containing nested structures
0.1.19	2022-09-13	16631	Adjust column type to a broadest one when merging two or more json schemas
0.1.18	2022-08-01	14213	Add support for jsonl format files.
0.1.17	2022-07-21	14911	"decimal" type added for parquet
0.1.16	2022-07-13	14669	Fixed bug when extra columns apeared to be non-present in master schema
0.1.15	2022-05-31	12568	Fixed possible case of files being missed during incremental syncs
0.1.14	2022-05-23	11967	Increase unit test coverage up to 90%
0.1.13	2022-05-11	12730	Fixed empty options issue
0.1.12	2022-05-11	12602	Added support for Avro file format
0.1.11	2022-04-30	12500	Improve input configuration copy
0.1.10	2022-01-28	8252	Refactoring of files' metadata
0.1.9	2022-01-06	9163	Work-around for web-UI, `backslash - t` converts to `tab` for `format.delimiter` field.
0.1.7	2021-11-08	7499	Remove base-python dependencies
0.1.6	2021-10-15	6615 & 7058	Memory and performance optimisation. Advanced options for CSV parsing.
0.1.5	2021-09-24	6398	Support custom non Amazon S3 services
0.1.4	2021-08-13	5305	Support of Parquet format
0.1.3	2021-08-04	5197	Fixed bug where sync could hang indefinitely on schema inference
0.1.2	2021-08-02	5135	Fixed bug in spec so it displays in UI correctly
0.1.1	2021-07-30	4990	Fixed documentation url in source definition
0.1.0	2021-07-30	4990	Created S3 source connector

S3

Prerequisites​

Setup guide​

Step 1: Set up Amazon S3​

Create a Policy​

Option 1: Using an IAM User​

Option 2: Using an IAM Role (Most secure)​

Step 2: Set up the Amazon S3 connector in Airbyte​

Supported sync modes​

Supported streams​

File Compressions​

Globs​

State​

User Schema​

S3 Provider Settings​

File Format Settings​

CSV​

Parquet​

Avro​

JSONL​

Document File Type Format (Experimental)​

Parsing via Unstructured.io Python Library​

Reference​

Config fields reference

Changelog​

Prerequisites

Setup guide

Step 1: Set up Amazon S3

Create a Policy

Option 1: Using an IAM User

Option 2: Using an IAM Role (Most secure)

Step 2: Set up the Amazon S3 connector in Airbyte

Supported sync modes

Supported streams

File Compressions

Globs

State

User Schema

S3 Provider Settings

File Format Settings

CSV

Parquet

Avro

JSONL

Document File Type Format (Experimental)

Parsing via Unstructured.io Python Library

Reference

Changelog