# Data requirements

## How to deliver the data

Supported file formats are listed [here](/getting-started/processing-documents/input-of-files.md#supported-file-formats).

For scans, the DPI should be at least 150 to get good quality OCR. Lower DPI will have an impact on the accuracy of the OCR.&#x20;

Smartphone photos of documents are supported but can reduce the accuracy of the OCR if they are very bad. It is sometimes useful to preprocess them using a smartphone app like CamScanner.&#x20;

#### Training data

For document classification, you can add a list of files with the correct document class or type.&#x20;

For entity extraction, it is best to use Metamaze for labelling the data.&#x20;

## Data requirements needed for document classification

Document type prediction is usually a very accurate step in the process. Still, the amount of labelled data needed depends on the amount of variation in your data.

### Text based document classification - document types

For classifying a document type based on text (for example a pay slip, invoice, loan agreement, ...) you typically need at least 20 examples per document type.&#x20;

If the document types are very close together, more will be needed for the model to learn how to distinguish them. So for example if you want to discriminate between 2nd hand car purchase orders vs new car purchase orders, more data might be needed.

### Text based document classification - interpretative

For tasks like sentiment analysis, priority estimation, ... that have a wide variety of input cases, a custom data requirements exercise is needed depending on the output classes. These can quickly need at least >100 documents per class.

### Image based document classification

When you need to classify a document based on visual content instead of text, please contact an Metamaze ML Engineer.

## Data requirements for entity extraction

Data requirements depend on the problem you are trying to solve. The Machine Learning models learn from context, so the more variety there is in context in production, the more data you need to annotate. The other way around, the more **relevant** context you have, the easier for the algorithm.

It is best to upload training data that is **as similar to production data as possible**. So if you want to build a production model that works for only 5 suppliers in one language, then only upload data from those 5 suppliers in that language. If you want to build a production model that needs to work for any supplier (e.g. thousands of different ones) in any language, it is best to upload the highest diversity: different suppliers, different languages, ...

The following properties make it **harder** for the model, so increase the amount of data needed:

* Every document is unique: no recurring templates in documents
* Bad quality scans will not be used as training data, so if the source data contains a lot of them you need more source data to retain an equal amount of good quality scans.
* Lots of interpretation needed due to subtle differences
* Little context to learn from
* No standardisation of terminology

A couple of examples

| Example                                                                                                                                                                                                                                                                                                  | Minimal data required |
| -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------- |
| <p>Annual accounting reports</p><ul><li>Fairly standardised as mandated by the law </li><li>Context is fairly stable: some companies will have more or less lines, but they are always in the same order. </li><li>Context is small: only keywords to learn from instead of whole sentences</li></ul>    | >50 docs              |
| <p>Financial prospectuses</p><ul><li>No structure at all: free form text like legal contracts</li><li>Very long documents (>50 pages) of which most is irrelevant</li><li>No standardisation and varied terminology</li><li>Subtle nuances in entity types (e.g. interest rate vs coupon rate)</li></ul> | >500 docs             |
| Technical documentation - standardised fact sheet                                                                                                                                                                                                                                                        | >100 docs             |
| <p>Car purchase documents</p><ul><li>Every car salesman has a different layout, with >1000 different layouts</li><li>Non standardised terminology (e.g. saldo, te betalen, ...)</li><li>Few context: mainly keywords but no long text</li></ul>                                                          | >500 docs             |
| Computer created, standard forms with simple standard fields (one template)                                                                                                                                                                                                                              | >10 docs              |


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs-old.app.metamaze.eu/overview-of-project-steps/how-much-data-do-you-need.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.