Verics- Grouping by file type

analisis de datos verics
Everything you need to know about your Verics report


In this section, we have grouped the data that have appeared in your documents according to the type of file where we have found them.

In this way, we can have a global vision of the volume of information that we are displaying on the web, as well as seeing how this data is distributed according to the types of documents that we have analysed and being able to identify which publication channels are most affected.

Volume of data displayed

Here we offer you an overview of the volume of data displayed in the web domain files.

To produce this graph, we have taken all the data we have analysed and grouped it by the type of file in which it appeared.

The number we associate with each type of document is the sum of the data we have found from the categories:

  • Filtrations: Assume a breach of GDPR:
    • Names and surnames of individuals
    • Personal emails
    • DNIs
  • Sensitive: They may represent a breach of the regulation in association with other data
    • GPS coordinates
    • Addresses
  • Style: They offer information about the software tools, which have been used for the creation and/or manipulation of those files:
    • Office tools
    • Photo cameras
    • Printer and photocopier models
    • ….

Distribution of the above data

Based on the same information that we use to generate the visualization of the volume of data exposed, in this case we focus on showing the distribution of the severity of these based on the type of file we have analyzed

With this sub-grouping, we can identify which publication channels are causing the greatest number of leaks that do not comply with the RGPD

Details of the data

In the table we offer you a summary of the data we have represented in the previous graphs.

Here you have the possibility to explore the details of what we have found, share it with your client, perform some other type of analysis, or simply save it to have your own data history.


Depending on the type of companies or institutions to which the domains you have analysed belong, a lot of information can be extracted just by looking at how these data exhibitions are grouped, and what types of documents make up the website

Here are some of the most representative examples.

Public institutions

They are characterised by the fact that they are websites with a very large number of documents.
The number of documents from office applications (Libreoffice, Word, Excel, PowerPoint….) and PDFs compared to files, graphics is much greater given how administrations tend to share information.

If the publication channels of office documents are not taken care of, they usually present a high number of data that represent a breach of GDPR and style rules. However, as is to be expected, they do not present information on coordinates, addresses, etc.

The origin of PDF documents is much more varied. This is due to the fact that the range of tools for publishing and generating these documents is much greater than the range of office tools. This means that there is no clear pattern in the nature of the data exposed that we can find in PDF documents

Big companies

These sites also present a very large number of documents, but the pattern in the type of files we find is very different.

Unlike public administrations, here there are many more file types:

  • PDF: Information publication channels are usually centralized. In this way, it is controlled that the information included in the files that are made public is homogeneous throughout the website. The advantage of this approach is that there is only a “single” point of failure when it comes to displaying data and there is very thorough control over the publication processes. The disadvantage is that if this point fails, the exposure affects the whole website (as in the example of the graph)
  • Graphics: (JPG, PNG…) In the environment of large private companies, much more money is invested in taking care of the corporate image, and this is reflected in the composition of the type of files that make up the web.
  • HTML: Normally, communication strategies are based on the creation of content in a blog of their own, or in the form of articles within their content manager.

Small companies

These sites are simply face-to-face

They are characterised by a very basic type of file, promoted by

  • Graphics: (JPG, PNG…) Usually obtained from image banks. The data that we classify as leaks, are related to the authors of these photographs. In these contexts this is deliberate as it provides a very economical advertising element for the author.
  • HTML: The giant of the content managers for these websites is WordPress, so it makes sense that these files are the predominant information containers on these websites.

Related post:


Ref – File types

Discover the different types of files that Verics and Metacleaner can process.