ipfs-search.com blog

Bump in the road

2023-06-02T00:00:00-05:00

Alas. We are shutting down ipfs-search, with a full-stop on june the 7th. For now. The upkeep is too much to carry and we have not found the necessary support yet. We don’t know when we will go live again, there is no certainty at this point. Only reflection with a sense of pride of what we built and achieved over the years, and hope for a relaunch.

Our journey

ipfs-search.com started in 2016 as an idea of Mathijs to use IPFS to index his ebooks collection using IPFS. This quickly grew with a crawler that “sniffed” the updates of IPFS, collecting what was being shared on the entire network by everybody. Add metadata-extraction, a search API and a simple frontend, ipfs-search was born.

Later, Aad from RedPencil.io joined to help with frontend improvements, hosting, and fundraising, and over the years, ipfs-search.com grew to a usable search engine for IPFS, that took a stance against collecting users’ personal data or promoting search results based on advertisements. In fact, there are no advertisements and no user-targeted biases in the search results. We store nothing about our users!

At the end, the index grew with a whopping 1 million CIDs per day, and with help of an NLNet/NGI0 grant we created the third iteration of the frontend, completely equipped with search filters, well-designed layout, mobile-friendly, fileviewers for almost anything including e-books, images, videos, and even a music player with playlists.

We received a Filecoin Dev Grant to make the search engine scaleable for a lot of user traffic, for which we had to overcome several bumps in the road and finally succeeded. The search engine is now ready for next stages, and gives a great way to access content on ipfs with an unparalleled index of data collected over 7 years.

There are some problems too; while we had creeped in a lot of awesome technical features, the search is not biased and therefore does not really target any user segments. We had no marketing department to gather more users and unmoderated access to the millions of random files shared on IPFS do not generate viral adoption of the search engine. Without advertisement or user targeting as a business model there was no immediate way to monetize on the frontend as a product at all, let alone enough to work on usability features. We did not want to sell out on our principles and more funding appeared a lot harder to find than expected. The cryptowinter and the global economic turmoil did not help in that regard. In the meantime, the costs in servers and manhours are piling up.

Hope for the future

So, as we are running into the red, we have decided to shut down, albeit with great regret, until we find a way to make this sustainable. There are a lot of people expressing their support and even helping to fund us through OpenCollective, and we truly hate to feel like we are letting them down. Fortunately, there are some beacons of hope on the horizon!

First of all, we have found a new partner to take care of hosting ipfs-search.com, at DCent. They have offered generously to grant their hardware to us while we find new support, and we hope we can migrate there soon. Besides this, we are exploring several options of funding, for which we may have to develop novel ways to apply the technology or invent new features that make it attractive for future users.

Epilogue

We hit a bump in the road, and have to shut down. But we hope it does not affect our users and supporters for too long, and we are making good plans to restart. We are looking forward to the next phase of this project and are proud of the efforts that have brought us so far, so keep an eye on our feeds. And if you have any idea or questions on how to help, contact us at info@ipfs-search.org

The Crossroads: ipfs-search.com’s Fight for Survival

2023-05-23T00:00:00-05:00

Introduction

Since 2016, we at ipfs-search.com have been committed to building a neutral, privacy-friendly, open-source search for the Web3 community. However, today we find ourselves at a critical crossroads. Financial challenges are threatening our ability to continue to operate, risking an imminent shutdown of our public services. We wanted to take this moment to reflect on our journey, explain our current situation, and outline our next steps.

Standing at the crossroads, ready to forge a path forward. CC BY-SA 2.0 Carsten Tolkmit

Our Journey and Impact

Our story began in 2016, as a hobby project with a vision for a more decentralized and democratized internet. We saw the potential of the Interplanetary File System (IPFS) to disrupt traditional web paradigms and provide a genuinely open, resilient, and privacy-focused infrastructure for information. Recognizing that IPFS’s potential would be unattainable without an effective search and discovery tool, we embarked on creating the first search engine specifically designed for IPFS, giving birth to ipfs-search.com.

As pioneers in this space, we committed to creating a search engine rooted in principles of privacy, neutrality, and transparency – ideals often overlooked by traditional search engines. Our aim has always been to make information accessible to all, unbiased, and free from the control of any single entity.

Throughout the years, we’ve had the pleasure of receiving support from various sources. We received backing from the EU commission’s NGI0 Search and Discovery fund through NLNet Foundation, an organization dedicated to promoting a networked world, unrestricted by commercial or political monopolies. We also had to face difficult times, including a previous shutdown, after which Aad from redpencil.io graciously stepped up and offered to support our hosting temporarily.

This support enabled us to evolve from a single server to a proper cluster, enhancing our service. However, as our infrastructure grew, so did the associated costs. Operating costs have become prohibitive for our small initiative, despite our significant personal investments and dedication to this cause. The financial commitment required for maintaining our services has been coupled with our personal financial needs, exacerbated by the onset of the crypto winter and a wider economic downturn.

Nevertheless, our journey has been filled with remarkable achievements. From handling 1000 hits/s on our API endpoints in 2022, showcasing our readiness to scale with IPFS and handle large-scale integration as the first search and discovery platform within the IPFS ecosystem, to contributing to the broader Web3 movement, we’ve left an indelible mark on the digital landscape.

We’ve proudly fostered a more open and inclusive digital world through our open-source commitment, enabling the community to freely use, study, share and improve our work. But as we now stand at a critical crossroads, we reflect with pride on our accomplishments and look forward with resolve to surmount the challenges that lay before us. Our mission of providing a truly neutral, open-source search for Web3 remains unshakeable.

The Current Situation

Without additional funding, we will need to shut down our public APIs in the coming weeks. Site search is already suspended, but API access remains—for now. This is a consequence of the balancing act between our commitment to the mission and the stark reality of operational costs and personal sustenance needs.

Site search is already suspended, but API access remains—for now.

The Road Ahead

However, we view this not as the end of our journey, but as a challenging bend in the road. Our vision of a truly democratized internet, where information is accessible to all and uninfluenced by political or commercial interests, remains strong. We’re actively looking for solutions, seeking new funding sources, and exploring every possible avenue to continue our mission.

We need your help to navigate this. If you believe in what we do, please share our situation within your community and consider supporting us directly through OpenCollective.

We view this as a challenging bend in the road.

Stay in Touch

We are committed to transparency and will continue to share updates here on the blog and on Mastodon and Twitter. We’re always open to your questions and suggestions, so feel free to reach out to us at info@ipfs-search.com.

Thank you for your understanding and unwavering support.

Sincerely,

The ipfs-search.com Team
Mathijs de Bruin
Frido Emans
Aad Versteden

Searching Web 3 at Web Scale

2023-04-18T00:00:00-05:00

Introduction

In 2021 we set ourselves the ambition of being able to handle 1000 hits/s on our API endpoints. To demonstrate that we are ready to scale with IPFS and that we can handle large-scale integration as the first and only search and discovery platform within the IPFS ecosystem.

Source: https://messari.io/report/state-of-filecoin-q4-2022

This is the second post in a 2-post miniseries where we explain the challenges we faced scaling up, and how they were eventually overcome. In the previous post we dealt mainly with low response times as we scaled our cluster to 33 nodes. We will now describe how we built a realistic benchmark. It details the problems which we faced scaling up to 73 nodes and how they were overcome by completely restructuring our indexes.

Finally, we can say that our platform can handle well over 1300 hits/s with <150ms for 95% of requests, equivalent to serving 3100 unique users.

Building a real-world benchmark

Caching in OpenSearch

As with any real-world web-application, our search engine and it’s backend OpenSearch, heavily rely on caching. Particularly, the Lucene indexes on which OpenSearch is built uses memory-mapped files, transparently allowing the OS kernel to keep all or parts of the search index in RAM. This means that Lucene can read files as if they are already in memory. It also means that there is simply no way to switch caching off. On top of this, OpenSearch has caches for requests, for shard data and for sorting and aggregation (field data), which use heap memory. Hence the recommendation to allocate no more than about half of RAM to OpenSearch/Elasticsearch’s heap, the remainder being used by the OS’s VFS (virtual file system) to cache memory mapped files.

This is particularly relevant designing a benchmark for a large database or a search engine. If we were simply to repeat a single request, or a small number of requests, we would not really be benchmarking our search engine — we would be testing the performance of it’s caches instead!

Creating benchmarks from real-world traffic

In order to circumvent this problem, we decided to use actual API requests to model ‘virtual users’. While we do not store any identifiable information on our users (yay, no pesky GDPR banners!), we do in fact log all requests. So we wrote a script and processed some 6 months of log data into visits; browsing experiences of virtual users, based on actual user journeys through our API and frontend.

The result is a 240 MB JSON blob and a short JavaScript file to be used with Grafana’s load tester k6. You can check out our repo to see exactly what we’ve done!

Choosing k6

We just love everything in Grafana’s stack, particularly how they’re AGPL, like us! But we chose k6 because it’s extremly efficient in handling a large amount of parallel sockets, using Golang’s goroutines for fully non-blocking parallel performance while using a JS VM (Goja) for implementing/scripting the actual tests. This ensures that the machine doing the tests is almost never the bottleneck and hence’ (at this scale) we don’t have to worry about coordinating load tests from multiple machines.

Early results

With the tests we created, we can simply select the amount of Virtual Users (VU), or specify ramping in stages to perform tests yielding results like this:

running (5m30.0s), 0000/2000 VUs, 5129 complete and 1604 interrupted iterations
default ✓ [======================================] 2000 VUs  5m0s

     ✗ is status 200
      ↳  79% — ✓ 173720 / ✗ 44841

   ✗ checks.........................: 79.48% ✓ 173720     ✗ 44841
     data_received..................: 516 MB 1.6 MB/s
     data_sent......................: 35 MB  107 kB/s
     http_req_blocked...............: avg=47.9ms  min=0s       med=250ns   max=21.83s   p(90)=400ns   p(95)=491ns
     http_req_connecting............: avg=34.1ms  min=0s       med=0s      max=15.54s   p(90)=0s      p(95)=0s
   ✓ http_req_duration..............: avg=1.25s   min=0s       med=1.54ms  max=1m0s     p(90)=7.31ms  p(95)=99.49ms
       { expected_response:true }...: avg=8ms     min=320.42µs med=1.52ms  max=54.87s   p(90)=5.22ms  p(95)=7.86ms
   ✗ http_req_failed................: 20.35% ✓ 45003      ✗ 176055
     http_req_receiving.............: avg=48.46µs min=0s       med=22.25µs max=103.96ms p(90)=54.67µs p(95)=66.8µs
     http_req_sending...............: avg=30.5µs  min=0s       med=26.5µs  max=17.16ms  p(90)=48.81µs p(95)=57.68µs
     http_req_tls_handshaking.......: avg=12.72ms min=0s       med=0s      max=21.67s   p(90)=0s      p(95)=0s
     http_req_waiting...............: avg=1.25s   min=0s       med=1.47ms  max=1m0s     p(90)=7.18ms  p(95)=99.4ms
     http_reqs......................: 221058 669.854297/s
     iteration_duration.............: avg=1m11s   min=27.92ms  med=50.11s  max=5m26s    p(90)=2m55s   p(95)=3m52s
     iterations.....................: 5129   15.541997/s
     vus............................: 1606   min=1606     max=2000
     vus_max........................: 2000   min=2000     max=2000

This tells us is that 221K requests were performed in 5m and 30s at an average rate of 670/s of which 20% failed, probably due to servers hitting capacity limits. The average request duration was over 1s but 95% of requests were served within 100ms.

In-depth statistics and visualisations

Having a short ASCII summary of a single test is cute, but that doesn’t tell us what we’re after. We need to know what happens to our machines, to our cluster, as we scale it up and… as it breaks. If it does, we need to know how it breaks, figure out why, remediate it and confirm that in fact we did.

In order to do that, we got k6 to write metrics to InfluxDB and created a dashboard visualising the results in Grafana. Both of which we had set up prior to this scale out to investigate latency issues, as discussed in our previous post.

Overview of our benchmarking dashboard.

This is what it looks like when we hit peak capacity.

It is often the maxing out of CPU on of 1 or 2 servers which casues the entire cluster to take increasingly longer lunches.

Not the scaling we expected

As soon as we had the tests set up, we started plugging in servers. Over the past year we had been improving our Ansible deployment stack to be able to fully automatically install, configure and setup Hetzner bare metal boxes, so we could deploy any number of nodes in about 30m.

Overview of all the (cold, with cleared frontend cache) benchmarks we've performed.

However, as we added nodes and thus capacity, we observed not only that the number of requests per second did not go up, the actual peak duration skyrocketed!

Specfically, with 33 nodes we were peaking around 700 RPS with a peak request duration of around 900ms. With 42 nodes we hit 750 RPS at about 1s. At 59 nodes we were again around 700 RPS with over 3s request durations. Something was definitely wrong!

As you may notice from the screenshot, we tried any number of tweaking of settings, upgrading OpenSearch, tweaking our API and even reinstalling our servers. One key aspect which kept returning is that the same 5 or so nodes were handling about 10x the IOPS of the other nodes. It turned out that somehow the cluster decided that these 5 nodes (despite or perhaps due to our myriad of shards) were handling a much greater share of the traffic and were causing a bottleneck in our cluster.

IOPS in progress for all of our nodes. This is an indicator of the degree to which IO exhaustion is a bottleneck, particularly on NVMe-based setups (like ours). Note how most nodes are not even mentioned here, few have ~10 IOPS in progress and then there’s a few with ~100 in progress.

By this time, we had been trying for well over a year to meet our 1000 hit/s benchmark. By now, we really expected to have met the mark, simply by plugging in more servers. Yet, we were forced to acknowledge that a much deeper overhaul was necessary.

We get by with a little help from our friends

By this point, we were despairing and decided to ask for help. Thus far, we had been assisted by DataForest for practical assistance in our deployment setup, like installing Grafana and migrating to OpenSearch. Although we had no budget left, they decided to help us and use our particular and (apparently rare) problem as a study case. They gave us extended and concrete recommendations on how to further optimize our cluster. We owe them a great debt of gratitude and respect, especially considering that they operate from a country in war, Ukraine.

In addition, we opted for a free trial with Opster and, despite our honesty about limited budget they volunteered to have an in-depth look into our issues. They too, were quite suprised by our cluster’s odd behaviour, allocating so much load to just few of the nodes. It might not entirely be an accident that they published an article ‘OpenSearch Hotspots – Load Balancing, Data Allocation and How to Avoid Hotspots’ shortly after assisting us… Regardless, we can’t express enough gratitude for the amount of real-world knowledge about Elastic and OpenSearch they freely put out there.

Both of these parties gave us roughly similar recommendations, among which:

Put similar data in the same index.
Only index fields which you’re using.

It did not lead to a single root cause. It seems our problem did not, in fact, have a single clear solution. A basic fact about complex systems, confirmed.

Rethinking our index

It became clear that we had to dig deeper. Despite despairing we could not give up, not with the amount of time already invested. A challenge is not a great challenge if one can be certain to make it!

Splitting our index

Not sure what caused our problems, we took a wild gamble and we decided to do what we had been postponing for years: splitting our index! This is like open brain surgery for search engines! It literally affects every single part of our stack.

We were certain that having 4 huge indexes (files, directories, invalids and partials) was not an optimal solution. We were sure it was going to give performance improvements of some kind. But it is truly challenging to re-index the close to 800 million documents. Just a single typo, and you’d have to do it again. Just a small coding mistake, and you’re losing data. Just one of 73 servers crashing, and you can start again.

Categorising documents

And not just that… how exactly are we going to group our documents? Documents, audio, images, videos, directories and ‘other’, like we have in our frontend? But what, on Earth’s name is the definition of a ‘document’!?

List of categories in our frontend.

In order to make informed decisions about this, we decided to query our dataset for statistics, based on our ‘working’ definition of content types from the frontend. How many items of each category did we have? What sort of fields were present for various types and categories?

Field statistics

Hence, we produced extensive statistics with using scripted metrics, as that’s the only way to gather statistics on unindexed fields:

// Init
state.fields = new HashMap();

// Map
void iterateHashMap(String prefix, HashMap input, HashMap output) {
  input.forEach((key, value) -> {
    String fieldName = prefix + key;

    if (value instanceof Map) {
      iterateHashMap(fieldName + '.', value, output);
      return null;
    }

    if (output.containsKey(fieldName)) {
      output[fieldName] += 1;
    } else {
      output[fieldName] = 1;
    }
  });
}

iterateHashMap('', params['_source'], state.fields);

// Combine
state.fields

// Reduce
HashMap output = new HashMap();

states.forEach(field -> {
  field.forEach((fieldName, count) -> {
    if (output.containsKey(fieldName)) {
      output[fieldName] += count;
    } else {
      output[fieldName] = count;
    }
  })
});

return output;

Which adds up to the the following OpenSearch DSL query to get a list of fieldnames with occurance counts:

{
  "query": {
    "match_all": {}
  },
  "size": 0,
  "aggs": {
    "aggs": {
      "scripted_metric": {
        "init_script": "state.fields = new HashMap();",
        "map_script": "void iterateHashMap(String prefix, HashMap input, HashMap output) {  for (entry in input.entrySet()) {    String fieldName = prefix + entry.getKey();    if (entry.getValue() instanceof Map) {      iterateHashMap(fieldName + '.', entry.getValue(), output);    } else {      if (output.containsKey(fieldName)) {        output[fieldName] += 1;      } else {        output[fieldName] = 1;      }    }  }}iterateHashMap('', params['_source'], state.fields);",
        "combine_script": "state.fields",
        "reduce_script": "HashMap output = new HashMap();for (fields in states) {  for (field in fields.entrySet()) {    String fieldName = field.getKey();    Integer count = field.getValue();    if (output.containsKey(fieldName)) {      output[fieldName] += count;    } else {      output[fieldName] = count;    }  }}return output;"
      }
    }
  }
}

The result was a staggering amount of information, which we proceeded to sort out. We categorised each and every field: should it be copied, removed if it’s a duplicate, or simply not be indexed at all?

Field statistics per type. Full dataset.

Document count per data type. Full dataset.

Mime types in our index. Full dataset.

Mapping All the Things

With painstaking work and difficult decisions, we finally managed to arrive at a suitable mapping from mime types to indexes as well as which fields to index and how. Specifically, for some fields we chose to copy the data to another field but retain the source document intact. For others, we simply eliminated the source field (see ‘Deduplicating fields’ below).

This resulted in monsters of mappings, such as the following (for documents):

{
  "dynamic": "strict",
  "properties": {
    "cid": {
      "type": "keyword"
    },
    "content": {
      "type": "text",
      "term_vector": "with_positions_offsets"
    },
    "content:character-count": {
      "type": "integer"
    },
    "first-seen": {
      "type": "date",
      "format": "strict_date_time"
    },
    "last-seen": {
      "type": "date",
      "format": "strict_date_time"
    },
    "ipfs_tika_version": {
      "index": false,
      "type": "keyword"
    },
    "language": {
      "properties": {
        "confidence": {
          "index": false,
          "doc_values": false,
          "type": "keyword"
        },
        "language": {
          "type": "keyword"
        },
        "rawScore": {
          "type": "double"
        }
      }
    },
    "references": {
      "properties": {
        "hash": {
          "type": "keyword"
        },
        "name": {
          "type": "text"
        },
        "parent_hash": {
          "type": "keyword"
        }
      }
    },
    "size": {
      "type": "long"
    },
    "urls": {
      "enabled": false,
      "type": "object"
    },
    "metadata": {
      "dynamic": false,
      "properties": {
        "Content-Type": {
          "type": "keyword"
        },
        "mime:type": {
          "type": "keyword"
        },
        "mime:subtype": {
          "type": "keyword"
        },
        "X-TIKA:Parsed-By": {
          "index": false,
          "type": "keyword"
        },
        "dc:title": {
          "type": "text"
        },
        "dc:creator": {
          "type": "text"
        },
        "dc:contributor": {
          "type": "text",
          "index": false,
          "doc_values": false,
          "copy_to": "metadata.dc:creator"
        },
        "meta:last-author": {
          "type": "text",
          "index": false,
          "doc_values": false,
          "copy_to": "metadata.dc:creator"
        },
        "article:author": {
          "type": "text",
          "index": false,
          "doc_values": false,
          "copy_to": "metadata.dc:creator"
        },
        "dc:identifier": {
          "type": "keyword"
        },
        "xmpMM:DocumentID": {
          "type": "keyword",
          "index": false,
          "doc_values": false,
          "copy_to": "metadata.dc:identifier"
        },
        "xmpMM:DerivedFrom:DocumentID": {
          "type": "keyword",
          "index": false,
          "doc_values": false,
          "copy_to": "metadata.dc:identifier"
        },
        "xmpMM:DerivedFrom:InstanceID": {
          "type": "keyword",
          "index": false,
          "doc_values": false,
          "copy_to": "metadata.dc:identifier"
        },
        "Content Identifier": {
          "type": "keyword",
          "index": false,
          "doc_values": false,
          "copy_to": "metadata.dc:identifier"
        },
        "dc:language": {
          "type": "keyword"
        },
        "dc:description": {
          "type": "text"
        },
        "dc:subject": {
          "type": "text",
          "copy_to": "metadata.dc:description"
        },
        "meta:keyword": {
          "type": "text",
          "copy_to": "metadata.dc:description"
        },
        "dc:publisher": {
          "type": "keyword"
        },
        "dcterms:created": {
          "index": false,
          "type": "date",
          "format": "date_optional_time",
          "ignore_malformed": true
        },
        "dcterms:modified": {
          "index": false,
          "type": "date",
          "format": "date_optional_time",
          "ignore_malformed": true
        },
        "w:comments": {
          "type": "text",
          "index": false,
          "doc_values": false,
          "copy_to": "content"
        },
        "xmpTPg:NPages": {
          "type": "short"
        },
        "og:site_name": {
          "type": "text"
        },
        "og_type": {
          "type": "keyword"
        },
        "doi": {
          "type": "keyword",
          "index": false,
          "doc_values": false,
          "copy_to": "metadata.dc:identifier"
        },
        "pdf:docinfo:custom:IEEE Publication ID": {
          "type": "keyword",
          "index": false,
          "doc_values": false,
          "copy_to": "metadata.dc:identifier"
        },
        "pdf:docinfo:custom:IEEE Issue ID": {
          "type": "keyword",
          "index": false,
          "doc_values": false,
          "copy_to": "metadata.dc:identifier"
        },
        "pdf:docinfo:custom:IEEE Article ID": {
          "type": "keyword",
          "index": false,
          "doc_values": false,
          "copy_to": "metadata.dc:identifier"
        },
        "WPS-JOURNALDOI": {
          "type": "keyword",
          "index": false,
          "doc_values": false,
          "copy_to": "metadata.dc:identifier"
        }
      }
    }
  }
}

The final mappings for other types will soon be published in our docs. Until then, you may have a look at our WIP branch.

In the end, we indexed the following types together:

(Compressed) archives: ZIP, tarballs, etc.
(Textual) Documents: Text, HTML, Word, PDF, PowerPoint, etc.
Images
Videos
Audio
Directories
Data: JSON, binary blobs, etc.
Unknown: files with no extracted metadata whatsoever
Other: Anything not in any of the aforementioned categories

Data cleanup

Hashing out 12 billion links!

During earlier development on the crawler, the suspicion started to arise that some of the documents in our index were ‘slightly’ larger than others. While implementing Redis caching for our indexer we discovered that some documents had thousands of links to them, whereas most documents have just one or a few.

It wasn’t until we wrote our linksplitter that we fully realized how bad the problem was! And in hindsight, it makes perfect sense: millions of Wikipedia articles with regular updates create new references to the same documents all the time. So we discovered that without knowing it, we were searching through a whopping 12.193.745.087 links! No wonder our search was slow!

At this time we merely split out the links to optimise search performance (chucking any but the last 8 references away during reindexing). But like you, we can’t wait to bring this dataset to the world — preferably loading it into and serving it from an actual graph database.

We have a graph of links on IPFS going back to 2019!

I couldn’t help but exploring a little bit what (a tiny fraction of) IPFS’ content graph looks like.

Deduplicating fields

Many fields were duplicates due to a lack of clear metadata standards in our metadata extractor based on Apache’s Tika. This meant carefully looking at our data, it meant scrutinizing Apache Tika’s developer documentation on metadata keys and it meant writing an intimidating ‘Painless’ (Elastic/OpenSearch built-in scripting language) to harmonize fields. In the process, we also created a shell-script ‘Painless’ uploader to make uploading ‘painless’ less … painful.

void harmonizeField(HashMap ctx, String srcFieldName, String dstFieldName) {
    if (ctx.containsKey(srcFieldName)) {
        ArrayList srcValues = ctx[srcFieldName];

        if (ctx.containsKey(dstFieldName)) {
            ArrayList dstValues = ctx[dstFieldName];

            if (srcValues == dstValues) {
                // src and dst values are equal, remove src
                ctx.remove(srcFieldName);
            }

            return;
        }

        ctx[dstFieldName] = srcValues;
        ctx.remove(srcFieldName);
    }
}

String nestedKey = 'metadata';
Map remapFields = [
        ...
    'w:comments': 'w:Comments',
    'comment': 'w:Comments',
    'Comments': 'w:Comments',
    'JPEG Comment': 'w:Comments',
    'Exif SubIFD:User Comment': 'w:Comments',
    'User Comment': 'w:Comments',
        ...
];

if (!ctx.containsKey(nestedKey)) return;

HashMap nestedCtx = ctx[nestedKey];

if (nestedCtx == null) return;

for (entry in remapFields.entrySet()) {
    String srcFieldName = entry.getKey();
    String dstFieldName = entry.getValue();

    harmonizeField(nestedCtx, srcFieldName, dstFieldName);
}

Re-hashing document ID’s

During critical reflection along the lines of “why are these 5 darn servers taking all our load!??? 🤯”, we figured that one potential cause could be an unequal distributions of documents among shards due to us using IPFS/IPLD CID’s as document identifiers. See, all CID’s start with just one of a few options of the same bytes.

If the underlying index doesn’t re-hash them, a lot of documents could end up together in ways which are sub-optimal and/or the shard distribution could end up all messed up. As we weren’t able to find conclusive evidence in OpenSearch’ code as to whether or not custom (vs. generated) DocID’s were hashed, we decided to re-hash them using SHA1.

As a bonus, this allowed us to do one other thing we’ve been wanting to do, which is adding a protocol identifier to our documents, paving the way for future support of other content-addressed protocols (e.g. ipfs://bafy... overbafy...).

Luckily, someone published a SHA1 implementation for Painless which we thankfully made use of!

Other ‘Painless’ stuff

While we were undertaking the huge effort of Reindexing All the Things, we decided to also implement some other ‘nice to haves’:

Cropping body content to 1 MB (some were as large as 10 MB!).
Splitting mime types into their constituent type, subtype and parameters.
Adding character counts (size) for body content.

Re-index From Hell

With the mapping and all our scripts snugly fit into an ingest pipeline, we were ready to start re-indexing. Writing horrendous queries such as this to sort our documents based on mimetype along the process:

{
  "source": {
    "index": "ipfs_files_v9",
    "query": {
      "bool": {
        "filter": {
          "range": {
            "first-seen": {
              "gte": 2023,
              "lt": 2024,
              "format": "yyyy"
            }
          }
        },
        "should": [
          { "wildcard": { "metadata.Content-Type": "text/x-web-markdown*" }},
          { "wildcard": { "metadata.Content-Type": "text/x-rst*" }},
          { "wildcard": { "metadata.Content-Type": "text/x-log*" }},
          { "wildcard": { "metadata.Content-Type": "text/x-asciidoc*" }},
          { "wildcard": { "metadata.Content-Type": "text/troff*" }},
          { "wildcard": { "metadata.Content-Type": "text/plain*" }},
          { "wildcard": { "metadata.Content-Type": "text/html*" }},
          { "wildcard": { "metadata.Content-Type": "message/rfc822*" }},
          { "wildcard": { "metadata.Content-Type": "message/news*" }},
          { "wildcard": { "metadata.Content-Type": "image/vnd.djvu*" }},
          { "wildcard": { "metadata.Content-Type": "application/xhtml+xml*" }},
          { "wildcard": { "metadata.Content-Type": "application/x-tika-ooxml*" }},
          { "wildcard": { "metadata.Content-Type": "application/x-tika-msoffice*" }},
          { "wildcard": { "metadata.Content-Type": "application/x-tex*" }},
          { "wildcard": { "metadata.Content-Type": "application/x-mobipocket-ebook*" }},
          { "wildcard": { "metadata.Content-Type": "application/x-fictionbook+xml*" }},
          { "wildcard": { "metadata.Content-Type": "application/x-dvi*" }},
          { "wildcard": { "metadata.Content-Type": "application/vnd.sun.xml.writer.global*" }},
          { "wildcard": { "metadata.Content-Type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document*" }},
          { "wildcard": { "metadata.Content-Type": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet*" }},
          { "wildcard": { "metadata.Content-Type": "application/vnd.openxmlformats-officedocument.presentationml.presentation*" }},
          { "wildcard": { "metadata.Content-Type": "application/vnd.oasis.opendocument.text*" }},
          { "wildcard": { "metadata.Content-Type": "application/vnd.ms-powerpoint*" }},
          { "wildcard": { "metadata.Content-Type": "application/vnd.ms-htmlhelp*" }},
          { "wildcard": { "metadata.Content-Type": "application/vnd.ms-excel*" }},
          { "wildcard": { "metadata.Content-Type": "application/vnd.sun.xml.draw*" }},
          { "wildcard": { "metadata.Content-Type": "application/rtf*" }},
          { "wildcard": { "metadata.Content-Type": "application/postscript*" }},
          { "wildcard": { "metadata.Content-Type": "application/pdf*" }},
          { "wildcard": { "metadata.Content-Type": "application/msword5*" }},
          { "wildcard": { "metadata.Content-Type": "application/msword2*" }},
          { "wildcard": { "metadata.Content-Type": "application/msword*" }},
          { "wildcard": { "metadata.Content-Type": "application/epub+zip*" }}
        ],
        "minimum_should_match": 1
      }
    }
  },
  "dest": {
    "index": "",
    "pipeline": "ipfs_files_cleanup_v11"
  }
}

Year by year

This in and by itself was already a bit of a tedious process, but then we started having stability issues and operations started to unpredictably crash. With over 300 million documents to re-index (the others being invalids and partials, not requiring re-indexing) we couldn’t risk losing all of our progress so, as you see, we started indexing documents by year.

10 documents at a time

As you might have gathered from some of the subtle hints above, some of the documents in our index were really humongous in size. Some have well over 10 MB in links/references, some have up to 10 MB in body content (full text-indexed!). Knowing that OpenSearch’ indexing bulk indexing buffers are 100 MB, this turned out quite problematic.

While the point of our re-index is exactly to get rid of these huge documents, in order to do so we’d have to process them and, without jumping through even more horrible hoops, meant indexing nearly a billion documents in batches of 10, in order not to overflow Elastic’s buffer.

Out of file descriptors!?

And then, during the process we increasingly experienced random nodes disappearing. By now, we were used to quite a bit of 💩 from OpenSearch. But it got to the point where we were literally unable to complete simple indexing operations without a node disappearing, killing the re-index scroll in the process.

We got these weird and uncommon exceptions from OpenSearch, telling us there were not enough file descriptors. What the!?? So, we dug deeper…

It turned out that not OpenSearch. Not Kubo. Not our crawler but… Telegraf, Influx’ metrics collecting daemon, was eating our file descriptors. And not just in any way, it was doing so tediously slowly, adding 1 FD per second, creating a problem which was so slow to emerge that it took months to manifest.

Once it was discovered that Telegraf was the culprit though, it was easy enough to identify the malignant code, a plugin logging ethernet statistics to attempt to diagnose the scaling behaviour discussed prior in this post (in order to exclude ethernet ring buffer overflows).

Open Source is Awesome

Being good FOSS citizens, an issue was created on Influx’ Telegraf repo which was reviewed that same day. Only to find that the next day they already had a PR ready, complete with an artifact allowing us to verify that the issue was indeed resolved. Within 6 days Influx released an updated version.

This is incredible! We ❤️ 💚 💙 💖 Open Source! And… great work InfluxDB, you got good things going on! 👀

As soon as this was resolved, our cluster was rock stable again and finally managed to Reindex All the Things. Ready for testing!

Re-sharding All the Shards

Except, not really. When you’re building an index in OpenSearch/ElasticSearch, you kind of have to guesstimate the amount of shards. The general recommendation is that a single shard should be between 10 and 50 GB in size, ideally 20-30 GB. Yet, there’s no reliable way to know the size of the index ahead of … indexing.

Of course we estimated the size of shards, using the fraction of total documents times the total size (~15 TB) of our files index. But as we discovered, many documents have different sizes, different fields and some of our estimates were way off.

Eventually, we had to shrink (merge shards) on some of our indexes and split shards on some others, finally we brought all of them within the desirable range. Mind you, our cluster only handles these kinds of processes well for one index at a time and they can take up to 24h to complete.

After which we set up replication and waited another half a day for the cluster to balance. And then, only then, are we ready to actually use our indexes.

Rewriting our API server

But wait… we just went from searching 2 indexes (files and directories) to searching 9 of them! However we approach this, it means a profound change to the way our queries function. How are we going to integrate that into our vanilla JS API server, most of which has not been touched in a year? Particularly, how are we going to make sure that we’re not missing out on relevant search results just because we made a silly typo?

We can only abuse a memo so many times without giving some credit…

Typing All the Things!

Our solution was to Rewrite it in ~~Rust~~ TypeScript. Simply put, we have a lot of literals, there is a lot of code to rewrite/migrate — our API server really hasn’t gotten the love it deserves, pending a full rewrite like this. Type inference in this case allows us to do abstract reasoning over types such that if our code isn’t right, it simply won’t compile.

For example, in the API server we’ve created types for:

Document fields, with field name literals.
Document’s source, as a subset of document fields.
Query fields (including boosts and highlights), again restrained by document fields.

Thanks to this approach, it becomes literally impossible to refer to non-existing fields or missing data because a typo in a field name (except, of course, where the literals are defined). We already caught several bugs of which we were not previously aware from our older API code.

We also created a new common types library, we’ve implemented types for:

The shared types between client and server allows for much stronger consistency in implementation. This will help us and you, power-user you are, to talk to our service in predictable and reliable ways.

Searching for subtypes through our new indexes

As a bonus of this great rewrite, users will soon have access to a subtype field in addition to type in queries and results, based on our newly generated indexes. This will have zero resource-impact for us (rather the opposite) and will allow you to query directly for:

Archives.
Audio.
Data.
Documents.
Images.
Videos.
Unknown’s and;
the illustrious ‘Other’.

Monorepo’s for JS hipsters 〰️

Like all the fashionable kids (and some of our Goliath competitors) these days, we decided to rock with a proper monorepo with our client, our server and types as separate packages bundled snuggly together. To orchestrate it all, we opted for Lerna, the now-not-so-hip anymore wrapper around the increasingly hyped Nx build system.

This guy’s using Lerna. He’s hip. (Shamelessly gleaned from Matthieu Lux’s presentation.)

This not only allowed us street cred’ and swag around places where espresso is served so fashionably bitter it turns your cheeks concave, it also allows us to:

Publish everything to NPM at once.
Keeping versions in sync.
Perform end-to-end integration testing from client to server.
Rapidly iterate on the API, ensuring consistently without managing tons of repo’s.

Isn’t JavaScript, I mean ECMAScript, I mean TypeScript, I mean Node, I mean NPM, native ESM, I mean ALL OF THIS JUNK TOOLING which gets replaced every 3 DOODLING MONTHS AMAZING? Hipster 💩, yes. We’re into it!

Now, we can do all the things Perl people were doing in the 90’s. Except, with Prettier, our code doesn’t look like Larry Wall fell asleep on his keyboard once.

Anyways, as with all our stuff, the Source is Out There(tm) and soon, arguably, merged to main and published to NPM (which is not at all like CPAN and definitely not as well designed!).

Larry Wall was the Original Hipster.

Ready for testing!

So now, without further ado, we are really ready for testing!

That sound you hear when the deadline’s passed by, yes, we heard it. A couple of times. Not to mention the sound of 💸 we could have made while we were making the first and only search engine for IPFS more awesome.

But, lo and behold…

1300 hit/s! Wow! Uau! 😮💥

“Uau”, that’s what Portuguese people say. And it happens, I live there. So that’s what I said.

Remember that graph we started with? Noticed the part where with the same amount of nodes we suddently jumped up? And where request duration plummeted down? No, I am not making a reference to the disgraceful state of the climate or the economy. There’s good things happening in this world.

Our search engine getting incredibly faster, for one thing. We hit well over 1300 hits per second, 30% more than we expected to, with only 75% of the 100 nodes we estimated for it. That’s equivalent of serving over 3000 users so fast they will not even know they were waiting.

Soon(tm), because although our goal it has succeeded, QED and all, there is still a bit of cleanup to do!

Requests per second shooting up like El Niño off the coast of Peru.

Request durations dropping like the value of Bored Apes after bored rich monkeys realized they paid for the proof of having paid for something.

Wrapping things up

For one, we are not yet indexing new stuff until we’ve refactored our crawler. Only then we can add what’s been indexed since we ran this test. And only then can we throw away our old index, making space, scaling down our cluster again to what we currently need. In the full awareness that…

We Are Ready For It!

Bring it on! Users of the world, unite! Come, seek with us the Interplanetary Filesystem and thou shalt find!

Yes! (Please, don’t tell me that it buffers… There’s NOTHING WRONG WITH THE DHT! Eh!? Eh??) Anyways, Iroh is here to fix it all. 👋🙏

1000 hits/s? Challenge accepted!

2023-04-03T00:00:00-05:00

Introduction

In fall 2021 we started the ambitious work of seeing whether ipfs-search.com could truly handle web-scale traffic. Through the grapevine, we’d heard how a well known search engine might be interested in searching IPFS. Searching IPFS is what we do since 2016, so we said “challenge: accepted”.

The same grapevine told us that this search engine handles about 1000 requests per second. At the time, we were handling about 0.1 requests/second, so quite a difference. However, in our statistics we’re seeing early signs of exponential growth, in which case 4 orders of magnitude really doesn’t take that long.

Being passive on the internet means explosive growth will overwhelm you. Our success might well be the cause of our demise. Or worse; as with many start-ups we might feel forced to give up our ideals under market pressures, just to survive.

This is the first part in a two-post blog miniseries, where we describe how indeed we managed to surpass our ambitions of handling 1000 requests per second.

Traffic growth over 2022.

Index growth over 2022.

It’s elastic, right?

We were running Elastic (currently OpenSearch, as Elastic isn’t Open Source anymore), a document store specifically designed to scale and handle gigantic datasets. After Google’s publication in the early 2000’s of MapReduce, the smart folks behind Elasticsearch (amongst others) built a FOSS (Free and Open Source) search index with it. In theory, allowing scaling without a limit. However…

Theory doesn’t work in practice.

Early benchmarks suggested that a single node was able to handle 10 queries per second. Which, again in theory, suggested that merely scaling out our cluster from 4 to 100 servers ought to do it. But alas, it wasn’t so easy.

As soon as we scaled our cluster from 4 up to 30 nodes, average response times shot up to over half a second! Mind you, these are averages — it implies that some of our users had to wait for several seconds for search results.

Response times over 2021.

The Internet is impatient!

Unlike visitors of your local library, users have strong expectations when it comes to looking for information on the internet. Wait more than 200 ms and a website is experienced as slow. Wait more than 1 second and you’ll start interrupting the user’s flow. More than a few seconds and users will leave, never to return again. (Reference) It doesn’t matter how many queries per second we can serve, it’ll be useless if we’re serving them too slowly!

Endless fidgeting with knobs

Like any large and complex machine, Elastic/OpenSearch has a large number of configuration options which one can spend a lifetime tuning. Sadly enough, it seems that few experts in the field have bothered to share detailed knowledge. As soon as one leaves the ‘safe’ territory of the Proof of Concept, enter the domain of the Tech Consultant. Search being our core activity, this is potentially an endless sinkhole of funds, which we do not have in the first place!

Source: https://flashbak.com/the-control-panel-archive-the-tactile-beauty-of-buttons-meters-knobs-and-dials-406888/

Rather than Outsourcing All The Things, we ended up becoming the consultants ourselves. Which is one of the reasons it took us over a year to learn how to overcome these obstacles, with the end result being that we now have all the knowledge in-house. (We did get some help, but more towards the practical side of the implementation.)

Over time, we tried:

Increasing the number of index shards and…
… decreasing them again.
Increasing the number of replicas and…
… decreasing them again.
Tuning our refresh interval.
Implementing batching/bulk reads and writes for our crawler.
Tuning our index buffer size.
Searching rounded dates.
Upgrading from ElasticSearch to OpenSearch and…
Upgrading OpenSearch again.
Reindexing All The Things, several times.
Ensuring persistent (keepalive) connections for search clients.
Tuning max_merge_count to prevent index throttling.
Reducing our crawling rate.
Enabling shard location awareness and…
… disabling it again.
Tuning max_concurrent_shard_requests in search queries.
Enabling _local shard preference in queries and…
… disabling it again.
Setting per-shard search API timeout.
Set translog durability to async.
Tuning reclaim_deletes_weight.

Resources

Tune for indexing speed by Elasticsearch
Tune for search speed by Elasticsearch
How to avoid index throttling, deep dive in segments merging by Denis Gladkikh
Adjusting Merge Settings to Make Frequent Updates Less Painful by Reason
Elasticsearch Performance Optimization by Alibababa
How can I improve the indexing performance[…]? by Amazon

Reading and writing, but not at the same time!

It turns out there was not a single factor which could be clearly outlined as the ‘root cause’ of our issue, rather a number of factors was colluding. However, discovered there was resource contention between our crawler’s indexing and search queries. This is also why many of our measures focused on improving search through improving index performance. In the end, implementing asynchronous/bulk reads and writes significantly increased the stability of our cluster, reducing both the variance in response times as well as the average.

It did become clear though, as would be expected, that performing crawling in bulk and asynchronously was a major factor in getting our response times under control. And so in summer ‘22 it finally seemed we were ready to continue scaling, but…

A glimpse of one of the monitoring dashboards which we developed along the process.

Our benchmark haven’t started yet!

Throughout the ‘minor’ delay and distraction of finally getting these darn response times under control, we went waaaay overboard creating extremely insightful monitoring dashboards. We implemented deep-reaching functionality in our crawler, of all components.

But we hadn’t yet managed to scale our cluster beyond 33 nodes! Nor develop or run our actual benchmark! Want to learn how we achieved this?

Continue reading our second post.

Decentralised search: from dream to reality

2022-09-26T00:00:00-05:00

Decentralised search: from dream to reality

At the beginning of May 2022, distributed web specialists from redpencil.io and ipfs-search.com conducted an experiment to run a fully distributed search index at ipfs-search.com. The experiment was created and performed by Aad Versteden, together with Elena Poelman, and was based on the research by professor Pieter Colpaert of Ghent University.

In our short talk, Aad Versteden, also co-founder and CEO of redpencil.io, shares the general concept of decentralization, the design, and procedure of the experiment as well as his insights for the future of distributed search.

ZM: How did redpencil.io come to work with ipfs-search.com? What attracted you to them?

AV: We showed interest in ipfs-search.com because discovery is a cornerstone of new web technologies. So, when we noticed that their services had gone down, we reached out to see if we could help. We work towards opening up the internet, and having a golden tool like ipfs-search.com disappear was not something we were going to ignore.

What’s more, we have chosen the distributed search route because running servers on this scale wouldn’t be financially feasible forever. As ipfs-search.com grows, I think there will be a funding gap that we won’t be able to cover (the search is growing faster than redpencil.io). The other team members there don’t have much experience with Linked Data technologies. So it seems that there is scope for some breakthroughs.

As redpencil.io works on distributed knowledge with tangible applications, it made sense that we execute this experiment.

Could you provide a little overview of the entire system in which the experiment was carried out?

IPFS itself allows you to share resources through their network and lets you share what you have used with others, peer-2-peer. The concept is quite simple: when someone downloads a certain file and their neighbour also wants it, it could be shared directly instead of going through some centralised server. In the case of Netflix, for example, this means that if I am at my parents’ house with my brother, and we are separately watching the same Netflix series, that series will only be downloaded to our home network once.


Michel Bakni, CC BY-SA 4.0 https://creativecommons.org/licenses/by-sa/4.0, via Wikimedia Common

For search engines this is obviously a bit more complex, IPFS solves that for larger media files, but having a search index shared and used over the Internet is quite uncommon. There was research in the Linked Data space on how we can build resources that are shared and discoverable. If we consider the search index as a big folder of files hosted on IPFS, we find that we can reuse some of the technologies—mainly research by professor Pieter Colpaert.

What they have done is to say—if we are going to have a dataset, and we want to get information out of it, we shouldn’t be running a very heavy server to do that because then we are the ones who have to pay for that server. It’s better for the end users to have a slightly higher cost per query and for us, the providers, to have a vastly lower cost. The cheapest way to do something like that is basically to say: look, here’s the data in an index, go and figure out how to reuse it.

Sharing the index as a whole would mean people downloading gigabytes of data to answer a query. Nobody wants to do that, and it is not feasible.

So, prof. Colpaert found a way to split this data to retrieve only what is needed to perform a query. Purely by using Linked Data technologies. There is a solution for search engines, prefix search, and also for full-text search, but we haven’t tried full-text search.

What have you tried?

We implemented prefix search. It means that we took the full 2019 and 2020 datasets from ipfs-search.com, and created a split version of it. We had all the titles and looked at what letter they start with. The way it works is if someone searches for a title that starts with a ‘T’, they will be redirected to a page of the index with that letter or combination of letters. It narrows down the results so that for each letter searched—only one page is retrieved. These pages are small parts of the search index, so if my spouse also searched for the same letter(s), I would automatically provide her with my part of the index. It would not go through anyone else, it would remain on the local network.


Frontend view. Courtesy of redpencil.io

Prefix search allows you to search for the beginning of the page title only. It breaks down the search query into letters and creates some sort of container for each letter’s results. It keeps narrowing down the search results until you get all records from the index that starts, let’s say, with “The sta”. This is great progress, but it is similar to an index of the library more than to the search engines we are familiar with nowadays.

So what exactly did this experiment involve?

Our experiment consisted of taking the ipfs-search.com database, titles, and some identifiers so that we knew where to find these resources, partitioning it to enable this type of search using known technologies, publishing the full dataset on IPFS, and then building a frontend hosted on IPFS. If someone wanted a full search index, they could pin that folder to have it locally available. This is useful in cases where someone wants to host it to make it easily accessible.

We have some benchmarks that are great user experience, but for some others, it was more a proof of concept than a usable tool. For example, when we hosted it on one node, it was quite slow at times. With 3 nodes, the content was already faster to access: it took a few seconds to get to the first page, and then it would go on quickly. With 4 nodes, we needed a second to download the first page, and subsequent pages took about 250 milliseconds. Of course, for already searched keywords, the results appear faster, so you can see them as they are discovered. The more people use the index, the faster it becomes.

What downsides did this approach have?

Well, it’s a fully distributed search index in the sense that the index itself is shared and it’s a bit strange to even be possible and a bit strange that it actually works haha.

However, the search index is built centrally by one entity that says: this is the index, you should trust us. The same way as it is with ipfs-search.com. Suboptimal but this is the reality for now.
The other downside is updating the index – every month it will be full of new pages, hence the data you cached and shared with others will carry no value anymore. So that’s a bit problematic. But improvements are very feasible and possible.

Another one is the fact that we didn’t build a full-text search; on ipfs-search.com, you very often search for a topic rather than a title you already know. A full-text index would be more useful for end users.

If we imagine a fully distributed search engine, what would it look like in practice?

When you search a query, you have a need for certain parts of a large database. What happens now is that Elasticsearch, which is used server-side at ipfs-search.com, gets a set of results, and to compute that, it will need to use parts of its index. It will combine them and come up with 50 results that might be of interest to the user.

In the semantic web, where the idea that everything should be decentralised and discoverable is prevalent, the approach would be different. It would be to take the search index and cut it into a million pieces that the user can retrieve.

Imagine you view an image via ipfs-search.com. This means that the image will be in your cache for a while and then forgotten. But in case someone else asks for the same image earlier, you can offer it to them.

The same will happen with all tiny pieces of the index you downloaded, as long as they are cached, you can share them with other peers, and effectively host them.

In case ipfs-search.com ceases to exist, the index remains alive, without pinning it anywhere, and will still be available through peers that are using it. If enough people have its bits and pieces, with some luck we will still have a full index.

It’s worth mentioning that even if some pieces are missing, it means that they were of no interest to the user, no one was looking for them.


CC BY-SA 3.0 http://creativecommons.org/licenses/by-sa/3.0/, via Wikimedia Commons

It is also not a stretch to imagine that a user will trust and choose certain indexes. For example, a user decides to trust ipfs-search.com, and also, their university’s search engine, and wants to combine the information gathered by these indexes. It is possible to create a space, where people can search through entities they want to. If that’s possible, it’s also possible to have a distributively constructed search. And it is not only about trust because sometimes you want to look for something via a source you don’t trust so much.

When we did the experiment, we found it exceptional that we could have something working without a huge central database that provides search, that can be commoditised and done by people… Go back 15 years, and it would be a threat to some major industries.

So, there is hope?

There is hope. A lot of it. If people want their communities to find stuff, and they don’t want to contaminate other communities with it, then we can build a distributed search. There will be a lot of research on human behaviour to be done and experiments like “does it explode today or not?”.

But I think it’s feasible and the technologies we have today are a good start. It’s extremely promising. But also we need to be very realistic – this is not something that is going to replace the main search engine within five years or so, because there will be a lack of functionality. If there is full-on research into it, then yes, totally. But this is not what is happening right now.

Do you and your team have any plans regarding running an expanded version of the current experiment?

If possible, we should go towards larger data sets. We notice exponential growth in the search index, and we also noticed that the way how we now build the search index can’t keep up with the growing database, it gradually becomes slower. It was the first experiment and we know how to counter the issues. Great results for a proof of concept.

We’ll have to see what the performance impact is of running across nodes at some point and what the impact is for full-text search, but we are very much in the game.

How to run the application at home:

Preparing your IPFS daemon

The ipfs daemon can be configured. The way the application is currently hosted, it expects Access-Control-Allow-Origin or be set to *. This means any website can request any resource over IPFS. This shouldn’t be able to cause any harm when on public IPFS.

Easy route

ipfs config "Access-Control-Allow-Origin" "[*]"

Complete route

All settings can be configured through

ipfs config edit

This guide assumes you have the IPFS companion up and running with your own gateway and that your gateway has the Access-Control-Allow-Origin set to ["*"] as in:

{ "API": { "Access-Control-Allow-Origin": ["*"] } }

Opening up the frontend

We assume you’re running the IPFS companion which redirects calls to ipfs:// to your local daemon.

Visit the frontend at ipfs://ipfs/QmXiKm8Y37YyNWsX3bMNpMEHuoUCKWkWvPVUFGP2Ex9kq6

Entering strange information

The frontend allows you to browse different indexes. We’ve made the starting point of a full-text search index available at https://gateway.ipfs.io/ipfs/QmbJT8MRZnyv8gYQmcmUk8FYdgqJFwrn6634CCtxiPd3xr/1

under the ttl format.

Enter the aforementioned URL in the first text input.

Pick .ttl as a format and click SET DATASOURCE.

You can verify the first page was fetched by opening your network tab. Notice 1.ttl has been fetched.

Searching

You can now enter any search query. Results are fetched live using a prefix search index.

As you type results, the pages for each letter of the query are fetched. Sometimes the network doesn’t find its way and it takes a while to find the specific page.

And sometimes it goes very fast:

Anatomy of a search engine

2022-09-11T00:00:00-05:00

Anatomy of a search engine

In previous posts, we’ve covered the development of frontend filters, described progress on scaling up the cluster architecture, and glanced at the importance of web security.

Now it is time to dive a little deeper into what ipfs-search.com, and basically any modern search engine, consists of.
As this is a very complex topic, we will take the liberty here of viewing just a few selected elements.

Our latest statistics show that our index is growing rapidly. We store 20 TB of searchable data. Currently, every day, half a million documents are added to the index.

Let’s take a look at how is it done. Here we have the elements that are responsible for catching this data, classifying it, and giving them correct labels.

A network sniffer

If you go through ipfs-search.com docs, you can read in the documentation that our search “…sniffs the DHT gossip and indexes file and directory hashes”.

Sounds cool, but what does that even mean?


Jurriaan Schulman, CC BY-SA 3.0 http://creativecommons.org/licenses/by-sa/3.0/, via Wikimedia Commons

When we send information over a computer network, it is broken down into smaller units. They are the smallest units of network communication, called data packets. The sender’s node (which is just a device connected to a network) breaks down each piece of information into these smallest units, and after completing their journey to the receiver’s node, they are reassembled into their originals.


https://commons.wikimedia.org/wiki/File:Network_packet.jpg, via Wikimedia Commons

Data packets are commonly monitored by sysadmins for security reasons, to search for anomalies in traffic, and perform maintenance.

Intercepting data packets on a computer network is called packet sniffing, and it’s a term that is normally used in information security and network diagnostics. We recognize two ways of using it, legal and not so. It’s often how our governments listen in on our private communication and in the past, was commonly used by hackers for identity theft — stealing credit cards, passwords, etc. (Nowadays, most communication is encrypted, but creepy organisations like the NSA store all of your data and are likely able to break even modern strong encryption.)

The sniffing process looks similar to wiring a phone or eavesdropping behind the door, although it requires way more than only gathering data.

A sniffer itself is a piece of software (like, for example, Wireshark, which provides GUI and some helpful analytics tools) that you connect to a computer network to see the traffic.


Wireshark, CC BY-SA 4.0 https://creativecommons.org/licenses/by-sa/4.0, via Wikimedia Commons

ipfs-search.com sniffer

Our sniffer does not commit any crimes though. It’s based on the existing Hydra-Booster, “A new type of DHT (Distributed Hash Tables) node designed to accelerate the Content Resolution & Content Providing on the IPFS Network. A (cute) Hydra with one belly full of records and many heads (Peer IDs) to tell other nodes about them, charged with rocket boosters to transport other nodes to their destination faster.”


Hydra-booster

To make it more useful for our purposes, we created a ‘middleware’/proxy between the part in IPFS/libp2p that stores what hosts have, so that every time it learns about something new, it gets passed to our crawler infrastructure.

Our sniffer is currently run on a single node, where we do deduplication of sniffed content. We are upgrading our architecture to allow for distributed sniffing of new content from IPFS’s DHT.

📢 ipfs-search.com sniffer currently uses 12 heads to process about 3000 hashes per second.

Gossip

Then we just need gossip.


CC-BY-NC-SA 4.0 via SL Enquirer

Exactly the same way when people go to the café to exchange important or less important information, in a peer-to-peer network (like Libp2p/IPFS, BitTorrent, or other content-addressed storage systems) nodes talk to other nodes about the content they have.


Scott Martin, CC BY-SA 3.0 https://creativecommons.org/licenses/by-sa/3.0, via Wikimedia Commons

They have rather simple conversations going on, like “Where is this file? Have you seen it?”, “Which node has it?”, “It was here, but now it’s there.” etc.

📢 So how does ipfs-search.com do content discovery? How do we know what’s on IPFS?

For the network, we’re just a bunch of nodes, we listen to other nodes announcing what’s available. When we hear the message saying “I have this file, you can download it from me” a small signal passes through our network, and our crawler (the infrastructure that extracts metadata) gets the file and indexes it.

We store them in our database which lives on the cluster consisting of several servers which each index and search about 2 TB. So on the one side, we have crawlers that capture, index, and extract metadata whenever the sniffer finds new content, and on the client side, there is ipfs-search.com, our beautiful frontend. When a user searches for something, they talk to our database, and this is where the result of their query comes from.

A crawler


CC BY-NC-SA 2.0 by Héctor García

A typical search engine also works with web crawlers. A crawler, or sometimes web spider, or, surprisingly, a spiderbot, is a bot, another piece of software, that visits webpages and indexes content that is uploaded by the users. It is also necessary to keep this content up to date and can be helpful with validating hyperlinks or HTML code.


Sketch of the ipfs-search.com architecture

The ipfs-search.com crawler is also the component that orchestrates the process of extracting metadata from all data that is flowing through our network.

For this job, we use Apache’s Tika, for which we developed the highly efficient streaming tika-extractor, that gets a blob of bits and bytes thrown at its server by the crawler and puts a label: This is a music file, that is a text file, these are an author and a title… We made a special component that asynchronously requests data over our IPFS node, which makes this process more efficient.

{
  "metadata": {
    "xmpDM:genre": [
      "Soundtrack"
    ],
    "xmpDM:composer": [
      "Nobuo Uematsu"
    ],
    "X-Parsed-By": [
      "org.apache.tika.parser.DefaultParser",
      "org.apache.tika.parser.mp3.Mp3Parser"
    ],
    "creator": [
      ""
    ],
    "xmpDM:album": [
      "\"Final Fantasy IX\" Original Soundtrack, Disk 4"
    ],
    "xmpDM:trackNumber": [
      "24"
    ],
    "xmpDM:releaseDate": [
      "2000"
    ],
    "meta:author": [
      ""
    ],
    "xmpDM:artist": [
      ""
    ],
    "dc:creator": [
      ""
    ],
    "xmpDM:audioCompressor": [
      "MP3"
    ],
    "resourceName": [
      "24-Coca Cola TV CM 1.mp3"
    ],
    "title": [
      "Coca Cola TV CM 1"
    ],
    "xmpDM:audioChannelType": [
      "Stereo"
    ],
    "version": [
      "MPEG 3 Layer III Version 1"
    ],
    "xmpDM:logComment": [
      "eng - \nhttp://www.ffdream.com"
    ],
    "xmpDM:audioSampleRate": [
      "44100"
    ],
    "channels": [
      "2"
    ],
    "dc:title": [
      "Coca Cola TV CM 1"
    ],
    "Author": [
      ""
    ],
    "xmpDM:duration": [
      "20218.76953125"
    ],
    "Content-Type": [
      "audio/mpeg"
    ],
    "samplerate": "-\"44100\""
  },
  "version": 2,
  "type": "file"
}

A bitswap protocol

It is worth mentioning, that IPFS is built on the protocol called bitswap where basically nodes trade data, exchanging a want-have request. If you want to download something, the way to get it is to have something that somebody else wants. This is how the network balances itself.

Summary

So basically, what ipfs-search.com does is: while nodes (all the computers that are connected to IPFS) talk to each other about available resources, the sniffer (another node), listens to this communication, and extracts hashes. When something is interesting, the crawler extracts data from the hashes and indexes them.

Of course, there is more to it. There are other processes under the hood, such as queuing, which is done using RabbitMQ, or our search API microservice. We refer those interested to our documentation.

Taking it further

In April Protocol Labs released the first production of the Network Indexer which makes searching (by CID or multihash) content-addressable data networks like IPFS and Filecoin possible. This is a decisive step towards a goal that also is in our line of work: easier and more accessible fetching of data across the IPFS network.

We might be looking at the option of combining these two indexing technologies. The result could be exciting.

Also, we’ll be moving to a different queuing system where we can have multiple sniffers and/or have them integrated with our IPFS nodes.

Resources:

Scaling up the search

2022-07-01T00:00:00-05:00

As some of you know, we are supported by NLNet through the EU’s Next Generation Internet (NGI0) programme, which stimulates network research and development of the free Internet, to do the architecture for scaling up our infrastructure. We are additionally supported by the Filecoin Foundation, who support the growth of the distributed web, through a devgrant, whichs helps us to actually implement the scale-out.

We successfully followed our plan to move step-by-step from one server to a 5 node cluster setup, then to 15 servers, and now we are scaling up through 30, 50 and up to 100 nodes. This puts us on the path to 1000 hits per second; a thousand users every second searching something. We are now in the middle of the way, running on 30 servers. The current experiment is for us to learn how to scale our infrastructure up, until 100 nodes.

Right now, we have indexing capacity of 20 TB, and we are planning to have 100 TB by the end of our scale-out experiment. It is a real challenge as a typical computer stores around 1 TB and copying this 1 TB from one computer to another can take hours.

But let us walk you through what have been going on in our headquarters recently

One of our ways to limit costs is to use physical servers instead of, very popular, cloud servers. This choice is also recommended by Elasticsearch, which we use. After a careful research, we have chosen Hetzner hosting, a German company that provides climate neutral servers, which was also important for us. Why exactly we decided to use bare metal servers? We like to keep an eye on what’s going on. We are able to track temperature and delays on individual discs, we know about every hardware failure, every unusual behaviour pattern and if we were using virtual server we wouldn’t know all these things. Also, the costs are about a factor 10 lower, because we use a lot of data, memory, storage, CPU and I/O.

In the beginning we have been indexing on one server, the most powerful server at Hetzner’s and of course at one point it ran full. We had to shut down the indexing, because we weren’t able to take new files. All this was caused by the fact that in the previous year we made some changes to the crawler (the part that extracts data from the hashes and indexes them) that made it about 100 times faster. So suddenly, instead of indexing 0.1 document per second, we were indexing about 10 documents per second. The consequence was obvious — scaling up the hosting.

🛠 We weren’t expecting a totally smooth transition, as we know that designing a perfect cluster is almost impossible at the beginning.

So, when we went up to 2 servers, and there were no problems, it was a great surprise. Our deployments are automated, we are using Ansible. This allowed us in the past to change a hosting company in about two days. It is a reasonable solution to deal with multiple servers. Instead of executing a gazillion commands for every server manually, and checking the results, Ansible does this for us. But the architecture, what server does what, and telling that in the correct way to Ansible, was the challenge.

Redundancy

Later we moved to 3 servers, and we had reached the point where when something breaks in one of them, the page is still up. If you design a larger server architecture, there will always be, depending on the size of your system, some number of servers that perform badly. They are guaranteed to crash at one point, and by expecting it, there will be no degradation of the service as a result. However, to be safe while this is happening, we needed to prepare a fault-tolerant cluster. It means, among other things, distributing sliced parts of data (called shards) between multiple nodes. Then, creating a copy of every shard and allocating it to a different node in a way that no original and its copy live on the same node. The replica shard is always up-to-date with the original. That make sure that even if some servers are down, all the data is available.

These shards, logical and physical divisions of an index, need to be tuned really carefully to the size of the server and size of our constantly growing data.

Although it comes with some disadvantages, horizontal partitioning, by reducing index size, greatly improves search performance.

Coordination through dedicated master nodes

We also introduced the difference between data and master nodes. Master nodes take care of allocating chosen shards of our data on chosen servers, and making sure the servers know about each other. They are also maintaining information including shards’ localization (which node are they on), index mapping, and performing healthchecks. We have to adjust numbers of data nodes to the growing architecture in order to maintain cluster stability, but the amount of master nodes always stays at 3.

Data replication

Last but not least, we were working on data replication. IPFS Search by definition is an entry to a lot of data, which must not be lost. We set our replication factor to 2 which means that we keep 3 copies of data in our cluster, 1 primary and 2 replicas. In other words, even if a primary is lost, its replica can be made a primary until the recovery.

In addition to this, we make daily snapshots of our index, so that even if we accidentally delete all our data (e.g. human error or end of the World…) we keep a backup.

So we came a long way from 1 document to 500 documents being indexed or updated at the same time, and we’re still improving and optimizing various part of this system. The challenge here was (and still is) finding a golden way to tune the shards, and keep our cluster healthy and balanced.

Breaking the silent consent - closer to the free Internet, an interview with the founder of IPFS Search

2022-06-03T00:00:00-05:00

Online privacy and security are too rarely questioned by ordinary users. Taking them for granted comes from the fact that most people believe they have control over the information they share. Most of us live in silent consent to something that, in a non-virtual society, we would never give permission for. Not everyone has the time to use alternative tools, search engines, browsers, plug-ins, and a range of security features. Learning new things takes time, and with that, as with everything else, the primacy of convenience wins out.

Still, there is a safer future for the Internet, and it’s us, users, who should fight to bring this future into today.

Our guest is Mathijs de Bruin, founder, and inventor of IPFS Search, a search engine indexing the open-source Interplanetary File System which describes itself as “a peer-to-peer hypermedia protocol designed to preserve and grow humanity’s knowledge by making the web upgradeable, resilient, and more open.”

ZM: Maybe before we really start this talk, tell us how this description of the project you contribute to, resonates with you?

MdB: I think what Juan Benet is striving for is something that we have been looking for in the hacker movement for a long time.

So, as hacker technologists, we are the people that maintain the Internet infrastructure. For most users, the Internet is something that is just “there.” Like a black box, you put a few cables together and the magic happens. But for us, it is different, we know what’s going on inside this box.

Let’s remember that initially the Internet was set up as a research protocol, among research institutions, mostly by the US government. It was created to be resilient against outsiders’ attacks, specifically nuclear attacks. But it wasn’t set up to be resilient to censorship or cyberattacks.


Internet backbone as of January 15, 2005. CC BY 2.5

The way I see it is that the only reason we have free internet right now is because a lot of people that are maintaining core infrastructure have very strong morals and principles. It is not accidental that the Internet used to be an open, free protocol. Now we have mobile providers who offer free Internet for Facebook, YouTube, and Spotify, but not in general. And at the same time, we see that Facebooks and YouTubes of this world are applying various kinds of censorship to a frightening degree. We are talking about free, democratic societies. It’s a complete erosion of civil liberties that… we didn’t really have until the advent of cyberspace, and now we are already looking to lose them. We’ve been worried about that in the hacker community for a long time.


CC BY-NC-SA 2.0 by Florian Hauschild

In countries that are not pretending to be free, the Internet has been cut off or censored shamelessly. For example, Turkey blocked Turkish Wikipedia because of an article about state-sponsored terrorism. Spain has been blocking access to information about the referendum in Catalonia, at one point Russia blocked 25% of the Internet because people were saying things the government didn’t like on Telegram.

You should also know that most of the Internet is hosted on Amazon servers. This is another topic, people think that Amazon sells books and toilet brushes, but actually, they sell Internet infrastructure – that’s their core business. And Amazon is an example of a company that doesn’t care about this freedom I mentioned above, they just want to make money. They are not apologetic about it.

So we have been saying for a long time, that the moment you buy into Amazon, the moment you buy into Facebook where it is OK to censor people and trace and track everything, there is no turning back. Governments, companies, and other entities… once they gain such power, they will never give it up.

In opposition to that, people like Juan Benet and people from the hacker community were thinking: OK, so we have torrents, where when you download a file or a film, you also upload it, and there is no official uploader nor downloader and no one can go to a single party and force them to take this file offline… this idea was behind IPFS.

People started to realize: wait, so if we use the same principles we used for torrents, and we use them to make a new kind of Web then censorship will become impossible. And at the same time, you don’t need these big companies anymore.

Imagine that every time someone is viewing the Gangnam Style K-pop video on YouTube it gets downloaded from somewhere on their computer. It has 4,387,208,147 views. Sick amount of data for no reason, it’s the same content transferred again and again.

Let’s make some assumptions. The video clocks in at 117 Megabytes. That means (at most) 274,286,340,432 Megabytes, or 274.3 Petabytes of data for the video file alone have been sent since this was published. If we assume a total expense of 1 cent per gigabyte (this would include bandwidth and all of the server costs), $2,742,860 has been spent on distributing this one file so far.
Source: HTTP is obsolete. It’s time for the distributed, permanent web by kyledrake

And now we arrived at another advantage of IPFS: If I make a video whose content is not politically correct, for example, for my government and I want to share this with the world, there is no one who could possibly take this down. Also, I don’t need to keep it on my server in my house.

I think we’ve covered some important issues here, each of which would probably lead us to an endless and interesting discussion, but let’s focus on a few points: you started talking about how we are vulnerable online, susceptible to tracking. About censorship, privacy… What else happens when we surf the web?

What’s been happening since 9/11 in the USA and every country aligned with the USA is that certain entities within the governments started to think that it is completely alright to forfeit fundamental human rights in general in the face of adversity, particularly terrorism. Suddenly it was alright to lock up and torture people and consider absolutely everyone a suspect. In the wider spectrum, it means that if you watch something “bad” on the Internet, you are susceptible to blackmail.


MEP’s demonstrating support for Edward Snowden, who unveiled the government’s extensive and generally unconstitutional domestic spying programs. CC BY European Union

All these companies, like Google, Amazon, Facebook, Apple, etc. are not only obligated to give to any government entity all the information they collected but also to keep their mouth shut about it. Literally, everything gets stored. The stuff you did in the past, what house you want to buy, what car you drive, when you have your period, your consumption patterns … and there is a general acceptance of that mainly because people read too little science fiction. Vinay Gupta, a great thinker in many fields, regarding where we are as a technical society, said that the problem is that intellectual leaders of this world, people who studied literature have completely overlooked science fiction.

These leaders, Gupta claims, don’t know how information is propagated on the Internet, they don’t know about some systemic behavior of large centralized systems that’s really very important. A government that spies and knows everything about its own citizens is a different kind of government… We live in a world where AI trained in a specific task has by far exceeded human capacity.

Knowing all that, what are the security goals of IPFS Search, and which have already been achieved?

I think one of the things we are trying to hack is not so much technological, which is also why I have been talking politics this whole interview. I think the goal is to do for search what Wikipedia has done for encyclopedias.

The idea would be to have content discovery outside the platform capitalism domain, which is “We are connecting everybody with everybody, you have to go through me, and every time you go through me, you are paying with your attention, which is the most valuable token.” We want to challenge this model and make the search engine very inexpensive first, by making sure that what we make is very close to what users need and want, and second, by making sure that there is not only one place where all these servers run. We fully expect that as the content we put into our database grows, the log on our database also increases exponentially.


By: opensource.com

But also when people that are putting content online start providing some searches, it becomes a decentralized protocol like Bitcoin, for example. We want to set up an incentive system, possibly backed up by a blockchain, possibly backed by a funky thing called Zero-Knowledge proof, where you can actually make sure that a bunch of people can run a search engine, and they can do it in a way where even if it scales up, even if there are lots of people and some of them don’t play by the rules, you can still get reliable search results. This is our long-term vision.

So again, we have a lot of people involved in the process of development. IPFS Search is based on the idea of a community project.

Yes, we are an open-source project, our model is a bit like Wikipedia, but of course, we currently don’t have an index or a catalogue that people can edit just like that. We would like to have some users’ feedback in our actual search results, but there are some technical problems to solve first, so we prefer to focus on the search for now. As for a community contribution, if you want to change something in the user interface, you want to have a filter or suggest something, you can just propose it via our GitHub repository.


Giulia Forsythe, redrawn by Asiyeh Ghayour under CC0 1.0

We really love it if you want to improve our documentation or contribute ideas, that’s super welcome. But at the same time, we know that at some point we might face various kinds of censorship. And this is why we publicly share the entire index of our search engine. No search engine has ever done that. So we are not open-source only because of our code, but also because of our index. Similar to OpenStreetMap, we have the same license for data. It means that if somebody wants to take us down or censor us, there is nothing in the way of other people to fork us, copy and paste our entire search engine. If they take it and make a better search engine based on ours, the only thing they need to do is to share their improvements and data set. It’s a double-decentralized principle.

📢 Let’s summarize what we have said until now: We are focusing on having a working search engine that other people can copy and paste, and later together with other people we want to make it properly decentralized.

Do you have any particular cooperation on the horizon right now?

Certain niche search engines that target a user who has a bit more knowledge about privacy or is into decentralized Web, such as Brave or DuckDuckGo, are interested in having indexing of decentralized Web.

So, we want to see if we can handle web-scale traffic in order to start such cooperation.

If you were to look back, what happened in the project that is worth mentioning?

Interesting question, if I look back, indeed a lot of things have happened. In the team, our social structures, how we are working together, etc. But what’s visible is the new front-end we launched and a new, better structure, beyond basic usability, that allows us to receive contributions but also work faster, make improvements, be more agile, and closer to users.


On https://api.ipfs-search.com you can play freely with our API.

One of our goals also, instead of having a normal front end, is to offer our services to the world, so people can integrate our search engine into their own websites. We already have an API, so developers can play with our files, directories, and all data and metadata.

What else has happened is porn. What we noticed quite quickly is that some really weird, but not illegal or frightening, stuff got published on our search engine. That’s become a bit of a problem, because we don’t primarily want to be a porn search engine, haha.

To solve it, we implemented a filter, that you can technically also run in a browser, that is using AI to analyze pictures whether it is porn or not. It doesn’t work perfectly, but we might also get to the point where we improve the model ourselves. But what also came with that is using AI to classify our content, and this means we can use it also to do similar stuff with music, to know the genre and group music together, to navigate between text files, we have. It’s very interesting because AI is coming from a point where it was something abstract, that only Google had the power to use, to publicly available models that you can implement. It all leads us to the internal discussion about applying censorship and becoming evil, but we have found a way: we are only blurring the pictures and giving the users choice to switch the filter off.

What are your and your team’s plans for the future? What do you want to achieve within the next 6 months?

We want to go to The Moon and back, haha.

I think in the next year, because of the way IPFS is increasing its popularity, we are going to start growing exponentially. Or IPFS will fail. This means that someone will do the same thing better, and we will move to theirs, our infrastructure is ready for that.


Over the past year, we’ve more than tripled our index as we’re scaling up to a 50-node cluster.

I don’t think that will happen, though, because there are large groups that have been actively supporting IPFS. Also, the NFT ecosystem is running on IPFS, there is a lot going on for them. So, if the amount of available content grows exponentially, it means that we have to grow our infrastructure exponentially. We need to be able to expand and have proper frontend and backend teams to also address more features and check what our users need, and try some solutions.

So far we have been three friends working together, like a hobby that grew out of hand, but now we are looking into starting a company or a foundation. Actually, what we would like to be is something that so far doesn’t exist legally, a social enterprise – an organization that at the same time tries to make money while also guaranteeing certain non-monetary, societal goals. So it will be a personal challenge and also a challenge for us as a team.

It will be a very interesting and tricky year for us.

NSFW-filter for ipfs-search.com

2022-04-18T00:00:00-05:00

The problem

When we upgraded the frontend for IPFS-search, and while doing so made the graphic content a lot more visible, it became immediately apparent that there was a lot of X-rated material on ipfs, and this made the browsing experience less than pleasant at times. Most search queries turned up at least some imagery of explicit scenes;

_{It happened on a boat last Tuesday.}

_{They are white, and they are in a house. What else do you need to know?}

_{Fresh ideas on where to get vegetables and what to do with them.}

_{Clearly, the girls on the right are captivated by the scene in the middle.}

To the rescue: NSFW.js

Filtering this out is not a trivial matter. In order to do this properly, you need to classify all content automatically, and for this you need an intelligent system. Fortunately, we found NSFW.js, an open source library that implements an already trained AI model to classify images on nudity and pornographic content and should also work for drawings. The library claims to have 93% accuracy. We made it a priority to integrate this into the search engine.

The AI looks at an image and responds with an estimate classification for five categories: ‘porn’, ‘sexy’, ‘hentai’ (sexually explicit drawings), ‘drawing’ (non-explicit), and ‘neutral’. The estimate comes as a number between 0 and 1, with 1 being absolute certainty that it falls in this category and 0 being absolute certainty that it doesn’t.

Architecture

For the architecture, we decided on making a microservice to classify IPFS images. The first idea was to cache the results on IPFS, but after some trial and error it seemed that the benefit did not outweigh the trouble, and we decided to work in stead with a simple server-side cache. While the NSFW.js library is targeted for client-side classification, it was relatively simple to integrate it into a node/express server, with a Nginx reverse proxy with a built-in cache.

The rationale for using a microservice, rather than simply frontend-based, was that this would be able to serve both the search frontend and the search crawlers and/or API; where the crawlers in due time would be able to attach metadata about the classification to the database, the frontend would directly be able to access this information as long as it isn’t (yet) available, and decide on whether/how to display the results from the API.

Prototype

For the first iteration, the prototype, we did nothing more than to blur out images in the frontend (using CSS) if they would be classified as “not suitable for work”. A simple toggle-switch, with its setting stored in the browsers’ local storage, would turn the feature on and off. The search frontend would call for each individual image the microservice, and as long as the result was undecided, (either because the request was in-flight or because it returned an error), ‘assume the worst’, i.e., keep the image blurred. We implemented a tooltip message displaying the classification percentages for images in the browser, so we could see what data the assessment was based on.

_{Without blur filter (but pixelated for editorial reasons.)}

_{With blur filter enabled.}

The reason for doing this on the frontend and not yet in the crawler was to field-test the microservice without committing this information to the database, by being able to see directly which images it blurred and which it didn’t.

The result was already much friendlier search-engine with a lot less obnoxious feel to it. It turned out that the estimation thresholds for the categories ‘sexy’, ‘porn’ and ‘hentai’ need to be very low, around 10-15%, or it started to miss a lot of hits. As would be expected, there were some false positives, and the lower the threshold would be set, the more there are. A few false negatives occur too, but not that many.

_{False positive: Obama eating a strawberry classifies as porn with a certainty of 45%. Maybe it is that look on his face.}

_{False positive: This guy, unabashedly exhibiting his banana; 24% certain it is pornography. (It seems that the classifier has a thing for fruit.)}

_{False positive: These golden lines classify 45% as porn. No comment. They aren’t even that curvy.}

_{False negative: Only 8% certain of pornographic content, which doesn’t meet our (current) threshold. Warning: the image contains product placement.}

_{False negative: the classifier is probably thrown off by the letters photoshopped as background layer; it is not a drawing, and it is definitely not neutral.}

_{False negative; the internet/IPFS is overflowing with this kind of imaginative artwork. Fortunately, most of it is properly classified by NSFW.js as ‘hentai’, because these cartoons are not for kids.}

Altogether, the NSFW-filter prototype worked very well, and because of this, we brought the prototype to production, just so we had a UX we could show with some more confidence to people in general.

Backend integration

The obvious downsides of having this done solely by the frontend are:

you can not add an adult-filter to the search API, and simply not showing the results that surpass the threshold causes weird paging issues (e.g., if all results of a single page have positive NSFW classification, you would see an empty page). The best we could come up with was blurring it, but typically, you don’t want these results at all.
Because IPFS is still pretty slow, the first time classification for new content can take long; after this, the cache takes care of it.

So, the second phase was to make the code a bit more mature, and incorporate a connection with the microservice into the backend, the crawler. We did this by adding the classifications of files to the metadata of the search engine database. Then the API could filter on it by request.

To do this, we needed to add one more feature: information about which exact AI model had been used for a specific classification. NSFW.js has several models directly available, and it can not be ruled out that other, better ones will become available in the future, or even that we would be training our own datasets e.g. using user feedback.
So, stored data should have a reference to which model was used to generate it, in the way that some next generation API can make informed decisions about, for example, whether to access the microservice for newer data or not. We solved this by calculating the IPFS-CID of the model files (using js-ipfs) and adding this to the classification-microservices’ output.

Finally, we integrated the microservice API into the crawler and added a nsfw filter on the frontend for the search query. It was notable that images that had been indexed before the nsfw-microservice had been connected to the crawler were omitted from filtered results, as could be expected.

Considerations and debate

It is currently unknown to us how the 93% accuracy has been calculated, but with any AI based classification, you will always get a number of false positives and negatives. We considered using user feedback for improving the model, but quickly abandoned this because of all the complications this would bring. There are GDPR regulations, storage of feedback data, fighting trolls and bots and trollbots, design of UX for feedback, security, QA, and so forth. But most of all was there the already tough issue of keeping websearch neutral, unbiased and completely private while at the same time having to curate users’ opinions about sensitive, highly debatable matters.

Because the debate does not end with filtering nudity, it merely starts there. What about targeted violence, fake-news, controversial symbols or politics, discrimination, etc. etc.? What about written documents or audiorecordings, shouldn’t these be filtered? With the resources we have now, this is too much to be dealing with, and it may not be urgent, yet. However, with an increasing user base and search-index covering more and more materials, these questions are likely to come up down the line. A good set of solutions solution to deal with this will obviously be much more complex than implementing an open source library into the system.

Bonus bonus!

As the nsfw filter classifies for drawings too, we can use it to create a query parameter filter for that too, without much effort.

Conclusion

We were successful to deal with the issue of content that is ‘not suitable for work’ in a straightforward way without the need for too many resources, thanks to the plugin NSFW.js. The user experience of IPFS-search.com has increased a lot as a consequence.

References

Microservice repo - https://github.com/ipfs-search/nsfw-server
NSFW.js - https://nsfwjs.com/
ipfs-search.com

Making ipfs-search distributed

2021-09-24T00:00:00-05:00

A sketch for how distributed search could be realized for the IPFS and the distributed web

Special thanks to Nina for creating this sketch

How we could realize distributed search:

Provider nodes that wish to participate, parse and index only the files they have added to a dweb (DHT hashes) and that have world file permissions.
This local index is put on an (IPFS) cluster.
A query can use the distributed index.
Initial search functionality is a basic boolean search.
Settings functionality anticipates tuning.
In the future, one can add to the search engine functionality with extensions.

Control for users.

Sketch

The ballon d’essai can consist of:

A distributed index using an IPFS cluster
An indexer package with which content providers can index what they provide and add such an index to the distributed index, starting with indexing documents
A thin, separate client with which people can query the distributed index and receive results ranked relevant to the query.

Overlay networks

An IPFS node can be fingerprinted through the content it stores. An overlay network needs to offer an “anonymous” mode that only enables features known to not leak information.

No local discovery.
No transports other than, for example, via Tor (an overlay network consisting of more than seven thousand relays to conceal a user’s location and usage from anyone conducting network surveillance or traffic analysis).
Private routing to make the network non-enumerable.

Parsing

We could code different parsers for each type of file but that is not our main focus at the moment, and because a Python port of the Apache Tika library exists that according to the documentation supports text extraction from over 1500 file formats, we go with that, at least for now. But it is slow, and in the future we may reconsider.

This parser is pointed to the root of a site or a collection, parses its content (thereby creating a corpus) and adds the objects to ipfs, rather than fetching the ipfs hashtable and taking it from there. Again, we wish to focus on the indexing and clustering of a distributed index, not on finding out how to use the ipfs hashtable (for now).

Code on this page is a first shot and should be read as pseudocode snippets.

import os, os.path
import ipfsApi
from tika import parser
from multiprocessing import Pool

def tika_parser(file_path):
    # Extract text from document
    content = parser.from_file(file_path)
    if 'content' in content:
        text = content['content']
    else:
        return
    # Convert to string
    text = str(text)
    # Normalisation to utf-8 format
    safe_text = text.encode('utf-8', errors='ignore')
    # Escape any \ issues
    safe_text = str(safe_text).replace('\\', '\\\\').replace('"', '\\"')
    # Add hash (as filename) and content of file to corpus dataframe
    ...

def walkthrough ()
    corpus_root = os.getxxx (path_to_root)
    walk through the directory structure to fetch each file_path and 
        add each encountered object to ipfs (if duplicate, will not be pinned)
        add hash and file_path to paths
    return paths

    pool = Pool()
    pool.map(tika_parser, paths)
    return corpus

Resources

Distributing the index on an IPFS Cluster

IPFS does not guarantee redundancy. We can use IPFS clustering.
Only popular indexes will be able to get a decent speed.
- We can run a few web agent type ipfs nodes in a cluster that pin all the indexes. Give these enough bandwidth and we have some basis nodes that can act as mirrors and can also be served via HTTPS (the internet-facing demo version).
- IPFS can replace mirror indexes with IPNS addresses. We will still need reliable hosting for these initial seeders.

Risks

IPFS is still in alpha development. That means there are a lot of (undiscovered) bugs and vulnerabilities and the code is not stable. This could create (security) problems.

Resources

Querying the index

Our intention is to support boolean queries and phrase queries.

Sanitize the query (stemming all the words, making all letters lowercase, removing punctuation)
Tokenise the query (split into words)
Get term lists from the distributed index, which documents they appear in, and union the lists

Boolean query

For each inverted index from self and received from neighbours:

def one_word_query(word, invertedIndex):
	pattern = re.compile('[\W_]+')
	word = pattern.sub(' ',word)
	if word in invertedIndex.keys():
		return [filename for filename in invertedIndex[word].keys()]
	else:
		return []

Aggregate lists and union

def free_text_query(string):
	pattern = re.compile('[\W_]+')
	string = pattern.sub(' ',string)
	result = []
	for word in string.split():
		result += one_word_query(word)
	return list(set(result))

AND

For an AND use an intersection instead of a union to aggregate the results of the single word queries.

Phrase query

def phrase_query(string, invertedIndex):
	pattern = re.compile('[\W_]+')
	string = pattern.sub(' ',string)
	listOfLists, result = [],[]
	for word in string.split():
		listOfLists.append(one_word_query(word))
	setted = set(listOfLists[0]).intersection(*listOfLists)
	for filename in setted:
		temp = []
		for word in string.split():
			temp.append(invertedIndex[word][filename][:])
		for i in range(len(temp)):
			for ind in range(len(temp[i])):
				temp[i][ind] -= i
		if set(temp[0]).intersection(*temp):
			result.append(filename)
	return rankResults(result, string)

Yggdrasil

Yggdrasil is an early-stage implementation of a fully end-to-end encrypted IPv6 network. It is lightweight, self-arranging, supported on multiple platforms and allows pretty much any IPv6-capable application to communicate securely with other Yggdrasil nodes. Yggdrasil does not require IPv6 Internet connectivity - it also works over IPv4.

Looking at it for its clustering and bootstrapping implementation.

Resources

Testing

Scalability indicators
- Number of hashes crawled per second per-peer versus the number of peers
- Number of downloaded bytes per second versus the number of peers
Performance indicators
- Number of hashes crawled per second versus different CPU loads/platforms
- Throughput of a peer versus the number of crawled job queues (to determine the optimal number of crawl job queues) per platform (differentiate using agent attributes).
Node failure
- If automated, this may require adding data entry points in the API that are only used for testing.
- Add test data, check that it has been added and has propagated throughout the neighbourhood.
- Take an agent offline (check that it has gone down and is inaccessible) and verify that all the data appears to be working.
- Pull data manually from each data store (check there are no errors as a result) on the agent, and verify that the data is still retrievable from the system.
- Bring the downed node back online. The data that belongs on this node begins to flow back into the node.
- After a while, pull the data from the agent to check that data that was sent to its neighbours when it was down is stored correctly.
Predictive analysis
- Test for false negatives and false positives of the various classifiers with unlabelled traffic data[^testing]

Resources

Grid’5000

ipfs-search.com blog

Bump in the road

Our journey

Hope for the future

Epilogue

The Crossroads: ipfs-search.com’s Fight for Survival

Introduction

Our Journey and Impact

The Current Situation

The Road Ahead

Stay in Touch

Searching Web 3 at Web Scale

Introduction

Building a real-world benchmark

Caching in OpenSearch

Creating benchmarks from real-world traffic

Choosing k6

Early results

In-depth statistics and visualisations

Not the scaling we expected

We get by with a little help from our friends

Rethinking our index

Splitting our index

Categorising documents

Field statistics

Mapping All the Things

Data cleanup

Hashing out 12 billion links!

Deduplicating fields

Re-hashing document ID’s

Other ‘Painless’ stuff

Re-index From Hell

Year by year

10 documents at a time

Out of file descriptors!?

Open Source is Awesome

Re-sharding All the Shards

Rewriting our API server

Typing All the Things!

Searching for subtypes through our new indexes

Monorepo’s for JS hipsters 〰️

Ready for testing!

1300 hit/s! Wow! Uau! 😮💥

Wrapping things up

We Are Ready For It!

1000 hits/s? Challenge accepted!

Introduction

It’s elastic, right?

Theory doesn’t work in practice.

The Internet is impatient!

Endless fidgeting with knobs

Resources

Reading and writing, but not at the same time!

Our benchmark haven’t started yet!

Decentralised search: from dream to reality

Decentralised search: from dream to reality

Further reading/watching:

How to run the application at home:

Preparing your IPFS daemon

Easy route

Complete route

Opening up the frontend

Entering strange information

Searching

Anatomy of a search engine

Anatomy of a search engine

A network sniffer

ipfs-search.com sniffer

Gossip

A crawler

A bitswap protocol

Summary

Taking it further

Scaling up the search

But let us walk you through what have been going on in our headquarters recently

🛠 We weren’t expecting a totally smooth transition, as we know that designing a perfect cluster is almost impossible at the beginning.

Redundancy

Coordination through dedicated master nodes

Data replication

Breaking the silent consent - closer to the free Internet, an interview with the founder of IPFS Search