Elasticsearch at RTE: Blackout Prevention through Weather Prediction

About RTE

At the core of the power system, RTE (Réseau de transport d’électricité) keeps the balance between power consumption and generation. Twenty-four hours a day and seven days a week, we play a key role in directing the flow of electricity and maximizing power system efficiency for our customers and the community. We convey electricity throughout mainland France, from power generation facilities to industrial consumers who are connected to the transmission grid, and to the distribution grid which provide the link between RTE and end users. We operate France’s high and extra-high voltage transmission system, the biggest in Europe.

Our Daily Challenge

The electrical resistance of a power line causes it to produce more heat as the current it carries increases. If this heat is not sufficiently dissipated, the metal conductor in the line may soften to the extent that it sags under its own weight between supporting structures. If the line sags too low, a flash over to nearby objects (such as trees) may occur, causing a transient increase in current. Automatic protective relays detect the excessively high current and quickly disconnect the line, with the load previously carried by the line transferred to other lines. If the other lines do not have enough spare capacity to accommodate the extra current, their overload protection will react as well, causing a cascading failure. Eventually, this can lead to a widespread power outage (blackout), like the one that occurred in Northeastern and Midwestern United States and the Canadian province of Ontario on Thursday, August 14, 2003.

This incident had major adverse effects on the proper functioning of the regional economy, administration, public services, and more generally, on people’s daily lives. Power plants went offline to prevent damage in the case of an overload, forcing homes and businesses to limit power usage. Some areas lost water pressure because pumps lacked power, causing potential contamination of the water supply. Railroad service, airports, gas stations, and oil refineries had to interrupt service due to lack of electricity. Cellular communication devices were disrupted and cable television systems were disabled. Large numbers of factories were closed in the affected area, and others outside the area were forced to close or slow work because of supply problems and the need to conserve energy while the grid was being stabilized.

Unleashing the Power of Numerical Weather Prediction Data

Basically, the problem we are trying to solve consists of dynamically determining the sag margin without violating clearance requirements. A way of solving this problem is Dynamic Line Rating (DLR). The DLR prediction model aims to answer this simple question:
What is a transmission line’s maximum instantaneous current carrying capacity after accounting for the effects of weather (temperature, wind, and solar radiation) on thermal damage and line sag?

Clearance requirements for power lines

To answer that question, we used data provided by Météo France, the French national meteorological service. This data is formatted into GRIB2 files that can be sourced from Météo France’s open data platform. GRIB is a file format for the storage and transport of gridded meteorological data, such as output from the Numerical Weather Prediction model. It is designed to be self-describing, compact, and portable across computer architectures. The GRIB standard was designed and is maintained by the World Meteorological Organization.

How did the Elastic Stack Help us Respond to the Challenge?

The goal of the POC was to provide the easiest and most powerful access to this weather data to end users. We found out that Elasticsearch’s powerful ingest capabilities, geo indexing, and query features could help us achieve our goal efficiently and at scale both in terms of throughput and storage size.

Let’s go deeper in how we built the data processing stream using the Elastic Stack.

Architecture: Elasticsearch and Logstash Combined with Kafka

Architecture: Elasticsearch and Logstash Combined with Kafka

The data processing pipeline consists of four stages:

Data pre-processing: First, we extracted the needed data from the GRIB2 file to a flat CSV file. For this, we have developed custom code to wrap a CLI utility called wgrib2, provided by the Climate Prediction Center of the US National Weather Service.

Data buffering: Next, we buffered each piece of data by pushing it into a Kafka topic.

Data indexing: Then we read messages from the Kafka topic using the Logstash input plugin to extract the desired fields and build the JSON document according to our Elasticsearch mapping specification. We have also indexed the geographical location of the power lines using the geoshape datatype.

Elasticsearch weather data mapping:

 "_all": { "enabled": false }, "_source": { "enabled": true, "excludes": ["rid", "debut", "fin", "location"] }, "properties": { "requestid": { "type": "keyword" }, "debut": { "type": "date", "format": "yyyy-MM-dd HH:mm:ss" }, "fin": { "type": "date", "format": "yyyy-MM-dd HH:mm:ss" }, "code": { "type": "keyword" }, "valeur": { "coerce": false, "type": "scaled_float", "scaling_factor": 1000 }, "location": { "type": "geo_shape", "tree": "quadtree", "precision": "1100m" } }

Data querying: Finally, the data could easily be accessed using an Elasticsearch query combining the weather data and the geographical location of the power lines.

Key Benefits of Using the Elastic Stack and Kafka

  1. The whole pipeline is:
    • Simple to schedule
    • Stateless and therefore scalable
    • Replayable
    • Modularized into simple autonomous tasks
  2. GRIB2 records are translated to universal JSON Elasticsearch documents, enabling requests from a much wider range of clients.
    • Most client side technologies are able to query the JSON-based RESTful API
    • Requesters are not bound to GRIB2 enabled software packages
  3. Kafka is a broker with topic partitioning and message retention, which is great:
    • Decoupling the consumption of data from production
    • The ability to be stopped and and continue ingesting data from the same point when restarted
    • The ability to replay the ingest phase of a pipeline repeatedly into multiple consumers, with no change to the source configuration
    • The option to deliver messages to different consumers depending on readers grouping configurations and origin

Key facts of the performance test

Improved Disc Use Performance with Elasticsearch: Tuning correctly the weather data document mapping with the correct index options helped us reduce the index storage size and the document unitary size by 25%.

Graph: Improved Disc Use Performance with Elasticsearch

Increased Memory Stability with Elasticsearch: We have found no memory leaks, even during heavy indexing process. Stopping indexing has made memory turn back to a standard constant value.

Linear Scaling with Logstash: With correctly configured topic partitions on Kafka, we have tested the Logstash scaling option and we found that having two Logstash pipelines running instead of one made the ingestion process run twice as fast. With only one Logstash pipeline, we reached an ingesting rate of 275,000 documents per minute, adding a second Logstash pipeline doubled the throughput.

Geo Querying Performance with Elasticsearch: Filtering millions of documents by intersecting two complex geoshapes in two different indexes took only two seconds per request, which is a very satisfying response time.

Outlook: Providing a Robust yet Easy to Use Platform for Weather Data Access

The POC has demonstrated the feasibility of ingesting millions of records, combining them with geographical locations in a query, and sending back a ready to use data set for the end user, with great indexing and querying performance.

Our next challenge is to build a scalable (billions of records over multiple weather sources), easy to use, robust platform for grid experts to access weather data. The goal is to provide a powerful tool to discover, experiment with, and build new prediction models or improve the accuracy of existing ones. Eventually, these models will go into production to more accurately predict the effects of the weather on our transmission system assets and help us optimize their availability for the benefit of our customers and the community.

Using X-Pack Machine Learning and Alerting

The Elastic Stack combined with the X-Pack alerting and machine learning features can help us identify when the weather predictions values associated with other metrics make the line to deviate from its normal clearance range and automatically alert on risks of thermal damage and line sag.


Akli RAHMOUN is a project manager, technical software architect and developer at RTE.
He enjoys spending time finding solutions for accelerating decision making process using technology.

Tony BLANCHARD is an architect and trainer at ZENIKA. He has been working on the Elastic stack since 2012 to answer architecture, technical and prototyping questions.

Elastic Cloud Enterprise 1.1.1 released

We are happy to announce the availability of ECE 1.1.1, a maintenance release. Please see the release notes here or head straight to our downloads page to get it!

Bug fixes

ECE 1.1.1 corrects an important upgrade issue that we discovered and reported last week. Before this release, upgrading from 1.0.2 to 1.1.0 and subsequently restarting allocators or adding more capacity could lead to authentication errors in Kibana. This upgrade issue has now been fixed in 1.1.1.

In addition to this bug fix, we added a new configuration parameter --timeout-factor which allows you to control timeout values during installation.

Existing ECE installations can be live upgraded to version 1.1.1.

Elastic Cloud Enterprise 1.1.0 upgrade issues and workaround

Last week we released ECE 1.1.0 with new features and bug fixes. Since then, we’ve discovered a critical bug that leads to Kibana being unavailable post-upgrade. This bug only affects users who upgraded from 1.0.x versions of ECE to 1.1.0. 

Note: NEW installs of ECE 1.1.0 are not affected

Issue

If you have upgraded to ECE 1.1.0 and any of the following are true:

  • You have tried to restart allocators
  • You have tried to increase capacity by adding hosts.,

If you have done those things, you will see Kibana authentication issues and unavailability. Monitoring will also be affected.

Immediate Workaround

As an immediate workaround, we’ve created a script that can fix this issue on your deployment, which is available for download. You can run this script anywhere that has access to the coordinator node on port 12400.

The script requires two pieces of information:

  1. NODE_IP which is the coordinator’s IP address
  2. AUTH which is the password for the admin console root user

Note: The software package jq is required on the host running this script for it to run properly.

Here’s what it looks like:

[user@localhost ~]$ curl -O https://download.elastic.co/cloud/fix-ece-1.1.0-upgrade.sh
[user@localhost ~]$ chmod a+x ./fix-ece-1.1.0-upgrade.sh
[user@localhost ~]$ ./fix-ece-1.1.0-upgrade.sh
Please enter IP address of the coordinator:<your ip here>
Please enter root password to admin console:<your password here>

The script will test connectivity, and if successful, make the changes. If not, it will let you know. After that has happened, it will prompt you to remove frc-services-forwarders-services-forwarder from every host. 

This is a manual step, which you can do with this command on every host running ECE:

docker rm -f frc-services-forwarders-services-forwarder

Once this is removed, ECE will automatically restart the above service with the patched configuration.

Official patched version

We are working on a 1.1.1 release that addresses this situation. We are targeting a 1.1.1 release early next week which will allow you to upgrade from 1.1.0 or any older versions.

If you are planning to upgrade your existing deployment to 1.1.0, we strongly recommend to hold off until we release 1.1.1.

We apologize for the inconvenience this may have caused and look forward to releasing 1.1.1 ASAP. Please contact us if you have any questions about this.

Why am I seeing bulk rejections in my Elasticsearch cluster?

Elasticsearch supports a wide range of use-cases across our user base, and more and more of these rely on fast indexing to quickly get large amounts of data into Elasticsearch. Even though Elasticsearch is fast and index performance is continually improved, it is still possible to overwhelm it. At that point you typically see parts of bulk requests getting rejected. In this blog post we will look at the causes and how to avoid it.

This is the second installment in a series of blog posts where we look at and discuss your common questions. The first installment discussed and provided guidelines around “How many shards one should aim to have in an Elasticsearch cluster?”

What happens when a bulk indexing request is sent to Elasticsearch?

Let’s start at the beginning and look at what happens behind the scenes when a bulk indexing request is sent to Elasticsearch.

When a bulk request arrives at a node in the cluster, it is, in its entirety, put on the bulk queue and processed by the threads in the bulk thread pool. The node that receives the request is referred to as the coordinating node as it manages the life of the request and assembles the response. This can be a node dedicated to just coordinating requests or one of the data nodes in the cluster.

A bulk request can contain documents destined for multiple indices and shards. The first processing step is therefore to split it up based on which shards the documents need to be routed to. Once this is done, each bulk sub-request is forwarded to the data node that holds the corresponding primary shard, and it is there enqueued on that node’s bulk queue. If there is no more space available on the queue, the coordinating node will be notified that the bulk sub-request has been rejected.

The bulk thread pool processes requests from the queue and documents are forwarded to replica shards as part of this processing. Once the sub-request has completed, a response is sent to the coordinating node.

Once all sub-requests have completed or been rejected, a response is created and returned to the client. It is possible, and even likely, that only a portion of the documents within a bulk request might have been rejected.

The reason Elasticsearch is designed with request queues of limited size is to protect the cluster from being overloaded, which increases stability and reliability. If there were no limits in place, clients could very easily bring a whole cluster down through bad or malicious behaviour. The limits that are in place have been set based on our extensive experience supporting Elasticsearch for different types of use-cases.

When using the HTTP interface, requests that results in at least a partial rejection will return with response code 429, ‘Too many requests’. The principle also applies when the transport protocol is used, although the protocol and interface naturally is different. Applications and clients may report these errors back to the user in different ways, and some may even attempt to handle this automatically by retrying any rejected documents.

How can we test this in practice?

In order to illustrate the practical impact of this behaviour, we devised a simple test where we use our benchmarking tool Rally to run bulk indexing requests against a couple of Elastic Cloud clusters with varying number of data nodes. Configuration and instructions on how to run Rally is available in this gist.

The same indexing workload was run against three different Elastic Cloud clusters. We have been indexing with one replica shard configured wherever possible. The clusters consisted of one, two and three data nodes respectively, with each data node having 8GB RAM (4GB heap for Elasticsearch, 4GB native memory). Invoking the GET /_nodes/thread_pool API we could see that each data node by default had a fixed bulk thread pool size of two with a queue size of 200:

%> curl -XGET http://<es_url>:<es_port>/_nodes/thread_pool</es_port></es_url> "bulk": { "type": "fixed", "min": 2, "max": 2, "queue_size": 200
}

During the test we indexed into a varying number of shards (2, 4, 8, 16, and 32) using a varying number of concurrent clients (8, 16, 24, 32, 48, and 64) for each cluster. For every combination of shard and client count we indexed 6.4 million documents with a batch size of 100 documents and another 6.4 million documents with a batch size of 200 documents. This means that in total we attempted to index 384 million documents per cluster.

For this test we treat the clusters as a black box, and perform the analysis from the client’s perspective. To limit the scope we will also not look at the impact of various configurations on performance as that is a quite large topic on its own.

All the generated, detailed metrics were sent to a separate Elastic Cloud instance for analysis using Kibana. For each request Rally measures how many the documents in the bulk request were rejected and successful. Based on this data we can classify each request as successful, partially rejected, and fully rejected. A few requests also timed out, and these have also been included for completeness.

Unlike Beats and Logstash, Rally does not retry failed indexing requests, so each has the same number of requests executed but the final number of documents indexed varied from run to run depending on the volume of rejections.

How bulk rejection frequency depend on shard count, clients count, and data node count?

Bulk rejections occur when the bulk queues fill up. The number of queue slots that get used depends both on the number of concurrent requests, and the number of shards being indexed into. To measure this correlation we have added a calculated metric, client shard concurrency, to each run. This is defined as the number of shards being indexed into, multiplied by the number of concurrent indexing threads, and indicates how many queue slots would be needed to hold all bulk sub-requests.

In the graph below, we show how the percentage of requests that result in partial or full rejections, depends on the client shard concurrency for the three different clusters.

Screen Shot 2017-11-08 at 09.28.57.png

For clusters with one or two nodes we can see that appearance of bulk rejections start when the client shard concurrency level is somewhere between 192 and 256. This makes sense as each node has a bulk queue size of 200. For the cluster with 3 nodes we can see that it is able to handle even higher level of client shard concurrency without any bulk rejections appearing. 

Once we get over this limit, we start seeing partial bulk rejections, where at least one sub-request has managed to get queued and processed. A relatively small portion of requests also result on full rejections as the concurrency level increases, especially for the single node cluster. 

When we compare the single and two node clusters, we can see that the percentage of fully successful requests increases slightly and that there are fewer full rejections. This is expected, as the total bulk queue across the cluster is twice as large and requests are sent to all data nodes. Even though the total bulk queue size is twice as large across the cluster, the 2 node cluster does not appear able to handle twice the client shard concurrency of the single node cluster. This is likely due to the fact that distribution is not perfect and that the introduction of replica shards have resulted in each indexing operation requiring more work and being slower as a result. An important thing to note is also that all partial rejections are treated as equals in this graph. The number of rejected documents is not shown and does indeed vary depending on the cluster size, but we will shortly look at that in greater detail. 

When we go to three data nodes, we see a more marked improvement, and receive requests without any rejections at high levels of concurrency. We also only see full rejections for the highest concurrency levels.

If we instead plot the average portion of rejected documents per request as a function of shard and client count for the three clusters, we get the following graphs.

Screen Shot 2017-11-14 at 11.35.13.png  

Here we can see that the percentage of rejected events grows with increased concurrency levels for all cluster sizes. We can also see that the rejection levels drop across the board with the more data nodes we add, which is expected.

Earlier we saw that partial rejections started at approximately the same time for both one and two node clusters. If we now look at these graphs, we can see that the portion of rejected documents grows faster for the single node cluster compared to the one with two data nodes. This means that even though we saw a similar level of partially rejected requests, the larger cluster had more documents indexed per request.

Can’t I just get around this by increasing the bulk queue size?

One of the most common reactions when faced with bulk rejections is to increase the size of the bulk queue. Why not set it to a really large value so you do not have to worry about this again?

Increasing the size of the queue is not likely to improve the indexing performance or throughput of your cluster. Instead it would just make the cluster queue up more data in memory, which is likely to result in bulk requests taking longer to complete. The more bulk requests there are in the queue, the more precious heap space will be consumed. If the pressure on the heap gets too large, it can cause a lot of other performance problems and even cluster instability.

Adjusting the queue sizes is therefore strongly discouraged, as it is like putting a temporary band-aid on the problem rather than actually fixing the underlying issue. So what else can we do improve the situation?

Can coordinating only nodes help?

By introducing coordinating only nodes, the data nodes will be able to focus on processing sub-requests, as the request itself will not take up a slot on their bulk queue. This is generally good, but the actual benefit of this arrangement is likely to vary from use-case to use-case. In many use cases it does relatively little difference, and we see lots of successful indexing heavy use cases that do not use dedicated coordinating nodes.

What conclusions can we draw?

As always, there is not necessarily any good one-size-fits-all solution, and the way to address bulk rejections will vary from use-case to use-case. If you see bulk rejections, try to understand why they are taking place and whether it is a single node or the whole cluster that is affected.

If the cluster is unable to cope with the load, ensure that all nodes are sharing the load evenly. If this does not help, it may be necessary to scale the cluster out or up. This will increase capacity and make it less likely that queues are filled up. Increasing the bulk queue size is only likely to postpone the problems, and may actually make them worse.

Also remember that rejected requests do not always mean that all documents were unsuccessful. Make sure you inspect the full response and retry the appropriate documents. Logstash and Beats already do this by default.

We hope this has given you a better understand of how it works. If you have any further questions, there are many ways to engage with us, including through our forum.