Borys Minaiev

Solving Jigsaw Puzzle with bare Rust

bminaiev — Wed, 30 Nov 2022 21:45:13 GMT

I participate in a lot of programming competitions and usually, if you show good results, organizers send you some prizes. Typically it is just t-shirts, but at some point, you have too many of them, so you had to make a bed cover from them. But sometimes prizes are more interesting. This time Google HashCode organizers sent a jigsaw puzzle if you advanced to the finals. That was cool, so I decided to solve it.

It went pretty well until I got into trouble. I finished a big part of the puzzle, but all the left pieces were completely white. I divided all of them into groups by their shape, but still for each specific position you usually need to try all pieces from two or three groups. It was insanely hard for me to find each next correct piece, and also didn't bring much joy. I found only several new pieces over the next couple of evenings and decided to stop. I was thinking "I am a software engineer! I shouldn't solve it by hand! I should write a program, which solves it!". At that point, I took a picture on my phone and put all the pieces back in the box, so I can use a table for other stuff.

Several months later...

The difference between the idea of writing a program for solving a puzzle, and actually writing a program, is quite big, but later I had some free time, so I really decided to do it. Back then I didn't actually know how hard it would be and if it was actually possible. But I had a rough plan.

Crop/reshape the picture. We want to be able to check if two pieces fit together, so we need to know their exact shapes. But when you take a picture, far objects are smaller in terms of pixels, so we need to fix it.
Detect pieces. We need to separate pieces from the background (and also from other pieces, if they are connected).
Detect borders. When you fit pieces, you actually don't care about the full piece, you only need to fit the borders.
Detect corners. This was an optional point of the plan. From a human standpoint, you think about the pieces as something, which has four sides, and you fit one side of the piece with another side of another piece. But from a computer standpoint, it can try all possible fitting positions.
Fit pairs of pieces. For each pair of pieces, for each pair of sides, I wanted to try to fit them, and calculate some score, which shows how well they fit.
Find the solution. If we know how good pieces fit, we can forget about shapes and think just about a graph problem, where for each piece we need to find the best position (and rotation) in a grid to maximize the overall fit function.
Show the solution. When positions in a grid are known, we need to come back to pieces with shapes and show them. Need to find the correct position and rotation for each piece, so they all fit together. It is also possible that one piece fits well with all the neighbors in terms of the score function, but there is no way to actually put it on a surface to achieve that score function with all neighbors at the same time.

Now, when I actually wrote down this plan, I realized how many details there are. But when I thought about it, it didn't look that scary.

To make things even more fun, I decided to not use any existing smart libraries for computer vision and other stuff, and just work with an image as with a two-dimensional array of (r, g, b) values.

Reshaping the picture

Luckily I had a square table in the photo, so I could use its corners as base points for reshaping. Obviously, there are some programs, which could reshape the picture automatically, but I only started working on this project, had a lot of energy, and decided to implement it myself (don't repeat my mistakes).

My idea was to make a GUI, which lets you pick four corners of a table with a mouse, and then do reshaping. I wanted to put corners of the table into corners of a new picture, and stretch everything else. It is easy to say "stretch", but how do you do this in terms of pixels?

Let's say we decided that the new picture will have a size 1000x1000. Then we can connect table corners via segments, split them into 1000 equal parts, and then connect points from opposite sides. Each quadrilateral will correspond to one pixel in a new picture. In the case of a 50x50 picture, it looks like this.

To calculate a color for each new pixel, we can iterate over all pixels inside the quadrilateral, and take an average. It is a little bit more complicated because some pixels could be partially inside several different quadrilaterals, but we can give them a weight proportional to the area of the intersection. We need to do some computational geometry, but in the end, we get a result, which seems pretty reasonable!

But...

This method works pretty well, except for one "but". It doesn't do what we need! From this transformation, we need one property. If we choose 4 points [p1, p2, p3, p4] and the distance (in real life) between p1 and p2 is the same as between p3 and p4, it should be the same in the generated picture. This property holds for corners of the table. But it doesn't work for other points.

Funny enough, I didn't realize the problem until I implemented all other parts of the algorithm and started testing it on real pieces, trying to fit pairs found by the algorithm, and didn't succeed.

The issue comes from the fact that we split segments into 1000 equal parts. If we take a segment and look at the midpoint (calculated in terms of pixels), it will not have the same distance to two ends in a physical world. It will be closer to a point, which was closer to a camera when the photo was taken.

Ok, this method is bad, but how to reshape it correctly?

The transformation, which we need, is called homography, and I found a nice post, which explains it. OpenCV even has a function getPerspectiveTransform for it. But part of the challenge was to not use such libraries, so I implemented it myself.

Let me briefly explain the idea. We say that when we take a photo, all points are projected to some plane (by drawing a line between it and a camera, and looking where it intersects the plane). We don't know the actual camera location, but we want to estimate it based on the fact that we know the positions of specific points (in our case table corners). After that, we can take other pixels from the photo, and calculate their actual physical location.

We can express this transformation as multiplication by some 3x3 matrix with unknown coefficients. One of the coefficients could be explicitly set to 1.0, as we don't care about the scaling factor. We have 4 corners and for each of them we can write down 2 linear equations (for x and y coordinates). Overall 8 equations and 8 unknown variables, so we can find a unique solution via Gaussian elimination. I skipped some details, but it should be possible to find good explanations about it on the internet.

Correctly reshaped picture (do you see the difference?):

Pieces detection

Great, we reshaped the picture, what is next? We need to separate pieces from the background. Again, there are probably some CV libraries, which can do it out of the box, but we want to do it ourselves.

The worst thing you can do when approaching problems like this is to start writing something smart and complicated. Usually, it will not work, and you will lose a lot of time. As in all other parts of this post, I suggest thinking about how you'd approach such a problem yourself before reading my solution.

As all pieces, I cared about, were white, and the background was dark, I decided if a pixel is part of a piece or not by a simple formula:

fn is_piece_color(color: Color32) -> bool {
    color.r() + color.g() + color.b() >= 500
}

I got this result:

I joined all connected pixels into groups, tweaked a constant a little bit, and got this picture:

Obviously, it is not ideal. Some pieces were merged together and some didn't contain all the pixels, but it was good enough to start implementing the next stages of the algorithm and test things.

Borders detection

We want to split a set of pixels from one piece into inner points and the border. It is very easy. We say the pixel is on the border if there is another pixel near it, which doesn't belong to this piece. After implementing this check, I got a picture:

When we know a set of pixels lying on the border, we need to put them in some reasonable order. Intuitively we just want to go around the piece, and write down all the pixels, but in reality, it was hard to come up with some reasonable algorithm, which does it. We need this order to be able to match different pieces together. If we have it, we can traverse two borders at the same time, and check that pixels are not too far from each other.

More formally we want to find such an order of pixels that the distance between each pair of neighboring (in that order) pixels is quite small (e.g. less than 3 pixels). Finding such a permutation for a general graph is NP-hard, but we can somehow use special properties of our graph to simplify the task.

I implemented some greedy algorithm, which just always went to the closest not visited pixel. And then applied some local optimizations to improve generated permutation. Mostly it found good solutions:

But sometimes something didn't go well:

Again, we don't need a 100% working solution (at least in the testing stage), we can improve it later if it becomes the bottleneck. Btw, if somebody knows a good algorithm for this task — please let me know :)

Corners detection

Great, we detected a border of a piece, and put all pixels of it in, for example, clockwise order. Now we want to split this border into four parts, where each part corresponds to one side of a piece. I hope intuitively it should be easy to understand what side is, but it is hard (at least for me) to formally define it.

Why do we actually need to detect separate sides?

First, it makes matching easier. When we try to fit one piece with another, we only need to try 4x4=16 possible ways.
Second, it makes the later graph task more discrete. For example, we can say that we need to put pieces in a grid, and for each piece choose one of 4 possible rotations. After that, we know which pieces are connected and by which sides.

So, how to detect corners? Again, I encourage you to think about this problem before reading my solution!

How are corners different from other points of the border? One difference is that if you look at several pixels before the corner, and several pixels after, the direction changes a lot. But it is also true for some other points... Also, it is hard to express "changes a lot" in the algorithm. You will probably need to choose some constants like how many pixels to take before and after, and what angle is big enough. And maybe for different pieces, you will need to use different constants.

Another algorithm, which I tried worked like this. Let's first calculate the center mass of the piece (you can actually see the blue dot in the pictures above). Then for each pixel from the border calculate the distance to the center. Then we say the pixel is probably a corner, if it is further from the center, then pixels nearby. It already works pretty well:

This algorithm detects all corners, but also sometimes some additional points (but not too many of them, which is good). We can try to iterate over all possible fours of detected points and choose a four, which is the best by some metric. We only need to come up with a good metric. Intuitively, corners should form a rectangle, so we can choose the most "rectangular" four points. I did two things:

For every three consecutive points, I calculated the angle between them and compared it with 90 degrees.
Divided the longest distance between consecutive points by the shortest distance. The idea is to exclude options, where we have two very close points.

Then I multiplied two values and found the minimum score. Probably there are better/easier ways of doing this, but I just played with formulas until they worked well on real examples:

Fitting borders

Okay, as we detected the sides of each piece, now it should be pretty straightforward to say if two pieces fit or not. We just put them nearby, iterate over pixels from both sides at the same time, and check that the i-th pixel from one border is near the i-th pixel of the border of another piece.

Wait, put them nearby? How?

We need to apply some transformation (rotation and shift) to one piece such that the metric we try to improve (sum of distances for corresponding pixels) is the best. How to find such transformation? We can try to fit the corners of one piece with corners from another piece and hope other points will also fit nicely.

There was also a problem connected to the fact that we don't detect pieces on the photo ideally. The actual border of the piece is not very white on a photo, so we don't mark it as part of the piece, and our pieces are smaller than in reality. And when we try to fit two pieces, which should fit in real life, borders don't match exactly. I first tried to fix it by applying an additional shift and saying the corresponding pixels should have a distance similar to this shift:

But it didn't work very well, because, for some pairs of pieces, which shouldn't fit at all, I got good scores, because even after the shift, borders intersected much, and the distance between corresponding pixels was negative but equal to the shift by absolute value (and I didn't come up with a way to distinguish good positive distance and very bad negative, as I only knew the absolute value of it).

A good fix, which helped, was to just enlarge each piece by two pixels, and then try to match borders exactly, without a shift.

Back to the trick, where we chose needed transformation based on the corners. Sometimes we don't detect corners precisely (maybe off by a couple of pixels). But when we use a slightly incorrectly detected corner, it changes the transformation quite a bit, and we receive not-so-good results:

Obviously, we can rotate the bottom piece a little bit to fit more nicely, but how to say it to the algorithm?

I estimated the transformation by corners first and then tried to change it a little bit with local optimizations. For example, we can calculate the score, shift the piece a little bit in a random direction, and check if the score improved. If not, we shift the piece back, but if it improves, we try to do it again. And we can also try to rotate the piece in the same way. After applying this technique I got a good result:

The last part

Great, we are almost there! Congratulations if you read till this point!

If for each pair of pieces we can calculate how well they fit, we can build a graph on it. And apply some algorithms to it to find the best possible overall fitting. I didn't know a good algorithm, which solves it perfectly, so I wrote a very easy greedy approach. Just try to fit pieces with the best possible scores, until you connect all of them into one figure.

Obviously, it is not perfect, but after all the work done before that moment, I wanted to see some real result, even if it will not give the absolutely correct result. And what do you think I got?

Well, this looks very incorrect! But it also looks a little bit promising. You can see that some pairs of pieces actually fit pretty nicely.

Anyway, this post is already pretty long, and there is a lot more to cover in this story, so I decided to split it into two parts. See you in part 2!

If you want some spoilers, you can check the GitHub repository with the source code and images: https://github.com/BorysMinaiev/jigsaw-puzzle-solver

If you know Russian and liked this post, consider subscribing to my Telegram channel: https://t.me/bminaiev_blog

Online Point Location Algorithm

bminaiev — Mon, 10 Oct 2022 14:05:46 GMT

Round 3 of Meta Hacker Cup 2022 finished a couple of days ago. Today I'll talk about the hardest problem from that round, which was solved only by 3 people. The problem is quite classical, so basically after reading the problem statement, experienced competitive programmers already know what they need to write, but the amount of the code is pretty big, and almost nobody likes geometry problems, so this probably explains why there are so few accepted solutions.

Picture from the initial problem statement

Let me describe the main part of the problem. You are given n polygons, which don't intersect (and even touch) with each other, but polygons could be fully enclosed in another. You need to answer q queries one by one. Each query is one point, and you need to find the smallest (by area) polygon (out of the given in the initial input), which contains this point. The total number of vertices in all polygons is around one million, and the total number of queries is also around one million. So you can't just check all pairs of queries and polygons.

The classical algorithm for this task is doing a sweep-line by X coordinate and maintaining a set of all edges, which contains the current X coordinate. The set itself should be sorted by Y coordinate. You also need to store all the revisions of this set, so when you receive a query (x, y), you can find a revision for a specific x, and find a lower bound for y in the set. Because you want to efficiently store revisions, you can't just use a built-in std::set, you need to use a persistent treap or a similar data structure. You can find the details of this algorithm (a simplified offline version of it) in cp-algorithms. But even without reading this article, you probably already understood why there were so few accepted solutions to the problem from Hacker Cup.

I knew this algorithm before the round, but I don't like writing treaps during the contests, so I didn't manage to implement it in time (thankfully, solving all other problems was enough to advance to the Finals). After the round, I talked to Pavel, and he showed me another algorithm, which could be used for this problem. I didn't know it before, it has worse time complexity (O(log^2 n) per query, O(n log n) memory), but I think it is easier to implement. I don't think it is very well known. The only place, I could find, where it is mentioned, is in this paper, so I decided to share it here.

Main idea

Note. We say our polygons don't include the boundary, so if the query point exactly lies on the boundary of some polygon, we return the parent of that polygon. It is possible to include boundaries, but it requires some carefulness.

So we have a point, and we want to find in which polygon it is located. Let's draw a ray from a point to the left, and find the first edge, which it intersects. There could be two cases:

Our point is inside the polygon, which contains this edge.
Our point is just outside of the polygon, which contains this edge. In this case, our point is inside the "parent" polygon of the one, which we just found.

How to determine what is our case? Let's say we stored the vertices of our polygons in the counter-clockwise order. Then if the edge goes from top to bottom, the point is inside the polygon. If from bottom to top — outside of it.

We ignore edges, which have the same y coordinate of ends.

How to efficiently find the first edge to the left of the point? Let's first discuss how to efficiently find all edges, which intersect line y=C. We can maintain a segment tree by Y coordinate. Each node of the segment tree corresponds to some segment min_y..max_y of y coordinates. In the node, we store a vector of all segments, which fully covers this range of y coordinates. Each edge is added to O(log n) nodes, so overall O(n log n) memory is used.

Segment Tree

Here is a basic type for storing edges.

#[derive(Clone, Copy, PartialEq, Eq, Debug)]
struct Segment {
    fr: Point,
    to: Point,
    polygon_id: usize,
}

impl Segment {
    pub fn get_lower_higher(&self) -> (Point, Point) {
        if self.fr.y < self.to.y {
            (self.fr, self.to)
        } else {
            (self.to, self.fr)
        }
    }
}

And let's start implementing the main PointLocation structure. Points y coordinates could be arbitrarily large, so we need to compress coordinates (and store them in the all_y field).

pub struct PointLocation {
    all_y: Vec,
    tree_nodes: Vec>,
    ...
}

impl PointLocation {
    // vertices should be specified in ccw order
    pub fn new(polygons: &[Vec]) -> Self {
        let mut all_y: Vec = polygons
            .iter()
            .flat_map(|poly| poly.iter().map(|p| p.y))
            .collect();
        all_y.sort();
        all_y.dedup();
        let tree_nodes_cnt = all_y.len().next_power_of_two() * 2;
        let mut res = Self {
            all_y,
            tree_nodes: vec![vec![]; tree_nodes_cnt],
            ...
        };
        for (polygon_id, polygon) in polygons.iter().enumerate() {
            for i in 0..polygon.len() {
                let segment = Segment {
                    fr: polygon[i],
                    to: polygon[if i + 1 == polygon.len() { 0 } else { i + 1 }],
                    polygon_id,
                };
                res.add_segment(0, 0, res.all_y.len() - 1, &segment);
            }
        }
        ...
        res
    }

    fn add_segment(&mut self, tree_v: usize, l: usize, r: usize, segment: &Segment) {
        let min_y = self.all_y[l];
        let max_y = self.all_y[r];
        let (lower, higher) = segment.get_lower_higher();
        if lower.y <= min_y && higher.y >= max_y {
            self.tree_nodes[tree_v].push(segment.clone());
        } else if lower.y >= max_y || higher.y <= min_y {
            return;
        } else {
            let m = (l + r) >> 1;
            self.add_segment(tree_v * 2 + 1, l, m, segment);
            self.add_segment(tree_v * 2 + 2, m, r, segment);
        }
    }
}

To iterate over all edges, which intersect some y=C, we need to just recursively go down by segment tree as we usually do, and list all edges in each node, which we visit.

Comparator

Instead of iterating over all edges, which intersect y=C, we want to efficiently find the closest one to the left. We can sort all edges in each node of the segment tree by the X coordinate, and then do a binary search to find the closest one. If we do that, the overall complexity for each query will be O(log^2 n), as we need to run a binary search on O(log n) segment tree nodes. How to do a binary search? We just want to know, if the point is to the left or to the right of the edge. This is done by a simple vector (cross) product:

impl Segment {
    pub fn cmp_p(&self, p: Point) -> Ordering {
        let (lower, higher) = self.get_lower_higher();
        if p.y < lower.y || p.y > higher.y {
            return Ordering::Equal;
        }
        Point::vect_mul(&lower, &higher, &p).cmp(&0)
    }
}

There is a special case for a point with too big or too small y coordinate. During the binary search, it should never happen as we only do it over the edges, which intersect our y coordinate. But it will help us later.

To do a binary search, we should sort all edges inside each node of the segment tree. But what comparator should we use? They should be sorted in the order of increasing X coordinates, but it is a little bit tricky. We have an invariant that each edge stored in the node covers at least segment y_min..y_max of that node. For each edge, we can compute an x coordinate of an intersection of this edge and y_min and then sort by that x.

Comparing segments is not as easy as you can think...

But computing this x coordinate requires floating point arithmetic, which should be avoided whenever possible. So instead we can write comparator differently. When we want to compare segment (p1, p2) with (p3, p4), it is always possible to find one point of [p1, p2, p3, p4], such that the y coordinate of that point is covered by another segment. When we found this point, we can compare it with another segment using the same comparator as in the binary search.

It is possible to prove that this comparator and floating point one returns the same result.

The final version of the comparator looks like this:

impl Ord for Segment {
    fn cmp(&self, other: &Self) -> Ordering {
        self.cmp_p(other.fr)
            .then_with(|| self.cmp_p(other.to))
            .then_with(|| other.cmp_p(self.fr).reverse())
            .then_with(|| other.cmp_p(self.to).reverse())
    }
}

then_with tries the next possibility if the previous comparator returned Ordering::Equal. It is the moment when our special case from cmp_p helps.

Point location

The only thing left is to implement the searching logic. But it is very similar to regular segment tree implementation. I implemented it without recursion just to speed it up a little bit.

pub fn locate_point(&self, p: Point) -> Option {
    let mut segment: Option = None;
    let mut tree_v = 0;
    let (mut l, mut r) = (0, self.all_y.len() - 1);
    loop {
        let min_y = self.all_y[l];
        let max_y = self.all_y[r];
        if p.y < min_y || p.y > max_y {
            break;
        }
        if let Some(idx) = binary_search_last_true(0..self.tree_nodes[tree_v].len(), |i| {
            self.tree_nodes[tree_v][i].cmp_p(p) == Ordering::Less
        }) {
            let new_segment = self.tree_nodes[tree_v][idx];
            if segment.is_none() || segment.unwrap().cmp(&new_segment) == Ordering::Less {
                segment = Some(new_segment);
            }
        }
        if l + 1 < r {
            let m = (l + r) >> 1;
            let mid_y = self.all_y[m];
            if p.y < mid_y {
                tree_v = tree_v * 2 + 1;
                r = m;
            } else {
                tree_v = tree_v * 2 + 2;
                l = m;
            }
        } else {
            break;
        }
    }
    segment.and_then(|segment| {
        if segment.fr.y < segment.to.y {
            self.parents[segment.polygon_id]
        } else {
            Some(segment.polygon_id)
        }
    })
}

One thing, which I haven't covered is self.parents. For each i, it stores the smallest polygon self.parents[i], which contains polygon i. How do we compute it? It is very easy. We can just run the locate_point function on the leftmost point of that polygon. The only tricky moment is that we need to call it in the correct order to make sure that we already computed parents for all polygons to the left of it.

impl PointLocation {
    // vertices should be specified in ccw order
    pub fn new(polygons: &[Vec]) -> Self {
        ...
        for node in res.tree_nodes.iter_mut() {
            node.sort();
        }
        let mut polygons_left_points: Vec<_> = polygons
            .iter()
            .enumerate()
            .map(|(id, poly)| (poly.iter().min().unwrap(), id))
            .collect();
        polygons_left_points.sort();
        for (&p, polygon_id) in polygons_left_points.into_iter() {
            res.parents[polygon_id] = res.locate_point(p);
        }
        res
    }
}

Conclusion

I think this algorithm is easier to implement than a treap-based solution. The only tricky moment is comparators. But they are basically the same for both solutions. Everything else is just a regular segment tree code. The full code for the HackerCup problem could be found here.

It has O(n log^2 n) complexity, but in fact, it is pretty fast. My code solves the full dataset of the HackerCup problem under 5s on my laptop even in a single thread.

If you read till this point and know Russian, consider subscribing to my Telegram channel with similar posts: https://t.me/bminaiev_blog

#cp

#geometry

ICFPC 2022

bminaiev — Wed, 07 Sep 2022 20:21:35 GMT

For the third year in a row, I participated in the ICFP Contest as a member of RGBTeam with Roman Udovichenko and Gennady Korotkevich. We won it last year and hopefully did a good job this year as well. The final results are not published yet, but we were in the first place when the scoreboard was frozen (2 hours before the end of the contest). Obviously, some teams could have submitted better solutions during the freeze, so the final results could be a little different.

Frozen scoreboard (2 hours before the end)

One of the differences between the ICFP Contest and other competitions is that problem is prepared by different people every year, so you get a unique experience each time. This year's task was quite straightforward (say, compared to two years ago), but still quite interesting and challenging. So thanks a lot to the organizers for preparing it!

Problem

The detailed problem statement could be found here, but let me briefly describe it. You were given a picture, which you need to draw by doing some operations. Initially, you start from one empty rectangle (actually, that was not true for some of the tasks, but more on that later), and can do some operations:

Split a rectangle into two by a line, which is parallel to the X or Y axis.
Split a rectangle into four by two lines parallel to the X and Y axis.
Merge two adjacent rectangles. It is only allowed if the merged figure is still a rectangle.
Fully color one rectangle.
Swap two rectangles. It is only allowed if both rectangles have the same size.

When you merge or split rectangles, previous rectangles are destroyed and you can't refer to them anymore.

Each type of operation has some cost associated with it. Also, the cost of the operation is divided by the area of the rectangle it is applied to (in the case of a merge operation — the area of the biggest rectangle). So coloring the whole canvas is very cheap (because we divide by the big area), and coloring one pixel is very-very expensive.

This cost function seems controversial, but probably the idea was to force participants to use big rectangles to create pictures, which are still similar to a very detailed picture.

After you have done all operations, the generated picture is compared to a target one pixel by pixel. The difference between two pixels is calculated as the Euclidian distance between RGB values:

To calculate the final score, you need to sum the cost of all operations and the differences between all pixels. A lower score is better. Here is an example of the target picture and picture produced by some operations with a total score of 17139.

Task 15. Score = 17139

In total there were 40 different tasks, and the overall score is just a sum of scores for each task.

Initial approach

This problem could be solved by dynamic programming. For each state (x_left, y_bottom, width, height) we can calculate the optimal scoring function. Transitions are simple:

Fully color this rectangle.
Split rectangle by line into two. In this case, you need to calculate the sum of scores for two smaller dp states and the cost of the split operation.

There are O(N^4) states in this dp, where N is the width/height of the picture (which was 400). When we fully color the rectangle, we need to calculate the sum of distances for all pixels in this rectangle, which can be done in O(N^2). So naively this algorithm requires O(N^4) memory and does O(N^6) operations, which are quite big numbers for N=400.

But we can optimize this algorithm by reducing the N. Instead of running it on the initial picture, we can first split the picture into blocks of size 10x10, for each block calculate the average color, and then run the algorithm on the picture of size 40x40. Of course, it will not find the optimal answer, and will only draw rectangles with coordinates, which are multiple of 10, but at least it could finish in a reasonable amount of time.

And if you are ready to wait, you can use blocks of smaller size instead of 10x10. You can also optimize performance a little bit by not calculating scores for some transitions/states. For example, if you have considered splitting the rectangle into two and know the best possible score, and now trying to check if just coloring the whole rectangle could lead to a better score. You can start iterating through all pixels in this rectangle and calculate the sum of distances with the target picture. But if the sum already exceeds the best score from splitting, you can stop there, and not consider this move. Or you can do some kind of estimations like "consider 10% of random pixels, approximate the score, only fully recalculate if it is good enough". Such optimizations could speed things up a little bit, but not dramatically (you still can't run dp on the whole 400x400 picture).

Scoring function

As I already mentioned the cost of operations is calculated by a little bit strange formula when you pay proportional to the inverse of the rectangle area. Let's say we just want to draw a single pixel (x, y) in the middle of the canvas. The Naive way to do this looks like this:

Split the canvas by (x, y).
Split the top-right part by (x+1, y+1).
Pixel (x, y) is a separate rectangle now, so we can color it.

The third operation costs a lot as the area of the colored rectangle is just 1.

Instead, we can have a longer sequence of moves, but which doesn't require operations with small rectangles:

Split by (x, y).
Fully color the top-right rectangle.
Split the top-right rectangle by (x+1).
Color the right part of it white.
Split the left part by (y+1).
Color the top part of it white.

There is a problem with the second approach. If something was already drawn in the top-right corner, it will be cleared by our operations. So we can't use it naively to draw all rectangles. But we can always sort all rectangles we want to draw in a way that drawing each new rectangle doesn't cause any problems to all the previous ones.

In this approach, we don't even need to do operations 3-6. We can always assume there will be some later rectangle, which will recolor the top-right corner correctly.

To summarize, our algorithm will look like this:

Split the whole picture into rectangles, where each rectangle will be colored into one color.
Sort rectangles by some magic comparator.
For each rectangle in order, color it and the area to the top-right of it into one color. This is done by splitting the whole canvas by the left-bottom corner of the rectangle, coloring one part, and merging all parts into one big rectangle back.

Local optimizations

ICFP Contest is usually 72 hours long, and the first 24 hours are usually called lightning division. There is a special "prize" for being the first on the leaderboard after the first 24 hours, so you want to have some working solution for this moment.

After the first ~20 hours, we had a solution similar to what is described above. It used some dynamic programming, and the trick to color rectangles. It gave us a pretty decent score, but still quite far from first place.

As I already wrote in a previous post (sorry, in Russian), local optimizations are a really powerful technique. It is quite common in marathon-style contests, that you generate some reasonable solution first, and then try to iteratively change it a little bit to improve the score.

To be able to do local optimizations, we need to represent our solution in a way that small changes to the solution don't lead to a completely different resulting picture. For example, if we represent the solution as just a sequence of applied operations and try to add/delete some operations, it will not work very well as inserting one operation could completely change the meaning of the next operations.

Instead, we can represent our solution as just a list of bottom-left corners of rectangles with their colors. This list is sorted in an order in which rectangles are drawn.

We can start by picking a random corner from the list, and moving it by 1 in a random direction. If we get a solution with a better score, we leave a corner in a new place, otherwise, move it back. And trying to do this operation in a loop while the score improves.

This alone works pretty well. One of the reasons is explained by how we build our initial solution. We used a dp, which doesn't have one-pixel precision, instead, all coordinates are rounded to the closest block size. So if the target picture contains some "border" between two objects, our dp probably guessed this border incorrectly by a couple of pixels.

We can also try to delete existing corners or add new ones in our local optimizations. But adding new corners doesn't work well, because we need to guess a good position for a corner, and the probability of it is pretty low.

Color picking

Another optimization that we made before the 24-hour mark is picking a better color for each "color" operation. Initially in dynamic programming, when we want to fully color a rectangle, we just used an average of all target pixels lying in that rectangle. But this is not optimal.

Let's consider a simple case of the rectangle consisting of one pixel with color (255, 0, 0) and 10 pixels with color (0, 0, 0). For such a case, we can only consider the first coordinate (red) as others are equal to zero. The weighted average of the red component is (255 + 0 * 10) / 11 = 23. And the cost of this rectangle is 10 * 23 + 1 * (255 - 23) = 462.

Instead, we can just use color (0, 0, 0) for this rectangle. And the score for this color will be 10 * 0 + 1 * 255 = 255. Which is much better than 462!

Finding the best color in the general case is actually a known problem called Geometric median. Wiki suggests there is a Weiszfeld's algorithm, which can be used for solving it. In the end, some of our solutions used it, but we started with a simpler algorithm, which is Local optimizations (again).

Let's start with a weighted average color (which is a good approximation to a correct answer), and then try to change one of the (red, green, blue) components by one in one of two directions. After changing, we recalculate the cost, and, if it is better than before, continue with a new color. If we tried all 6 possible changes and none of them led to a better answer, we stop.

This algorithm is very easy to code, but actually works pretty fast and, I believe, always finds the best color.

Local optimization improvements

To show how good local optimizations work, let's just look at one example. Here is a target picture for test 21:

Target picture

Here is a solution generated by dynamic programming with block_size = 6. Score = 22087.

Dynamic programming. Score = 22087.

And here is the same solution after local optimizations. The score is 14653.

After local optimizations. Score = 14653.

Maybe the difference in the generated pictures is not that big, but the difference in scores is really huge! The best answer found for this test during the whole competition is 12567, which is not that far from 14657 found by 10 seconds of local optimizations.

The results of the lightning division are currently not published yet, but I am pretty confident that we won it by a huge margin (thanks again to local optimizations!).

Merging

After the lightning division finished, there was a small modification to the problem statement, and new tests were added. The difference was that in new tests the initial canvas is not empty. Instead, it was split into squares of different colors.

Example of a new task (32)

I am not sure what was the expected way of handling new cases, but when we saw this modification, we almost instantly decided that we just need to merge all squares into the big one, clear it, and draw the picture as we did in previous tests.

Merging all squares into one seemed like not a very hard task. We can just merge each line separately, and then merge lines together. So we coded this pretty fast and got some scores.

We also wrote a tool to show the top scores for each test. It looked like this:

I should also mention that all new tests had target pictures that exactly matched the target pictures of previously known tests. For example, the target picture for test 32, which is shown above, is absolutely the same as for test 9. At some point, we saw a team CowDay, which had a score on new tasks, which is much lower than we spent on just merging all squares together (not including an additional cost we pay to actually draw a picture after merging). There were two possible options:

They do not merge squares, and instead, use them to draw a target picture somehow.
They learned how to pay less for merging.

We checked the difference between their scores for new tests and corresponding old tests. And for both tests where the initial canvas is split into 20x20 squares, the difference was 19877, which could not be a coincidence. So probably they merged squares more efficiently (we spent 28451 on merging), but how?

After quite some time of thinking, we realized that when merging squares together, it is sometimes important to use the "cut" operation!

As I already mentioned, the cost of operations is proportional to the inverse of the rectangle area. So we want to reduce the number of operations with small rectangles. When we naively merge each line separately, we start each line by merging two small squares, which costs a lot. Instead, we can do something like:

Merge the first 7 rows separately.
Merge them into one big 7x20 block.
Split this 7x20 block into 20 blocks of size 7x1.
For each 7x1 block add 13 more cells, so now we have 20 columns of size 20x1.
Merge all 20 columns together.

This is how it works for a 16x16 case.

This way requires more operations than the naive one, but it only uses 7 very costly operations instead of 20. The cost of this way is 19872, which almost matched the difference the CowDay team had in their scores, so looked like we found the same way as them.

After some time we noticed that team Unagi does something even more clever with the cost of 17571.

In the end, we realized we can generalize our solution to use several steps:

Build first A rows.
Build first B columns.
Build next C rows.
Build next D columns.
...

Values (A, B, C, D, ...) could be computed with dynamic programming. The state of the dynamic programming is (we built X first rows, and Y first columns, which are represented by two rectangles). There are two ways of representing a state as a union of two rectangles, so we also need to add an additional bit, which indicates it, to the state. And transitions are just adding some amount of new rows or columns.

Our final method for 16x16 looked like this:

Generating this sequence of operations looks like a nightmare, and I am very happy to have teammates who did this instead of me!

More optimizations

We also tried a number of other optimizations. Some of them improved the score, but not very dramatically.

We used Simulated annealing instead of local optimizations in the end.
We tried 4 possible rotations of the picture. In our solution, the top-right corner of the picture is very special, and drawing rectangles there cost a lot. So if we rotate a picture such that the top-right corner doesn't contain complicated objects, it costs less.
We tried to use the swap operation to move complicated parts of the picture far from the top-right corner. We did this mostly manually and improved the scores on a couple of tests. But based on other teams' writeups, we underestimated the importance of this operation, and it could be used much more efficiently.
We had a really nice visualization tool, which also supported some manual modifications of the solutions.

Infrastructure

It is not uncommon in ICFP Contest to use a lot of external computational resources as in most of the algorithms like simulated annealing if you give them more time, they will find a better solution.

But we decided to just use our 3 laptops to run our code. I think it forces you to write smarter algorithms, not rely on big servers :)

In terms of programming languages, I used Rust and my teammates used C++. We also had some tools written in Python.

Final thoughts

I enjoyed this year's contest. There were a couple of technical issues during the contest, but they all were resolved. The problem was a little bit straightforward (e.g. our solutions didn't change much since 24h mark), but still pretty interesting. Thanks a lot to the organizers!

Looking forward to the ICFPC 2023!

#icfpc

#contests

O(N^2). Часть 2: prefetching

bminaiev — Mon, 22 Aug 2022 14:25:53 GMT

Продолжаю рассказывать о том, как получать AC с простыми решениями, у которых асимптотика сильно хуже чем хотели авторы задачи.

В этот раз речь пойдет о задаче CF814F. Ее формулировка довольно простая. Есть массив a из 10^5 чисел, каждое из которых не больше 2⋅10^4. Нужно ответить на 10^5 запросов (l, r). Для каждого запроса нужно посчитать количество взаимнопростых чисел от 1 до 10^5 с числом a[l] * a[l + 1] * ... * a[r].

Авторское решение — какая-то корневая со сложной асимптотикой. Мы же будем писать простое решение, которое работает за O(N^2).

Первая идея

Для каждого числа x от 1 до 2⋅10^4 посчитаем битовую маску mask[x] длиной 10^5, где i-й бит проставлен, если gcd(x, i) != 1. Тогда чтобы ответить на запрос (l, r) нужно посчитать mask[a[l]] | mask[a[l + 1]] | ... | mask[a[r]], а потом посчитать количество проставленных бит в полученной маске.

Ровно в таком виде решение имеет асимптотику O(N^3) и естественно будет работать долго. Но можно добавить divide&conquer и немного оптимизаций для маленьких простых и получить AC. Но мы пойдем другим путем.

Вторая идея

Идея с масками прикольная, но асимптотика получается слишком большой. Вместо этого можно воспользоваться методом сканирующей прямой. Будем перебирать левую границу запросов l от больших к меньшим и для каждого числа x поддерживать массив first[x] — наименьшая позиция >= l такая, что gcd(x, a[first[x]]) != 1. Тогда чтобы ответить на запрос (l, r) нужно посчитать количество позиций x таких, что first[x] <= r. Это можно сделать простым циклом, который хорошо оптимизируется и работает быстро.

Осталось только научиться быстро пересчитывать массив first, когда передвигаем указатель l. Все что нужно сделать — посмотреть на mask[a[l]] и для всех позиций x, где проставлен бит, обновить first[x] = l.

mask[x]

Давайте обсудим как хранить mask[x]. Во-первых, он занимает довольно много памяти. Сам массив имеет длину 2⋅10^4, и для каждого элемента нужно сохранить 10^5 бит. Суммарно это уже порядка 250мб, а memory limit в задаче 256 мегабайт.

Во-вторых, если мы хотим быстро пересчитывать массив first, то получение i-го бита mask[x] должно работать очень быстро (и использовать операции, которые можно ускорить с помощью simd). А в самом простом случае получение i-го бита из битсета требует каких-то битовых сдвигов, которые вряд ли соптимизируются.

На самом деле существуют simd инструкции, которые могут проставить конкретные элементы в одном массиве из другого исходя из проставленных битов в маске. Но для этого компилятор должен догадаться, как трансформировать биты из битсета в формат, который нужен для этих инструкций. Если вы умеете помогать компилятору это делать без написания инструкций руками — расскажите мне :)

Если же хранить mask[x] как просто массив bool/u8, то код, который обновляет first, будет хорошо векторизоваться. Но тогда он будет занимать в 8 раз больше памяти и не будет помещаться в 256 мегабайт.

Давайте не хранить все элементы mask, а только первые сколько-то элементов (например, 2400, чтобы все еще помещаться в ML). Если мы хотим узнать какой-нибудь mask[a * b], то можно просто посчитать mask[a] | mask[b].

Как для конкретно x найти разложение x = a * b такое, чтобы a и b были не очень большими? Давайте посмотрим на наибольший простой делитель x (назовем его p). Если и для p и для x/p у нас есть подсчитанная маска, то все хорошо. Иначе p очень большое, а mask[p] состоит только из битов вида p*i, так что такую маску можно создать заново за время пропорциональное N/p.

Реализация

Как и в прошлый раз будем выделять горячие куски кода в отдельные функции и включать для них simd оптимизации. Например, функция обновление массива first из какой-то маски будет выглядеть так:

#[target_feature(enable = "avx2")]
unsafe fn update_first(first: &mut [u32], mask: &[bool], new_value: u32) {
    for i in 0..first.len() {
        if mask[i] {
            first[i] = new_value;
        }
    }
}

Подсчет ответа так:

#[target_feature(enable = "avx2")]
unsafe fn calc_res(first: &[u32], r: u32) -> i32 {
    first.iter().map(|x| (*x <= r) as i32).sum()
}

А главная часть решения так:

for l in (0..n).rev() {
    let cur = a[l];

    unsafe {
        let l = l as u32;
        if cur < mask.len() {
            update_first(&mut first, &mask[cur], l)
        } else {
            let p = largest_prime[cur];

            if p < mask.len() {
                update_first(&mut first, &mask[cur / p], l);
                update_first(&mut first, &mask[p], l);
            } else {
                update_first(&mut first, &mask[cur / p], l);
                for i in (p..first.len()).step_by(p) {
                    first[i] = l;
                }
            }
        }
    }

    for query in queries[l].iter() {
        res[query.id] = max_v as i32 - unsafe { calc_res(&first, query.r as u32) };
    }
}

Массив first имеет тип [u32], а не [usize], потому что usize 64 бита и работает медленнее. Поэтому приходится добавлять касты из одного типа в другой :(

К сожалению ровно в таком виде решение работает долго. На случайном тесте локально оно работает 1.7с, а на серверах CodeForces — 4.4c (TL в задаче 3с).

Профилируем

Почему такая большая разница между локальным временем работы и на КФ?

Если бы у нас была возможность запускать любые программы на серверах КФ, можно было бы попробовать использовать perf или какую-нибудь другую утилиту, чтобы понять, что конкретно тормозит. Но ее нет.

Вместо этого просто попробуем минимизировать пример, на котором видно проблему. Померяем отдельно сколько работают функции update_first и calc_res. Будем измерять скорость с помощью другой небольшой функции:

#[inline(never)]
fn measure(f: &mut F)
where
    F: FnMut(),
{
    let start = Instant::now();
    const MAX_MILLIS: u128 = 1000;
    let mut iters = 0;
    while start.elapsed().as_millis() < MAX_MILLIS {
        iters += 1;
        f();
    }
    println!(
        "{} iters, av.time: {}mcs",
        iters,
        start.elapsed().as_secs_f64() / (iters as f64) * 1000.0 * 1000.0
    );
}

Сгенерируем случайные массивы и посмотрим, сколько будут работать функции:

pub fn main() {
    let mut rnd = Random::new(787788);
    const N: usize = 100_000;
    let mut first = vec![0; N];
    let mask = gen_vec(N, |_| rnd.gen_bool());
    measure(&mut || unsafe { update_first(&mut first, &mask, 123) });
}

Результат:

70355 iters, av.time: 14.213780456257549mcs

Аналогично для calc_res:

pub fn main() {
    let mut rnd = Random::new(787788);
    const N: usize = 100_000;
    let first = gen_vec(N, |_| rnd.gen_u32(100));
    let mut hash = 0;
    measure(&mut || unsafe {
        hash += calc_res(&first, 50);
    });
}

В этом случае еще посчитаем какое-то число hash, чтобы компилятор не мог просто не выполнять код внутри calc_res.

Результат получается такой:

115493 iters, av.time: 8.658555583455275mcs

В худшем случае наше решение 10^5 раз вызывает (2 раза update_first и 1 раз calc_res), что по идее должно работать 10^5 * (2 * 14.2 + 8.65) = 3.7c, а на случайных данных еще меньше. А решение в запуске на КФ почему-то работало 4.4с.

Секрет довольно прост — в случае с маленьким тестом мы запускаем update_fist на одном и том же массиве mask, а на реальных данных каждый раз берем случайный, который скорее всего не лежит в кеше.

Чтобы проверить эту гипотезу можно немного переписать тест:

pub fn main() {
    let mut rnd = Random::new(787788);
    const N: usize = 100_000;
    let mut mask = vec![vec![false; N]; 2400];
    for it in 0..mask.len() {
        for i in 0..N {
            mask[it][i] = rnd.gen_bool();
        }
    }
    let mut first = vec![0; N];
    measure(&mut || unsafe { update_first(&mut first, &mask[rnd.gen_usize(mask.len())], 123) });
}

Который работает в полтора раза дольше:

40216 iters, av.time: 24.866071638153972mcs

Улучшаем

Теоретически, когда программа обращается к последовательным элементам массива, cpu hardware prefetcher должен заметить этот паттерн и подгружать следующие элементы массива в кеш до того как они потребуются. Но в данном случае почему-то этого не происходит.

Но мы можем загрузить нужные элементы в кеш руками. Сделаем небольшую функцию, которая будет нам помогать:

fn prefetch(ptr: *const i8) {
    unsafe {
        core::arch::x86_64::_mm_prefetch::<{ core::arch::x86_64::_MM_HINT_T0 }>(ptr);
    }
}

Функция _mm_prefetch принимает не только адрес, но и уровень кеша, в который нужно загружать данные. Мы будем подгружать во все уровни сразу.

Перепишем функцию update_first с использованием prefetch. Самая простая версия может просто подгружать данные, которые находятся с каким-то сдвигом от текущих обрабатываемых элементов:

#[target_feature(enable = "avx2")]
unsafe fn update_first(first: &mut [u32], mask: &[bool], new_value: u32) {
    const SHIFT: usize = 1024;
    for i in 0..first.len() {
        if mask[i] {
            first[i] = new_value;
        }
        prefetch(mask.as_ptr().add(i + SHIFT) as *const i8);
    }
}

Но такая версия работает гораздо хуже исходной. Как минимум потому, что обновление first хорошо векторизовалось, а prefetch делается для каждого отдельного байта. На самом деле этого делать не нужно, так как процессор подгружает сразу кеш линию в 64 байта. Так что можно вызывать его только каждую 64-ю итерацию.

Чтобы цикл все еще хорошо векторизовался можно разделить его на блоки. После обработки каждого блока, будем подгружать следующий блок в кеш.

#[target_feature(enable = "avx2")]
unsafe fn update_first(first: &mut [u32], mask: &[bool], new_value: u32) {
    let n = first.len();
    let mask = &mask[..n];
    const SHIFT: usize = 1024;
    for start in (0..n).step_by(SHIFT) {
        for i in start..min(start + SHIFT, n) {
            if mask[i] {
                first[i] = new_value;
            }
        }
        for it in 0..SHIFT / 64 {
            prefetch(mask.as_ptr().add(start + SHIFT + it * 64) as *const i8);
        }
    }
}

Такой код уже работает быстрее:

51547 iters, av.time: 19.399798669175702mcs

Но почему-то он все еще работает дольше, чем когда используется один и тот же массив mask:(

Но даже этого (плюс еще каких-то небольших оптимизаций) хватает, чтобы получить АС на текущих тестах!

#rust

#simd

#cp

#prefetch

Google AI4Code или мой первый раз на Kaggle

bminaiev — Thu, 11 Aug 2022 21:55:16 GMT

Вчера наконец-то закончилось трехмесячное соревнование от гугла по машинному обучению, где нужно было написать программу, которая умеет понимать взаимосвязь между питоновским кодом и комментариями к нему. На вход вашей программе давался Python ноутбук, который состоит из клеток с кодом и клеток с текстом. Клетки с кодом даны в том же порядке, в котором они были в исходном ноутбуке. А клетки с текстом случайно перемешаны. Цель вашей программы — как можно точнее восстановить исходный порядок.

До этого соревнования у меня не было никакого реального опыта машинного обучения, и я подумал, что это отличный шанс попробовать что-то новое. В итоге это оказалось достаточно прикольно и интересно, но очень выматывающе (хочется же получить результат получше, но ничего не работает, приходится пробовать много всего).

Тестирование проходит в два этапа. Во время самого соревнования можно послать свой код и узнать, сколько он набирает баллов на закрытом датасете. Сам закрытый датасет набирался из публично доступный ноутбуков с Kaggle, так что теоретически можно было обучить свою модель на них и получить идеальный результат. Поэтому в следующие три месяца организаторы будут собирать новый датасет и перетестируют все решения на нем.

Так что официальных результатов ждать еще три месяца, но на предварительных у меня 25е место из 1000+ команд. С одной стороны для первого раза довольно неплохо. С другой конечно же можно было гораздо лучше и судя по таблице результатов, решение явно можно сильно улучшить.

О чем написать?

За прошедшие три месяца случилось довольно много интересных историй, и уместить их все в один пост явно не получится (ну либо его никто не дочитает до конца). Так что я постараюсь какую-то часть написать в этот пост, а потом, возможно, напишу еще отдельных историй (ставьте лайки и пишите о чем интересно почитать!).

Bad setup

Я потратил очень большое количество времени просто на то, чтобы сделать удобной разработку. Kaggle предоставляет возможность создавать jupyter ноутбуки и даже бесплатно запускать их. Они даже дают возможность пользоваться их gpu (30+ часов в неделю). Это прикольно, потому что можно сразу начать что-то делать, но на самом деле это ужасно не удобно.

Во-первых, все тормозит. Запускаешь клетку ноутбука, ждешь секунду пока она исполнится. С одной стороны можно потерпеть, но с другой — это реально существенно замедляет процесс.
Во-вторых, когда ноутбук становится длиннее клеток 20, им становится невозможно пользоваться. Начинаешь запускать клетки в неправильном порядке, какие-то инварианты глобальных переменных ломаются, потом очень долго дебажишь.
В-третьих, нет нормальной возможности сохранить промежуточные данные куда-то на диск. Есть какое-то временное хранилище пока ноутбук работает, но если хочется переиспользовать данные, их нужно скачать/сохранить куда-то отдельно. По умолчанию, максимально ноутбук может работать 9 часов, а потом выключится.
В-четвертых, нет нормальной возможности версионирования и хранения кода. Пусть у меня есть код, который генерирует какую-то модель, и код, который ее проверяет. На Kaggle это должно быть два отдельных ноутбука, чтобы второй мог использовать модель, который сгенерил первый. Но если я хочу переиспользовать часть кода (например, который считывает данные или считает статистику), его нужно будет скопировать.
В-пятых, gpu, который Kaggle предоставляет бесплатно, P100, не то чтобы очень крутые.

В общем, если бы я мог путешествовать во времени, и дать себе один совет, это явно был бы "не использовать Kaggle для запуска кода".

Better setup?

Я вряд ли пришел к идеальному варианту, но в итоге получилось так:

Весь код лежит в репозитории на github.

Общий код лежит в .py файлах.
Под каждый отдельный эксперимент есть свой .ipynb файл (в котором желательно не больше ~10 клеток).
Чтобы подключать .py файлы из .ipynb, можно пользоваться этими магическими командами:

%load_ext autoreload
%autoreload 2

Редактировал и писал новый код я в основном локально (latency гораздо лучше чем на Kaggle!).
Чтобы запускать код, я арендовал сервер на https://jarvislabs.ai/

Сервер с gpu A6000 (у которого 48Gb памяти вместо 16Gb на Kaggle) стоит ~1$/час. В итоге я суммарно потратил где-то 200$, но я не особо пытался экономить.
У них есть persistent storage, так что если сервер выключить, а потом включить — данные будут на месте.
На сервер можно заходить по ssh. Оказывается у ssh есть замечательный ключ -A, который форвардит локальные ключи на сервер, к которому подключаешься. Например, если хочешь выкачать приватный github репозиторий на сервер, но не хочешь добавлять в github профиль ключ с этого сервера, то -A это то, что нужно.

Когда все-таки хочется скачать с/на cервер что-то большое, можно пользоваться scp, но он тормозит (видимо потому что сервера стоят где-то в Индии), а хорошо работает Google Drive Cli. Правда безопасность этого решения несколько сомнительная.
Я не придумал как нормально сабмитить свое решение на Kaggle. Сабмитить нужно один ноутбук, а в репозитории код лежит в нескольких файлах. Если их в тупую собрать в один файл, то могут появиться функции с одинаковыми именами. Или будут какие-то лишние инклуды. В итоге я фиксил это все руками, но скорее всего можно как-то лучше.

Графики!

По жизни я очень люблю визуализировать данные и считаю, что обычно это путь к успеху.

В этот раз я пользовался https://wandb.ai/ и он очень клевый! В нем очень просто логировать данные во время экспериментов, а потом визуализировать и строить дашборды.

Например как-то так выглядели графики с результатами разных моделей:

На самом деле он интерактивный и в нем можно что-то понять, честно-честно!

Если зазумить, то будет так:

Тут на графике каждая линия — какая-то отдельная модель, которую тестируем. А значение — средний скор модели на Х ноутбуках. "Настоящий" скор модели мы бы получили, если бы посмотрели на значение при X=+infinity, но тестирование довольно долгое, поэтому хочется уметь быстрее понимать, получилась ли модель лучше чем другая.

Посмотрев на такой график можно понять, что тестирование, например, только на 100 ноутбуках, будет работать плохо и данные будут слишком шумные. А вот после 300 ноутбуков, если одна модель показывает результат лучше другой, то и настоящий результат скорее всего будет такой же.

Wandb позволяет строить графики интерактивно, т.е. прямо во время работы программы. Это спасает много времени, когда случайно запустил что-то неправильно, и сразу можешь заметить, что скор уж слишком маленький даже на первых ноутбуках.

А еще иногда можно заметить, что график одной модели в точности совпадает с графиком другой, и понять, что в эксперименте явно что-то пошло не так (например, мы почему-то тестируем другую модель).

В следующей серии?

Какое в итоге было решение.
Как я визуализировал данные и ничего полезного из этого не получил :(
Сколько раз я долго искал какие-то баги, потому что привык к нормальным языкам программирования, а не к питону.
Как я пробовал сделать что-то более сложное, но ничего не работало :(

#kaggle

Если вы дочитали до сюда, то подписывайтесь на мой канал https://t.me/bminaiev_blog

Генерируем полимино

bminaiev — Thu, 28 Jul 2022 17:31:08 GMT

В недавнем открытом кубке в задаче J в качестве подзадачи нужно было сгенерировать все возможные связные по стороне клеточные фигуры из не более чем четырех клеток. Во время контеста я так и не придумал, как написать код, который это делает, так, чтобы этот код вообще хотелось писать (и чтобы успеть это сделать минут за 5-10).

После конца контеста я вроде бы понял, что не все так страшно, так что давайте обсудим какие есть варианты.

Hardcode

Еще во время контеста я подумал, что различных фигурок из 4 клеток не так много, и можно просто нарисовать их все на бумажке, а потом сделать константный массив в коде. Скорее всего такое решение будет даже самое короткое. Но у него есть очевидный недостаток — можно случайно забыть какую-то фигурку.

В качестве упражнения я потом попытался это сделать. Если вам не лень, то можете проверить свою внимательность и понять, забыл ли я что-то :)

Может лучше кодом?

Самое базовое решение, которое мне пришло в голову, выглядело так. Любая фигурка из 4 клеток должна помещаться в квадрат 4х4. Поэтому можно:

перебрать все возможные закраски квадрата 4х4 (их всего 2^16)
проверить, что фигура внутри состоит из не более чем 4 клеток
проверить, что фигура связная

У этого метода есть проблема, что одну и ту же фигуру мы сгенерируем несколько раз. Например, прямоугольник 1х4 мы сгенерируем 4 раза (как каждую отдельную строку).

Чтобы избавиться от этой проблемы, можно привести каждую фигурку к каноническому виду (например, сдвинув влево и вверх до упора) и сложить все фигуры в хештаблицу.

Другой способ — можно проверить, что у сгенерированной фигуры в первой строке и в первом столбце закрашено хотя бы по одной клетке. Тогда можно обойтись без хештаблицы.

Немного деталей реализации. Будем хранить закрашенные клетки как один инт. Будем считать, что клетка (row, col) закрашена, если в этом инте проставлен (row*4+col)-й бит.

Также будем считать, что у нас уже есть написанная структура данных Dsu, которая умеет поддерживать множества. У dsu есть операция unite, которая принимает два элемента. Если они уже лежат в одном множестве, то возвращает false, а иначе объединяет множества и возвращает true.

Код получается примерно такой:

fn gen_figures_mask(max_cnt: usize) {
    let id = |row: usize, col: usize| row * max_cnt + col;

    let first_row_mask = (1 << max_cnt) - 1;
    let first_col_mask = (0..max_cnt).map(|row| 1 << id(row, 0)).sum::();

    for mask in 1i32..(1 << (max_cnt * max_cnt)) {
        if mask.count_ones() as usize > max_cnt
            || (mask & first_row_mask) == 0
            || (mask & first_col_mask) == 0
        {
            continue;
        }
        let mut dsu = Dsu::new(max_cnt * max_cnt);
        let mut num_comps = mask.count_ones() as usize;
        for r in 0..max_cnt {
            for c in 0..max_cnt {
                let id1 = id(r, c);
                if (1 << id1) & mask != 0 {
                    if r + 1 < max_cnt && ((1 << id(r + 1, c)) & mask) != 0 {
                        if dsu.unite(id1, id(r + 1, c)) {
                            num_comps -= 1;
                        }
                    }
                    if c + 1 < max_cnt && ((1 << id(r, c + 1)) & mask) != 0 {
                        if dsu.unite(id1, id(r, c + 1)) {
                            num_comps -= 1;
                        }
                    }
                }
            }
        }
        if num_comps == 1 {
            // TODO: return figure
        }
    }
}

Итого код получился не очень длинный, но в нем довольно много каких-то битовых трюков и различных +- 1, в которых можно ошибиться. Плюс не очень понятно в каком именно формате возвращать ответ.

Breadth-first search

Другой возможный способ — написать bfs по фигурам. Начинаем с фигурки в одну клетку. А потом каждый раз добавляем новую клетку рядом с уже существующей. Такие фигуры будут автоматически связными.

Но с таким подходом мы можем сгенерировать некоторые фигуры несколько раз. Чтобы этого не произошло, будем сохранять их в хештаблицу. При этом каждую фигуру будем хранить как вектор клеток, которые отсортированы лексикографически.

Но проблема одинаковых фигур все еще может возникнуть. Например, мы начали с клетки (0, 0), а потом добавили либо клетку (1, 0), либо клетку (-1, 0). В обоих случаях получилась одинаковая фигура. Чтобы такого не происходило можно считать, что клетка (0, 0) должна всегда присутствовать в фигуре, и что она должна быть лексикографически меньше всех остальных.

Предположим, что у нас уже есть структура Point, которая хранит пару (x, y), а так же определена операция + на них.

Тогда код будет примерно таким:

#[derive(Clone, PartialEq, Eq, PartialOrd, Ord, Hash)]
pub struct Figure(Vec);

impl Figure {
    pub fn new(mut pts: Vec) -> Self {
        pts.sort();
        Self(pts)
    }
}

const SHIFTS_4: [Point; 4] = [
    Point { x: 0, y: 1 },
    Point { x: 0, y: -1 },
    Point { x: 1, y: 0 },
    Point { x: -1, y: 0 },
];

pub fn gen_figures(max_cnt: usize) -> Vec {
    let start = Figure::new(vec![Point::ZERO]);
    let mut seen = HashSet::new();
    let mut queue = VecDeque::new();
    seen.insert(start.clone());
    queue.push_back(start);
    while let Some(fig) = queue.pop_back() {
        if fig.0.len() == max_cnt {
            continue;
        }
        for p in fig.0.iter() {
            for shift in SHIFTS_4.iter() {
                let new_point = p + shift;
                if new_point > Point::ZERO && !fig.0.contains(&new_point) {
                    let next_figure = Figure::new([fig.0.clone(), vec![new_point]].concat());
                    if !seen.contains(&next_figure) {
                        seen.insert(next_figure.clone());
                        queue.push_back(next_figure);
                    }
                }
            }
        }
    }
    seen.into_iter().collect()
}

В таком коде уже явно меньше какой-то битовой магии и сложнее ошибиться.

Bonus

Первый способ работал достаточно быстро, потому что 2^16 это не так много. Но если бы нужно было генерировать фигурки с большим количеством клеток, он бы работал значительно дольше.

Второй способ на каждую фигурку создает отдельный вектор и кладет его в хештаблицу, что тоже не самое лучшее решение с точки зрения скорости.

На самом деле существует способ генерации полимино, который работает за время пропорциональное их количеству и не требует хештаблицы. Оставим его придумывание в качестве упражнения.

#cp

#rust

Если вы дочитали до сюда, то подписывайтесь на мой канал https://t.me/bminaiev_blog

Local Optimizations Is All You Need

bminaiev — Tue, 19 Jul 2022 16:11:21 GMT

Недавно maroonrk сказал, что задачу ARC142E не взяли в AtCoder Grand Contest (а взяли только в Regular), потому что боялись, что какие-то эвристические решения могут зайти. Мимо такого заявления нельзя было пройти спокойно, и надо было срочно загнать какую-то лажу в эту задачу.

Задача следующая. Есть два массива a и b (до 100 элементов в каждом, значения до 100), а также список пар (i, j). Нужно поувеличивать элементы в a так, чтобы для каждой пары (i, j) из списка выполнялось хотя бы одно из двух условий:

a[i] >= b[i] && a[j] >= b[j].
a[i] >= b[j] && a[j] >= b[i].

При этом каждый раз когда увеличиваем a[i] на 1, платим одну монету. Нужно минимизировать количество потраченных монет.

Общая идея

Пусть мы откуда-то узнали, в каком порядке должны быть отсортированы a[i] в конце. Тогда восстановить сами a[i] оптимально довольно легко. Пусть информация про сортировку нам известна в формате массива rank, такого что: rank[i] < rank[j] обозначает, что a[i] < a[j]. Тогда если есть ограничение на пару (i, j), то оптимально будет a[i] сделать хотя бы min(b[i], b[j]), а a[j] сделать хотя бы max(b[i], b[j]).

Самая простая версия — будем генерировать случайную перестановку rank, строить ответ исходя из нее, считать стоимость. И повторять пока есть время. Это решение получает несколько АС, но большинство WA:

Локальные оптимизации

Локальные оптимизации это очень клевая техника, которая часто используется в marathon-style задачах, но иногда ее можно применить и в обычных контестах. Идея в том, чтобы немного изменять текущее решение, и применять только те изменения, которые улучшают итоговый результат.

В случае с перестановками типичный пример локальной оптимизации — поменять местами два соседних элемента. Другая возможная оптимизация — взять случайный элемент, и переставить его в случайное место. Именно ее и будем использовать.

Для простоты имплементации массив rank сделаем не перестановкой, а массивом даблов. Тогда основная часть решения будет выглядеть так:

let calc_score = |rank: &[f64]| -> usize {
    let mut cur_values = init_values.to_vec();
    let mut cur_res = 0;
    for (fr, to) in all_restrictions.iter() {
        let need = max(need[fr], need[to]);
        if rank[fr] >= rank[to] && cur_values[fr] < need {
            cur_res += need - cur_values[fr];
            cur_values[fr] = need;
        }
    }
    cur_res
};

let mut rnd = Random::new(787788);
let mut rank = gen_vec(n, |_| rnd.gen_double());
let mut best_res = calc_score(&rank);

let start_time = Instant::now();
while start_time.elapsed().as_millis() < 1000 {
    let pos = rnd.gen(0..n);
    let new_val = rnd.gen_double();
    let old_val = rank[pos];
    rank[pos] = new_val;
    let new_res = calc_score(&rank);
    if new_res <= best_res {
        best_res = new_res;
    } else {
        rank[pos] = old_val;
    }
}

Такое решение уже работает гораздо лучше:

Во время дорешивания на AtCoder можно смотреть названия тестов, в данном случае они выглядят так (тут только часть тестов):

С одной стороны почти все случайные тесты это решение проходит, но с другой есть еще много "maximal" тестов, которые получают WA. Но есть один обнадеживающий момент — какой-то максимальный тест все-таки прошел:

Оптимизации

Часто при написании локальных оптимизаций, нужно уметь быстро пересчитывать ответ при каждом изменении. Сейчас мы делаем это за O(n^2), потому что должны посмотреть на каждое ребро. Но поскольку мы меняем rank только одного элемента, то можно пересчитать ответ за O(n).

Сам процесс оптимизации довольно скучный, так что не будем его тут описывать.

Но важно понимать, что лучше всего начинать с простого решения, чтобы проверить, что локальные оптимизации вообще работают, и только потом начинать оптимизировать.

После оптимизаций результат был такой:

После изменения rand seed, суммарный результат остался такой же, но разбиение по тестам было другое:

Из этого можно было сделать вывод, что шанс есть :)

Была идея, что можно сделать какие-то константные оптимизации, после которых решение успеет проверить больше перестановок, и в итоге получить АС.

После довольно долгих оптимизаций результат был примерно такой:

Причем в зависимости от rand seed, не работает разный тест...

Последняя оптимизация

Локальные оптимизации это хорошо, но еще лучше когда исходное решение уже сгенерировано хорошей эвристикой. Хотелось как-то учитывать, что если исходное значение элемента уже большое, то и rank должен быть скорее всего больше (но не всегда).

В итоге сработал какой-то такой код:

fn gen_initial_ranks(rnd: &mut Random, start: &[usize], need: &[usize]) -> Vec {
    let n = start.len();

    let mx = rnd.gen(10..200i32);
    let a = rnd.gen(-mx..mx);
    let b = rnd.gen(-mx..mx);
    let mut scores = gen_vec(n, |pos| {
        (
            pos,
            start[pos] as i32 * a + need[pos] as i32 * b + rnd.gen(0..mx * mx),
        )
    });
    scores.sort_by_key(|(_, y)| *y);
    let mut ranks = vec![0; n];
    for i in 0..n {
        ranks[scores[i].0] = i;
    }
    ranks
}

После этого решение уже заходит с довольно большим запасом по времени:

Вывод

Лучше писать нормальные решения. Но когда правильное решение не придумывается, важно уметь запихивать и то, чего не хотели авторы задачи.

#cp

#optimizations

Воспроизводимые бенчмарки

bminaiev — Wed, 13 Jul 2022 17:22:35 GMT

Недавно хотел понять, как лучше написать функцию, которая принимает два массива типа u8 и возвращает первую позицию, в которой они отличаются.

Для простоты будем считать, что массивы одинакового размера. Самая простая версия на Rust выглядит так:

pub fn mismatch(s: &[u8], t: &[u8]) -> Option {
    s.iter().zip(t.iter()).position(|(x, y)| x != y)
}

Но, как можно увидеть тут, такой код не векторизуется, а честно проверяет символы один за одним.

Как можно сделать лучше?

С теоретической точки зрения все довольно просто. Зачем сравнивать по одному байту, если можно сравнивать несколько за раз? Например, в С++ мы могли бы скастить char * к int *, и сравнивать массивы уже как массивы интов. По сути так мы сравниваем за раз по 4 байта. Когда нашли первую отличающуюся четверку, можно отдельным проходом найти какой именно байт из 4 отличается.

И еще нужно отдельно обработать конец массива, если длина не делится на 4.

На самом деле процессоры умеют обрабатывать больше 4 байт за раз. ymm регистры могут хранить сразу 256 бит (в 8 раз больше i32!). Так что можно разбить исходные массивы на блоки по 32 байта, сравнивать их с помощью simd, а когда нашли отличающийся блок, найти нужный байт в тупую.

chunks_exact

В Rust есть специальный модуль с интринскиками для использования simd напрямую, но код получается довольно сложно читаемый (да и писать его не очень приятно).

Было бы гораздо лучше, если бы компилятор мог сам понять, что можно использовать simd, и сделал бы это явно лучше чем мы руками (учитывая какие инструкции знает текущий процессор и насколько они быстрые). К сожалению ни код выше, ни плюсовый аналог std::mismatch, компилятор не оптимизирует.

Но иногда можно написать код немного по-другому, чтобы компилятор понял, как именно нужно оптимизировать. В данном случае у slice есть функция chunks_exact, которая делит массив на куски одинакового размера (плюс какой-то хвост). После этого компилятор может понять, что у всех кусков одинаковый размер, и их можно сравнивать с помощью simd-инструкций.

Итоговый код получается такой:

pub fn mismatch_fast(s: &[u8], t: &[u8]) -> Option {
    let len = s.len();

    const CHUNK_SIZE: usize = 32;
    let offset = s
        .chunks_exact(CHUNK_SIZE)
        .zip(t.chunks_exact(CHUNK_SIZE))
        .position(|(c1, c2)| c1 != c2)
        .unwrap_or(len / CHUNK_SIZE)
        * CHUNK_SIZE;

    s[offset..]
        .iter()
        .zip(t[offset..].iter())
        .position(|(c1, c2)| c1 != c2)
        .map(|x| x + offset)
}

Если посмотреть на генерируемый ассемблер, то вроде бы все ожидаемо, есть какой-то цикл, который загружает в ymm0 регистр вначале данные из одного массива, потом xor-ит с данными из другого, проверяет получился ли 0, сдвигается на 32 байта в двух массивах и переходит к следующей итерации:

Цикл нормального человека

Меряем скорость

Один из рекомендованных способов измерять производительность Rust кода это cargo bench.

Написал я тест:

#![feature(test)]
extern crate test;

#[inline(never)]
pub fn mismatch(s: &[u8], t: &[u8]) -> Option {
    s.iter().zip(t.iter()).position(|(x, y)| x != y)
}

#[inline(never)]
pub fn mismatch_fast(s: &[u8], t: &[u8]) -> Option {
    let len = s.len();

    const CHUNK_SIZE: usize = 32;
    let offset = s
        .chunks_exact(CHUNK_SIZE)
        .zip(t.chunks_exact(CHUNK_SIZE))
        .position(|(c1, c2)| c1 != c2)
        .unwrap_or(len / CHUNK_SIZE)
        * CHUNK_SIZE;

    s[offset..]
        .iter()
        .zip(t[offset..].iter())
        .position(|(c1, c2)| c1 != c2)
        .map(|x| x + offset)
}

#[cfg(test)]
mod tests {
    use super::*;
    use test::Bencher;

    fn gen_inputs() -> (Vec, Vec) {
        const LEN: usize = 200_000;
        const CHANGED_POSITION: usize = 100_500;

        let s: Vec = (0..LEN).map(|x| (x % 10) as u8).collect();
        let mut t = s.clone();
        t[CHANGED_POSITION] = 1;
        assert_ne!(s, t);
        (s, t)
    }

    #[bench]
    fn simple(b: &mut Bencher) {
        let (s, t) = gen_inputs();
        b.iter(|| {
            mismatch(&s, &t).unwrap();
        });
    }

    #[bench]
    fn fast(b: &mut Bencher) {
        let (s, t) = gen_inputs();
        b.iter(|| {
            mismatch_fast(&s, &t).unwrap();
        });
    }
}

Запускаю:

$ cargo bench --quiet

running 2 tests
test tests::fast   ... bench:      48,570 ns/iter (+/- 423)
test tests::simple ... bench:      52,240 ns/iter (+/- 3,557)

test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured; 0 filtered out; finished in 1.05s

Хмм, оба способа работают примерно 50мкс, но должно же было быть в 32 раза быстрее? Ну ладно, не в 32, все-таки simd инструкции, наверное, работают чуть дольше чем обычные сравнения, но не настолько же!

Так, ну ладно, давайте запустим perf на коде, который должен был работать быстро:

$ perf record cargo bench --quiet fast
...
$ perf report

Вижу я там примерно такое:

Но куда делся тот прекрасный простой цикл, который мы видели в Compiler Explorer?

Учимся смотреть ассемблер правильно

Смотреть на ассемблер в выводе perf конечно прикольно, но должен же быть какой-то более простой способ?

Раньше для этого был cargo asm, который вроде бы больше не поддерживается и плохо работает с workspaces. Зато появился какой-то cargo-show-asm. Им и будем пользоваться.

$ cargo asm test_diff::mismatch_fast

В принципе это очень похоже на тот код, который был в Compiler explorer. Но почему в perf мы видели какой-то другой?

На самом деле cargo bench и cargo asm используют разные опции для компиляции (RUSTFLAGS), что может приводить к подобным эффектам. Давайте уберем RUSTFLAGS вообще:

$ RUSTFLAGS="" cargo bench --quiet 

running 2 tests
test tests::fast   ... bench:       2,989 ns/iter (+/- 77)
test tests::simple ... bench:      52,195 ns/iter (+/- 579)

test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured; 0 filtered out; finished in 1.72s

Хмм, теперь fast версия действительно быстрая, но что же было такое плохое записано в RUSTFLAGS?

На самом деле в ~/.cargo/config у меня прописано, что нужно оптимизировать под локальный процессор. Поэтому воспроизвести это поведение можно так:

$ RUSTFLAGS="-C target-cpu=native -O" cargo bench --quiet 

running 2 tests
test tests::fast   ... bench:      48,568 ns/iter (+/- 3,422)
test tests::simple ... bench:      52,410 ns/iter (+/- 1,354)

test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured; 0 filtered out; finished in 1.10s

Очень подозрительно, что когда мы даем больше свободы компилятору, он начинает генерировать более медленный код. А еще в Compiler Explorer мы тоже передавали target-cpu=native, и он генерировал нормальный код.

Но давайте все-таки посмотрим на код, который генерируется во время cargo bench. Говорят можно просто передать --emit=asm в RUSTFLAGS, так и сделаем:

$ RUSTFLAGS="-C target-cpu=native -O --emit=asm" cargo bench --quiet 

running 2 tests
test tests::fast   ... bench:       2,577 ns/iter (+/- 42)
test tests::simple ... bench:      53,892 ns/iter (+/- 2,685)

test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured; 0 filtered out; finished in 2.76s

и потом в target/release/deps/test_diff-af77c0bfbd72c9e5.s можно найти уже знакомый цикл:

Так, стоп, но это же быстрая версия? Мы же ожидали увидеть простыню, которую видели в выводе perf. Хмм, и судя по результату теста, она реально работает быстро (причем даже быстрее чем когда передавали пустой RUSTFLAGS). Т.е. получается от того, что мы захотели посмотреть на asm и передали --emit=asm, все стало работать быстрее?

На самом деле --emit=asm неявно форсит флаг codegen-units=1, поэтому можно передать его сразу и получить такой же эффект:

$ RUSTFLAGS="-C target-cpu=native -O -C codegen-units=1" cargo bench --quiet 

running 2 tests
test tests::fast   ... bench:       2,501 ns/iter (+/- 164)
test tests::simple ... bench:      53,446 ns/iter (+/- 592)

test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured; 0 filtered out; finished in 1.12s

Скорее всего он же передается в Compiler explorer, поэтому мы видели там быструю версию кода.

Больше оптимизаций

Говорят, чем с большем уровнем оптимизаций компилировать код, тем быстрее он будет работать. Но на самом деле если передать opt-level=3:

$ RUSTFLAGS="-C target-cpu=native -O -C codegen-units=1 -C opt-level=3" cargo bench --quiet 

running 2 tests
test tests::fast   ... bench:      49,108 ns/iter (+/- 970)
test tests::simple ... bench:      53,382 ns/iter (+/- 693)

test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured; 0 filtered out; finished in 0.55s

он опять станет медленным.

Кстати, его же можно передать в Compiler Explorer и увидеть к чему приводят оптимизации.

Вывод

#rust

#simd

O(N^2)

bminaiev — Fri, 08 Jul 2022 19:47:18 GMT

Каждый раз, когда вижу в олимпиадных задачах ограничение N≤10^5, хочется проверить, стали ли компьютеры достаточно быстрыми, чтобы O(N^2) успел по времени.

Я не видел особо туториалов по тому, как использовать SIMD в олимпиадах, так что решил куда-то записать свой опыт.

Когда я прочитал задачу F с Edu131, Time Limit = 6.5с показался слишком привлекательными...

Задача

Если отбросить детали, то её можно переформулировать следующим образом. Есть два массива alive (добавлена ли сейчас точка в множество) и cnt (количество живых точек правее на расстоянии не более d), каждый размером 2·10^5. Нужно уметь добавлять/удалять точки из множества. Т.е:

Обновить alive
На некотором отрезке слева от текущей точки добавить или отнять 1 от cnt.

После каждой операции нужно считать сумму по всем живым точкам cnt[i] * (cnt[i] - 1) / 2.

Baseline

Сразу скажу, что делать нормальные воспроизводимые бенчмарки мне лень, так что погрешность измерений может быть довольно большой. Иногда измерения будут с ноута, а иногда с запуска на CodeForces (где все работает в ~2 раза медленнее и другое окружение). А еще все примеры кода будут на Rust, но должно быть все понятно.

Тестировать будем на тесте, когда для всех i от 1 до N, точку i добавляют в множество, и нужно обновить все значения cnt левее i. Вроде бы это максимальный тест, который может быть в исходной задаче.

Самый простой вариант обработки одного запроса query выглядит примерно так:

alive[query] = !alive[query];
let delta = if alive[query] { 1 } else { -1 };
for c in cnt[seg_start[query]..query].iter_mut() {
    *c += delta;
}
let res = cnt
    .iter()
    .zip(alive.iter())
    .map(|(&cnt, &alive)| if alive { cnt * (cnt - 1) / 2 } else { 0 })
    .sum();

Локально он работает ~11.1с, на CodeForces это будет секунд 20, так что точно не вариант.

Простые оптимизации

Некоторые считают, что if в самом горячем месте кода это всегда плохо, и если его заменить, например, на умножение, все станет гораздо быстрее. На самом деле это не так, и в нашем случае все будет работать только медленнее. По крайней мере на этом тесте alive у нас всегда true, так что бранч предиктор будет очень хорошо предсказывать этот переход, и он почти не будет влиять на скорость.

А вот деление это очень плохо, так что если не делить на два внутри map, а сделать это только один раз в конце, то тест отработает за ~8.5с.

Perf

Умение пользоваться perf-ом может сильно помочь в поиске проблемных кусков кода. Но довольно часто perf выдает какую-то чушь, и нужно уметь это чинить. Базовые правила использования perf-а для Rust:

В Cargo.toml у workspace-а или проекта нужно обязательно добавить:

[profile.release]
debug = 1

Иначе он не будет показывать строки исходного кода, которые соответствуют asm-у.

В ~/.cargo/config дописать:

[build]
rustflags = "-C force-frame-pointers=yes"

Возможно так perf будет лучше понимать откуда какая функция вызвалась и лучше строить стектрейсы. На скорость вроде бы влиять особо не должно.

perf record нужно запускать с флагом -g, чтобы записывались стектрейсы.
Еще в perf record можно добавлять --call-graph dwarf, но лично у меня почему-то после этого perf report долго запускается.

Итак, в perf-е текущей версии есть два горячих места:

Обновление cnt

Подсчет текущего результата

Какие можно сразу сделать выводы?

Судя по использованию ymm регистров и буквe p (packed) в названии некоторых инструкций — тут уже есть SIMD! Компилятор умный (?)
Пересчет результата занимает явно больше времени чем обновление cnt (в столбце слева процент времени, который программа провела в этой строке).

Упрощаем (жизнь процессору, не код)

Как улучшить подсчет результата?

Во-первых, можно пересчитывать его только для части массива, на котором он поменялся (оптимизация в два раза на нашем тесте!).

Во-вторых, можно только считать разницу между старым результатом и новым, тогда не нужно будет делать умножения в самом вложенном месте.

Получился явно не самый простой для понимания код:

alive[query] = !alive[query];

let delta0 = if alive[query] { 0 } else { -1 };
let delta = if alive[query] { 1 } else { -1 };

res += cnt[seg_start[query]..query]
    .iter()
    .zip(alive[seg_start[query]..query].iter())
    .map(|(&cnt, &alive)| if alive { cnt + delta0 } else { 0 })
    .sum::()
    * 2
    * delta;
    
res += delta * cnt[query] * (cnt[query] - 1);
    
for c in cnt[seg_start[query]..query].iter_mut() {
    *c += delta;
}

Но основные идеи довольно просты:

Заводим глобальную переменную res на все запросы.
Отдельно обрабатываем вклад текущей точки.
(с точностью до +-1) добавляем к ответу сумму cnt у живых точек слева, потому что это то, на сколько поменялись cnt[i] * (cnt[i] - 1) при изменении cnt[i] на 1.

Такое решение работает локально ~4.7c, что очень обнадеживает. Но на CodeForces не укладывается даже в 15с, очень жаль.

Что не так с CodeForces?

Когда пытаешься что-то оптимизировать, полезно выделить важный кусок кода в отдельную функцию, которая ни от чего не зависит и у который понятный интерфейс. В нашем случае выделим две функции:

pub fn add_const(arr: &mut [i64], delta: i64) {
    for val in arr.iter_mut() {
        *val += delta;
    }
}

pub fn calc_res(alive: &[bool], cnt: &[i64], delta0: i64) -> i64 {
    cnt.iter()
        .zip(alive.iter())
        .map(|(&cnt, &alive)| if alive { cnt + delta0 } else { 0 })
        .sum()
}

После этого на них можно смотреть в Compiler Explorer. Например, можно заметить, что функция add_const использует xmm регистры (обычно это хороший знак), а calc_res — нет. Т.е. calc_res совсем не использует никакой SIMD магии. Но почему?

По умолчанию компилятор раста очень консервативен относительно того, какие инструкции он использует. Это нужно, чтобы программа, которую скомпилировали на одном компьютере, могла запускаться на другом. Даже если ваш процессор супер-пупер новый и поддерживает кучу клевых быстрых инструкций, по умолчанию раст вместо них будет использовать старые и проверенные.

Если вы уверены, что программа будет запускаться на том же железе, на котором компилируется, то можно передать -C target-cpu=native в строку компиляции. Это можно сделать и в compiler explorer и увидеть, что теперь calc_res использует xmm/ymm регистры.

Но есть проблема, что мы не можем поменять флаги, с которыми компилируется наша программа на CodeForces. Зато в Rust есть возможность внутри кода сказать компилятору, чтобы он использовал модные инструкции. Но если во время исполнения окажется, что их нет, программа как-то упадет, так что такой код автоматически становится unsafe. Примерно так:

#[target_feature(enable = "avx2")]
pub unsafe fn add_const(arr: &mut [i64], delta: i64) {
    for val in arr.iter_mut() {
        *val += delta;
    }
}

#[target_feature(enable = "avx2")]
pub unsafe fn calc_res(alive: &[bool], cnt: &[i64], delta0: i64) -> i64 {
    cnt.iter()
        .zip(alive.iter())
        .map(|(&cnt, &alive)| if alive { cnt + delta0 } else { 0 })
        .sum()
}

Аналогичная проблема есть и на других тестирующих системах, но нужно быть осторожным и использовать только те расширения, которые там действительно есть. Например, на Yandex Contest прогресс остановился на

#[target_feature(enable = "sse2")]

Версия кода, которая использует avx2, в запуске на CF работает уже 7.7с, а не больше 15 как раньше! Напомню, что TL в задаче 6.5c, так что осталось совсем чуть-чуть.

64/32

Можно заметить, что cnt всегда помещается в 32 бита, так что можно использовать [i32]. К сожалению, сумма элементов уже не влазит в i32, так что нужно не забыть добавить много кастов по всему коду к i64.

Такой код работает уже порядка 6.5с в запуске на CF. Возможно идея в том, что больше 32-битных чисел помещается в один xmm/ymm регистр. А возможно на CF все еще используют 32-битные что-то и это как-то влияет? Но факт остается фактом, оптимайз действительно помогает.

К сожалению эти 6.5с из запуска не учитывают считывание, вывод, и случайные изменения времени работы от теста к тесту, так что нужно еще немного соптимизировать.

Последняя оптимизация

Мне все еще хотелось сделать alive типом [i32] и заменить if на умножение, но это по прежнему только замедляло программу.

Но на самом деле вместо умножения можно использовать битовые операции. Например, сказать, что в alive мы храним либо 0, либо -1, а при подсчете делаем & с ним. Финальная версия выглядит как-то так:

#[target_feature(enable = "avx2")]
pub unsafe fn calc_res(alive: &[i32], cnt: &[i32], delta0: i32) -> i64 {
    cnt.iter()
        .zip(alive.iter())
        .map(|(&cnt, &alive)| (alive & (cnt + delta0)) as i64)
        .sum()
}

Она работает ~6.2c и должна стабильно получать AC.

Вывод

Скорее всего у этого текста целевая аудитория 1 человек и это я сам, но хорошо хоть записал :)

Обсуждать можно тут или на CF.

#rust

#cp

#simd